Model Fit Operator
The ModelFit operator allows for the definition of multiple high-dimensional models for the purpose of the classification of data. Models are specified in a space created by selected variables, e.g. for a model specified over variables [v1, v2, v3] a model is a collection of 3-dimensional points each component corresponding to one of the three variables. Once all models are defined, the distance from each model to every spatial location of the data is calculated as the distance of one representative point from each model to that location (whichever point is closest). The model with the shortest distance is a spatial location's classifying model. The operator creates two variables, one containing the integer tag of the classifying model and another containing the actual calculated distance.
Parts of the Operator
- Add Model: Any number of models may be added/renamed and each appears in the left table of the operator GUI. The right side of the interface is specific to a single model and updates on the selection of any model in the left table.
- Add Variable/Add Point: Models are defined by selecting variables (i.e. a variable space) and specifying some number of points in that space. The buttons for these additions affect the table on the right for any selected model.
- Input Values: The actual values of the points may be inputted into the right table at any time during the definition process.
- Select Input Space: Distance metrics may be calculated in a variety of spaces. Currently there is support for 4 spaces. The input values may be input in any of these spaces and this is specified here:
- Variable – the original space of the dataset.
- 0-1 – the data is normalized to be in the range [0, 1].
- Log – the base 2 log is taken of the data.
- Probability – the data is converted using its cumulative density function, distance in this space corresponds to an area under a density curve.
- Select Calculation Space: In which space should the distance metrics be calculated.
- Select Distance Metric: Choose one of 3 implemented distance metrics: Euclidean (2-norm), Manhattan (1-norm) and Maximum (infinity-norm). From the 1-norm to infinity-norm these distance metrics are least to most, respectively, sensitive to a single component in the calculation. Each of 4, 5, and 6 are specific to each defined model, and must be set for each.
- Open sresa1b_ncar_ccsm3_0_run1_200001.nc (http://www.unidata.ucar.edu/software/netcdf/examples/files.html) linked to from: (https://wci.llnl.gov/codes/visit/datafiles.html). We'll be looking at two variables from this dataset, pr and tas. Pseudocolor plots of these variables follow:
- Add a Pseudocolor plot of operators->ModelFit->model
- Add a model, and rename the model to “Max pr”
- Add a variable to the model, choose pr from the “Scalars” menu. Then input the maximum value for pr (0.0003) into the model definition table.
- Press the “Apply” button. Now, draw the plot. Make sure the Pseudocolor plot range is form min: 0 to max: 1. The model variable will contain a 1 everywhere that the model classifies the data, and a 0 where no model does. With the default (Hot) color table, the resulting image should be all red. This is because with only one model, it is trivially true that the one model is the closest of all defined models to each point in the data.
- There is a way, however, to define a range at any input value, forcing models to be good fits. In the table, replace the input value with the following: 0.00015-0.00030. This lets the operator know that unless a points value falls within that range, a model should not even be considered a reasonable classifier. Notice the resulting image contains 3 colors now. The blue is everywhere the model is not the classifier and the red is where it is. The green is an interpolation error caused by the fact that we have discrete data and a continuous color scale. The following image has a 2 color discrete color scale and this problem is resolved. As a sanity check, compare this image with a Pseudocolor plot of the pr variable, and use the Threshold operator to look at only the values between 0.00015 and 0.00030.
- Revert to the original Max pr (value input is 0.0003), and add a second model. For this second model, name it Max tas, add the tas variable, and input its max value (309).
- Make sure the Pseudocolor plot range is now at least [0, 2], to allow for the new integer tag for the second specified model (Max tas). Draw this plot. Again, assuming default Hot colors, the graph is all green. This is because the two variables are on different scales. Also, 309, is the approximate max value, and is actually farther away from the maximum value (.1) than the two farthest points in the pr variable.
- To normalize scales, change for both models the Calculation Space to 0-1. Now the plot is overwhelmingly red, but both models are classifying some of the data. This tells us that once scales are removed, that the overall range of temperature is much smaller than that of pressure.
- To achieve maximum representation in the output of all models, let's normalize both scales and distributions: change the Calculation Space to Probability. Here is the output:
- Finally, let's look at the distance variable. Remember, the classifying model at a point is based on a minimum distance. The actual distance is stored in the distance variable. The following is the resulting image from changing variables from ModelFit->model to ModelFit->distance with the same two models. The colors have been changed to an inverted grayscale. This is the “Gray” choice in the Pseudocolor plot attributes menu, and is inverted (with a checkbox) since lowest distances are the best fits. These are the distances classification in step 10.