Missing Data Implementation

This page describes the current method that VisIt uses to handle missing data values. Various I/O libraries are able to annotate arrays with a special value that indicates missing data. When the special missing data value appears in a data array, the client code that processes the data is supposed to ignore the missing data values. VisIt removes missing data values from the visualization so they are not drawn.

Representing missing data

The missing data value has various names depending on the I/O library that was used. The NETCDF convention for missing data seems to bleed into some other libraries such as HDF4/5 and so we will examine the convention here. For more information, see NETCDF attribute conventions. To paraphrase, there are a few different attributes present on a variable that can be used to represent missing data.

  • _FillValue : The value that was written to the array where no other data values were written.
  • missing_value : The value that represents missing data.
  • valid_min : Every value in the array below valid_min's value is missing data.
  • valid_max : Every value in the array above valid_max's value is missing data.
  • valid_range : Every value outside of the specified range is missing data.

Exposing missing data

In order to expose the various missing data values to VisIt so they can be used, we add some new fields to the avtScalarMetaData object. These fields are populated in a reader's PopulateDatabaseMetaData method using the missing data attributes for each variable. Variables with no missing data can indicate MissingData_None to tell VisIt that no missing data was detected, though this is the default behavior for scalar variables.

 typedef enum { 
    MissingData_None,
    MissingData_Value,
    MissingData_Valid_Min, 
    MissingData_Valid_Max,
    MissingData_Valid_Range
 } MissingData;

 MissingData missingDataType;
 double      missingData[2];

The MissingData_Value, MissingData_Valid_Min, MissingData_Valid_Max cases would each store a single value into the missingData[0] storage location. The MissingData_Valid_Range case would store both the min and max range values into the missingData array.

The NETCDF reader exposes missing data values based on values from attributes on the file's variables. Here is an example of how to expose missing data:

smd = new avtScalarMetaData;

// Use an exact value
double values[2] = {1.e20, 1.e20};
smd->SetMissingData(values);
smd->SetMissingDataType(avtScalarMetaData::MissingData_Value);

// Use a valid min = 0.
double values2[2] = {0., 0.};
smd->SetMissingData(values2);
smd->SetMissingDataType(avtScalarMetaData::MissingData_Valid_Min);

// Use a valid max = 100
double values3[2] = {100., 100.};
smd->SetMissingData(values3);
smd->SetMissingDataType(avtScalarMetaData::MissingData_Valid_Max);

// Use a valid range [0., 100.]
double values4[2] = {0., 100.};
smd->SetMissingData(values4);
smd->SetMissingDataType(avtScalarMetaData::MissingData_Valid_Range);

avtMissingDataFilter

Missing data are handled in the avtMissingDataFilter class. The filter's role is to examine the variables in the contract and if they have missing data in their metadata, it will create an array called avtMissingData on the dataset and optionally remove cells or nodes that have missing data. The avtMissingData array is an unsigned char array where 0's denote real values and non-zero values denote missing data. The pipeline's data request indicates the desired treatment of missing data, and the possible values are: ignore, identify, and remove.

  • The ignore mode means that even if variables have missing data, the values should be ignored and the filter's input should be returned as the output.
  • The identify mode causes VisIt to create the avtMissingData array and add it to the dataset, just to mark where the missing data values occur in the dataset.
  • The remove mode causes VisIt to identify the missing data values and also remove them, serving up a subset of the data.

For various implementation reasons, mostly related to pick, avtMissingDataFilter currently occurs twice in the pipeline. The first instance of avtMissingDataFilter comes before the EEF in the pipeline so it has the full list of variables that must be read from the database. This instance of the filter creates the avtMissingData array. The second instance of avtMissingDataFilter comes after the EEF so it can actually remove the cells if missing data removal is desired. Putting both of those operations before EEF in a single filter execution caused problems with pick because the mesh connectivity was different, depending on the requested variable list. Separation of missing data into 2 stages solved the pick issues.

I had originally implemented missing data support inside of avtGenericDatabase but I chose to move it into a filter when I could not easily make it work with material selection. By moving the missing data calculations to a filter, the values with missing data are removed after material selection, which is what I wanted. Putting such a stage in the database itself is complicated by the API for avtDataTree so a filter was a more natural fit for the existing API.

Spreadsheet

The Spreadsheet plot alters the contract to tell the pipeline to identify the missing data values instead of removing them. The Spreadsheet plot uses the avtMissingData array that is passed down the pipeline to show missing data values in a different background color.

Other considerations

  • By putting avtMissingData before the EEF, I'm not sure how this would affect CMFE expressions. It might make pos_cmfe necessary more often. Now that missing data before the EEF just creates the avtMissingData array, the issues are probably solved.
  • Should there be an option to ignore any missing data values, by not inserting the avtMissingData filter?
  • It could be argued that the avtMissingDataFilter should request a missing data mask for each variable from the plugin via a GetMetaData call that calls the plugin's GetAuxiliaryData method. The reason is that the plugin reads the data in its native precision and also reads the missing data values in their native precision. The mask can be created reliably when using bitwise comparison when the native precision is used. The current implementation uses data after they have come through the database and the transform manager, which may muck around with the precision of the data, possibly causing the values in the data arrays to not exactly match the doubles stored for the missing data in the metadata. A related problem is when data are stored as short but are then scaled and offset to reproduce float or double values but the missing data is still stored in the original short precision. In that case, the missing data must also be scaled and offset using the same values as the real data but it could be advantageous to create the missing data mask before any transformations have taken place. These issues are not so critical when valid_min, valid_max, or valid_range are used for the missing data rule because bitwise comparison is not used.