Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QC-aware transformations #703

Open
maxwelllevin opened this issue Aug 8, 2023 · 8 comments
Open

QC-aware transformations #703

maxwelllevin opened this issue Aug 8, 2023 · 8 comments
Labels
enhancement New feature or request

Comments

@maxwelllevin
Copy link
Contributor

xarray provides powerful resample and groupby methods for transforming data onto a target coordinate grid, but it has no concept of QC variables so any transformations applied (e.g., mean, nearest) are performed naively and could result in the use of data that has been flagged as bad.

I think ACT would be a great place to host extensions to xarray's methods that do account for QC values in transformations. The proposed interface would be a series of methods that mirror the transformation types offered by the ARM Data Integrator (ADI) made available to ARM users in the PCM Interface:

  • nearest neighbor
  • bilinear interpolation
  • bin averaging
  • auto (picks between interpolation and averaging based on bin size) -- optional

The ADI library makes a few key decisions for QC-aware transformations that I think should be mirrored here:

  • data values QC'd as bad are excluded from consideration in the transformation
    • for averaging this is equivalent to marking these as NaN
    • for nearest neighbor(interpolation) the nearest non-bad point(s) are used
  • for averaging, if >threshold % of values in a bin are bad, then the output value is set to missing
    • also if >other threshold are indeterminate, output value is also set to missing
    • there are reasonable defaults set for each (I think these are 50% and 80%, respectively)
    • PCM/ADI goes a step further and lets you customize this for each variable, but I'm not sure if the functionality/complexity trade-off is worth it for ACT
  • optionally, an output summary QC variable is generated for each QC'd input variable

I think this could be implemented as a method applied to an xarray DatasetResample / DatasetGroupBy object returned by ds.resample / ds.groupby, e.g.,:

# Proposed API

import act
import xarray as xr


ds = act.io.armfiles.read_netcdf(...)

ds.resample(time="30min").apply(
    act.qc.transform.NearestNeighbor(tolerance="15min")
)

ds.resample(time="30min").apply(
    act.qc.transform.Interpolate(method="linear")
)

ds.groupby("time.hour").apply(
    act.qc.transform.BinAverage(
        bad_threshold=0.5,
        indeterminate_threshold=0.8,
        add_transform_qc=False,  # maybe also a roll-up QC option like PCM (4 bits instead of 10+)
    ),
)

The transform functions/classes (NearestNeighbor, Interpolate, BinAverage) should take and return xarray Dataset objects. The input passed by the apply method contains all the points in the given bin and the output is expected to be a 0-coord Dataset with scalar values for each data variable (metadata included).

I'm totally open to any changes/feedback. This could probably use several iterations of revisions to make it easier for users. Let me know what you think!

@kenkehoe
Copy link
Contributor

I wonder if applying the QC before transformation will be the best method. Xarray will apply the same methods to QC data as the other data. This results in Int dtypes upconverted to float and the values no longer whole numbers. We could encourage applying QC to the desire way before performing any method modifications.

@maxwelllevin
Copy link
Contributor Author

@kenkehoe Yeah, I think we're more or less on the same page there.

QC masks should be applied before the transformation(s) to mask out bad values and drop the original QC variables so you don't wind up with nonsense averages or interpolations of bitpacked QC values. So far this is all within the realm of user-written code and maybe an example or two of how to do this using ACT.

What I am suggesting is an extension on this which also provides output QC values that make sense given the transformation type, kind of like how ARM does it (examples of bitpacked transform QC below). For ARM, we automatically assume that any QC check with a 'Bad' assessment should automatically get masked out, and if a certain threshold of points are bad for a given bin, then the output for that bin should also get QC'd as bad and masked out. There's similar logic for 'Indeterminate' assessments.

Detailed ARM QC

E.g., in the DOD interface for a particular variable you can add a ancillary variable transform -> bitpacked -> detail and you wind up with these bits for a new qc variable:

qc_aps_total_N_conc(time):int
    long_name = Quality check results on variable: Aerosol number concentration from integrated size distribution, APS
    units = 1
    standard_name = quality_flag
    description = This variable contains bit-packed integer values, where each bit represents a QC test on the data. Non-zero bits indicate the QC condition given in the description for those bits; a value of 0 (no bits set) indicates the data has not failed any QC tests.
    flag_method = bit
    bit_1_description = QC_BAD:  Transformation could not finish, value set to missing_value.
    bit_1_assessment = Bad
    bit_1_comment = An example that will trip this bit is if all values are bad or outside range.
    bit_2_description = QC_INDETERMINATE:  Some, or all, of the input values used to create this output value had a QC assessment of Indeterminate.
    bit_2_assessment = Indeterminate
    bit_3_description = QC_INTERPOLATE:  Indicates a non-standard interpolation using points other than the two that bracket the target index was applied.
    bit_3_assessment = Indeterminate
    bit_3_comment = An example of why this may occur is if one or both of the nearest points was flagged as bad.  Applies only to interpolate transformation method.
    bit_4_description = QC_EXTRAPOLATE:  Indicates extrapolation is performed out from two points on the same side of the target index.
    bit_4_assessment = Indeterminate
    bit_4_comment = This occurs because the input grid does not span the output grid, or because all the points within range and on one side of the target were flagged as bad.  Applies only to the interpolate transformation method.
    bit_5_description = QC_NOT_USING_CLOSEST:  Nearest good point is not the nearest actual point.
    bit_5_assessment = Indeterminate
    bit_5_comment = Applies only to subsample transformation method.
    bit_6_description = QC_SOME_BAD_INPUTS:  Some, but not all, of the inputs in the averaging window were flagged as bad and excluded from the transform.
    bit_6_assessment = Indeterminate
    bit_6_comment = Applies only to the bin average transformation method.
    bit_7_description = QC_ZERO_WEIGHT:  The weights for all the input points to be averaged for this output bin were set to zero.
    bit_7_assessment = Indeterminate
    bit_7_comment = The output "average" value is set to zero, independent of the value of the input.  Applies only to bin average transformation method.
    bit_8_description = QC_OUTSIDE_RANGE:  No input samples exist in the transformation region, value set to missing_value.
    bit_8_assessment = Bad
    bit_8_comment = Nearest good bracketing points are farther away than the "range" transform parameter if transformation is done using the interpolate or subsample method, or "width" if a bin average transform is applied.  Test can also fail if more than half an input bin is extrapolated beyond the first or last point of the input grid.
    bit_9_description = QC_ALL_BAD_INPUTS:  All the input values in the transformation region are bad, value set to missing_value.
    bit_9_assessment = Bad
    bit_9_comment = The transformation could not be completed. Values in the output grid are set to -9999 and the QC_BAD bit is also set.
    bit_10_description = QC_BAD_STD:  Standard deviation over averaging interval is greater than limit set by transform parameter std_bad_max.
    bit_10_assessment = Bad
    bit_10_comment = Applies only to the bin average transformation method.
    bit_11_description = QC_INDETERMINATE_STD:  Standard deviation over averaging interval is greater than limit set by transform parameter std_ind_max.
    bit_11_assessment = Indeterminate
    bit_11_comment = Applies only to the bin average transformation method.
    bit_12_description = QC_BAD_GOODFRAC:  Fraction of good and indeterminate points over averaging interval are less than limit set by transform parameter goodfrac_bad_min.
    bit_12_assessment = Bad
    bit_12_comment = Applies only to the bin average transformation method.
    bit_13_description = QC_INDETERMINATE_GOODFRAC:  Fraction of good and indeterminate points over averaging interval is less than limit set by transform parameter goodfrac_ind_min.
    bit_13_assessment = Indeterminate
    bit_13_comment = Applies only to the bin average transformation method.

Summary ARM QC

E.g., using the summary transform QC transform -> bitpacked -> summary

qc_aps_total_N_conc(time):int
    long_name = Quality check results on variable: Aerosol number concentration from integrated size distribution, APS
    units = 1
    standard_name = quality_flag
    description = This variable contains bit-packed integer values, where each bit represents a QC test on the data. Non-zero bits indicate the QC condition given in the description for those bits; a value of 0 (no bits set) indicates the data has not failed any QC tests.
    flag_method = bit
    bit_1_description = Transformation could not finish (all values bad or outside range, etc.), value set to missing_value.
    bit_1_assessment = Bad
    bit_2_description = Transformation resulted in an indeterminate outcome.
    bit_2_assessment = Indeterminate

Moving forward with this is probably opening a whole can of worms as there are a bunch of things to scope out, but in general I think this could be a very nice feature for ACT and its users. I'd be open to collaboration here

@kenkehoe
Copy link
Contributor

@maxwelllevin I can see argument for both sides of this discussion on how much to do for the user. ARM's ability to provide correct and meaningful QC is not 100%. There are plenty of cases where limits are not set correctly for a long period of time. This results in good data being labeled as Indeterminate and Indeterminate data being labeled as Bad. I think we need to provide code/examples for both cases to ensure the users can get the results that best work with their analysis.

I think you are suggesting we create a method to convert b level data to s level data? I'm down with that. Should be pretty simple to implement. Plus it will show off ACT's superior technology.

@AdamTheisen
Copy link
Collaborator

@kenkehoe @maxwelllevin thanks for the great discussion so far on this! I do want to get thoughts from @mgrover1 and @zssherman on this as well. I do want to go back to the original request which included the below request. I think we could easily take care of A with what Ken's thinking and develop a act.qc.prepare method. For B, that gets complicated and gets to where the act.qc.transform method that @maxwelllevin noted would be valuable.

A. data values QC'd as bad are excluded from consideration in the transformation

  • for averaging this is equivalent to marking these as NaN
  • for nearest neighbor(interpolation) the nearest non-bad point(s) are used
    B. for averaging, if >threshold % of values in a bin are bad, then the output value is set to missing
  • also if >other threshold are indeterminate, output value is also set to missing
  • there are reasonable defaults set for each (I think these are 50% and 80%, respectively)
  • PCM/ADI goes a step further and lets you customize this for each variable, but I'm not sure if the functionality/complexity trade-off is worth it for ACT
  • optionally, an output summary QC variable is generated for each QC'd input variable

Overall, I see 3 needs here:

  1. Documentation to show how users can prepare their data before using any transformations
  2. A method to prepare the data prior to any transformation that cleans up the data based on the QC
  3. A method that transforms the data and adds value in the QC flagging as noted in 2.

Does that sound right?

@maxwelllevin
Copy link
Contributor Author

@AdamTheisen yeah, that all sounds right to me. I think "3. A method that transforms the data and adds value in the QC flagging" is more challenging than it first appears. Definitely would recommend discussing and defining/narrowing the scope on that for the first pass

@mgrover1
Copy link
Collaborator

That sounds reasonable @AdamTheisen ! I agree with @maxwelllevin about being clear about the scope,

@zssherman
Copy link
Collaborator

@AdamTheisen Sounds reasonable as well to me!

@AdamTheisen AdamTheisen added enhancement New feature or request and removed V2.0.0 labels Oct 11, 2023
@AdamTheisen
Copy link
Collaborator

Note, we added an example in #734 on how users should filter with ACT before applying xarray transformations like resample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants