Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetCDF valid_min/_max/_range do not mask datasets and do not get scaled #8359

Open
claytharrison opened this issue Oct 23, 2023 · 2 comments

Comments

@claytharrison
Copy link

claytharrison commented Oct 23, 2023

What is your issue?

When reading a netCDF dataset with decode_cf and mask_and_scale set to True, Xarray uses the scale_factor and _FillValue/missing_value attributes of each variable in the dataset to apply the proper masking and scaling. However, from what I can tell, it does not handle certain other common attributes when masking, in particular: valid_max, valid_min, and valid_range. I can't find any direct statement of this behavior in the Xarray documentation or by searching this repository, but I encountered the behavior myself and found a mention in the documentation for the xcube package (this relates to zarr rather than netCDF but is the only mention I could find).

It is nontrivial to handle this as a user, because you (rightfully) lose the scale_factor attribute on read when mask_and_scale is true. Since valid_min/_max/_range are stored in the same domain as the packed data if conventions are followed (i.e. unscaled if there is a scale_factor), it becomes complicated to use them for masking after the fact.

I can only find one discussion (#822) on whether these attributes should or should not be handled by Xarray. In that thread, it was brought up that 1) netCDF4-python doesn't handle this on their end, 2) this doesn't really matter from a technical standpoint anyway because Xarray uses its own logic for scaling, and 3) apparently, they are not directly part of the CF conventions, but rather the NUG convention.

However, netCDF4-python does mask values outside valid_min/_max/_range when opening a dataset (Unidata/netcdf4-python#670), so I feel it would be natural to do the same in Xarray, at least when decode_cf and mask_and_scale are both True. Additionally, according to the netCDF attribute conventions, "generic applications should treat values outside the valid range as missing". I'm not sure any of this was the case back in 2016 when this was last discussed.

I propose that mask_and_scale should (optionally?) mask values which are invalid according to these attributes. If there are reasons not to, then perhaps, at least, valid_min/_max/_range could be transformed by scale_factor and add_offset when scaling is applied to the rest of the dataset, so that users can easily create the relevant masks themselves.

@claytharrison claytharrison added the needs triage Issue that has not been reviewed by xarray team member label Oct 23, 2023
@welcome
Copy link

welcome bot commented Oct 23, 2023

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@dcherian dcherian added topic-backends topic-CF conventions and removed needs triage Issue that has not been reviewed by xarray team member labels Oct 23, 2023
@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Feb 6, 2024

@claytharrison Sorry for the massive delay here. This slipped somehow through the cracks.

Thanks for the detailed problem description.

For xarray I currently see only three solutions/workarounds handling these types of packed data:

  1. scale_factor and/or add_offset are saved within variables encoding dict. Users could transform the valid_min/valid_max/valid_range with those and create appropriate masks.
  2. do 1. as part of decoding and scale/offset the valid_* attributes. Reverse the process on encoding.
  3. add actual_range attribute (if not available) when decoding (derived from valid_* attributes). actual_range should have the type intended for the unpacked data.

See this section of the CF Conventions for details: Missing data, valid and actual range of data.

Simplest solution, but less user friendly is 1. Solution 2. is too much involved and error prone. Solution 3. would be less invasive and most user friendly. There might be other solutions which I do not have on the list right now.

I'd favour solution 3. which is conforming to standard, user friendly and relatively easy to handle in the encoding step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants