Skip to content

NetCDF valid_min/_max/_range do not mask datasets and do not get scaled #8359

Open
@claytharrison

Description

@claytharrison

What is your issue?

When reading a netCDF dataset with decode_cf and mask_and_scale set to True, Xarray uses the scale_factor and _FillValue/missing_value attributes of each variable in the dataset to apply the proper masking and scaling. However, from what I can tell, it does not handle certain other common attributes when masking, in particular: valid_max, valid_min, and valid_range. I can't find any direct statement of this behavior in the Xarray documentation or by searching this repository, but I encountered the behavior myself and found a mention in the documentation for the xcube package (this relates to zarr rather than netCDF but is the only mention I could find).

It is nontrivial to handle this as a user, because you (rightfully) lose the scale_factor attribute on read when mask_and_scale is true. Since valid_min/_max/_range are stored in the same domain as the packed data if conventions are followed (i.e. unscaled if there is a scale_factor), it becomes complicated to use them for masking after the fact.

I can only find one discussion (#822) on whether these attributes should or should not be handled by Xarray. In that thread, it was brought up that 1) netCDF4-python doesn't handle this on their end, 2) this doesn't really matter from a technical standpoint anyway because Xarray uses its own logic for scaling, and 3) apparently, they are not directly part of the CF conventions, but rather the NUG convention.

However, netCDF4-python does mask values outside valid_min/_max/_range when opening a dataset (Unidata/netcdf4-python#670), so I feel it would be natural to do the same in Xarray, at least when decode_cf and mask_and_scale are both True. Additionally, according to the netCDF attribute conventions, "generic applications should treat values outside the valid range as missing". I'm not sure any of this was the case back in 2016 when this was last discussed.

I propose that mask_and_scale should (optionally?) mask values which are invalid according to these attributes. If there are reasons not to, then perhaps, at least, valid_min/_max/_range could be transformed by scale_factor and add_offset when scaling is applied to the rest of the dataset, so that users can easily create the relevant masks themselves.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions