Description
What is your issue?
When reading a netCDF dataset with decode_cf
and mask_and_scale
set to True
, Xarray uses the scale_factor
and _FillValue
/missing_value
attributes of each variable in the dataset to apply the proper masking and scaling. However, from what I can tell, it does not handle certain other common attributes when masking, in particular: valid_max
, valid_min
, and valid_range
. I can't find any direct statement of this behavior in the Xarray documentation or by searching this repository, but I encountered the behavior myself and found a mention in the documentation for the xcube package (this relates to zarr rather than netCDF but is the only mention I could find).
It is nontrivial to handle this as a user, because you (rightfully) lose the scale_factor
attribute on read when mask_and_scale
is true. Since valid_min
/_max
/_range
are stored in the same domain as the packed data if conventions are followed (i.e. unscaled if there is a scale_factor
), it becomes complicated to use them for masking after the fact.
I can only find one discussion (#822) on whether these attributes should or should not be handled by Xarray. In that thread, it was brought up that 1) netCDF4-python doesn't handle this on their end, 2) this doesn't really matter from a technical standpoint anyway because Xarray uses its own logic for scaling, and 3) apparently, they are not directly part of the CF conventions, but rather the NUG convention.
However, netCDF4-python does mask values outside valid_min
/_max
/_range
when opening a dataset (Unidata/netcdf4-python#670), so I feel it would be natural to do the same in Xarray, at least when decode_cf
and mask_and_scale
are both True
. Additionally, according to the netCDF attribute conventions, "generic applications should treat values outside the valid range as missing". I'm not sure any of this was the case back in 2016 when this was last discussed.
I propose that mask_and_scale
should (optionally?) mask values which are invalid according to these attributes. If there are reasons not to, then perhaps, at least, valid_min
/_max
/_range
could be transformed by scale_factor
and add_offset
when scaling is applied to the rest of the dataset, so that users can easily create the relevant masks themselves.