Description
Some (large) data providers are writing NetCDF-4-extended files but using an _Unsigned
attribute to indicate that a signed data type should be interpreted as unsigned bytes.
Background: Unidata/netcdf4-python#656
From the background discussion above, it is my understanding that xarray does not honor the attribute because it’s not a part of the CF spec, is only mentioned as a proposed attribute in the NetCDF Best Practices, and because "xarray wants the Variable
dtype to be the same as the dtype of the data returned."
Taking the above as a given, it is necessary for xarray users encountering such variables to do the following after reading the data:
scale_factor = data.encoding['scale_factor']
add_offset = data.encoding['add_offset']
unscale = ((data - add_offset)/scale_factor).data.astype(dtype).astype('float64')
fixed = unscale * scale_factor + add_offset
The un-scaling step can be saved by turning off auto mask and scale.
In order to automate the above process while still being able to use the functionality of Dataset
, one approach might be to automatically perform the above steps on some known list of variables, and then reassign those variables to the Dataset
. The downside is the need to read all variables up front, which could be expensive when processing large datasets where not all variables are needed.
Is there another approach that would preserve lazy data loading, for instance by providing pre/post hooks for transformation functions at the __getitem__
stage? Is there something I could do to help document that as a best practice?