Skip to content

Best practice when the _Unsigned attribute is present in NetCDF files #1444

Closed
@deeplycloudy

Description

@deeplycloudy

Some (large) data providers are writing NetCDF-4-extended files but using an _Unsigned attribute to indicate that a signed data type should be interpreted as unsigned bytes.

Background: Unidata/netcdf4-python#656

From the background discussion above, it is my understanding that xarray does not honor the attribute because it’s not a part of the CF spec, is only mentioned as a proposed attribute in the NetCDF Best Practices, and because "xarray wants the Variable dtype to be the same as the dtype of the data returned."

Taking the above as a given, it is necessary for xarray users encountering such variables to do the following after reading the data:

scale_factor = data.encoding['scale_factor']
add_offset = data.encoding['add_offset']
unscale = ((data - add_offset)/scale_factor).data.astype(dtype).astype('float64')
fixed = unscale * scale_factor + add_offset

The un-scaling step can be saved by turning off auto mask and scale.

In order to automate the above process while still being able to use the functionality of Dataset, one approach might be to automatically perform the above steps on some known list of variables, and then reassign those variables to the Dataset. The downside is the need to read all variables up front, which could be expensive when processing large datasets where not all variables are needed.

Is there another approach that would preserve lazy data loading, for instance by providing pre/post hooks for transformation functions at the __getitem__ stage? Is there something I could do to help document that as a best practice?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions