Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for netCDF4.EnumType #8147

Merged
merged 45 commits into from
Jan 17, 2024
Merged

Conversation

bzah
Copy link
Contributor

@bzah bzah commented Sep 5, 2023

This pull request add support for enums on netcdf4 backend.

Enum were added in netCDF4-python in 1.2.0 (September 2015).
In the netcdf format, they are defined as types and can be use across the dataset to type variable when on creation.
They are meant to be an alternative to flag_values, flag_meanings.

This pull request makes it possible for xarray to read existing enums in a file, convert them into flag_values/flag_meanings and save them as enums when an special encoding flag is filled in.

TODO:

  • Add implementation for other backends ? Will be added in follow-up PR

xarray/conventions.py Outdated Show resolved Hide resolved
@bzah
Copy link
Contributor Author

bzah commented Sep 6, 2023

According to netCDF4 example here, to deal with missing values in variables based having a meaning explained by an enum, the enum declaration should contain a mapping between fillvalue and "Missing" or equivalent value.

But the netCDF C and python libraries do not enforce this.
So we can end up with clunky variables that contains missing_values but which are invalid values according to the enum declaration. The example below illustrate this

import netCDF4 as nc

ds = nc.Dataset("./toto.nc", "w")
my_enum = ds.createEnumType("u1","my_enum",{"a":0, "b":1})
ds.createDimension("time", 10)
my_var = ds.createVariable("my_var", my_enum, "time") 
# no fill_value defined above
my_var[:].data 
# my_var[:] is full of 255, which is the default fillvalue for unsigne byte (u1) 

Related issue to Unidata/netcdf-c#982, in particular Unidata/netcdf-c#982 (comment)

In this PR, this causes to_netcdf to crash.

@bzah bzah changed the title WIP: naive implementation of enum WIP: Add enum support Sep 6, 2023
@benbovy
Copy link
Member

benbovy commented Sep 6, 2023

Thanks @bzah for working on this! I’m probably not the most qualified to review this PR as I’m not familiar with the netCDF4 Enum data type, though.

Validate with xarray team if it's ok to add attribute to Variable and DataArray

This seems a bit too invasive to me for such an edge case. Propagating variable metadata (attrs and encoding) in xarray operations is already complicated and is often source of bugs. Extending the xarray.Variable class with two additional attributes to track will make the situation worse.

Since this is specific to the netCDF4 format, a less invasive approach would be to implement a class EnumArray(VariableCoder) (see xarray/coding/variables.py) and plug it into xarray's builtin netcdf backend. I see two options for decoding an enum netcdf variable:

  1. return an xarray.Variable with integer (e.g., np.uint8) dtype and extract the enum_dict as an item of the variable attrs (or encoding?).
  2. wrap the netcdf4 variable as a custom, explicitly indexed (duck) array (see xarray/core/indexing.py for a few classes inheriting from ExplicitlyIndexedNDArrayMixin)

@bzah
Copy link
Contributor Author

bzah commented Sep 6, 2023

Hi @benbovy and thanks for the comment. I guessed it would be a bit spicy to add attributes to Variable, but I wanted to have a naive working approach to enable discussions.
Note that enums are also available in HDF5 so we could implement an enum handler for some other backends. I don't think enums are part of zarr specification though.

TBH, I don't really like using encoding and attrs. I can see that it is convenient to put everything into theses two dictionaries, but this is not tidy enough for me. I like when I can look at a class definition and know what can be there and what cannot. With attrs or encoding this is all hidden IMHO.

I like the ExplicitlyIndexedNDArrayMixin idea. I will try to make something out of it.

@dcherian
Copy link
Contributor

dcherian commented Sep 8, 2023

(1) is definitely a lot easier. We'd also want to support specifying the enum type at write, so that we can roundtrip the file.

(2) would be a lot more involved. What kind of operations would you like to see take advantage of the enum dictionary?

@jhamman
Copy link
Member

jhamman commented Sep 8, 2023

(1) is definitely a lot easier. We'd also want to support specifying the enum type at write, so that we can roundtrip the file.

+1 on this. Don't want to push you too hard off (2) but (1) would have been my recommended approach.

TBH, I don't really like using encoding and attrs.

I'd be interested to get more of your rational on this. We've been discussing making encoding a read-only field (outside of the backend constructors). For more control over the content of attrs, you may want to look at xarray-dataclasses or xarray-schema.

@kmuehlbauer
Copy link
Contributor

(1) is definitely a lot easier. We'd also want to support specifying the enum type at write, so that we can roundtrip the file.

+1 on that, too. I'm having this on the plate for h5netcdf anyway, so would be good to coordinate.

@bzah
Copy link
Contributor Author

bzah commented Sep 11, 2023

Understood, thanks for the feedback everyone. I will then try implementing 1):

return an xarray.Variable with integer (e.g., np.uint8) dtype and extract the enum_dict as an item of the variable attrs (or encoding?).

I should be able to find time for that this week.

@jhamman, my though was that attrs and encoding can be filled with basically anything (can they ?) and it may be hard to keep track of what may be in there. Whereas having dedicated properties at class level make it obvious what is the purpose of each attribute. But maybe it's just my java instinct that is tickling.
Thanks for sharing xarray-dataclasses and xarray-schema, this looks very interesting to me.

Abel Aoun added 3 commits September 12, 2023 15:21
Reuse instead of duplicating function.
Remove attempts to workaround the fill_value issues.
@bzah
Copy link
Contributor Author

bzah commented Sep 13, 2023

Ok I have a simple working implementation for enums. I still have unit tests to fix and to add though.
However, I know that it will not work for datasets that already have enums out there.
There is indeed a hole in the specification and implementation of HDF5 and netcdf-c that make it possible to create clunky datasets.

Basically, you can create datasets with fill_values outside the enum range and they are considered valid by HDF5.
For example when creating a dataset with:

    import netCDF4 as nc
    import xarray as xr

    clouds_ds = nc.Dataset("clouds_ds__explicit_fill_value.nc", "w")
    cloud_type = clouds_ds.createEnumType("u1","cloud_type", {"clear": 0, "cloudy":1})
    clouds_ds.createDimension("time", size=10)
    clouds_ds.createVariable(
                "clouds",
                cloud_type,
                "time",
                fill_value=255, # or None, same result
            )
    clouds_ds["clouds"][0] = 1
    print(clouds_ds["clouds"][:].data)
    # [out] [ 1 255 255 255 255 255 255 255 255 255 ]
    clouds_ds.close()

netCDF4 lets you create a variable with fill_value outside the enum range but

  • ncdump will crash if you do ncdump clouds_ds__explicit_fill_value.nc
  • And with the current PR's implementation xarray will fail to write it too:
xr_ds = xr.open_dataset("clouds_ds__enumed_fill_value.nc")
xr_ds.to_netcdf("xr_clouds-clouds_ds__enumed_fill_value.nc")
# --> throws an exception because xarray unmasks the values 
#       and try to push 255 (fill_value) in the resulting netCDF file.

It's worth noting that data producer may be tempted to avoid specifying a missing values in the enum definition if they believe it will always be filled with something. But I believe it should be discouraged.

Possible workarounds

When reading a netCDF with enums, if fill_value (either in attributes or from the mask) is not in the enum possible values and there are missing values, then:

    1. Add a dedicated enum key value pair for the fill_value. Like {"_Undefined": <fill_value>}. Thus modifying the enum declaration.
    1. Or, keep track of the mask of missing values and re-apply it at writing time (assuming it's still valid).
    1. Or, forbid opening the file as we consider it invalid, and perhaps have a flag to fallback to i.

I don;t like i. because we loose by simply opening a file and rewriting a copy, the content would not be identical.
ii. seems bad too because if the variable is modified by the user we can't know where to apply the mask.
iii. would be the best IMO, in particular it might force data producer that use xarray to think about missing_values when designing enums.

Relevant discussion on netcdf-c: Unidata/netcdf-c#982

Do you have suggestions ?

@kmuehlbauer
Copy link
Contributor

@bzah Thanks for this first wip implementation. I'll try to review over the next days.

@kmuehlbauer
Copy link
Contributor

Hi @bzah, yes indeed, the default netcdf fill_value issue is a tricky one. There is a general discussion in #2742 with quite some offspring issues.

In general xarray has a relaxed view when it comes to reading non-standard/broken/mismatched (you name it) files. If it is readable, xarray should be able to import it. As netcdf4-python is able to read those files users will expect xarray to ingest it too.

So I'd add another point to your above list:

  • iv. do nothing on read (just read), then raise on write with a meaningful error

As this affects only existing files

  • which have default_fill value activated and are not written completely
  • which set a _FillValue outside the enum-range

we might get away without too much hassle.

For the overall approach we could think to create flag_meanings and flag_values attributes on read (established CF convention) in conjunction with a new key enum in encoding (encoding["enum"]="my_enum_name"). On write, those attributes should be removed and the netCDF4.EnumType be created instead.

PROs:

  • users current workflows with flag_meanings/flag_values will automagically work for the new netCDF4.EnumType too
  • even if encoding is lost while processing, data will still be written out in a meaningful way (flag_meanings and flag_values)
  • can be meaningful serialized to other backends, which do not have a notion of enum
  • users could use the new netCDF4.EnumType as output just by setting encoding["enum"]="my_enum_name" for their existing variables (if they have flag_meanings and flag_values available)
  • no need to add dict to valid_types for the backend API
  • no need to invent new naming scheme (enum_name, enum_meaning)
  • backwards and roundtrip compatible (encoding["dtype"] isn't touched)

CONs:

  • I'm surely biased, but I can't immediately see any

Also note this comment from @samain-eum cf-convention/discuss#238 (comment) where EUMETSAT is following a similar path.

WDYT @bzah?

@bzah
Copy link
Contributor Author

bzah commented Sep 14, 2023

Hi @kmuehlbauer, thanks for linking the fill_value issue, interesting read. I would be willing to try to open a PR to fix that (fetching the mask, getting the implicit fill_value and making it explicit in attrs) once I'm done with Enums.

Regarding:

iv. do nothing on read (just read), then raise on write with a meaningful error

Looks better than raising an error on read, but might be frustrating for user if they do all their modifications and get an error on calling to_netcdf.
In particular if the opening, processing and saving is done by a library, then it would be the role of this library to reject or fix the file.
Maybe printing a warning at the opening of the file and having a flag has_valid_enum_fill_value would help ?

As for:

create flag_meanings and flag_values attributes on read [...] [and] encoding["enum"]="my_enum_name". On write, those attributes should be removed and the netCDF4.EnumType be created instead.

I like the idea, it's simpler than mine and from xarray point of view looks elegant and make it easy to be CF compliant.
In particular, we avoid some of inconsistencies described by @samain-eum because xarray users will not have access to the EnumType, so they can only modify flag_meanings and flag_values to update the enum declaration.

However, one enumType may be used in several variables, possibly in several groups. In my opinion, the biggest isuse with flag_meanings and flag_values is how to synchronize them across variables when one change:
Say I have an enum like: safety: {"safe": 0, "unsafe": 1}.
As this is a very generic enum, it can be used in several groups and variables. We would be tempted to declare it once in a parent group, just like you would do with a dimension.
However, if our xarray user want to add a level "very_unsafe": 2, it would be tedious to keep the consistency for all the flag_meanings and flag_values in other variables.
Even without groups, in the same Dataset we could have multiple DataArray which rely on the same enum and we would have the same consistency issue.
Whereas, the EnumType in netCDF4 declared once and modifying it would update every variable typed with it.

One solution could be to have these <enum_name>_flag_values and <enum_name>_flag_meanings at Dataset level but it's no longer CF compliant.

Note that my implementation has the same consistency problem as what you are suggesting.

@kmuehlbauer
Copy link
Contributor

Hi @kmuehlbauer, thanks for linking the fill_value issue, interesting read. I would be willing to try to open a PR to fix that (fetching the mask, getting the implicit fill_value and making it explicit in attrs) once I'm done with Enums.

This has been tried before, but there was no conclusion on how to handle this, see #5680 (comment) and ff.

Regarding:

iv. do nothing on read (just read), then raise on write with a meaningful error

Looks better than raising an error on read, but might be frustrating for user if they do all their modifications and get an error on calling to_netcdf. In particular if the opening, processing and saving is done by a library, then it would be the role of this library to reject or fix the file. Maybe printing a warning at the opening of the file and having a flag has_valid_enum_fill_value would help ?

Warnings might be overlooked or disabled by users. They might help a bit, though. The users might be frustrated because their source data is broken, but an explicit error message raised by xarray describing the problem should help them to fix things before another write attempt. And, just to note that, without that the error would be thrown by netCDF4-python (as is now).

As for:

create flag_meanings and flag_values attributes on read [...] [and] encoding["enum"]="my_enum_name". On write, those attributes should be removed and the netCDF4.EnumType be created instead.

I like the idea, it's simpler than mine and from xarray point of view looks elegant and make it easy to be CF compliant. In particular, we avoid some of inconsistencies described by @samain-eum because xarray users will not have access to the EnumType, so they can only modify flag_meanings and flag_values to update the enum declaration.

I'm interested in how to modify an existing enum. I could not find anything about that in the netCDF4-python docs.

However, one enumType may be used in several variables, possibly in several groups. In my opinion, the biggest isuse with flag_meanings and flag_values is how to synchronize them across variables when one change: Say I have an enum like: safety: {"safe": 0, "unsafe": 1}. As this is a very generic enum, it can be used in several groups and variables. We would be tempted to declare it once in a parent group, just like you would do with a dimension. However, if our xarray user want to add a level "very_unsafe": 2, it would be tedious to keep the consistency for all the flag_meanings and flag_values in other variables. Even without groups, in the same Dataset we could have multiple DataArray which rely on the same enum and we would have the same consistency issue. Whereas, the EnumType in netCDF4 declared once and modifying it would update every variable typed with it.

This is one thing which might be a problem, but the backends can do this kind of discovery (eg. for dimensions IIRC). I have doubt's that the enum type can be updated without touching/rewriting all connected variables. The enum type is directly written to the hdf5 dataset (see h5dump), beside being declared as DATATYPE.

DATATYPE "my_enum" H5T_ENUM {
      H5T_STD_I16LE;
      "a"                1;
      "b"                2;
   };
 DATASET "my_var" {
    DATATYPE  H5T_ENUM {
       H5T_STD_I16LE;
       "a"                1;
       "b"                2;
    }
  :
  :

Looking at this, I'm also interested how netcdf-c maps the declared DATATYPE to already existing DATASETs with that DATATYPE? And why netcdf decided to have named enum types as obviously every DATASET has the enum type attached to itself? I'll do some experimenting myself also for the implementation in h5netcdf.

Another interesting note is that h5py adds a little metadata to it's numpy dtype to mark it as enum:

import h5py
dt = h5py.enum_dtype({"RED": 0, "GREEN": 1, "BLUE": 42}, basetype='i')
print(dt.type)
print(dt.metadata)
<class 'numpy.int32'>
{'enum': {'RED': 0, 'GREEN': 1, 'BLUE': 42}}

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Jan 11, 2024

I've added my suggestions and also removed the test file which slipped in. Needed to special case h5netcdf for now until upstream has added this feature (named enum types).

To make it more explicit I had to add NativeEnumCoder to add the dtype metadata in the encoding step. We still can have dedicated CFEnumCoder later (flags* stuff).

With the metadata trick this works also for h5netcdf backend (with invalid_netcdf=True). It writes a transient enum to the dataset (as netCDF4 does) but does not create the named enum type in the file. This has to be implemented in h5netcdf.

I always have trouble with typing, so appreciate any help with this. Beside the typing this is ready. @bzah is this also good to go from your side? We might have to add some note to the io-docs.

@kmuehlbauer kmuehlbauer self-requested a review January 11, 2024 09:17
@bzah
Copy link
Contributor Author

bzah commented Jan 11, 2024

Many thanks @kmuehlbauer for these improvements. I will have a look at mypy issues.

attributes = {k: var.getncattr(k) for k in var.ncattrs()}
data = indexing.LazilyIndexedArray(NetCDF4ArrayWrapper(name, self))
encoding: dict[str, Any] = {}
Copy link
Contributor Author

@bzah bzah Jan 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A TypedDict for encoding and its possible values would be cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS this is taken care of in in #8520. So if #8520 goes in first, we should change here.

@kmuehlbauer
Copy link
Contributor

@dcherian I think this is ready for final round of review. Can't get the one windows run to work, error seems unrelated. Thanks @bzah for fixing the typing.

@kmuehlbauer kmuehlbauer changed the title Add enum support Add support for netCDF4.EnumType Jan 11, 2024
Copy link
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beside the one doubled same this is LGTM.

xarray/backends/netCDF4_.py Outdated Show resolved Hide resolved
@kmuehlbauer kmuehlbauer added the plan to merge Final call for comments label Jan 14, 2024
attributes = {k: var.getncattr(k) for k in var.ncattrs()}
data = indexing.LazilyIndexedArray(NetCDF4ArrayWrapper(name, self))
encoding: dict[str, Any] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS this is taken care of in in #8520. So if #8520 goes in first, we should change here.

@kmuehlbauer kmuehlbauer reopened this Jan 16, 2024
@kmuehlbauer
Copy link
Contributor

@dcherian I'm about to merge this, but I'm a bit unsure about the failing test. It looks like its not related, but every once in a while all tests succeed. Is this a flaky thing and we can merge away? Thanks!

@kmuehlbauer kmuehlbauer merged commit d20ba0d into pydata:main Jan 17, 2024
26 checks passed
@kmuehlbauer
Copy link
Contributor

Thanks @bzah for sticking with us and pushing this through!

@bzah bzah deleted the enh/add-enum-support branch January 17, 2024 08:09
@bzah
Copy link
Contributor Author

bzah commented Jan 17, 2024

Many thanks @kmuehlbauer , @dcherian and others for making it possible !

@dcherian
Copy link
Contributor

Thanks @bzah and @kmuehlbauer . It'd be nice to follow up with some docs on how to create a new variable that gets encoded to Enum on write.

@kmuehlbauer
Copy link
Contributor

@dcherian Definitely! I'm about to release EnumType feature in h5netcdf the next days. I'll open a PR with the necessary changes on the xarray side and will add documentation appropriately. Thanks for pointing that out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for netcdf4 enum
6 participants