-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for netCDF4.EnumType #8147
Conversation
59e7390
to
bdfa8ce
Compare
According to netCDF4 example here, to deal with missing values in variables based having a meaning explained by an enum, the enum declaration should contain a mapping between fillvalue and "Missing" or equivalent value. But the netCDF C and python libraries do not enforce this. import netCDF4 as nc
ds = nc.Dataset("./toto.nc", "w")
my_enum = ds.createEnumType("u1","my_enum",{"a":0, "b":1})
ds.createDimension("time", 10)
my_var = ds.createVariable("my_var", my_enum, "time")
# no fill_value defined above
my_var[:].data
# my_var[:] is full of 255, which is the default fillvalue for unsigne byte (u1) Related issue to Unidata/netcdf-c#982, in particular Unidata/netcdf-c#982 (comment) In this PR, this causes |
Thanks @bzah for working on this! I’m probably not the most qualified to review this PR as I’m not familiar with the netCDF4 Enum data type, though.
This seems a bit too invasive to me for such an edge case. Propagating variable metadata ( Since this is specific to the netCDF4 format, a less invasive approach would be to implement a class
|
Hi @benbovy and thanks for the comment. I guessed it would be a bit spicy to add attributes to Variable, but I wanted to have a naive working approach to enable discussions. TBH, I don't really like using encoding and attrs. I can see that it is convenient to put everything into theses two dictionaries, but this is not tidy enough for me. I like when I can look at a class definition and know what can be there and what cannot. With attrs or encoding this is all hidden IMHO. I like the ExplicitlyIndexedNDArrayMixin idea. I will try to make something out of it. |
(1) is definitely a lot easier. We'd also want to support specifying the enum type at write, so that we can roundtrip the file. (2) would be a lot more involved. What kind of operations would you like to see take advantage of the enum dictionary? |
+1 on this. Don't want to push you too hard off (2) but (1) would have been my recommended approach.
I'd be interested to get more of your rational on this. We've been discussing making |
+1 on that, too. I'm having this on the plate for h5netcdf anyway, so would be good to coordinate. |
Understood, thanks for the feedback everyone. I will then try implementing 1):
I should be able to find time for that this week. @jhamman, my though was that attrs and encoding can be filled with basically anything (can they ?) and it may be hard to keep track of what may be in there. Whereas having dedicated properties at class level make it obvious what is the purpose of each attribute. But maybe it's just my java instinct that is tickling. |
Reuse instead of duplicating function.
ced991c
to
ab53970
Compare
for more information, see https://pre-commit.ci
Ok I have a simple working implementation for enums. I still have unit tests to fix and to add though. Basically, you can create datasets with fill_values outside the enum range and they are considered valid by HDF5. import netCDF4 as nc
import xarray as xr
clouds_ds = nc.Dataset("clouds_ds__explicit_fill_value.nc", "w")
cloud_type = clouds_ds.createEnumType("u1","cloud_type", {"clear": 0, "cloudy":1})
clouds_ds.createDimension("time", size=10)
clouds_ds.createVariable(
"clouds",
cloud_type,
"time",
fill_value=255, # or None, same result
)
clouds_ds["clouds"][0] = 1
print(clouds_ds["clouds"][:].data)
# [out] [ 1 255 255 255 255 255 255 255 255 255 ]
clouds_ds.close() netCDF4 lets you create a variable with fill_value outside the enum range but
xr_ds = xr.open_dataset("clouds_ds__enumed_fill_value.nc")
xr_ds.to_netcdf("xr_clouds-clouds_ds__enumed_fill_value.nc")
# --> throws an exception because xarray unmasks the values
# and try to push 255 (fill_value) in the resulting netCDF file. It's worth noting that data producer may be tempted to avoid specifying a missing values in the enum definition if they believe it will always be filled with something. But I believe it should be discouraged. Possible workaroundsWhen reading a netCDF with enums, if fill_value (either in attributes or from the mask) is not in the enum possible values and there are missing values, then:
I don;t like i. because we loose by simply opening a file and rewriting a copy, the content would not be identical. Relevant discussion on netcdf-c: Unidata/netcdf-c#982 Do you have suggestions ? |
@bzah Thanks for this first wip implementation. I'll try to review over the next days. |
Hi @bzah, yes indeed, the default netcdf fill_value issue is a tricky one. There is a general discussion in #2742 with quite some offspring issues. In general xarray has a relaxed view when it comes to reading non-standard/broken/mismatched (you name it) files. If it is readable, xarray should be able to import it. As netcdf4-python is able to read those files users will expect xarray to ingest it too. So I'd add another point to your above list:
As this affects only existing files
we might get away without too much hassle. For the overall approach we could think to create PROs:
CONs:
Also note this comment from @samain-eum cf-convention/discuss#238 (comment) where EUMETSAT is following a similar path. WDYT @bzah? |
Hi @kmuehlbauer, thanks for linking the fill_value issue, interesting read. I would be willing to try to open a PR to fix that (fetching the mask, getting the implicit fill_value and making it explicit in attrs) once I'm done with Enums. Regarding:
Looks better than raising an error on read, but might be frustrating for user if they do all their modifications and get an error on calling As for:
I like the idea, it's simpler than mine and from xarray point of view looks elegant and make it easy to be CF compliant. However, one enumType may be used in several variables, possibly in several groups. In my opinion, the biggest isuse with flag_meanings and flag_values is how to synchronize them across variables when one change: One solution could be to have these Note that my implementation has the same consistency problem as what you are suggesting. |
This has been tried before, but there was no conclusion on how to handle this, see #5680 (comment) and ff.
Warnings might be overlooked or disabled by users. They might help a bit, though. The users might be frustrated because their source data is broken, but an explicit error message raised by xarray describing the problem should help them to fix things before another write attempt. And, just to note that, without that the error would be thrown by netCDF4-python (as is now).
I'm interested in how to modify an existing enum. I could not find anything about that in the netCDF4-python docs.
This is one thing which might be a problem, but the backends can do this kind of discovery (eg. for dimensions IIRC). I have doubt's that the enum type can be updated without touching/rewriting all connected variables. The enum type is directly written to the hdf5 dataset (see h5dump), beside being declared as DATATYPE.
Looking at this, I'm also interested how netcdf-c maps the declared DATATYPE to already existing DATASETs with that DATATYPE? And why netcdf decided to have named enum types as obviously every DATASET has the enum type attached to itself? I'll do some experimenting myself also for the implementation in h5netcdf. Another interesting note is that h5py adds a little metadata to it's numpy dtype to mark it as enum: import h5py
dt = h5py.enum_dtype({"RED": 0, "GREEN": 1, "BLUE": 42}, basetype='i')
print(dt.type)
print(dt.metadata)
<class 'numpy.int32'>
{'enum': {'RED': 0, 'GREEN': 1, 'BLUE': 42}} |
I've added my suggestions and also removed the test file which slipped in. Needed to special case h5netcdf for now until upstream has added this feature (named enum types). To make it more explicit I had to add With the metadata trick this works also for I always have trouble with typing, so appreciate any help with this. Beside the typing this is ready. @bzah is this also good to go from your side? We might have to add some note to the io-docs. |
Many thanks @kmuehlbauer for these improvements. I will have a look at mypy issues. |
attributes = {k: var.getncattr(k) for k in var.ncattrs()} | ||
data = indexing.LazilyIndexedArray(NetCDF4ArrayWrapper(name, self)) | ||
encoding: dict[str, Any] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A TypedDict for encoding and its possible values would be cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beside the one doubled same
this is LGTM.
attributes = {k: var.getncattr(k) for k in var.ncattrs()} | ||
data = indexing.LazilyIndexedArray(NetCDF4ArrayWrapper(name, self)) | ||
encoding: dict[str, Any] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcherian I'm about to merge this, but I'm a bit unsure about the failing test. It looks like its not related, but every once in a while all tests succeed. Is this a flaky thing and we can merge away? Thanks! |
Thanks @bzah for sticking with us and pushing this through! |
Many thanks @kmuehlbauer , @dcherian and others for making it possible ! |
Thanks @bzah and @kmuehlbauer . It'd be nice to follow up with some docs on how to create a new variable that gets encoded to Enum on write. |
@dcherian Definitely! I'm about to release EnumType feature in h5netcdf the next days. I'll open a PR with the necessary changes on the xarray side and will add documentation appropriately. Thanks for pointing that out! |
This pull request add support for enums on netcdf4 backend.
Enum were added in netCDF4-python in 1.2.0 (September 2015).
In the netcdf format, they are defined as types and can be use across the dataset to type variable when on creation.
They are meant to be an alternative to flag_values, flag_meanings.
This pull request makes it possible for xarray to read existing enums in a file, convert them into flag_values/flag_meanings and save them as enums when an special encoding flag is filled in.
TODO:
whats-new.rst