-
Notifications
You must be signed in to change notification settings - Fork 41
When creating a DataTree from a Dataset with path-like variable, subgroups are expected to be created #311
Comments
Note: this issue was generated from a notebook. You can use it to reproduce locally the bug. |
Thank you for spotting this!
Yes, or a simpler solution would be just to raise an error if someone tries to pass a Dataset which has path-like variable names. We are then basically assuming that (a) any Dataset passed is supposed to represent a group, and (b) a single group cannot have path-like variable names. I think there is also an argument to be made with round-tripping. I think we want this property to hold always: ds # start with a dataset
dt = DataTree(ds)
new_ds = dt.to_dataset() # expect new_ds == ds If ds contains path-like variable names, that are then automatically turned into different groups, this property won't work. If we just forbid it, we can make sure it works in all other cases. |
Hello, thanks for your answer! Indeed the round-trip capability seems the cleanest default behaviour. A Dataset is supposed to be a group. I think what I needed already exists: it's Line 1035 in 0afaa6c
I can have a look to fix the bug (forbidding Datasets containing path-like variable names), and I can suggest an error message like this to help the user that probably wanted this "auto-creating-group" behaviour but did not knew about
(if you have an idea to shorten and improve it I am interested!) Here is what I really wanted to do when creating the issue: from pathlib import PurePosixPath
# Note: paths represent variable names ; the mapping should only contain groups, hence `.parent`
# Note 2: the named DataArray is renamed to only keep the name part of the path.
mapping_path_to_dataarray = {
PurePosixPath(str(varname)).parent: xda.rename(PurePosixPath(str(varname)).name)
for varname, xda in xds.items()
}
mapping_path_to_dataarray
Though this logic will work, it seems a little bit over-complicated to me. xdt = dt.DataTree.from_dict(mapping_path_to_dataarray)
xdt if INTERACTIVE else print(xdt)
Note: Pylance is not happy of usage of dict of PurePosixPaths. It only expects str as keys.
Reconverting the DataTree gives an empty Dataset, as the root contains only groups reconverted = xdt.to_dataset()
reconverted if INTERACTIVE else print(reconverted)
Reconverting the subgroup containing the variable is successful reconverted = xdt["group/subgroup"].to_dataset()
reconverted if INTERACTIVE else print(reconverted)
|
Complement: After re-reading all of this, I think the most natural behaviour from a user perspective is the one of xdt = dt.DataTree()
for varname, xda in xds.items():
xdt[varname] = xda The only issue I have is that this make the for loop unavoidable as it contains an assignment (imperative not declarative) The over-complicated solution... mapping_path_to_dataarray = {
PurePosixPath(str(varname)).parent: xda.rename(PurePosixPath(str(varname)).name)
for varname, xda in xds.items()
} ...was only to get a dict-comprehension rather than a for loop with assignments. |
That looks good - I think long error messages are good if they are adding more context, like yours is. Only thing I would add is an explicit list of the offending variable names.
The Posix part is to forbid backslashes, only allowing forward slashes. Maybe we can forbid those at runtime whilst relaxing the type though.
Why? What do you dislike about the "one dataset per group" logic? |
Okay for adding the offending variable names! I completely agree with the "one dataset per group" logic. This is more from a usage perspective where I find these 3 lines to be the simplest to convert a Dataset containing paths to a DataTree: xdt = dt.DataTree()
for varname, xda in xds.items():
xdt[varname] = xda The user does not have to think that much: the assignment will take care of creating groups and renaming the DataArrays from the source Dataset accordingly. The only thing I quite dislike is having to make the for-loop and assignments, not having a declarative comprehension-like approach to achieve what these lines do, without becoming verbose. But I must admit this is mainly about aesthetics rather than a true need, so I don't think anything more needs to be done on this topic! |
I personally think this little loop is fine - it's clear, explicit and should't be buggy (once we fix the bug you reported in this issue!). If many other people appear to say that they would like a special function for ingesting this type of non-compliant dataset then we could revisit the idea. |
Closing in favour of pydata/xarray#9339 |
When creating a DataTree from a Dataset with path-like variable, subgroups are expected to be created
Test Data Initialization
A Dataset containing a single variable, with a name containing slashes, representing a path.
This flat Dataset containing path-like variable name is expected to produce groups and subgroups
once injected into a DataTree.
Unfortunately, it does not happen. Instead it produces a flat DataTree with a single variable,
with an illegal name (containing slashes).
This is not only cosmetic. Indeed, trying to access this malformed variable name will result in an error:
The expected behaviour would be the one of using
__setitem__
:Technical Hints
__setitem__
wraps the key into aNodePath
:datatree/datatree/datatree.py
Line 923 in 0afaa6c
Probably this section of the DataTree initialization logic would need to be adapted:
datatree/datatree/datatree.py
Line 408 in 0afaa6c
The text was updated successfully, but these errors were encountered: