-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate datatree.py module into xarray.core. #8789
Migrate datatree.py module into xarray.core. #8789
Conversation
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. |
xarray/core/datatree.py
Outdated
@@ -77,7 +73,7 @@ | |||
# """ | |||
|
|||
|
|||
T_Path = Union[str, NodePath] | |||
T_Path = str | NodePath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like mypy
and the Python 3.9 tests don't like this change. I can revert it if that sounds best?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that syntax won't work on 3.9 so we can't use it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can add the pragma at the top? from __future__ import annotations
. Maybe it's technically a pragma.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was already there, so I'm guessing the custom type here is not covered by __future__.annotations
in the same way that type hints are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay - that looks better but still not perfect.
I'm working on the mypy
issues still present, including things like annotations in the test files. I see some other things that I will try to debug tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's what I get for guessing without looking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two random things I forgot to mention earlier.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I added these changes in: 2c5e54c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still some things to work out, but I wanted to update the PR with some changes, to get some feedback and guidance.
@@ -169,33 +166,32 @@ def update(self, other) -> NoReturn: | |||
) | |||
|
|||
# FIXME https://github.com/python/mypy/issues/7328 | |||
@overload | |||
def __getitem__(self, key: Mapping) -> Dataset: # type: ignore[misc] | |||
@overload # type: ignore[override] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few new type: ignore[override]
annotations on DatasetView
methods. @flamingbear and I spent a while today trying to disentangle things here.
Essentially, the issue comes down to DatasetView
inheriting from Dataset
, which uses typing.Self
as a return type in a few places. This would mean DatasetView
would have a return type of DatasetView
. But, instead, these return types are explicitly a Dataset
. (mypy has some examples mentioning that subclasses shouldn't have more generalised return types than their parents, which seems related - thanks @flamingbear for the reference)
I'm not sure that just ignoring the error is the best thing to do, but I don't think we had a better idea for an implementation. I think @TomNicholas has mentioned a future step of reworking the class inheritance, but I'm not sure if that would also cover this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. So I guess I have violated the Liskov substitution principle here... Does that represent an actual possible failure mode? Like are there any places where "if you can use a Dataset
you can use a DatasetView
" fails? I suppose passing a DatasetView
into any function that attempts to call __setitem__
would violate this.
Are there alternative designs that would give us immutability without DatasetView
being a subclass of Dataset
? One alternative might be something like
class FrozenDataset:
_dataset: Dataset
def __init__(self, ds):
self._dataset = ds
def __getattr__(self, name):
# Forward the method call to the Dataset class
if name in ['__setitem__', ...]:
raise AttributeError("Mutation of the DatasetView is not allowed, ...")
else:
return getattr(self._ds, name)
but then that would not pass an isinstance
check, i.e. isinstance(dt.ds, Dataset)
would return False
.
@TomNicholas has mentioned a future step of reworking the class inheritance, but I'm not sure if that would also cover this.
I was just talking about having both Dataset
and DataTree
inherit from a common DataMapping
class that holds variables. But I don't think that would cover this, as that DataMapping
should also be returning Self
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much more like the Frozen
class that xarray sometimes uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So looking at this in Owen's absence. It seems like there are two paths.
- Use a frozenDataset type thing, that fails the isinstance check?
- Overriding the mypy errors understanding the the DatasetView is an library internal implementation and we know not to misuse it?
The first seems like the right thing to do, but I don't know what breaks with the isinstance check or why that would be necessary.
Is there some ABC that both Dataset and DatasetView could implement to pass the isinstance check and is that a big change? (edit: I'm thinking no. Also, this may show that I'm not afraid to ask the dumb questions in public.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to make it a wrapper. and isinstance(dt.ds, Dataset) will return False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming back to this with things that have been tried (and sadly not succeeded to date):
- First up, I tried to implement the suggested
FrozenDataset
wrapper. There was an issue with usinggetattr
- as it can't be used to intercept magic methods on Python classes. - Next: Trying to implement a
Metaclass
to allow interception of magic methods (initially inspired by this snippet). This proved tricky to do (I didn't quite get it fully working) and felt very much like it was an overly complicated solution for the problem we were trying to solve. - Next: Trying a mix-in to overwrite the affected methods. I put a time-box on this attempt as we want to unblock the migration. I did not get this working in the allocated time.
- Lastly: Conceding defeat and adding the
type ignore
statements to cover this.
Not ideal, but the DatasetView
and Datatree.ds
property both have usage over the last couple of years without significant issues. I opened #8855 to capture that we need to work on a better fix at a later date.
@@ -636,31 +629,31 @@ def __array__(self, dtype=None): | |||
"invoking the `to_array()` method." | |||
) | |||
|
|||
def __repr__(self) -> str: | |||
return formatting.datatree_repr(self) | |||
def __repr__(self) -> str: # type: ignore[override] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another annotation I wanted to call out. I think it makes sense here. NamedNode.__repr__
has an optional level
kwarg, which isn't an argument in repr_datatree
. repr_datatree
is using RenderTree
, which does have a maxLevel
, but I don't think that's quite the same thing. (If it is, though, we could edit DataTree.__repr__
to match the signature of NamedNode.__repr__
, and then pass it on through)
I don't think properly maps to the maxLevel
|
||
if d: | ||
# Populate tree with children determined from data_objects mapping | ||
for path, data in d.items(): | ||
# Create and set new node | ||
node_name = NodePath(path).name | ||
if isinstance(data, cls): | ||
if isinstance(data, DataTree): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this was largely to make mypy
happy, but I think it makes sense. Just using cls
, I don't think mypy
was realising that objects that were DataTree
instances couldn't get to the else
statement.
@@ -1064,14 +1059,18 @@ def from_dict( | |||
|
|||
# First create the root node | |||
root_data = d.pop("/", None) | |||
obj = cls(name=name, data=root_data, parent=None, children=None) | |||
if isinstance(root_data, DataTree): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra code here is a duplication of what is happening for the child nodes. mypy
was unhappy with the types for the root_data
, in case it was a DataTree
instance.
assert dt.name is None | ||
|
||
def test_bad_names(self): | ||
with pytest.raises(TypeError): | ||
DataTree(name=5) | ||
DataTree(name=5) # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a few mypy
annotations like this. I think they make sense, because the tests are explicitly checking what happens when you use the wrong type of argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you always have to do this within tests that check for TypeErrors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool beans. I didn't want people to think I'd just thrown in type: ignore
statements in just to quieten down mypy
. (I'd thought about it a bit, and then thrown in type: ignore
statements to quieten down mypy
😉)
results = DataTree(name="results", data=data) | ||
xrt.assert_identical(results[["temp", "p"]], data[["temp", "p"]]) | ||
results: DataTree = DataTree(name="results", data=data) | ||
xrt.assert_identical(results[["temp", "p"]], data[["temp", "p"]]) # type: ignore[index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is reasonable - this and test_getitem_dict_like_selection_access_to_dataset
are using Iterable
types (a list and a dictionary), which currently would raise exceptions (and the tests are marked as xfail
). Other options I considered included writing tests that would make sure the expected assertions were raised, but it feels like the actual desired behaviour is TBD, and that the best thing to do was to leave things as they were.
@@ -446,7 +440,7 @@ def ds(self) -> DatasetView: | |||
return DatasetView._from_node(self) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really a comment for L430:
def ds(self) -> DatasetView:
This makes sense (because of wanting an immutable version of the Dataset
), but it's causing the bulk of the remaining mypy
issues in the tests, where Dataset
objects are being directly assigned to DataTree.ds
. That said, I think I'm a little surprised this is causing an issue, given the signature below for the @ds.setter
. I'd love some guidance on resolving those mypy
errors!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also surprised that causes a problem... Is this pattern valid typing in general for properties?
class A:
...
class B(A):
...
class C:
@property
def foo(self) -> B:
...
@property.setter
def foo(self, value: A):
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fine until you set C.foo with A.
class A:
...
class B(A):
...
class C:
@property
def foo(self) -> B:
return B()
@foo.setter
def foo(self, value: A):
pass
nope = C()
nope.foo = A()
output:
> mypy demo/demo7.py
demo/demo7.py:20: error: Incompatible types in assignment (expression has type "A", variable has type "B") [assignment]
And I think this is the open issue for fixing/ignoring this issue
python/mypy#3004
But it seems like it's not high priority since it's been open for 7 years.
def update(self, other: Dataset | Mapping[str, DataTree | DataArray]) -> None: | ||
def update( | ||
self, other: Dataset | Mapping[str, DataTree | DataArray | Variable] | ||
) -> None: | ||
""" | ||
Update this node's children and / or variables. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a comment for L937 (can't add it directly)...
This is another mypy
issue because the type of k
can be a Hashable
(from iterating through a Dataset
) or a str
, but DataTree.name
needs a str
.
There seem to be a couple of options:
- Cast
k
as a string. - Check that
k
is a string, and raise an exception if it isn't.
Happy to do either (or something else). I couldn't think of an immediate issue with casting a Hashable
to a string, but wanted to check (in case there might be some chance of a weird collision between e.g. 1
and "1"
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is symptomatic of a more general issue that technically all the keys in DataTree
should be Hashable
to match what xarray "mostly" supports (see #8294 (comment)).
@headtr1ck do you have any thoughts on supporting Hashable
for names of child nodes in DataTree
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably have to support Hashable
at some point, otherwise any operation that combines a DataTree
with a Dataset
or DataArray
will be a nightmare to type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay thanks. In that case I defer to Owen on whether it would be easier to do that in this PR or a follow-up one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm helping out while Owen is at a conference. I was going to see if it were easy to add this change to handle Hashable and quickly ran into the PurePosixPath
's pathsegments
needing to be PathLike
.
I think if we want to use NodePath for traversing and altering the tree, the paths are going to need to be coerced into strings.
DataTree
's _name
can be changed to Hashable
, but when it's calling NamedNode
's __init__
I think that property setter will have to coerce to a string.
But I'm not thinking of what kind of problems is that going to cause? Beyond say someone using (7,)
as a key/path as well as "(7,)"
and having collisions.
I think we can handle that by checking collisions. but I'm not 100% sure yet. Am I missing something obvious?
I think I was missing something obvious (just making the DataTree's _name Hashable is not going to help here, it will also get converted by the NamedNode pieces.)
This is going to be a question for tomorrow's meeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that meeting we discussed how having Hashables in datatree is possibly more trouble than it is worth, because (a) you can't form a valid path out of a list of hashables (but a list of strings is always fine), and (b) names of groups can't be hashable-like in netCDF or definitely not in Zarr anyway, so there doesn't seem like much of a use case for hashable-named groups at least.
For now it was decided to see if we could type: ignore
our way to having an initial implementation that does not support hashables in datatree (which as we can explicitly forbid hashables at tree creation time hopefully isn't a ridiculous idea). #8836 was made to track the intent to revisit this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not really follow all the discussions about this PR.
But for me, not accepting Hashables sounds reasonable (even though just the thought of a tree-like datastructure sounds like this should be possible for any Hashables names. E.g. a node object with a Hashables name and a number of node children). But agreed, it makes things like getitem unnecessarily complicated.
Probably the best for now is some type ignores or casts.
In the future I anyway have plans to make Dataset a generic class in variable names (and dimension names for that matter). Then this problem can be solved by returning, e.g. Dataset[str]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not really follow all the discussions about this PR.
Yeah sorry - a lot of them happened over zoom (see the March 19th meeting notes here).
sounds like this should be possible for any Hashables names
The problem is not the tree construction, it's serialization (because I can't guarantee being able to make a single valid unix path out of .join'ing all those hashables).
Probably the best for now is some type ignores or casts.
Cool, that's what we've done.
In the future I anyway have plans to make Dataset a generic class in variable names (and dimension names for that matter). Then this problem can be solved by returning, e.g.
Dataset[str]
.
That would be awesome!
xarray/core/datatree.py
Outdated
@@ -1449,7 +1448,7 @@ def merge_child_nodes(self, *paths, new_path: T_Path) -> DataTree: | |||
|
|||
# TODO some kind of .collapse() or .flatten() method to merge a subtree | |||
|
|||
def as_array(self) -> DataArray: | |||
def as_dataarray(self) -> DataArray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget a quick update to whats-new.rst
for this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch.
94b17f9
to
c45c56a
Compare
doc/whats-new.rst
Outdated
@@ -32,6 +32,10 @@ New Features | |||
Breaking changes | |||
~~~~~~~~~~~~~~~~ | |||
|
|||
- ``Datatree``'s ``as_array`` renamed ``to_dataarray`` to align with ``Dataset``. (:pull:`8789`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomNicholas should this have been kept out of breaking changes? mostly because it's not actually released? Wasn't sure where I should keep it. reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we never had a need before for "breaking changes that aren't breaking yet but will be, but only relevant for previous users of another package" 😅
These datatree breaking changes really only need to be written down somewhere, even a GH issue, so that we can point to them all at once when it comes time to do the grand reveal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#8807 and reverted. 😬
Accurately reflects the default value now.
for "breaking changes that aren't breaking yet but will be, but only relevant for previous users of another package"
We updated the name but not the function.
So this is where we are moving forward with the assumption that DataTree nodes are alway named with a string. In this section of `update` even though we know the key is a str, mypy refuses. I chose explicit recast over mypy ignores, tell me why that's wrong?
@TomNicholas - I think this PR is at a point now where it can be reviewed again in earnest. 🤞 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should get this merged. Most of the changes are small and/or typing (I'm impressed you got mypy to pass!). I only have one substantive comment (about mixins).
from xarray.datatree_.datatree.common import TreeAttrAccessMixin | ||
from xarray.datatree_.datatree.formatting import datatree_repr | ||
from xarray.datatree_.datatree.formatting_html import ( | ||
datatree_repr as datatree_repr_html, | ||
) | ||
from xarray.datatree_.datatree.mapping import ( | ||
TreeIsomorphismError, | ||
check_isomorphic, | ||
map_over_subtree, | ||
) | ||
from xarray.datatree_.datatree.ops import ( | ||
DataTreeArithmeticMixin, | ||
MappedDatasetMethodsMixin, | ||
MappedDataWithCoords, | ||
) | ||
from .render import RenderTree | ||
from xarray.core.treenode import NamedNode, NodePath, Tree | ||
from xarray.datatree_.datatree.render import RenderTree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might it make sense to not actually import and use any of these from xarray.datatree._datatree
imports in this PR? For example the DataTree
object should still pass 99% of its tests without inheriting from the TreeAttrAccessMixin
. That way we are still being explicit about what has and has not been "merged and approved".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are losing some of the testing if you do that. Right now we're still collecting and running all of the tests in datatree_. If you pull out all of the pieces that aren't migrated you'll lose that testing. I think we're explicit about what is merged and approved by what is not in datatree_ anymore. And none of this is visible to a user yet. Let's talk in today's meeting and you can convince me otherwise.
With @shoyer 's blessing (in the meeting) I hereby merge this PR |
* main: (26 commits) [pre-commit.ci] pre-commit autoupdate (pydata#8900) Bump the actions group with 1 update (pydata#8896) New empty whatsnew entry (pydata#8899) Update reference to 'Weighted quantile estimators' (pydata#8898) 2024.03.0: Add whats-new (pydata#8891) Add typing to test_groupby.py (pydata#8890) Avoid in-place multiplication of a large value to an array with small integer dtype (pydata#8867) Check for aligned chunks when writing to existing variables (pydata#8459) Add dt.date to plottable types (pydata#8873) Optimize writes to existing Zarr stores. (pydata#8875) Allow multidimensional variable with same name as dim when constructing dataset via coords (pydata#8886) Don't allow overwriting indexes with region writes (pydata#8877) Migrate datatree.py module into xarray.core. (pydata#8789) warn and return bytes undecoded in case of UnicodeDecodeError in h5netcdf-backend (pydata#8874) groupby: Dispatch quantile to flox. (pydata#8720) Opt out of auto creating index variables (pydata#8711) Update docs on view / copies (pydata#8744) Handle .oindex and .vindex for the PandasMultiIndexingAdapter and PandasIndexingAdapter (pydata#8869) numpy 2.0 copy-keyword and trapz vs trapezoid (pydata#8865) upstream-dev CI: Fix interp and cumtrapz (pydata#8861) ...
This PR migrates the
datatree.py
module toxarray/core/datatree.py
, as part of the on-going effort to mergexarray-datatree
intoxarray
itself.Most of the changes are import path changes, and type-hints, but there is one minor change to the methods available on the
DataTree
class:to_array
has been converted toto_dataarray
, to align with the method onDataset
. (See conversation here)This PR was initially published as a draft here.
datatree/datatree.py
Track merging datatree into xarray #8572whats-new.rst
api.rst