-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug when querying unnamed dataarray #5493
base: main
Are you sure you want to change the base?
Conversation
xarray/core/dataarray.py
Outdated
name = _THIS_ARRAY if self.name is None else self.name | ||
ds = self._to_dataset_whole(name=name, shallow_copy=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason why we don't just use _to_temp_dataset
/ _from_temp_dataset
regardless of the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_to_dataset_whole
is just the same thing as _to_temp_dataset
except it accepts the shallow_copy
argument, which is was already being used here.
Using _to_dataset_whole
with name=None
will fail (so I can't just pass self.name
because that will be None
for an unnamed dataarray). Using _to_dataset_whole
without passing a name fails in the same way.
Then trying to use _from_temp_dataset
fails with
def _from_temp_dataset(
self, dataset: Dataset, name: Union[Hashable, None, Default] = _default
) -> "DataArray":
> variable = dataset._variables.pop(_THIS_ARRAY)
E KeyError: <this-array>
so I think it will fail for named dataarrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the shallow_copy=True
as this is a temporary dataset (and it's the usual pattern for delegating to the Dataset
implementation). No need to use the name
parameter for _from_temp_dataset
, either, as it's being restored from self
.
Edit: actually, the failure for _from_temp_dataset
fails because of the name
parameter to _to_dataset_whole
. In total, I'd suggest to use the same pattern as in sel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to use the same pattern as in
sel
That doesn't work for query
because query
needs the name of the dataarray to be a variable on the temporary dataset, otherwise it can't evaluate queries involving the dataarray name (such as x="a > 5"
where da.name = 'a'
). The tests will fail with this error if you try that:
> raise UndefinedVariableError(key, is_local) from err
E pandas.core.computation.ops.UndefinedVariableError: name 'a' is not defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically in order for query to work you have to provide a name to construct the temporary dataset with, which rules out using _to_temp_dataset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed. However, if I had to choose between "self"
for unnamed and the name for named DataArray
objects or always "self"
I'd choose the latter because it's easier to use (and the implementation would be easier)? Not sure if we should try to support both (i.e. allow referencing by name and using "self"
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. What I just wrote supports both individually: You use
the name when it's named, and self when it isn't. But if it's named you
can't use self. I suppose you could support using self even when it's named
by examining the query and replacing parts of it before giving it to
ds.query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I added support for using 'self'
even when the dataarray has a name, by using .replace()
. It works, but makes the implementation more complicated again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to slightly simplify this would be to always rename all dataarray names to 'self'
before querying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break if there's a non-dimensional coordinate variable named self
won't it? I'm not sure there's good solution though, perhaps ...
:D
""" | ||
|
||
ds = self._to_dataset_whole(shallow_copy=True) | ||
ds = ds.query( | ||
if self.name is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line causes a mypy error
xarray/core/dataarray.py:4510: error: Incompatible types in assignment (expression has type "Hashable", variable has type "str") [assignment]
Found 1 error in 1 file (checked 142 source files)
There might be a slightly neater way to do this, but this works.
pre-commit run --all-files
whats-new.rst