-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add invert option to DataArray/Dataset.stack() #8332
base: main
Are you sure you want to change the base?
Conversation
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @carschandler ! I left some comments. Interested what others think too.
for dim in dims: | ||
if dim is Ellipsis: | ||
raise ValueError( | ||
"Ellipsis syntax cannot be used when invert=True" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Ellipsis syntax cannot be used when invert=True" | |
"Ellipsis cannot be used when invert=True" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
invert : bool, default: False | ||
When `True`, all dimensions of the DataArray except for the sequence of | ||
dimensions specified in the `dimensions` parameter will be stacked. | ||
`dimensions` must have a length of 1 in this case, all dimensions listed | ||
must exist in the current DataArray. Neither the `**dimensions_kwargs` nor | ||
the ellipsis syntaxes may be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I empathize with the use case here. That said:
Low-ish confidence, but I'm not a great fan of a parameter which totally changes another parameter's meaning.
I'm not sure of a great way of doing it, though. Would adding this to .groupby
possibly make sense?? That's generally how we "invert" the dimensions. Or another param except_dimensions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that producing side-effects on other parameters is icky. I'm not sure how well it fits in groupby
though... I'm open to an except_dimensions
parameter... do you think it would need to be mutually exclusive with dimensions
? Also open to something like a stack_except
method, but don't know how sensitive the project is to introducing new methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(deleted a comment I submitted too early)
I think except_dimensions
could be a reasonable option. I don't love it, since this would be the only place in the library we use it.
Edit: this doesn't really work for group_by
, because the semantics are different where the indexes aren't unique, would need something like group_over
as suggested later. Am leaving the original comment here regardless
.groupby
could maybe work:
original from the docstring:
>>> stacked = arr.stack(newdim=("x", "y"))
>>> stacked
<xarray.DataArray (z: 2, newdim: 6)>
array([[ 0, 2, 4, 6, 8, 10],
[ 1, 3, 5, 7, 9, 11]])
Coordinates:
* z (z) <U5 'alpha' 'beta'
* newdim (newdim) object MultiIndex
* x (newdim) <U1 'a' 'a' 'a' 'b' 'b' 'b'
* y (newdim) int64 0 1 2 0 1 2
Using groupby
:
>>> stacked = arr.groupby('z').stack(newdim=...)
>>> stacked
<xarray.DataArray (z: 2, newdim: 6)>
array([[ 0, 2, 4, 6, 8, 10],
[ 1, 3, 5, 7, 9, 11]])
Coordinates:
* z (z) <U5 'alpha' 'beta'
* newdim (newdim) object MultiIndex
* x (newdim) <U1 'a' 'a' 'a' 'b' 'b' 'b'
* y (newdim) int64 0 1 2 0 1 2
It does kinda make sense, since the z
is the groupby key, and remains as a dimension like its general usage (i.e. try with .sum
if this isn't clear).
On reflection, I quite like it. But it's quite possible I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not bad... it's definitely not the first place I would look if I were trying to figure out how to do this because I typically think of using groupby
for arranging groups of data with matching coordinate values, and this concept stands separately from coordinates, but maybe that's just my misconception of what groupby
is. If we took it that route, I think we'd want to make sure to put a note in the DataArray
/Dataset
stack
documentation about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone else @pydata/xarray have views on the API? I think two broad options (along with a third of "do nothing, folks can write it themselves):
- The
groupby
option — potentially very nice, but also maybe too "out there" - The
exclude_dims
option — this would be the only method we have this (the others use.groupby
), so it's not really orthogonal
There's another which is to make xr.ALL_DIMS
into a class, and allow xr.ALL_DIMS.exclude('foo', 'bar')
. But introduces new functionality for possibly a fairly small use-case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @carschandler for the PR!
Maybe I'm too conservative, but do we really need some specific API for this case? Couldn't we just do
arr.stack(newdim=[d for d in arr.dims if d != "z"])
A potential source of confusion for Dataset.stack()
is the order of the stacked dimensions given as argument, which may be important in some cases especially if a multi-index is built from the stacked coordinates (it is by default). The order of the dimensions of a Dataset has no particular meaning so I think it is always preferable to pass them in a more explicit way to .stack
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course you could! I think you can always make this argument though (couldn't you just use np.zeros + 1
instead of np.ones
, np.zeros(a.shape)
instead of np.zeros_like
, and couldn't you just use a MultiIndex
in a DataFrame
or some numpy
arrays in a dict
to store multidimensional data?), and I’m in favor of adding conveniences when they don't obscure the rest of the code. If everyone agrees that this does indeed obscure the project, then I will gladly rest my case, but I think that the true value of a project like xarray
is in how the data can be quickly and simply indexed, so I figured this would be a nice convenience since all the other ways of achieving it are kind of ugly.
I do agree that the lack of order on a Dataset does pose a problem for setting the MultiIndex order... that being said, I think that there are still use cases where the order may not matter. Thanks for your feedback.
MultiIndex([('a', 0), | ||
('a', 1), | ||
('a', 2), | ||
('b', 0), | ||
('b', 1), | ||
('b', 2)], | ||
name='z') | ||
name='newdim') | ||
>>> inverted_stack = arr.stack({"newdim": "z"}, invert=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a clearer example would be exactly the same as https://github.com/pydata/xarray/pull/8332/files#diff-386f35dda14ca12d315f2eb71f0ab35d7fdc73419b5dc5175ea5720f3206903bR2793, but with the addition of invert=True
(or whatever construction we end up using).
(If copying the new output is annoying, we can use pytest-accept...)
>>> np.all(inverted_stack.indexes["newdim"] == stacked.indexes["newdim"]) | ||
True | ||
>>> np.all(inverted_stack == stacked) | ||
<xarray.DataArray ()> | ||
array(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are great tests, but possibly less good examples...
Yeah perhaps |
I added a note to my comment above caveating my Currently ranked vote:
|
This is a good reference. Thanks! |
Sorry @carschandler if you feel that my #8332 (comment) was dismissive, it wasn't intended to be so. I definitely don't want to discourage anyone making suggestions / contributions to improve Xarray, and I don't think that any suggestion would obscure the project! I just wanted to say that in this specific case the addition of an argument like |
@benbovy oh no not at all! Hope you didn't think I had taken offense after reading my response. I did genuinely appreciate your feedback and fully understand your sentiment. I understand your concerns about the |
Thanks @carschandler ! Very much appreciate your attitude and approach here! What do folks think about To do this, we would make I don't love it — it's not that much cleaner than |
@max-sixty I agree with all your points. Not as concise as it could be in If it is much work, then I would say it's not worth worrying about it and that list comprehension (or even |
One other thought is that when doing chained operations, list comprehensions can become cumbersome if what you want to stack is actually some indexed version of an array. Of course you could assign to another array intermediately, but once again this is clunkier and may not always create a view depending on your indexing. |
This brings in the option to stack all dimensions except for one or more dimensions listed. I find this very useful for quickly iterating over all the combinations of dimensions except for one (i.e. you have a number of input parameters that are parameterized and one time dimension, and you want to calculate some time response for all the combinations of these input parameters and store them in the time-row corresponding to the appropriate combination of inputs).
I played around with implementing
to_stacked_array()
forDataArray
, but this made less sense in the end since that method was really designed for Datasets.stack
all dimensions but one? #8278whats-new.rst
api.rst