-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add invert option to DataArray/Dataset.stack() #8332
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2731,6 +2731,7 @@ def stack( | |
dimensions: Mapping[Any, Sequence[Hashable]] | None = None, | ||
create_index: bool | None = True, | ||
index_cls: type[Index] = PandasMultiIndex, | ||
invert: bool = False, | ||
**dimensions_kwargs: Sequence[Hashable], | ||
) -> Self: | ||
""" | ||
|
@@ -2752,9 +2753,15 @@ def stack( | |
If False, don't create any index. | ||
If None, create a multi-index only if exactly one single (1-d) coordinate | ||
index is found for every dimension to stack. | ||
index_cls: class, optional | ||
index_cls : class, optional | ||
Can be used to pass a custom multi-index type. Must be an Xarray index that | ||
implements `.stack()`. By default, a pandas multi-index wrapper is used. | ||
invert : bool, default: False | ||
When `True`, all dimensions of the DataArray except for the sequence of | ||
dimensions specified in the `dimensions` parameter will be stacked. | ||
`dimensions` must have a length of 1 in this case, all dimensions listed | ||
must exist in the current DataArray. Neither the `**dimensions_kwargs` nor | ||
the ellipsis syntaxes may be used. | ||
**dimensions_kwargs | ||
The keyword arguments form of ``dimensions``. | ||
One of dimensions or dimensions_kwargs must be provided. | ||
|
@@ -2767,25 +2774,56 @@ def stack( | |
Examples | ||
-------- | ||
>>> arr = xr.DataArray( | ||
... np.arange(6).reshape(2, 3), | ||
... coords=[("x", ["a", "b"]), ("y", [0, 1, 2])], | ||
... np.arange(12).reshape(2, 3, 2), | ||
... coords=[("x", ["a", "b"]), ("y", [0, 1, 2]), ("z", ["alpha", "beta"])], | ||
... ) | ||
>>> arr | ||
<xarray.DataArray (x: 2, y: 3)> | ||
array([[0, 1, 2], | ||
[3, 4, 5]]) | ||
<xarray.DataArray (x: 2, y: 3, z: 2)> | ||
array([[[ 0, 1], | ||
[ 2, 3], | ||
[ 4, 5]], | ||
|
||
[[ 6, 7], | ||
[ 8, 9], | ||
[10, 11]]]) | ||
Coordinates: | ||
* x (x) <U1 'a' 'b' | ||
* y (y) int64 0 1 2 | ||
>>> stacked = arr.stack(z=("x", "y")) | ||
>>> stacked.indexes["z"] | ||
* z (z) <U5 'alpha' 'beta' | ||
>>> stacked = arr.stack(newdim=("x", "y")) | ||
>>> stacked | ||
<xarray.DataArray (z: 2, newdim: 6)> | ||
array([[ 0, 2, 4, 6, 8, 10], | ||
[ 1, 3, 5, 7, 9, 11]]) | ||
Coordinates: | ||
* z (z) <U5 'alpha' 'beta' | ||
* newdim (newdim) object MultiIndex | ||
* x (newdim) <U1 'a' 'a' 'a' 'b' 'b' 'b' | ||
* y (newdim) int64 0 1 2 0 1 2 | ||
>>> stacked.indexes["newdim"] | ||
MultiIndex([('a', 0), | ||
('a', 1), | ||
('a', 2), | ||
('b', 0), | ||
('b', 1), | ||
('b', 2)], | ||
name='z') | ||
name='newdim') | ||
>>> inverted_stack = arr.stack({"newdim": "z"}, invert=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a clearer example would be exactly the same as https://github.com/pydata/xarray/pull/8332/files#diff-386f35dda14ca12d315f2eb71f0ab35d7fdc73419b5dc5175ea5720f3206903bR2793, but with the addition of (If copying the new output is annoying, we can use pytest-accept...) |
||
>>> inverted_stack | ||
<xarray.DataArray (z: 2, newdim: 6)> | ||
array([[ 0, 2, 4, 6, 8, 10], | ||
[ 1, 3, 5, 7, 9, 11]]) | ||
Coordinates: | ||
* z (z) <U5 'alpha' 'beta' | ||
* newdim (newdim) object MultiIndex | ||
* x (newdim) <U1 'a' 'a' 'a' 'b' 'b' 'b' | ||
* y (newdim) int64 0 1 2 0 1 2 | ||
>>> np.all(inverted_stack.indexes["newdim"] == stacked.indexes["newdim"]) | ||
True | ||
>>> np.all(inverted_stack == stacked) | ||
<xarray.DataArray ()> | ||
array(True) | ||
Comment on lines
+2821
to
+2825
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think these are great tests, but possibly less good examples... |
||
|
||
|
||
See Also | ||
-------- | ||
|
@@ -2795,6 +2833,7 @@ def stack( | |
dimensions, | ||
create_index=create_index, | ||
index_cls=index_cls, | ||
invert=invert, | ||
**dimensions_kwargs, | ||
) | ||
return self._from_temp_dataset(ds) | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -5169,6 +5169,7 @@ def stack( | |||||
dimensions: Mapping[Any, Sequence[Hashable | ellipsis]] | None = None, | ||||||
create_index: bool | None = True, | ||||||
index_cls: type[Index] = PandasMultiIndex, | ||||||
invert: bool = False, | ||||||
**dimensions_kwargs: Sequence[Hashable | ellipsis], | ||||||
) -> Self: | ||||||
""" | ||||||
|
@@ -5192,9 +5193,15 @@ def stack( | |||||
- None. create a multi-index only if exactly one single (1-d) coordinate | ||||||
index is found for every dimension to stack. | ||||||
|
||||||
index_cls: Index-class, default: PandasMultiIndex | ||||||
index_cls : Index-class, default: PandasMultiIndex | ||||||
Can be used to pass a custom multi-index type (must be an Xarray index that | ||||||
implements `.stack()`). By default, a pandas multi-index wrapper is used. | ||||||
invert : bool, default: False | ||||||
When `True`, all dimensions of the DataArray except for the sequence of | ||||||
dimensions specified in the `dimensions` parameter will be stacked. | ||||||
`dimensions` must have a length of 1 in this case, all dimensions listed | ||||||
must exist in the current DataArray. Neither the `**dimensions_kwargs` nor | ||||||
the ellipsis syntaxes may be used. | ||||||
**dimensions_kwargs | ||||||
The keyword arguments form of ``dimensions``. | ||||||
One of dimensions or dimensions_kwargs must be provided. | ||||||
|
@@ -5210,8 +5217,33 @@ def stack( | |||||
""" | ||||||
dimensions = either_dict_or_kwargs(dimensions, dimensions_kwargs, "stack") | ||||||
result = self | ||||||
|
||||||
if invert: | ||||||
if len(dimensions) > 1: | ||||||
raise ValueError( | ||||||
"The dimensions argument must have length 1 when invert=True" | ||||||
max-sixty marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
) | ||||||
|
||||||
for new_dim, dims in dimensions.items(): | ||||||
# When inverting, all dims listed should be in the current | ||||||
# DataArray's dims | ||||||
for dim in dims: | ||||||
if dim is Ellipsis: | ||||||
raise ValueError( | ||||||
"Ellipsis syntax cannot be used when invert=True" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed |
||||||
) | ||||||
if dim not in self.dims: | ||||||
raise ValueError(f'Dimension "{dim}" does not exist') | ||||||
|
||||||
# Subtract specified dimensions from the DataArray's current | ||||||
# dimensions and stack the resulting dimensions | ||||||
dimensions = { | ||||||
new_dim: tuple(dim for dim in self.dims if dim not in dims) | ||||||
} | ||||||
|
||||||
for new_dim, dims in dimensions.items(): | ||||||
result = result._stack_once(dims, new_dim, index_cls, create_index) | ||||||
|
||||||
return result | ||||||
|
||||||
def to_stacked_array( | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I empathize with the use case here. That said:
Low-ish confidence, but I'm not a great fan of a parameter which totally changes another parameter's meaning.
I'm not sure of a great way of doing it, though. Would adding this to
.groupby
possibly make sense?? That's generally how we "invert" the dimensions. Or another paramexcept_dimensions
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that producing side-effects on other parameters is icky. I'm not sure how well it fits in
groupby
though... I'm open to anexcept_dimensions
parameter... do you think it would need to be mutually exclusive withdimensions
? Also open to something like astack_except
method, but don't know how sensitive the project is to introducing new methods.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(deleted a comment I submitted too early)
I think
except_dimensions
could be a reasonable option. I don't love it, since this would be the only place in the library we use it.Edit: this doesn't really work for
group_by
, because the semantics are different where the indexes aren't unique, would need something likegroup_over
as suggested later. Am leaving the original comment here regardless.groupby
could maybe work:original from the docstring:
Using
groupby
:It does kinda make sense, since the
z
is the groupby key, and remains as a dimension like its general usage (i.e. try with.sum
if this isn't clear).On reflection, I quite like it. But it's quite possible I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not bad... it's definitely not the first place I would look if I were trying to figure out how to do this because I typically think of using
groupby
for arranging groups of data with matching coordinate values, and this concept stands separately from coordinates, but maybe that's just my misconception of whatgroupby
is. If we took it that route, I think we'd want to make sure to put a note in theDataArray
/Dataset
stack
documentation about it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone else @pydata/xarray have views on the API? I think two broad options (along with a third of "do nothing, folks can write it themselves):
groupby
option — potentially very nice, but also maybe too "out there"exclude_dims
option — this would be the only method we have this (the others use.groupby
), so it's not really orthogonalThere's another which is to make
xr.ALL_DIMS
into a class, and allowxr.ALL_DIMS.exclude('foo', 'bar')
. But introduces new functionality for possibly a fairly small use-case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @carschandler for the PR!
Maybe I'm too conservative, but do we really need some specific API for this case? Couldn't we just do
A potential source of confusion for
Dataset.stack()
is the order of the stacked dimensions given as argument, which may be important in some cases especially if a multi-index is built from the stacked coordinates (it is by default). The order of the dimensions of a Dataset has no particular meaning so I think it is always preferable to pass them in a more explicit way to.stack
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course you could! I think you can always make this argument though (couldn't you just use
np.zeros + 1
instead ofnp.ones
,np.zeros(a.shape)
instead ofnp.zeros_like
, and couldn't you just use aMultiIndex
in aDataFrame
or somenumpy
arrays in adict
to store multidimensional data?), and I’m in favor of adding conveniences when they don't obscure the rest of the code. If everyone agrees that this does indeed obscure the project, then I will gladly rest my case, but I think that the true value of a project likexarray
is in how the data can be quickly and simply indexed, so I figured this would be a nice convenience since all the other ways of achieving it are kind of ugly.I do agree that the lack of order on a Dataset does pose a problem for setting the MultiIndex order... that being said, I think that there are still use cases where the order may not matter. Thanks for your feedback.