Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add invert option to DataArray/Dataset.stack() #8332

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 48 additions & 9 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -2731,6 +2731,7 @@ def stack(
dimensions: Mapping[Any, Sequence[Hashable]] | None = None,
create_index: bool | None = True,
index_cls: type[Index] = PandasMultiIndex,
invert: bool = False,
**dimensions_kwargs: Sequence[Hashable],
) -> Self:
"""
Expand All @@ -2752,9 +2753,15 @@ def stack(
If False, don't create any index.
If None, create a multi-index only if exactly one single (1-d) coordinate
index is found for every dimension to stack.
index_cls: class, optional
index_cls : class, optional
Can be used to pass a custom multi-index type. Must be an Xarray index that
implements `.stack()`. By default, a pandas multi-index wrapper is used.
invert : bool, default: False
When `True`, all dimensions of the DataArray except for the sequence of
dimensions specified in the `dimensions` parameter will be stacked.
`dimensions` must have a length of 1 in this case, all dimensions listed
must exist in the current DataArray. Neither the `**dimensions_kwargs` nor
the ellipsis syntaxes may be used.
Comment on lines +2759 to +2764
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I empathize with the use case here. That said:

Low-ish confidence, but I'm not a great fan of a parameter which totally changes another parameter's meaning.

I'm not sure of a great way of doing it, though. Would adding this to .groupby possibly make sense?? That's generally how we "invert" the dimensions. Or another param except_dimensions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that producing side-effects on other parameters is icky. I'm not sure how well it fits in groupby though... I'm open to an except_dimensions parameter... do you think it would need to be mutually exclusive with dimensions? Also open to something like a stack_except method, but don't know how sensitive the project is to introducing new methods.

Copy link
Collaborator

@max-sixty max-sixty Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(deleted a comment I submitted too early)

I think except_dimensions could be a reasonable option. I don't love it, since this would be the only place in the library we use it.

Edit: this doesn't really work for group_by, because the semantics are different where the indexes aren't unique, would need something like group_over as suggested later. Am leaving the original comment here regardless


.groupby could maybe work:

original from the docstring:

        >>> stacked = arr.stack(newdim=("x", "y"))
        >>> stacked
        <xarray.DataArray (z: 2, newdim: 6)>
        array([[ 0,  2,  4,  6,  8, 10],
               [ 1,  3,  5,  7,  9, 11]])
        Coordinates:
          * z        (z) <U5 'alpha' 'beta'
          * newdim   (newdim) object MultiIndex
          * x        (newdim) <U1 'a' 'a' 'a' 'b' 'b' 'b'
          * y        (newdim) int64 0 1 2 0 1 2

Using groupby:

        >>> stacked = arr.groupby('z').stack(newdim=...)
        >>> stacked
        <xarray.DataArray (z: 2, newdim: 6)>
        array([[ 0,  2,  4,  6,  8, 10],
               [ 1,  3,  5,  7,  9, 11]])
        Coordinates:
          * z        (z) <U5 'alpha' 'beta'
          * newdim   (newdim) object MultiIndex
          * x        (newdim) <U1 'a' 'a' 'a' 'b' 'b' 'b'
          * y        (newdim) int64 0 1 2 0 1 2

It does kinda make sense, since the z is the groupby key, and remains as a dimension like its general usage (i.e. try with .sum if this isn't clear).

On reflection, I quite like it. But it's quite possible I'm missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not bad... it's definitely not the first place I would look if I were trying to figure out how to do this because I typically think of using groupby for arranging groups of data with matching coordinate values, and this concept stands separately from coordinates, but maybe that's just my misconception of what groupby is. If we took it that route, I think we'd want to make sure to put a note in the DataArray/Dataset stack documentation about it.

Copy link
Collaborator

@max-sixty max-sixty Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone else @pydata/xarray have views on the API? I think two broad options (along with a third of "do nothing, folks can write it themselves):

  • The groupby option — potentially very nice, but also maybe too "out there"
  • The exclude_dims option — this would be the only method we have this (the others use .groupby), so it's not really orthogonal

There's another which is to make xr.ALL_DIMS into a class, and allow xr.ALL_DIMS.exclude('foo', 'bar'). But introduces new functionality for possibly a fairly small use-case.

Copy link
Member

@benbovy benbovy Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @carschandler for the PR!

Maybe I'm too conservative, but do we really need some specific API for this case? Couldn't we just do

arr.stack(newdim=[d for d in arr.dims if d != "z"])

A potential source of confusion for Dataset.stack() is the order of the stacked dimensions given as argument, which may be important in some cases especially if a multi-index is built from the stacked coordinates (it is by default). The order of the dimensions of a Dataset has no particular meaning so I think it is always preferable to pass them in a more explicit way to .stack.

Copy link
Contributor Author

@carschandler carschandler Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course you could! I think you can always make this argument though (couldn't you just use np.zeros + 1 instead of np.ones, np.zeros(a.shape) instead of np.zeros_like, and couldn't you just use a MultiIndex in a DataFrame or some numpy arrays in a dict to store multidimensional data?), and I’m in favor of adding conveniences when they don't obscure the rest of the code. If everyone agrees that this does indeed obscure the project, then I will gladly rest my case, but I think that the true value of a project like xarray is in how the data can be quickly and simply indexed, so I figured this would be a nice convenience since all the other ways of achieving it are kind of ugly.

I do agree that the lack of order on a Dataset does pose a problem for setting the MultiIndex order... that being said, I think that there are still use cases where the order may not matter. Thanks for your feedback.

**dimensions_kwargs
The keyword arguments form of ``dimensions``.
One of dimensions or dimensions_kwargs must be provided.
Expand All @@ -2767,25 +2774,56 @@ def stack(
Examples
--------
>>> arr = xr.DataArray(
... np.arange(6).reshape(2, 3),
... coords=[("x", ["a", "b"]), ("y", [0, 1, 2])],
... np.arange(12).reshape(2, 3, 2),
... coords=[("x", ["a", "b"]), ("y", [0, 1, 2]), ("z", ["alpha", "beta"])],
... )
>>> arr
<xarray.DataArray (x: 2, y: 3)>
array([[0, 1, 2],
[3, 4, 5]])
<xarray.DataArray (x: 2, y: 3, z: 2)>
array([[[ 0, 1],
[ 2, 3],
[ 4, 5]],

[[ 6, 7],
[ 8, 9],
[10, 11]]])
Coordinates:
* x (x) <U1 'a' 'b'
* y (y) int64 0 1 2
>>> stacked = arr.stack(z=("x", "y"))
>>> stacked.indexes["z"]
* z (z) <U5 'alpha' 'beta'
>>> stacked = arr.stack(newdim=("x", "y"))
>>> stacked
<xarray.DataArray (z: 2, newdim: 6)>
array([[ 0, 2, 4, 6, 8, 10],
[ 1, 3, 5, 7, 9, 11]])
Coordinates:
* z (z) <U5 'alpha' 'beta'
* newdim (newdim) object MultiIndex
* x (newdim) <U1 'a' 'a' 'a' 'b' 'b' 'b'
* y (newdim) int64 0 1 2 0 1 2
>>> stacked.indexes["newdim"]
MultiIndex([('a', 0),
('a', 1),
('a', 2),
('b', 0),
('b', 1),
('b', 2)],
name='z')
name='newdim')
>>> inverted_stack = arr.stack({"newdim": "z"}, invert=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a clearer example would be exactly the same as https://github.com/pydata/xarray/pull/8332/files#diff-386f35dda14ca12d315f2eb71f0ab35d7fdc73419b5dc5175ea5720f3206903bR2793, but with the addition of invert=True (or whatever construction we end up using).

(If copying the new output is annoying, we can use pytest-accept...)

>>> inverted_stack
<xarray.DataArray (z: 2, newdim: 6)>
array([[ 0, 2, 4, 6, 8, 10],
[ 1, 3, 5, 7, 9, 11]])
Coordinates:
* z (z) <U5 'alpha' 'beta'
* newdim (newdim) object MultiIndex
* x (newdim) <U1 'a' 'a' 'a' 'b' 'b' 'b'
* y (newdim) int64 0 1 2 0 1 2
>>> np.all(inverted_stack.indexes["newdim"] == stacked.indexes["newdim"])
True
>>> np.all(inverted_stack == stacked)
<xarray.DataArray ()>
array(True)
Comment on lines +2821 to +2825
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are great tests, but possibly less good examples...



See Also
--------
Expand All @@ -2795,6 +2833,7 @@ def stack(
dimensions,
create_index=create_index,
index_cls=index_cls,
invert=invert,
**dimensions_kwargs,
)
return self._from_temp_dataset(ds)
Expand Down
34 changes: 33 additions & 1 deletion xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -5169,6 +5169,7 @@ def stack(
dimensions: Mapping[Any, Sequence[Hashable | ellipsis]] | None = None,
create_index: bool | None = True,
index_cls: type[Index] = PandasMultiIndex,
invert: bool = False,
**dimensions_kwargs: Sequence[Hashable | ellipsis],
) -> Self:
"""
Expand All @@ -5192,9 +5193,15 @@ def stack(
- None. create a multi-index only if exactly one single (1-d) coordinate
index is found for every dimension to stack.

index_cls: Index-class, default: PandasMultiIndex
index_cls : Index-class, default: PandasMultiIndex
Can be used to pass a custom multi-index type (must be an Xarray index that
implements `.stack()`). By default, a pandas multi-index wrapper is used.
invert : bool, default: False
When `True`, all dimensions of the DataArray except for the sequence of
dimensions specified in the `dimensions` parameter will be stacked.
`dimensions` must have a length of 1 in this case, all dimensions listed
must exist in the current DataArray. Neither the `**dimensions_kwargs` nor
the ellipsis syntaxes may be used.
**dimensions_kwargs
The keyword arguments form of ``dimensions``.
One of dimensions or dimensions_kwargs must be provided.
Expand All @@ -5210,8 +5217,33 @@ def stack(
"""
dimensions = either_dict_or_kwargs(dimensions, dimensions_kwargs, "stack")
result = self

if invert:
if len(dimensions) > 1:
raise ValueError(
"The dimensions argument must have length 1 when invert=True"
max-sixty marked this conversation as resolved.
Show resolved Hide resolved
)

for new_dim, dims in dimensions.items():
# When inverting, all dims listed should be in the current
# DataArray's dims
for dim in dims:
if dim is Ellipsis:
raise ValueError(
"Ellipsis syntax cannot be used when invert=True"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Ellipsis syntax cannot be used when invert=True"
"Ellipsis cannot be used when invert=True"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

)
if dim not in self.dims:
raise ValueError(f'Dimension "{dim}" does not exist')

# Subtract specified dimensions from the DataArray's current
# dimensions and stack the resulting dimensions
dimensions = {
new_dim: tuple(dim for dim in self.dims if dim not in dims)
}

for new_dim, dims in dimensions.items():
result = result._stack_once(dims, new_dim, index_cls, create_index)

return result

def to_stacked_array(
Expand Down
7 changes: 7 additions & 0 deletions xarray/tests/test_dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -2547,6 +2547,13 @@ def test_stack_nonunique_consistency(self, da) -> None:
expected = DataArray(da.to_pandas().stack(), dims="z")
assert_identical(expected, actual)

def test_stack_inverted_consistency(self) -> None:
da = DataArray(
[[0, 1], [2, 3]],
dims=["x", "y"],
)
assert_identical(da.stack(z=["x"]), da.stack({"z": ["y"]}, invert=True))

def test_to_unstacked_dataset_raises_value_error(self) -> None:
data = DataArray([0, 1], dims="x", coords={"x": [0, 1]})
with pytest.raises(ValueError, match="'x' is not a stacked coordinate"):
Expand Down
Loading