Add invert option to DataArray/Dataset.stack() #8332

carschandler · 2023-10-18T13:38:38Z

This brings in the option to stack all dimensions except for one or more dimensions listed. I find this very useful for quickly iterating over all the combinations of dimensions except for one (i.e. you have a number of input parameters that are parameterized and one time dimension, and you want to calculate some time response for all the combinations of these input parameters and store them in the time-row corresponding to the appropriate combination of inputs).

I played around with implementing to_stacked_array() for DataArray, but this made less sense in the end since that method was really designed for Datasets.

Addresses A way to stack all dimensions but one? #8278
Tests added (added one for DataArray, just now realizing I should have one for Dataset)
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

welcome · 2023-10-18T13:38:41Z

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

max-sixty

Thanks for the PR @carschandler ! I left some comments. Interested what others think too.

xarray/core/dataset.py

max-sixty · 2023-10-18T18:35:59Z

xarray/core/dataset.py

+                for dim in dims:
+                    if dim is Ellipsis:
+                        raise ValueError(
+                            "Ellipsis syntax cannot be used when invert=True"


Suggested change

"Ellipsis syntax cannot be used when invert=True"

"Ellipsis cannot be used when invert=True"

max-sixty · 2023-10-18T18:41:50Z

xarray/core/dataarray.py

+        invert : bool, default: False
+            When `True`, all dimensions of the DataArray except for the sequence of
+            dimensions specified in the `dimensions` parameter will be stacked.
+            `dimensions` must have a length of 1 in this case, all dimensions listed
+            must exist in the current DataArray. Neither the `**dimensions_kwargs` nor
+            the ellipsis syntaxes may be used.


I empathize with the use case here. That said:

Low-ish confidence, but I'm not a great fan of a parameter which totally changes another parameter's meaning.

I'm not sure of a great way of doing it, though. Would adding this to .groupby possibly make sense?? That's generally how we "invert" the dimensions. Or another param except_dimensions?

I agree that producing side-effects on other parameters is icky. I'm not sure how well it fits in groupby though... I'm open to an except_dimensions parameter... do you think it would need to be mutually exclusive with dimensions? Also open to something like a stack_except method, but don't know how sensitive the project is to introducing new methods.

(deleted a comment I submitted too early)

I think except_dimensions could be a reasonable option. I don't love it, since this would be the only place in the library we use it.

Edit: this doesn't really work for group_by, because the semantics are different where the indexes aren't unique, would need something like group_over as suggested later. Am leaving the original comment here regardless

.groupby could maybe work:

original from the docstring:

>>> stacked = arr.stack(newdim=("x", "y")) >>> stacked <xarray.DataArray (z: 2, newdim: 6)> array([[ 0, 2, 4, 6, 8, 10], [ 1, 3, 5, 7, 9, 11]]) Coordinates: * z (z) <U5 'alpha' 'beta' * newdim (newdim) object MultiIndex * x (newdim) <U1 'a' 'a' 'a' 'b' 'b' 'b' * y (newdim) int64 0 1 2 0 1 2

Using groupby:

>>> stacked = arr.groupby('z').stack(newdim=...) >>> stacked <xarray.DataArray (z: 2, newdim: 6)> array([[ 0, 2, 4, 6, 8, 10], [ 1, 3, 5, 7, 9, 11]]) Coordinates: * z (z) <U5 'alpha' 'beta' * newdim (newdim) object MultiIndex * x (newdim) <U1 'a' 'a' 'a' 'b' 'b' 'b' * y (newdim) int64 0 1 2 0 1 2

It does kinda make sense, since the z is the groupby key, and remains as a dimension like its general usage (i.e. try with .sum if this isn't clear).

On reflection, I quite like it. But it's quite possible I'm missing something.

That's not bad... it's definitely not the first place I would look if I were trying to figure out how to do this because I typically think of using groupby for arranging groups of data with matching coordinate values, and this concept stands separately from coordinates, but maybe that's just my misconception of what groupby is. If we took it that route, I think we'd want to make sure to put a note in the DataArray/Dataset stack documentation about it.

Does anyone else @pydata/xarray have views on the API? I think two broad options (along with a third of "do nothing, folks can write it themselves):

The groupby option — potentially very nice, but also maybe too "out there"

The exclude_dims option — this would be the only method we have this (the others use .groupby), so it's not really orthogonal

There's another which is to make xr.ALL_DIMS into a class, and allow xr.ALL_DIMS.exclude('foo', 'bar'). But introduces new functionality for possibly a fairly small use-case.

Thanks @carschandler for the PR!

Maybe I'm too conservative, but do we really need some specific API for this case? Couldn't we just do

arr.stack(newdim=[d for d in arr.dims if d != "z"])

A potential source of confusion for Dataset.stack() is the order of the stacked dimensions given as argument, which may be important in some cases especially if a multi-index is built from the stacked coordinates (it is by default). The order of the dimensions of a Dataset has no particular meaning so I think it is always preferable to pass them in a more explicit way to .stack.

Of course you could! I think you can always make this argument though (couldn't you just use np.zeros + 1 instead of np.ones, np.zeros(a.shape) instead of np.zeros_like, and couldn't you just use a MultiIndex in a DataFrame or some numpy arrays in a dict to store multidimensional data?), and I’m in favor of adding conveniences when they don't obscure the rest of the code. If everyone agrees that this does indeed obscure the project, then I will gladly rest my case, but I think that the true value of a project like xarray is in how the data can be quickly and simply indexed, so I figured this would be a nice convenience since all the other ways of achieving it are kind of ugly.

I do agree that the lack of order on a Dataset does pose a problem for setting the MultiIndex order... that being said, I think that there are still use cases where the order may not matter. Thanks for your feedback.

max-sixty · 2023-10-18T18:43:29Z

xarray/core/dataarray.py

        MultiIndex([('a', 0),
                    ('a', 1),
                    ('a', 2),
                    ('b', 0),
                    ('b', 1),
                    ('b', 2)],
-                   name='z')
+                   name='newdim')
+        >>> inverted_stack = arr.stack({"newdim": "z"}, invert=True)


I think a clearer example would be exactly the same as https://github.com/pydata/xarray/pull/8332/files#diff-386f35dda14ca12d315f2eb71f0ab35d7fdc73419b5dc5175ea5720f3206903bR2793, but with the addition of invert=True (or whatever construction we end up using).

(If copying the new output is annoying, we can use pytest-accept...)

max-sixty · 2023-10-18T18:44:26Z

xarray/core/dataarray.py

+        >>> np.all(inverted_stack.indexes["newdim"] == stacked.indexes["newdim"])
+        True
+        >>> np.all(inverted_stack == stacked)
+        <xarray.DataArray ()>
+        array(True)


I think these are great tests, but possibly less good examples...

dcherian · 2023-10-20T17:26:53Z

Yeah perhaps group_over is the way: #324

max-sixty · 2023-10-20T17:43:23Z

Yeah perhaps group_over is the way: #324

I added a note to my comment above caveating my .groupby suggestion. .groupby has different semantics from "excluding these dimensions" when the indexes aren't unique, and so is not so great a suggestion. .group_over could work, though would be a bigger effort.

Currently ranked vote:

Offer ~~xr.ALL_DIMS.exclude("foo")~~ edit: since it needs to be an instance, would need to be xr.ALL_DIMS(exclude='foo') — it wouldn't be too difficult, since the code could be added to the infix_dims func and it would work everywhere. I don't love it, but it doesn't intrude on existing APIs
Do nothing — a shame, but a list comp is indeed not too bad. Could wait for .group_over
exclude_dims arg to .stack
invert

carschandler · 2023-10-20T17:48:02Z

Yeah perhaps group_over is the way: #324

This is a good reference. Thanks!

benbovy · 2023-10-21T13:05:06Z

Sorry @carschandler if you feel that my #8332 (comment) was dismissive, it wasn't intended to be so. I definitely don't want to discourage anyone making suggestions / contributions to improve Xarray, and I don't think that any suggestion would obscure the project!

I just wanted to say that in this specific case the addition of an argument like invert or exclude_dims, while it would certainly make some cases a bit more convenient and succinct, is problematic in several ways. Not only regarding the order of the dimensions of a Dataset, but also on how to handle the case where multiple stacked dimensions are given as kwargs to stack. I'm sure we could come up with some basic rules to easily handle those cases (e.g., do not allow multiple stack dim kwargs) but is it worth the trouble? About the ugliness of all other ways of achieving it, I respectfully disagree. While a list comp is slightly more verbose, it is easy to understand. At a first glance the .group_over or xr.ALL_DIMS-based alternatives look also good IMHO.

carschandler · 2023-10-23T13:58:37Z

@benbovy oh no not at all! Hope you didn't think I had taken offense after reading my response. I did genuinely appreciate your feedback and fully understand your sentiment. I understand your concerns about the invert and exclude_dims options, and your point about the list comprehension is valid as well. I agree that pursuing one of the other alternatives mentioned is probably better. At the end of the day, I just think it would be nice to be able to quickly iterate over rows across one (or more) dimension(s); whether it's in stack or some other method is less important.

max-sixty · 2023-10-23T19:12:22Z

Thanks @carschandler ! Very much appreciate your attitude and approach here!

What do folks think about xr.ALL_DIMS(exclude=['foo', 'bar'])?

To do this, we would make xr.ALL_DIMS into a small class and store exclusions. And then in infix_dims, we'd check for the class and materialize them based on the current dims.

I don't love it — it's not that much cleaner than [d for d in ds.dims if d not in ['foo', 'bar']]. But it's also not intrusive, doesn't require widespread changes or maintenance — I don't see much downside (maybe another item on the Dims type definition; though maybe not since it's Hashable...)

carschandler · 2023-10-23T19:55:08Z

@max-sixty I agree with all your points. Not as concise as it could be in group_over, but I do think that the syntax is still marginally cleaner than the list comprehension, and I think its meaning is more quickly conveyed to the reader. It also provides a canonical way to get all dims except one, and I always appreciate when a project gives me a way to do something; gives me more confidence in my code. On the other hand, it's nice to have pure equality between ... and ALL_DIMS, but I don't think that that's a good enough reason to reject it.

If it is much work, then I would say it's not worth worrying about it and that list comprehension (or even da.to_dataset().to_stacked_array()) would work for now until group_over is ready. If it's a simple change, I think it's a nice convenience.

carschandler · 2023-10-24T18:03:03Z

One other thought is that when doing chained operations, list comprehensions can become cumbersome if what you want to stack is actually some indexed version of an array. Of course you could assign to another array intermediately, but once again this is clunkier and may not always create a view depending on your indexing.

Add invert option to DataArray/Dataset.stack()

ad938c1

max-sixty reviewed Oct 18, 2023

View reviewed changes

max-sixty mentioned this pull request Dec 13, 2023

Add overloads to get_axis_num #8547

Merged

	"Ellipsis syntax cannot be used when invert=True"
	"Ellipsis cannot be used when invert=True"

Uh oh!

Add invert option to DataArray/Dataset.stack() #8332

Are you sure you want to change the base?

Add invert option to DataArray/Dataset.stack() #8332

Uh oh!

Conversation

carschandler commented Oct 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

welcome bot commented Oct 18, 2023

Uh oh!

max-sixty left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

max-sixty Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

carschandler Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

max-sixty Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

carschandler Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

max-sixty Oct 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carschandler Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

max-sixty Oct 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benbovy Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carschandler Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

max-sixty Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

max-sixty Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

dcherian commented Oct 20, 2023

Uh oh!

max-sixty commented Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carschandler commented Oct 20, 2023

Uh oh!

benbovy commented Oct 21, 2023

Uh oh!

carschandler commented Oct 23, 2023

Uh oh!

max-sixty commented Oct 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carschandler commented Oct 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carschandler commented Oct 24, 2023

Uh oh!

Uh oh!

carschandler commented Oct 18, 2023 •

edited

Loading

max-sixty Oct 18, 2023 •

edited

Loading

max-sixty Oct 19, 2023 •

edited

Loading

benbovy Oct 20, 2023 •

edited

Loading

carschandler Oct 20, 2023 •

edited

Loading

max-sixty commented Oct 20, 2023 •

edited

Loading

max-sixty commented Oct 23, 2023 •

edited

Loading

carschandler commented Oct 23, 2023 •

edited

Loading