Support group_over #324

shoyer · 2015-02-18T19:42:20Z

Multi-dimensional grouped operations should be relatively straightforward -- the main complexity will be writing an N-dimensional concat that doesn't involve repetitively copying data.

The idea with group_over would be to support groupby operations that act on a single element from each of the given groups, rather than the unique values. For example, ds.group_over(['lat', 'lon']) would let you iterate over or apply to 2D slices of ds, no matter how many dimensions it has.

Roughly speaking (it's a little more complex for the case of non-dimension variables), ds.group_over(dims) would get translated into ds.groupby([d for d in ds.dims if d not in dims]).

Related: #266

The text was updated successfully, but these errors were encountered:

jhamman · 2015-08-16T18:51:05Z

@shoyer -

I want to look into putting a PR together for this. I'm looking for the same functionality that you get with a pandas Series or DataFrame:

data.groupby([lambda x: x.hour, lambda x: x.timetuple().tm_yday]).mean()

The motivation comes in making a Hovmoller diagram. What we need is this functionality:

da.groupby(['time.hour', 'time.dayofyear']).mean().plot()

If you can point me in the right direction, I'll see if I can put something together.

shoyer · 2015-08-17T00:13:47Z

@jhamman For your use case, both hour and dayofyear are along the time dimension, so arguably the result should be 1D with a MultiIndex instead of 2D. So it might make more sense to start with that, and then layer on stack/unstack or pivot functionality.

I guess there are two related use cases here:

Multiple groupby arguments along a single dimension (pandas does this one already)
Multiple groupby arguments along different dimensions (pandas doesn't do this one).

jhamman · 2015-08-17T16:20:14Z

Agreed, we have two use cases here.

For (1), can we just use the pandas grouping infrastructure. We just need to allow xray.DataArray.groupby to support an iterable and pandas.Grouper objects. I personally don't like the MultiIndex format and prefer to unstack the grouper operations when possible. In xray, I think we can justify going that route since we support N-D labeled dimensions much better than pandas.

For (2), I'll need to think a bit more about how this would work. Do we add a groupby method to DataArrayGroupBy? That sounds messy. Maybe we need to write a N-D grouper object?

clarkfitzg · 2015-08-17T17:04:44Z

For (2) I think it makes sense to extend the existing groupby to deal with multiple dimensions. Ie, let it take an iterable of dimension names.

>>> darray.groupby(['lat', 'lon'])

Then we'd have something similar to the SQL groupby, which is a good thing.

By the way, in #527 we were considering using this approach to make the faceted plots on both rows and columns.

hottwaj · 2016-12-07T14:35:01Z

In case it is of interest to anyone, the snippet below is a temporary and quite dirty solution I've used to do a multi-dimensional groupby...

It runs nested groupby-apply operations over each given dimension until no further grouping needs to be done, then applies the given function "apply_fn"

def nested_groupby_apply(dataarray, groupby, apply_fn):
    if len(groupby) == 1:
        return dataarray.groupby(groupby[0]).apply(apply_fn)
    else:
        return dataarray.groupby(groupby[0]).apply(nested_groupby_apply, groupby = groupby[1:], apply_fn = apply_fn)

Obviously performance can potentially be quite poor. Passing the dimensions to group over in order of increasing length will reduce your cost a little.

jjpr-mit · 2017-10-16T15:35:06Z

Is use case 1 (Multiple groupby arguments along a single dimension) being held back for use case 2 (Multiple groupby arguments along different dimensions)? Use case 1 would be very useful by itself.

shoyer · 2017-10-16T18:24:33Z

Is use case 1 (Multiple groupby arguments along a single dimension) being held back for use case 2 (Multiple groupby arguments along different dimensions)? Use case 1 would be very useful by itself.

No, I think the biggest issue is that grouping variables into a MultiIndex on the result sort of works (with the current PR #924), but it's very easy to end up with weird conflicts between coordinates / MultiIndex levels that are hard to resolve right now within the xarray data model. Probably it would be best to resolve #1603 first, which will make this much easier.

stale · 2019-09-16T20:08:04Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

shoyer · 2019-09-16T21:26:21Z

Still relevant.

matthiasdemuzere · 2021-02-18T15:33:06Z

still relevant, also for me ... I just wanted to group by half hours, for which I'd need access to .groupby(['time.hour','time.minutes'])

alimanfoo · 2022-02-28T18:10:02Z

Still relevant, would like to be able to group by multiple variables along a single dimension.

dcherian · 2022-02-28T19:03:17Z

I have this almost ready in flox (needs more tests). So we should be able to do this soon.

In the mean time note that we can view grouping over multiple variables as a "factorization" (group identification) problem for aggregations. That means you can

use pd.factorize, pd.cut, np.searchsorted or np.bincount to convert each by variable to an integer code,
then use np.ravel_multi_index to combine the codes to a single variable idx
Group by idx and accumulate
use np.unravel_index (or just a simple np.reshape) to convert the single grouped dimension to a multiple dimensions.
Construct output coordinate arrays.

Closes pydata#924 Closes pydata#1056 Closes pydata#9332 xref pydata#324

shoyer added the API design label Feb 18, 2015

shoyer added this to the before 1.0 milestone Feb 18, 2015

shoyer added the topic-groupby label Feb 20, 2015

jhamman mentioned this issue Oct 2, 2015

Support Two-Dimensional Coordinate Variables #605

Closed

shoyer mentioned this issue Apr 18, 2016

"Reverse" groupby method for split/apply/combine #830

Closed

lesommer mentioned this issue May 24, 2016

How to reimplement grid coarsening and statistics-in-boxes in oocgcm ? lesommer/oocgcm#25

Open

shoyer mentioned this issue Jul 31, 2016

WIP: progress toward making groupby work with multiple arguments #924

Closed

pwolfram mentioned this issue Sep 20, 2016

Groupby exclude dimension #1013

Open

shoyer mentioned this issue Sep 12, 2017

Grouping with multiple levels #1569

Closed

mschrimpf mentioned this issue Sep 25, 2018

Efficient workaround to group by multiple dimensions #2438

Closed

stale bot added the stale label Sep 16, 2019

jhamman removed the stale label Sep 16, 2019

OriolAbril mentioned this issue Mar 24, 2021

xarray usage teddygroves/figure_skating#1

Merged

tomvothecoder mentioned this issue Apr 6, 2022

[Refactor]: Consider using flox and xr.resample() to improve temporal averaging grouping logic xCDAT/xcdat#217

Open

dcherian mentioned this issue May 15, 2022

Update GroupBy constructor for grouping by multiple variables, dask arrays #6610

Closed

dcherian mentioned this issue Nov 28, 2022

improving the API for binned groupby xarray-contrib/flox#191

Open

dcherian mentioned this issue Oct 20, 2023

Add invert option to DataArray/Dataset.stack() #8332

Open

4 tasks

dcherian added a commit to dcherian/xarray that referenced this issue Aug 16, 2024

GroupBy(multiple groupers)

ee18848

Closes pydata#924 Closes pydata#1056 Closes pydata#9332 xref pydata#324

dcherian mentioned this issue Aug 16, 2024

GroupBy(multiple groupers) #9372

Merged

10 tasks

dcherian changed the title ~~Support multi-dimensional grouped operations and group_over~~ Support group_over Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support group_over #324

Support group_over #324

shoyer commented Feb 18, 2015

jhamman commented Aug 16, 2015

shoyer commented Aug 17, 2015

jhamman commented Aug 17, 2015

clarkfitzg commented Aug 17, 2015

hottwaj commented Dec 7, 2016

jjpr-mit commented Oct 16, 2017

shoyer commented Oct 16, 2017

stale bot commented Sep 16, 2019

shoyer commented Sep 16, 2019

matthiasdemuzere commented Feb 18, 2021

alimanfoo commented Feb 28, 2022

dcherian commented Feb 28, 2022

Support group_over #324

Support group_over #324

Comments

shoyer commented Feb 18, 2015

jhamman commented Aug 16, 2015

shoyer commented Aug 17, 2015

jhamman commented Aug 17, 2015

clarkfitzg commented Aug 17, 2015

hottwaj commented Dec 7, 2016

jjpr-mit commented Oct 16, 2017

shoyer commented Oct 16, 2017

stale bot commented Sep 16, 2019

shoyer commented Sep 16, 2019

matthiasdemuzere commented Feb 18, 2021

alimanfoo commented Feb 28, 2022

dcherian commented Feb 28, 2022