-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support group_over #324
Comments
@shoyer - I want to look into putting a PR together for this. I'm looking for the same functionality that you get with a pandas Series or DataFrame: data.groupby([lambda x: x.hour, lambda x: x.timetuple().tm_yday]).mean() The motivation comes in making a Hovmoller diagram. What we need is this functionality: da.groupby(['time.hour', 'time.dayofyear']).mean().plot() If you can point me in the right direction, I'll see if I can put something together. |
@jhamman For your use case, both hour and dayofyear are along the time dimension, so arguably the result should be 1D with a MultiIndex instead of 2D. So it might make more sense to start with that, and then layer on stack/unstack or pivot functionality. I guess there are two related use cases here:
|
Agreed, we have two use cases here. For (1), can we just use the pandas grouping infrastructure. We just need to allow For (2), I'll need to think a bit more about how this would work. Do we add a groupby method to |
For (2) I think it makes sense to extend the existing groupby to deal with multiple dimensions. Ie, let it take an iterable of dimension names.
Then we'd have something similar to the SQL groupby, which is a good thing. By the way, in #527 we were considering using this approach to make the faceted plots on both rows and columns. |
In case it is of interest to anyone, the snippet below is a temporary and quite dirty solution I've used to do a multi-dimensional groupby... It runs nested groupby-apply operations over each given dimension until no further grouping needs to be done, then applies the given function "apply_fn"
Obviously performance can potentially be quite poor. Passing the dimensions to group over in order of increasing length will reduce your cost a little. |
Is use case 1 (Multiple groupby arguments along a single dimension) being held back for use case 2 (Multiple groupby arguments along different dimensions)? Use case 1 would be very useful by itself. |
No, I think the biggest issue is that grouping variables into a |
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here or remove the |
Still relevant. |
still relevant, also for me ... I just wanted to group by half hours, for which I'd need access to |
Still relevant, would like to be able to group by multiple variables along a single dimension. |
I have this almost ready in flox (needs more tests). So we should be able to do this soon. In the mean time note that we can view grouping over multiple variables as a "factorization" (group identification) problem for aggregations. That means you can
|
Closes pydata#924 Closes pydata#1056 Closes pydata#9332 xref pydata#324
Multi-dimensional grouped operations should be relatively straightforward -- the main complexity will be writing an N-dimensional concat that doesn't involve repetitively copying data.
The idea with
group_over
would be to support groupby operations that act on a single element from each of the given groups, rather than the unique values. For example,ds.group_over(['lat', 'lon'])
would let you iterate over or apply to 2D slices ofds
, no matter how many dimensions it has.Roughly speaking (it's a little more complex for the case of non-dimension variables),
ds.group_over(dims)
would get translated intods.groupby([d for d in ds.dims if d not in dims])
.Related: #266
The text was updated successfully, but these errors were encountered: