Skip to content

ENH: IntervalIndex as groups in groupby #37949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
twoertwein opened this issue Nov 19, 2020 · 12 comments
Open

ENH: IntervalIndex as groups in groupby #37949

twoertwein opened this issue Nov 19, 2020 · 12 comments
Labels

Comments

@twoertwein
Copy link
Member

twoertwein commented Nov 19, 2020

Is your feature request related to a problem?

I often have a list of (variable-size) intervals and I want to aggregate multiple statistics for these intervals.

import numpy as np
import pandas as pd

data = pd.DataFrame({'a': [1]*100, 'b': [2]*100}, index=np.arange(0.0, 10, 0.1))
starts = [0.0, 2.5, 5, 7]
ends = [0.7, 3.8, 6.1, 9.5]

means = []
for start, end in zip(starts, ends):
    means.append(data.loc[start:end, :].mean())

Describe the solution you'd like

It would be cool (and probably faster) to do:

groups = pd.IntervalIndex.from_arrays(starts, ends)
data.groupby(groups).mean()  # *** ValueError: Grouper and axis must be same length
# or for multiple statistics
data.groupby(groups).aggregate(['mean', 'sum', lambda x: x.quantile(0.75) - x.quantile(0.25)])
@twoertwein twoertwein added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 19, 2020
@jorisvandenbossche
Copy link
Member

I think you can already do this with pd.cut ? You can create an IntervalIndex by binning your float index with the defined starts/ends using pd.cut, and use that to group by.

@twoertwein
Copy link
Member Author

Yes pd.cut works if there are no overlapping intervals:

data.groupby(pd.cut(data.index, groups)).mean()

but it doesn't work with overlapping windows. pd.cut thows:
ValueError: Overlapping IntervalIndex is not accepted.

Unfortunately, pd.Categorical (the output of pd.cut) also needs a many-to-one mapping which will not allow it to accept overlapping intervals.

@jorisvandenbossche
Copy link
Member

Note that your example above doesn't use overlapping windows. The IntervalIndex you create with starts/ends doesn't contain overlapping intervals?

@jorisvandenbossche
Copy link
Member

Also in general, with the current implementation of groupby, each row can only belong to a single group. So I am not fully sure what you would expect with overlapping windows.

@twoertwein
Copy link
Member Author

yes, you are right, the example doesn't cover that, here is one with overlapping intervals:

import numpy as np
import pandas as pd

data = pd.DataFrame({'a': [1]*100, 'b': [2]*100}, index=np.arange(0.0, 10, 0.1))
starts = [0.0, 2.5, 5, 7, 3.1]  # the last interval overlaps with other intervals
ends = [0.7, 3.8, 6.1, 9.5, 5.2]

means = []
for start, end in zip(starts, ends):
    means.append(data.loc[start:end, :].mean())

@twoertwein
Copy link
Member Author

Also in general, with the current implementation of groupby, each row can only belong to a single group. So I am not fully sure what you would expect with overlapping windows.

that is good to know - I think I saw it more from a windowing perspective and than noticed that groupby might also do a very similar jobs.

Do you think it might be possible to (ab-)use the BaseIndexer to then use .rolling?

@jorisvandenbossche
Copy link
Member

I am not familiar enough with BaseIndexer to know, but I think it is worth a try.

@jreback
Copy link
Contributor

jreback commented Nov 19, 2020

yes the indexers provide a nice approach here
might be worth adding a PartionIndeder that groupby could accept

cc @mroeschke

@mroeschke
Copy link
Member

BaseIndexer would be the more appropriate solution if you have arbitrary start and end points, but there are some implicit assumptions:

  • The BaseIndexer subclass would need to return integer start and end points (unlike in your example where you are using floats), since its selecting windows from an underlying numpy array
  • The BaseIndexer subclass would need to return as many start and end point as there are rows in your DataFrame as the rolling API assume the returned object is of the same shape as the input.

@twoertwein
Copy link
Member Author

Thank you for your comments!

* The BaseIndexer subclass would need to return as many start and end point as there are rows in your DataFrame as the rolling API assume the returned object is of the same shape as the input.

I was implementing BaseIndexer but the above constraint leads to segfaults as the assumption is not met. Is there a way to use BaseIndexer outside of rolling (and still use the optimized cython functions)?

@mroeschke
Copy link
Member

Unfortunately no, rolling is the only API that accepts BaseIndexer subclasses.

In theory other APIs could accept it as it just returns start and end bounds but just needs to be implemented.

@yohplala
Copy link

yohplala commented May 26, 2021

Hi there,
Deeply interested in the topic of applying aggregation function on onverlapping windows, with a number of windows different than the DataFrame length (hence rolling not applicable with BaseIndexer) I have also found this other ticket that appears on the same topic / expressing the same need.
#27654

@mzeitlin11 mzeitlin11 added Groupby Interval Interval data type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants