-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
Description
xref some of the conversation in #20024 right now the following is possible
In [16]: df = pd.DataFrame([(0, 1), (0, 2), (1, 3), (1, 4)], columns=['key', 'val'])
In [18]: df - df.mean()
Out[18]:
key val
0 -0.5 -1.5
1 -0.5 -0.5
2 0.5 0.5
3 0.5 1.5
In [19]: df['val'] - df['val'].mean()
Out[19]:
0 -1.5
1 -0.5
2 0.5
3 1.5
Name: val, dtype: float64But trying to do something similar with grouped data does not work:
In [20]: df.groupby('key') - df.groupby('key').mean()
...
ValueError: Unable to coerce to Series, length must be 1: given 2I am proposing that we update the GroupBy class to allow numerical operations with the result of aggregations or transformations against that object. Note that this is possible today through a much more verbose and hackish:
In [23]: df.groupby('key').shift(0) - df.groupby('key').transform('mean')
Out[23]:
val
0 -0.5
1 0.5
2 -0.5
3 0.5The Series / DataFrame operations are all added via add_special_arithmetic_methods with their implementations being defined in ops.py. We could leverage a similar mechanism for GroupBy
Why is this worth doing?
- Consistent arithmetic ops for
Series,DataFrameandGroupByobjects - May enable deprecation of methods like
mad(see Cythonized GroupBy mad #20024) - Provides easier "demeaning" and "normalization" for grouped data
- Mirrors xarray implementation which appears well received by user base
Why may it not be worth doing?
- Will add more complexity to a
GroupByclass that is already in need of refactor - TBD
Consideration Points
With this proposal, the left operand would always be a GroupBy object and the right operand would always be a the result of a function application against that same GroupBy. The result of the operation should be a Series or DataFrame like-indexed to the original object.
That said, the following operations would in theory be identical:
df.groupby('key') - df.groupby('key').mean()
# OR
df.groupby('key') - df.groupby('key').transform('mean')I'm not sure if we care to differentiate between these and force users into choosing one or the other.
Thoughts?