-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
API: Consolidate groupby as_index and group_keys #49543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Would it be feasible instead to add the new I remember having a similar API decision when deciding
|
Overall plan seems reasonable. I'm not sure |
Love the overall idea. An Stepping back we have the question of:
I think 1. should just continue to use the |
@mroeschke - I definitely see the benefits of your proposal. The downside would be having to specify @bashtage - the argument would be specifying where the groupby keys go. I don't think this is conveyed by e.g.
I don't follow this - can you explain in more detail?
I'm not sure I see how this question is related - currently there is no option in groupby to choose whether to align or not. Maybe this is another enhancement you're thinking of? |
Thanks for thinking this through and the proposal, @rhshadrach ! I added some thoughts below, but to be clear, while those might seem "critical" of the proposal, I certainly agree that we should try to clean this part of the API (those different keywords that do very similar things but not exactly and for different methods is not ideal ..) To summarize the context for myself, the underlying behaviour (not current API) we want to control somehow is the following aspect of the
Some unstructured thoughts:
Rereading my comments, maybe my current thoughts could be summarized as: do we actually need to keep the |
Thanks @jorisvandenbossche! Any and all thoughts (critical or not) are much appreciated. Agreed on your objection to the name (and I think @WillAyd's as well). I'm focusing on the other aspects here first, and if we find something we want to move forward with, we can workshop the name. I'll keep calling it
Your parenthetical comment is indeed the reason I think it should be included, and so I disagree that it's "expanding the behavioural options unnecessarily".
Users would be able to get the keys in the index or column if they so desired; the documentation would be changed. The default behavior
I'm not sure what this means; apply does different things based on whether the function is e.g. a reduction or transform. When you say "need to keep" here, what does removing it entail? To make this more explicit, I think it would be helpful to consider the following example:
|
Reflecting on your comments more, I'm not sure this is what you're necessarily getting at, but it occurs to me that we have To reason about this, it would be useful to have examples of usages where the UDF is neither a reducer nor a transform. @MarcoGorelli - you mentioned in #49497 (comment) that you scrapped Kaggle notebooks. I'm wondering if something similar can be done here. Do you have code for this that can be shared? |
You'll need to set up an account on Kaggle, and download a token https://github.com/Kaggle/kaggle-api#api-credentials, and then
should work Then you can just remove all outputs with |
Does I think also would be easier to borrow the terminology for accepted argument values from pivot_table of |
key_placement might also be an option |
On the transform case
Just to be very specific about the exact behaviour you are thinking of: would it be adding the original column (used as key) from the calling dataframe to the result as an index level / column and thus keeping the original order? (which is not what group.apply with >>> df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
# with setting key as index
>>> df.groupby("a").transform(lambda x: x.cumsum()).set_index(df['a'], append=True).reorder_levels([1, 0], 0)
b
a
1 0 1
2 1 2
1 2 4
# with setting key as column
>>> df.groupby("a").transform(lambda x: x.cumsum()).assign(a=df['a'])[['a', 'b']]
a b
0 1 1
1 2 2
2 1 4 I assume it is the above, but to be explicit, the current >>> df.groupby("a", group_keys=True).apply(lambda x: x[['b']].cumsum())
b
a
1 0 1
2 4
2 1 2 That said, I am still a bit skeptical whether this functionality is really "needed" for I think the output shown in the "with setting key as column" example above is certainly an interesting output, repeating it here: >>> df.groupby("a").transform(lambda x: x.cumsum()).assign(a=df['a'])[['a', 'b']]
a b
0 1 1
1 2 2
2 1 4 But I think the way this is typically achieved right now with transform is with something like: >>> df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
>>> df["b"] = df.groupby("a")["b"].transform(lambda x: x.cumsum())
>>> df
a b
0 1 1
1 2 2
2 1 4 for which you want the current behaviour of having the result of If we want to enable the above in a method-chaining manner, I think we should rather explore other options, such as adding an
Here, it gives the same output as a potential I know this is getting a bit off-topic, but I do think we should consider both the consistency aspect (i.e. how can we make this keys-as-index/column-handling consistent across the different groupby methods) as the actual use cases, so we are not adding API surface only for the sake of consistency but also because it is useful. |
@rhshadrach yes, that is certainly related. >>> df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]}, index=[6, 7, 8])
>>> df.groupby(['x', 'y', 'z'], group_keys=True).apply(lambda x: x.cumsum())
a b
x 6 0 3
y 7 1 4
z 8 2 5
>>> df.groupby(['x', 'y', 'z'], group_keys=False).apply(lambda x: x.cumsum())
a b
6 0 3
7 1 4
8 2 5 This behaviour of >>> df.groupby(['x', 'y', 'z'], group_keys=True).apply(lambda x: x.cumsum()).reset_index(level=0, drop=True)
a b
6 0 3
7 1 4
8 2 5 Given that this is easy to do if you want this specific output, is it worth having a Another reason to have current >>> df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
>>> df.groupby("a", group_keys=True).apply(lambda x: x[['b']].cumsum())
b
a
1 0 1
2 4
2 1 2
>>> df.groupby("a", group_keys=False).apply(lambda x: x[['b']].cumsum())
b
0 1
1 2
2 4
# mimic group_keys=False result with reset_index gives different order
>>> df.groupby("a", group_keys=True).apply(lambda x: x[['b']].cumsum()).reset_index(level=0, drop=True)
b
0 1
2 4
1 2 But personally I would rather get rid of this special case in the |
Everything in this issue also applies to
Series.groupby
andSeriesGroupBy
; I will just be writing it forDataFrame
.Currently
DataFrame.groupby
have two arguments that are essentially for the same thing:as_index
: Whether to include the group keys in the index or, when the groupby is done on column labels (see #49519), in the columns.group_keys
: Whether to include the group keys in the index when callingDataFrameGroupBy.apply
.as_index
only applies to reductions,group_keys
only applies toapply
. I think this is confusing and unnecessarily restrictive.I propose we
as_index
andgroup_keys
keys_axis
to bothDataFrame.groupby
andDataFrameGroupBy.apply
; these take the same arguments, the only difference is that the value inDataFrameGroupBy.apply
, if specified, overrides the value inDataFrame.groupby
.keys_axis
can accept the following values:as_index=True
orgroup_keys=False
)as_index=False
)sum
,cumsum
,head
), reductions will return aRangeIndex
, transforms and filters will behave as they do today returning the input's index or a subset of it for a filter. Forapply
, this will behave the same asgroup_keys=False
today.Unlike
as_index
, this argument will be respected in all groupby functions whether they be reductions, transforms, or filters.Path to implementation:
keys_axis
in 2.0, and either add a PendingDeprecationWarning or a DeprecationWarning to as_index / group_keysA few natural questions come to mind:
as_index
orgroup_keys
?Currently these arguments are Boolean, the new argument needs to accept more than two values where the name reflects that it is accepting an
axis
. Also, adding a new argument provides a cleaner and more gradual path for deprecation.group_keys
toDataFrameGroupBy.apply
?In other groupby methods, we can reliably use
keys_axis="infer"
to determine the correct placement of the keys. However in apply, it is inferred from the output, and various cases can coincide - e.g. a reduction and transformation on a DataFrame with a single row. We want the user to be able to use "infer" on other groupby methods, but be able to specify how their UDF in apply acts. E.g.keys_axis
accept the value"none"
?This is currently how transforms and filters work - where the keys are added to neither the index nor the columns. We need to keep the ability to specify to
groupby(...).apply
that the UDF they are provided acts as a transform or filter.group_keys_axis
?I find "group" here redundant, but would be fine with this name too, and happy to consider other potential names.
cc @pandas-dev/pandas-core @pandas-dev/pandas-triage
The text was updated successfully, but these errors were encountered: