Skip to content

API: Consolidate groupby as_index and group_keys #49543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rhshadrach opened this issue Nov 5, 2022 · 12 comments
Open

API: Consolidate groupby as_index and group_keys #49543

rhshadrach opened this issue Nov 5, 2022 · 12 comments
Labels
API Design Groupby Needs Discussion Requires discussion from core team before further action

Comments

@rhshadrach
Copy link
Member

Everything in this issue also applies to Series.groupby and SeriesGroupBy; I will just be writing it for DataFrame.

Currently DataFrame.groupby have two arguments that are essentially for the same thing:

  • as_index: Whether to include the group keys in the index or, when the groupby is done on column labels (see #49519), in the columns.
  • group_keys: Whether to include the group keys in the index when calling DataFrameGroupBy.apply.

as_index only applies to reductions, group_keys only applies to apply. I think this is confusing and unnecessarily restrictive.

I propose we

  • Deprecate both as_index and group_keys
  • Add keys_axis to both DataFrame.groupby and DataFrameGroupBy.apply; these take the same arguments, the only difference is that the value in DataFrameGroupBy.apply, if specified, overrides the value in DataFrame.groupby.

keys_axis can accept the following values:

  • "infer" (the default): One of the following behaviors, inferred from the computation depending on if it is a reduction, transform, or filter.
  • "index" or 0: Add the keys to the index (similar to as_index=True or group_keys=False)
  • "columns" or 1: Add the keys to the columns (similar to as_index=False)
  • "none": Don't add the keys to either the index nor the columns. For pandas methods (e.g. sum, cumsum, head), reductions will return a RangeIndex, transforms and filters will behave as they do today returning the input's index or a subset of it for a filter. For apply, this will behave the same as group_keys=False today.

Unlike as_index, this argument will be respected in all groupby functions whether they be reductions, transforms, or filters.

Path to implementation:

  • Add keys_axis in 2.0, and either add a PendingDeprecationWarning or a DeprecationWarning to as_index / group_keys
  • Change warnings for as_index / group_keys to a FutureWarning in 2.1
  • Enforce depredations in 3.0

A few natural questions come to mind:

  1. Why introduce a new argument, why not keep either as_index or group_keys?

Currently these arguments are Boolean, the new argument needs to accept more than two values where the name reflects that it is accepting an axis. Also, adding a new argument provides a cleaner and more gradual path for deprecation.

  1. Why add group_keys to DataFrameGroupBy.apply?

In other groupby methods, we can reliably use keys_axis="infer" to determine the correct placement of the keys. However in apply, it is inferred from the output, and various cases can coincide - e.g. a reduction and transformation on a DataFrame with a single row. We want the user to be able to use "infer" on other groupby methods, but be able to specify how their UDF in apply acts. E.g.

gb = df.groupby(["a", "b"], keys_axis="infer")
print(gb.sum())  # Act as a reduction
print(gb.head())  # Act as a filter
print(gb.cumsum())  # Act as a transform
print(gb.apply(my_udf, keys_axis="index"))  # infer from the groupby call is not reliable here, allow user to specify how apply should act
  1. Why should keys_axis accept the value "none"?

This is currently how transforms and filters work - where the keys are added to neither the index nor the columns. We need to keep the ability to specify to groupby(...).apply that the UDF they are provided acts as a transform or filter.

  1. Why not name the argument group_keys_axis?

I find "group" here redundant, but would be fine with this name too, and happy to consider other potential names.

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

@rhshadrach rhshadrach added Groupby API Design Needs Discussion Requires discussion from core team before further action labels Nov 5, 2022
@mroeschke
Copy link
Member

Would it be feasible instead to add the new keys_axis to all the DataFrameGroupby methods instead (with a sensible default & restrictions per function)? This would include deprecating as_index/group_keys

I remember having a similar API decision when deciding df.groupby(engine="numba").mean() vs df.groupby().mean(engine="numba") (current), and some pros were

  1. Easier switching between arguments. If a user wants to switch from group_keys="columns" and group_keys="infer" between 2 methods, they will have to create a new DataFrameGroupby object.
  2. Less state for the DataFrameGroupby object. Semantically knowing that group_key only impacts the result of the method and not the DataFrameGroupby object itself.

@bashtage
Copy link
Contributor

bashtage commented Nov 7, 2022

Overall plan seems reasonable. I'm not sure keys_axis is very intuitive. Would axis_keys or maybe axis_labels be clearer (AFAIK the generic word, at least in typing, for the values in an index are labels)?

@WillAyd
Copy link
Member

WillAyd commented Nov 7, 2022

Love the overall idea. An axis feels a little strange to me because in theory I guess you could toss an apply against any axis and then separately have a key_axis keyword. Not sure there is a need for those two to be separate?

Stepping back we have the question of:

  1. Which axis should the computation be performed across
  2. How should we align the output of the computation

I think 1. should just continue to use the axis keyword we have throughout the library, but maybe the keyword for 2. becomes something like align_on_index = {True / False} ?

@rhshadrach
Copy link
Member Author

rhshadrach commented Nov 8, 2022

@mroeschke - I definitely see the benefits of your proposal. The downside would be having to specify keys_axis for each method. I would be happy either way.

@bashtage - the argument would be specifying where the groupby keys go. I don't think this is conveyed by e.g. axis_labels="columns". Similarly, axis_keys=1 seems odd to me - as if it's stating "the axis keys are 1". On the other hand, I think saying the "keys axis is 1" make sense.

@WillAyd

An axis feels a little strange to me because in theory I guess you could toss an apply against any axis and then separately have a key_axis keyword.

I don't follow this - can you explain in more detail?

How should we align the output of the computation

I'm not sure I see how this question is related - currently there is no option in groupby to choose whether to align or not. Maybe this is another enhancement you're thinking of?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 8, 2022

Thanks for thinking this through and the proposal, @rhshadrach ! I added some thoughts below, but to be clear, while those might seem "critical" of the proposal, I certainly agree that we should try to clean this part of the API (those different keywords that do very similar things but not exactly and for different methods is not ideal ..)

To summarize the context for myself, the underlying behaviour (not current API) we want to control somehow is the following aspect of the groupby behaviour:

  • What to do with the group key labels: add them to the resulting dataframe?
    • Yes, add them to the output -> Then, where to add the labels?
      • As index level(s)
      • As column(s)
    • No, don't add them to the output

Some unstructured thoughts:

  • I find the "axis" terminology in keys_axis a bit strange / potentially confusing. "axis" is used here differently than in most other places where we use that term (eg reducing over a certain axis, or set_axis and similar methods) (I think this is also the confusion of @WillAyd, this issue is not related to the existing axis keyword of groupby).
    keys_axis="column" doesn't mean to set the group keys as the "columns" axis (which would only makes sense when actually grouping by this axis), but only to add one entry (per group key) to that axis. The group key labels are basically always added to "axis 0" (i.e. one value for each row), but the only question is to add it as and index or a column (both along the same axis).
    (but I also don't directly have a better suggestion for a keyword name ..)

  • Would we allow df.groupby(.., keys_axis="none").sum() not having any key values in the result? (basically giving the result of currently doing df.groupby(..).sum().reset_index(drop=True)) (i.e. the combo of a true reducing method with group_keys=False/keys_axis="none")
    This doesn't really feel like a needed new feature (since it's already easily possible, and I don't think this often the desired result), and so it would be expanding the behavioural options unnecessarily? (at the same time, not allowing this combo also makes the interpretation of the new keyword more complex and dependent on the actual operation that follows. And it might actually be one of the gotchas with the current group_keys keyword that it doesn't apply to reductions)

  • What would a transform or filter function (and strict transform or filter, like transform() or head(), not inferred in apply) do with keys_axis="columns"|"index"? Or would we disallow this combo?
    Those methods are currently documented to preserve the original index and order.

  • Do we need to keys_axis="none" option at all? (i.e. the current group_keys=False) You mention:

    This is currently how transforms and filters work - where the keys are added to neither the index nor the columns. We need to keep the ability to specify to groupby(...).apply that the UDF they are provided acts as a transform or filter.

    We don't necessarily "need" to keep this ability for apply. The fact that apply tries to infer whether it got a transforming function, and in that case (and only if group_keys=False was specified) will preserve the original index order, is also something that we could deprecate and remove from apply (as come up a few times in the #34998 group_keys discussion) -> BUG: values-dependent reindex behavior in Groupby.apply #35278
    (also note that this is only for transforming functions, not filters. For filters, the keys_axis="none" would be the same as a reset_index(drop=True) afterwards, there is no reindexing of the output)
    And also related to API: Should apply be smart? #39209 on the question whether we should actually try to infer what kind of function is being applied to start with.

  • (mostly for apply()) Currently the logic for this new (and existing) keyword is "try to infer the type of function (reducer, transform, filter) -> based on this decide on how to handle group keys in the output -> but allow the user to override how to handle the group keys". Another possible logic could be to let the user override the first part, i.e. allow it to specify what kind of function is being passed, instead of allowing to override the consequence of it. Of course, that would only work if we are fine with having a fixed way how to handle the group keys depending on the type of function.


Rereading my comments, maybe my current thoughts could be summarized as: do we actually need to keep the group_keys=False optional behavior? If it is a reset_index(drop=True) away? (the only thing that doesn't make this last sentence true is #35278, but there can be other ways to provide the "transform" behaviour)
And if we wouldn't have this option, then the only remaining question is whether to set the group keys as index or as columns, and that probably would influence the API we would design.

@rhshadrach
Copy link
Member Author

rhshadrach commented Nov 12, 2022

Thanks for thinking this through and the proposal, @rhshadrach ! I added some thoughts below, but to be clear, while those might seem "critical" of the proposal, I certainly agree that we should try to clean this part of the API (those different keywords that do very similar things but not exactly and for different methods is not ideal ..)

Thanks @jorisvandenbossche! Any and all thoughts (critical or not) are much appreciated. apply is a particularly hard method to reason about.

Agreed on your objection to the name (and I think @WillAyd's as well). I'm focusing on the other aspects here first, and if we find something we want to move forward with, we can workshop the name. I'll keep calling it keys_axis for now for consistency.

This doesn't really feel like a needed new feature (since it's already easily possible, and I don't think this often the desired result), and so it would be expanding the behavioural options unnecessarily? (at the same time, not allowing this combo also makes the interpretation of the new keyword more complex and dependent on the actual operation that follows. And it might actually be one of the gotchas with the current group_keys keyword that it doesn't apply)

Your parenthetical comment is indeed the reason I think it should be included, and so I disagree that it's "expanding the behavioural options unnecessarily".

What would a transform or filter function (and strict transform or filter, like transform() or head(), not inferred in apply) do with keys_axis="columns"|"index"? Or would we disallow this combo?
Those methods are currently documented to preserve the original index and order.

Users would be able to get the keys in the index or column if they so desired; the documentation would be changed. The default behavior keys_axis="infer" would remain unchanged. This is an expansion of the current features.

Rereading my comments, maybe my current thoughts could be summarized as: do we actually need to keep the group_keys=False optional behavior? If it is a reset_index(drop=True) away?

I'm not sure what this means; apply does different things based on whether the function is e.g. a reduction or transform. When you say "need to keep" here, what does removing it entail? To make this more explicit, I think it would be helpful to consider the following example:

df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]}, index=[6, 7, 8])
print(df.groupby(['x', 'y', 'z'], group_keys=False).apply(lambda x: x.sum()))
print(df.groupby(['x', 'y', 'z'], group_keys=False).apply(lambda x: x.cumsum()))
print(df.groupby(['x', 'y', 'z'], group_keys=True).apply(lambda x: x.sum()))
print(df.groupby(['x', 'y', 'z'], group_keys=True).apply(lambda x: x.cumsum()))

@rhshadrach
Copy link
Member Author

rhshadrach commented Nov 12, 2022

@jorisvandenbossche

Rereading my comments, maybe my current thoughts could be summarized as: do we actually need to keep the group_keys=False optional behavior? If it is a reset_index(drop=True) away?

I'm not sure what this means;

Reflecting on your comments more, I'm not sure this is what you're necessarily getting at, but it occurs to me that we have agg for aggregations, transform for transformers, so we shouldn't restrict apply to having to support either one of those. If users want those behaviors, there are functions readily for them. apply should be for functions that don't fit these molds.

To reason about this, it would be useful to have examples of usages where the UDF is neither a reducer nor a transform. @MarcoGorelli - you mentioned in #49497 (comment) that you scrapped Kaggle notebooks. I'm wondering if something similar can be done here. Do you have code for this that can be shared?

@MarcoGorelli
Copy link
Member

You'll need to set up an account on Kaggle, and download a token https://github.com/Kaggle/kaggle-api#api-credentials, and then

 for i in $(kaggle kernels list --page-size 100 --sort-by 'voteCount' -p 1 --language python | awk '{ print $1 }' | tail -n +3); do kaggle kernels pull $i; done;

should work

Then you can just remove all outputs with nbstripout and grep for patterns you want

@WillAyd
Copy link
Member

WillAyd commented Nov 15, 2022

Does keys_location make things any clearer? Yes I was confused before but @jorisvandenbossche did a great write up. Not sold on that as a name either but axis is definitely confusing

I think also would be easier to borrow the terminology for accepted argument values from pivot_table of index and values rather than index and columns, as columns to me infers axis=1. I guess index and values would still be confusing if you groupby axis=1 and the keys actually do get put in the column index, but I think that still has more of a naming precedent than the current proposal

@WillAyd
Copy link
Member

WillAyd commented Nov 15, 2022

key_placement might also be an option

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 15, 2022

On the transform case

What would a transform or filter function (and strict transform or filter, like transform() or head(), not inferred in apply) do with keys_axis="columns"|"index"? Or would we disallow this combo?
Those methods are currently documented to preserve the original index and order.

Users would be able to get the keys in the index or column if they so desired; the documentation would be changed. The default behavior keys_axis="infer" would remain unchanged. This is an expansion of the current features.

Just to be very specific about the exact behaviour you are thinking of: would it be adding the original column (used as key) from the calling dataframe to the result as an index level / column and thus keeping the original order? (which is not what group.apply with group_keys=True does)
So using the following example dataframe (and using current API to mimic the result):

>>> df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
# with setting key as index
>>> df.groupby("a").transform(lambda x: x.cumsum()).set_index(df['a'], append=True).reorder_levels([1, 0], 0)
     b
a     
1 0  1
2 1  2
1 2  4
# with setting key as column
>>> df.groupby("a").transform(lambda x: x.cumsum()).assign(a=df['a'])[['a', 'b']]
   a  b
0  1  1
1  2  2
2  1  4

I assume it is the above, but to be explicit, the current group_keys=True in apply is slightly different (and I would also assume that apply with this new keyword would keep the below behaviour?)

>>> df.groupby("a", group_keys=True).apply(lambda x: x[['b']].cumsum())
     b
a     
1 0  1
  2  4
2 1  2

That said, I am still a bit skeptical whether this functionality is really "needed" for transform. Of course it would only be optional (you keep the existing behaviour by default), and so a similar argument as for the reductions can be given ("not allowing the option makes the new keyword dependent on the type of method", which can also be confusing, i.e. "why doesn't this keyword have any effect?"). But we are still expanding functionality, which needs tests, documentation, etc

I think the output shown in the "with setting key as column" example above is certainly an interesting output, repeating it here:

>>> df.groupby("a").transform(lambda x: x.cumsum()).assign(a=df['a'])[['a', 'b']]
   a  b
0  1  1
1  2  2
2  1  4

But I think the way this is typically achieved right now with transform is with something like:

>>> df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
>>> df["b"] = df.groupby("a")["b"].transform(lambda x: x.cumsum())
>>> df
   a  b
0  1  1
1  2  2
2  1  4

for which you want the current behaviour of having the result of transform() exactly indexed as the input (so not adding the key as an index level, or not as a column), otherwise the assignment won't work.

If we want to enable the above in a method-chaining manner, I think we should rather explore other options, such as adding an assign method that mimics DataFrame.assign (this is mentioned in #43902 as well). Hypothetical code example:

>>> df.groupby("a").assign(b=lambda x["b"].cumsum())
   a  b
0  1  1
1  2  2
2  1  4

Here, it gives the same output as a potential df.groupby("a", keys_axis="column").transform(lambda x: x.cumsum()). But that is only because there are no other columns in this tiny examples.

I know this is getting a bit off-topic, but I do think we should consider both the consistency aspect (i.e. how can we make this keys-as-index/column-handling consistent across the different groupby methods) as the actual use cases, so we are not adding API surface only for the sake of consistency but also because it is useful.
(but, I have to say that I don't have a very good idea of how transform() is typically being used in practice)

@jorisvandenbossche
Copy link
Member

Rereading my comments, maybe my current thoughts could be summarized as: do we actually need to keep the group_keys=False optional behavior? If it is a reset_index(drop=True) away?

I'm not sure what this means;

Reflecting on your comments more, I'm not sure this is what you're necessarily getting at, but it occurs to me that we have agg for aggregations, transform for transformers, so we shouldn't restrict apply to having to support either one of those. If users want those behaviors, there are functions readily for them. apply should be for functions that don't fit these molds.

@rhshadrach yes, that is certainly related.
To illustrate what I meant with "group_keys=False behaviour is only a reset_index(drop=True) away", using the example dataframe you used for the non-reduction case:

>>> df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]}, index=[6, 7, 8])
>>> df.groupby(['x', 'y', 'z'], group_keys=True).apply(lambda x: x.cumsum())
     a  b
x 6  0  3
y 7  1  4
z 8  2  5
>>> df.groupby(['x', 'y', 'z'], group_keys=False).apply(lambda x: x.cumsum())
   a  b
6  0  3
7  1  4
8  2  5

This behaviour of group_keys=False (or keys_axis="none") can in this case be achieved with a reset_index as well:

>>> df.groupby(['x', 'y', 'z'], group_keys=True).apply(lambda x: x.cumsum()).reset_index(level=0, drop=True)
   a  b
6  0  3
7  1  4
8  2  5

Given that this is easy to do if you want this specific output, is it worth having a group_keys keyword (or keys_axis="none" option) to start with?
(one reason is that it will be a bit more efficient to not first set is as the index to then only drop it, but not sure that will be a bottleneck / is worth it)

Another reason to have current group_keys for apply is because the group_keys=False is actually different than a simple reset_index (and so you won't be able to mimic that behaviour with reset_index as I did in the example above) for certain cases, when doing a transform. This is tracked in this issue: #35278
The reason that doesn't show up in the example above is because this reindexing behaviour doesn't matter in that case (the original dataframe is already sorted). But using the example I used in the previous comment:

>>> df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
>>> df.groupby("a", group_keys=True).apply(lambda x: x[['b']].cumsum())
     b
a     
1 0  1
  2  4
2 1  2
>>> df.groupby("a", group_keys=False).apply(lambda x: x[['b']].cumsum())
   b
0  1
1  2
2  4
# mimic group_keys=False result with reset_index gives different order
>>> df.groupby("a", group_keys=True).apply(lambda x: x[['b']].cumsum()).reset_index(level=0, drop=True)
   b
0  1
2  4
1  2

But personally I would rather get rid of this special case in the group_keys=False behaviour of apply, and so would prefer to not replicate this special case in a keys_axis="none". And at the same time if we get rid of that special case, it makes the use case of group_keys=False less useful to start with. And that's indeed related to "apply should be for functions that don't fit these molds." -> i.e. if you want the original-order-preserving behaviour, use transform and not apply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Groupby Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants