Skip to content

BUG: values-dependent reindex behavior in Groupby.apply #35278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TomAugspurger opened this issue Jul 14, 2020 · 0 comments
Open

BUG: values-dependent reindex behavior in Groupby.apply #35278

TomAugspurger opened this issue Jul 14, 2020 · 0 comments
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 14, 2020

xref #34998 (comment).

Currently (on master and in my PR at #34998), .groupby(...).apply has some value-dependent behavior. Specifically

In [1]: import pandas as pd

In [2]: df1 = pd.DataFrame({"A": [2, 1, 2], "B": [1, 2, 3]})
   ...: df2 = pd.DataFrame({"A": [2, 1, 2], "B": [1, 2, 1]})  # duplicates in group "2"

In [3]: df1.groupby("A", group_keys=False).apply(lambda x: x.drop_duplicates())
Out[3]:
   A  B
0  2  1
1  1  2
2  2  3

In [4]: df2.groupby("A", group_keys=False).apply(lambda x: x.drop_duplicates())
Out[4]:
   A  B
1  1  2
0  2  1

Internally, groupby constructs a list of DataFrames, one per group, that are the results of the UDF applied to each group. Those are concatenated together, and are at this point in "group" order. If we detect that the .apply was actually a transform, we reindex the concatenated result back to the original index.

if not not_indexed_same:
result = concat(values, axis=self.axis)
ax = self._selected_obj._get_axis(self.axis)
# this is a very unfortunate situation
# we can't use reindex to restore the original order
# when the ax has duplicates
# so we resort to this
# GH 14776, 30667
if ax.has_duplicates:
indexer, _ = result.index.get_indexer_non_unique(ax.values)
indexer = algorithms.unique1d(indexer)
result = result.take(indexer, axis=self.axis)
else:
result = result.reindex(ax, axis=self.axis)

Out[3] has been viewed as a transform and so was reindexed. Whether or not the UDF was a transform depends on the values, and we generally discourage this type of values-dependent behavior.

To solve this, we have a few options

  1. Implement a "table-wise" transform. This solves the usecase where people are using .apply rather than transform just because it operates on dataframes rather than columns. We could do this through .groupby(..., axis=None).transform() or through .groupby(...).transform_table() / transform_frame(). This doesn't help with the drop_duplicates example, which is more of a filter (that sometimes doesn't filter anything).
  2. Implement a "table-wise" filter. Currently .groupby().filter() expects the UDF to return a scalar, and filters groups based on that. It could be expanded to also allow the UDF to return an array. In this case it would filter rows where the returned value evaluates to True. This would solve the drop_duplicates use case, but not all use cases.
  3. Regardless of whether 1 or 2 are implemented, add a reindex_output keyword to groupby to control this very narrow case. This would only be relevant when group_keys=False and we've detected an apply. It gives users control over whether or not the result is reindexed.
>>> df1.groupby("A", group_keys=False, reindex_output=True).apply(lambda x: x.drop_duplicates())
   A  B
0  2  1
1  1  2
2  2  3

>>> df1.groupby("A", group_keys=False, reindex_output=False).apply(lambda x: x.drop_duplicates())
   A  B
1  1  2
0  2  1
2  2  3

It has no effect in any other case, including group_keys=False. By default, it can be None to preserve the values-dependent behavior on master. Though we can explore deprecating it if there's any desire.

@TomAugspurger TomAugspurger added Bug Needs Triage Issue that has not been reviewed by a pandas team member API Design Groupby and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 14, 2020
@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map Bug and removed API Design labels Aug 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby
Projects
None yet
Development

No branches or pull requests

2 participants