BUG: values-dependent reindex behavior in Groupby.apply #35278

TomAugspurger · 2020-07-14T21:01:29Z

Currently (on master and in my PR at #34998), .groupby(...).apply has some value-dependent behavior. Specifically

In [1]: import pandas as pd

In [2]: df1 = pd.DataFrame({"A": [2, 1, 2], "B": [1, 2, 3]})
   ...: df2 = pd.DataFrame({"A": [2, 1, 2], "B": [1, 2, 1]})  # duplicates in group "2"

In [3]: df1.groupby("A", group_keys=False).apply(lambda x: x.drop_duplicates())
Out[3]:
   A  B
0  2  1
1  1  2
2  2  3

In [4]: df2.groupby("A", group_keys=False).apply(lambda x: x.drop_duplicates())
Out[4]:
   A  B
1  1  2
0  2  1

Internally, groupby constructs a list of DataFrames, one per group, that are the results of the UDF applied to each group. Those are concatenated together, and are at this point in "group" order. If we detect that the .apply was actually a transform, we reindex the concatenated result back to the original index.

pandas/pandas/core/groupby/groupby.py

Lines 1114 to 1128 in b6222ec

    
           if not not_indexed_same: 
        
               result = concat(values, axis=self.axis) 
        
               ax = self._selected_obj._get_axis(self.axis) 
        
               # this is a very unfortunate situation 
        
               # we can't use reindex to restore the original order 
        
               # when the ax has duplicates 
        
               # so we resort to this 
        
               # GH 14776, 30667 
        
               if ax.has_duplicates: 
        
                   indexer, _ = result.index.get_indexer_non_unique(ax.values) 
        
                   indexer = algorithms.unique1d(indexer) 
        
                   result = result.take(indexer, axis=self.axis) 
        
               else: 
        
                   result = result.reindex(ax, axis=self.axis)

Out[3] has been viewed as a transform and so was reindexed. Whether or not the UDF was a transform depends on the values, and we generally discourage this type of values-dependent behavior.

To solve this, we have a few options

Implement a "table-wise" transform. This solves the usecase where people are using .apply rather than transform just because it operates on dataframes rather than columns. We could do this through .groupby(..., axis=None).transform() or through .groupby(...).transform_table() / transform_frame(). This doesn't help with the drop_duplicates example, which is more of a filter (that sometimes doesn't filter anything).
Implement a "table-wise" filter. Currently .groupby().filter() expects the UDF to return a scalar, and filters groups based on that. It could be expanded to also allow the UDF to return an array. In this case it would filter rows where the returned value evaluates to True. This would solve the drop_duplicates use case, but not all use cases.
Regardless of whether 1 or 2 are implemented, add a reindex_output keyword to groupby to control this very narrow case. This would only be relevant when group_keys=False and we've detected an apply. It gives users control over whether or not the result is reindexed.

>>> df1.groupby("A", group_keys=False, reindex_output=True).apply(lambda x: x.drop_duplicates())
   A  B
0  2  1
1  1  2
2  2  3

>>> df1.groupby("A", group_keys=False, reindex_output=False).apply(lambda x: x.drop_duplicates())
   A  B
1  1  2
0  2  1
2  2  3

It has no effect in any other case, including group_keys=False. By default, it can be None to preserve the values-dependent behavior on master. Though we can explore deprecating it if there's any desire.

The text was updated successfully, but these errors were encountered:

TomAugspurger added Bug Needs Triage Issue that has not been reviewed by a pandas team member API Design Groupby and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 14, 2020

mroeschke added Apply Apply, Aggregate, Transform, Map Bug and removed API Design labels Aug 8, 2021

jorisvandenbossche mentioned this issue Nov 8, 2022

API: Consolidate groupby as_index and group_keys #49543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: values-dependent reindex behavior in Groupby.apply #35278

BUG: values-dependent reindex behavior in Groupby.apply #35278

TomAugspurger commented Jul 14, 2020 •

edited

Loading

BUG: values-dependent reindex behavior in Groupby.apply #35278

BUG: values-dependent reindex behavior in Groupby.apply #35278

Comments

TomAugspurger commented Jul 14, 2020 • edited Loading

TomAugspurger commented Jul 14, 2020 •

edited

Loading