You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Internally, groupby constructs a list of DataFrames, one per group, that are the results of the UDF applied to each group. Those are concatenated together, and are at this point in "group" order. If we detect that the .apply was actually a transform, we reindex the concatenated result back to the original index.
Out[3] has been viewed as a transform and so was reindexed. Whether or not the UDF was a transform depends on the values, and we generally discourage this type of values-dependent behavior.
To solve this, we have a few options
Implement a "table-wise" transform. This solves the usecase where people are using .apply rather than transform just because it operates on dataframes rather than columns. We could do this through .groupby(..., axis=None).transform() or through .groupby(...).transform_table() / transform_frame(). This doesn't help with the drop_duplicates example, which is more of a filter (that sometimes doesn't filter anything).
Implement a "table-wise" filter. Currently .groupby().filter() expects the UDF to return a scalar, and filters groups based on that. It could be expanded to also allow the UDF to return an array. In this case it would filter rows where the returned value evaluates to True. This would solve the drop_duplicates use case, but not all use cases.
Regardless of whether 1 or 2 are implemented, add a reindex_output keyword to groupby to control this very narrow case. This would only be relevant when group_keys=False and we've detected an apply. It gives users control over whether or not the result is reindexed.
It has no effect in any other case, including group_keys=False. By default, it can be None to preserve the values-dependent behavior on master. Though we can explore deprecating it if there's any desire.
The text was updated successfully, but these errors were encountered:
xref #34998 (comment).
Currently (on master and in my PR at #34998),
.groupby(...).apply
has some value-dependent behavior. SpecificallyInternally, groupby constructs a list of DataFrames, one per group, that are the results of the UDF applied to each group. Those are concatenated together, and are at this point in "group" order. If we detect that the
.apply
was actually a transform, we reindex the concatenated result back to the original index.pandas/pandas/core/groupby/groupby.py
Lines 1114 to 1128 in b6222ec
Out[3] has been viewed as a transform and so was reindexed. Whether or not the UDF was a transform depends on the values, and we generally discourage this type of values-dependent behavior.
To solve this, we have a few options
.apply
rather than transform just because it operates on dataframes rather than columns. We could do this through.groupby(..., axis=None).transform()
or through.groupby(...).transform_table()
/transform_frame()
. This doesn't help with thedrop_duplicates
example, which is more of a filter (that sometimes doesn't filter anything)..groupby().filter()
expects the UDF to return a scalar, and filters groups based on that. It could be expanded to also allow the UDF to return an array. In this case it would filter rows where the returned value evaluates to True. This would solve thedrop_duplicates
use case, but not all use cases.reindex_output
keyword to groupby to control this very narrow case. This would only be relevant whengroup_keys=False
and we've detected an apply. It gives users control over whether or not the result is reindexed.It has no effect in any other case, including
group_keys=False
. By default, it can beNone
to preserve the values-dependent behavior on master. Though we can explore deprecating it if there's any desire.The text was updated successfully, but these errors were encountered: