-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Compute complete result_index upfront in groupby #55738
Conversation
…into gb_observed_pre � Conflicts: � pandas/core/groupby/ops.py
6ee8fcb
to
da9169d
Compare
…into gb_observed_pre # Conflicts: # pandas/core/groupby/ops.py
@@ -1204,7 +1204,7 @@ def test_value_counts_sort_categorical(sort, vc_sort, normalize): | |||
elif not sort and vc_sort: | |||
taker = [0, 2, 1, 3] | |||
else: | |||
taker = [2, 3, 0, 1] | |||
taker = [2, 1, 0, 3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ref: #56016 (comment)
@jbrockmendel @mroeschke - this is now ready |
Does this happen to fix #35202 |
No, see the bottom half of #55738 (comment). |
# Conflicts: # doc/source/whatsnew/v3.0.0.rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @rhshadrach |
@rhshadrach is the following change in behaviour intentional? df = pd.DataFrame(
{
"key_cat": pd.Categorical(["a", "a", "b", "b"]),
"key_noncat": [1, 1, 1, 2],
"values": [1.0, 2.0, 3.0, 4.0],
}
)
df.groupby(["key_cat", "key_noncat"], observed=False)["values"].agg(lambda x: x.sum()) Using released pandas, this injected NaNs for the unobserved values: # using pandas 2.2.0
>>> df.groupby(["key_cat", "key_noncat"], observed=False)["values"].agg(lambda x: x.sum())
key_cat key_noncat
a 1 3.0
2 NaN
b 1 3.0
2 4.0
Name: values, dtype: float64 versus after (I assume) this PR on pandas main: # using pandas main
>>> df.groupby(["key_cat", "key_noncat"], observed=False)["values"].agg(lambda x: x.sum())
key_cat key_noncat
a 1 3.0
2 0.0
b 1 3.0
2 4.0
Name: values, dtype: float64 The reason this now gives a 0 is because this actually calls the UDF on the empty group, and the sum of an empty Series is 0 (while before, the UDF was never called for the unobserved group). I don't think I have a strong opinion on what the behaviour should be, but if it is intentional, this probably warrants a whatsnew note (it was confusing to debug my failing tests). Apart of the different result, this also means that your UDF needs to be able to handle empty input (before, the UDF essentially never got empty input for those cases, I think?) |
I also see that your top post has a long list of changes / fixes:
I think those are mostly added to the whatsnew, but it might be clearer to have a separate section explaining the general change (unobserved groups are now properly included through all APIs), and then list some of the consequence of it (instead of those bullet points that will become part of a large list of bullet points in the groupby section) |
Thanks @jorisvandenbossche - indeed this is intentional; xref #36698. I'll |
* REF: Compute correct result_index upfront in groupby * Refinements * Refinements * Refinements * Restore inferring index dtype * Test fixups * Refinements * Refinements * fixup * fixup * fixup * Fix sorting and non-sorting * Cleanup * Call ensure_plantform_int last * fixup * fixup * REF: Compute correct result_index upfront in groupby * Add test * Remove test * Move unobserved to the end * cleanup * cleanup * cleanup * Merge fixup * fixup * fixup * Fixup and test * whatsnew * type ignore * Refactor & type annotations * Better bikeshed
groupby
method #55919doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Just a PoC at this point, need to try to move behavior changes out of here as much as possible.The main change this makes is it moves the computation of unobserved groups upfront. Currently, we only include unobserved groups upfront if there is a single grouping (e.g.
df.groupby('a')
) but not two or more (e.g.df.groupby(['a', 'b'])
). When there are two or more, we go through the groupby computations with only observed groups, and then tack on the unobserved groups at the end. By always including unobserved groups upfront, we can simplify the logic in the groupby code. Having unobserved groups sometimes included and sometimes not included upfront is also a footgun.In order to make this change, I found I needed to rework a bit of how NA values are handled in
Grouping._codes_and_uniques
. This in turn fixed some NA bugs.This PR fixes a number of issues that stem from this (shown in the test changes)
groupby.apply
if there is more than one groupinggroubpy.groups
len(df.groupby(..., dropna=True))
counts groups that have NA valuesdf.groupby(..., dropna=True).groups
has groups that have NA values.all
for empty groups wasnp.nan
but should beTrue
.any
for empty groups wasnp.nan
but should beFalse
.prod
for empty groups wasnp.nan
but should be1
In addition, it allows us to remove various similar objects throughout groupby:
BaseGrouper.group_keys_seq
BaseGrouper.reconstructed_codes
Grouping.group_index
Grouping.result_index
Grouping.group_arraylike
BinGrouper.reconstructed_codes
and a few methods
GroupBy._reindex_output
BaseGrouper._sort_idx
BaseGrouper._get_compressed_codes
But it does add
BaseGrouper.result_index_and_codes
. This computes the (aggregated) result index that takes into accountdropna
andobserved
, along with the codes for the groups themselves.ASVs; no performance regressions (with the standard 10% cutoff) other than groupby transform operations with multiple categorical groupings.
I haven't verified this, but I believe this would also make #55261 trivial to implement just by changing a few lines of
result_index_and_ids