REF: Compute complete result_index upfront in groupby #55738

rhshadrach · 2023-10-27T21:24:37Z

~~Just a PoC at this point, need to try to move behavior changes out of here as much as possible.~~

The main change this makes is it moves the computation of unobserved groups upfront. Currently, we only include unobserved groups upfront if there is a single grouping (e.g. df.groupby('a')) but not two or more (e.g. df.groupby(['a', 'b'])). When there are two or more, we go through the groupby computations with only observed groups, and then tack on the unobserved groups at the end. By always including unobserved groups upfront, we can simplify the logic in the groupby code. Having unobserved groups sometimes included and sometimes not included upfront is also a footgun.

In order to make this change, I found I needed to rework a bit of how NA values are handled in Grouping._codes_and_uniques. This in turn fixed some NA bugs.

This PR fixes a number of issues that stem from this (shown in the test changes)

Unobserved groups are not included in groupby.apply if there is more than one grouping
Unobserved groups do not appear in groubpy.groups
When there are multiple groupings, len(df.groupby(..., dropna=True)) counts groups that have NA values
When there are multiple groupings, df.groupby(..., dropna=True).groups has groups that have NA values
The value of .all for empty groups was np.nan but should be True
The value of .any for empty groups was np.nan but should be False
The value of .prod for empty groups was np.nan but should be 1

In addition, it allows us to remove various similar objects throughout groupby:

BaseGrouper.group_keys_seq
BaseGrouper.reconstructed_codes
Grouping.group_index
Grouping.result_index
Grouping.group_arraylike
BinGrouper.reconstructed_codes

and a few methods

GroupBy._reindex_output
BaseGrouper._sort_idx
BaseGrouper._get_compressed_codes

But it does add BaseGrouper.result_index_and_codes. This computes the (aggregated) result index that takes into account dropna and observed, along with the codes for the groups themselves.

ASVs; no performance regressions (with the standard 10% cutoff) other than groupby transform operations with multiple categorical groupings.

| Change   | Before [2b67593b] <test>   | After [e285742a] <gb_observed_pre>   |   Ratio | Benchmark (Parameter)                                                                                              |
|----------|----------------------------|--------------------------------------|---------|--------------------------------------------------------------------------------------------------------------------|
| +        | 308±4μs                    | 1.25±0ms                             |    4.08 | groupby.MultipleCategories.time_groupby_transform                                                                  |
| -        | 5.10±0.04ms                | 4.61±0.06ms                          |    0.91 | groupby.Apply.time_scalar_function_multi_col(5)                                                                    |
| -        | 239±4μs                    | 216±0.2μs                            |    0.9  | groupby.Categories.time_groupby_sort(False)                                                                        |
| -        | 60.1±2μs                   | 53.8±0.4μs                           |    0.9  | groupby.GroupByMethods.time_dtype_as_field('float', 'size', 'direct', 1, 'cython')                                 |
| -        | 241±5μs                    | 215±1μs                              |    0.89 | groupby.Categories.time_groupby_ordered_sort(False)                                                                |
| -        | 556±7μs                    | 488±60μs                             |    0.88 | arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function mul>)       |
| -        | 235±5μs                    | 205±2μs                              |    0.87 | groupby.Categories.time_groupby_extra_cat_sort(False)                                                              |
| -        | 419±8μs                    | 356±10μs                             |    0.85 | arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function le>)        |
| -        | 549±10μs                   | 461±50μs                             |    0.84 | arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function truediv>) |
| -        | 21.7±0.7μs                 | 18.1±0.06μs                          |    0.83 | groupby.GroupByMethods.time_dtype_as_field('uint', 'count', 'direct', 1, 'cython')                                 |
| -        | 22.5±0.3μs                 | 18.6±0.2μs                           |    0.83 | groupby.GroupByMethods.time_dtype_as_group('float', 'count', 'direct', 1, 'cython')                                |
| -        | 22.2±0.2μs                 | 18.4±0.8μs                           |    0.83 | groupby.GroupByMethods.time_dtype_as_group('uint', 'count', 'direct', 1, 'cython')                                 |
| -        | 22.3±0.8μs                 | 18.3±0.1μs                           |    0.82 | groupby.GroupByMethods.time_dtype_as_field('float', 'count', 'direct', 1, 'cython')                                |
| -        | 21.6±0.9μs                 | 17.8±0.3μs                           |    0.82 | groupby.GroupByMethods.time_dtype_as_field('int16', 'count', 'direct', 1, 'cython')                                |
| -        | 22.0±0.6μs                 | 18.1±0.3μs                           |    0.82 | groupby.GroupByMethods.time_dtype_as_group('int', 'count', 'direct', 1, 'cython')                                  |
| -        | 21.9±0.7μs                 | 17.8±0.3μs                           |    0.81 | groupby.GroupByMethods.time_dtype_as_field('int', 'count', 'direct', 1, 'cython')                                  |
| -        | 22.0±0.2μs                 | 17.9±0.07μs                          |    0.81 | groupby.GroupByMethods.time_dtype_as_group('int16', 'count', 'direct', 1, 'cython')                                |
| -        | 21.6±0.1μs                 | 17.0±0.06μs                          |    0.79 | groupby.GroupByMethods.time_dtype_as_group('object', 'count', 'direct', 1, 'cython')                               |
| -        | 520±50μs                   | 370±30μs                             |    0.71 | arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function sub>)         |
| -        | 13.7±1ms                   | 1.80±0ms                             |    0.13 | groupby.MultipleCategories.time_groupby_nosort                                                                     |
| -        | 15.8±2ms                   | 1.73±0ms                             |    0.11 | groupby.MultipleCategories.time_groupby_extra_cat_nosort                                                           |
| -        | 13.7±0.2ms                 | 1.57±0.01ms                          |    0.11 | groupby.MultipleCategories.time_groupby_ordered_nosort                                                             |
| -        | 33.6±0.4ms                 | 1.40±0ms                             |    0.04 | groupby.MultipleCategories.time_groupby_sort                                                                       |
| -        | 35.5±1ms                   | 1.23±0ms                             |    0.03 | groupby.MultipleCategories.time_groupby_extra_cat_sort                                                             |
| -        | 33.3±0.8ms                 | 1.15±0ms                             |    0.03 | groupby.MultipleCategories.time_groupby_ordered_sort                                                               |

I haven't verified this, but I believe this would also make #55261 trivial to implement just by changing a few lines of result_index_and_ids

…bserved_pre

…into gb_observed_pre � Conflicts: � pandas/core/groupby/ops.py

…bserved_pre

…into gb_observed_pre # Conflicts: # pandas/core/groupby/ops.py

rhshadrach · 2024-02-02T23:44:07Z

pandas/tests/groupby/methods/test_value_counts.py

@@ -1204,7 +1204,7 @@ def test_value_counts_sort_categorical(sort, vc_sort, normalize):
    elif not sort and vc_sort:
        taker = [0, 2, 1, 3]
    else:
-        taker = [2, 3, 0, 1]
+        taker = [2, 1, 0, 3]


Ref: #56016 (comment)

rhshadrach · 2024-02-03T12:54:48Z

@jbrockmendel @mroeschke - this is now ready

mroeschke · 2024-02-03T20:10:44Z

Does this happen to fix #35202

pandas/core/groupby/grouper.py

pandas/core/groupby/ops.py

rhshadrach · 2024-02-04T13:33:57Z

Does this happen to fix #35202

No, see the bottom half of #55738 (comment).

# Conflicts: # doc/source/whatsnew/v3.0.0.rst

mroeschke

LGTM

mroeschke · 2024-02-07T05:01:45Z

Thanks @rhshadrach

jorisvandenbossche · 2024-02-10T18:00:19Z

@rhshadrach is the following change in behaviour intentional?

df = pd.DataFrame(
    {
        "key_cat": pd.Categorical(["a", "a", "b", "b"]),
        "key_noncat": [1, 1, 1, 2],
        "values": [1.0, 2.0, 3.0, 4.0],
    }
)
df.groupby(["key_cat", "key_noncat"], observed=False)["values"].agg(lambda x: x.sum())

Using released pandas, this injected NaNs for the unobserved values:

# using pandas 2.2.0
>>> df.groupby(["key_cat", "key_noncat"], observed=False)["values"].agg(lambda x: x.sum())
key_cat  key_noncat
a        1             3.0
         2             NaN
b        1             3.0
         2             4.0
Name: values, dtype: float64

versus after (I assume) this PR on pandas main:

# using pandas main
>>> df.groupby(["key_cat", "key_noncat"], observed=False)["values"].agg(lambda x: x.sum())
key_cat  key_noncat
a        1             3.0
         2             0.0
b        1             3.0
         2             4.0
Name: values, dtype: float64

The reason this now gives a 0 is because this actually calls the UDF on the empty group, and the sum of an empty Series is 0 (while before, the UDF was never called for the unobserved group).

I don't think I have a strong opinion on what the behaviour should be, but if it is intentional, this probably warrants a whatsnew note (it was confusing to debug my failing tests).

Apart of the different result, this also means that your UDF needs to be able to handle empty input (before, the UDF essentially never got empty input for those cases, I think?)

jorisvandenbossche · 2024-02-10T18:07:05Z

I also see that your top post has a long list of changes / fixes:

This PR fixes a number of issues that stem from this (shown in the test changes)

...

I think those are mostly added to the whatsnew, but it might be clearer to have a separate section explaining the general change (unobserved groups are now properly included through all APIs), and then list some of the consequence of it (instead of those bullet points that will become part of a large list of bullet points in the groupby section)

rhshadrach · 2024-02-10T18:07:16Z

Thanks @jorisvandenbossche - indeed this is intentional; xref #36698. I'll ~~add a whatsnew note~~ try to make this a notable bug fix section in a followup.

* REF: Compute correct result_index upfront in groupby * Refinements * Refinements * Refinements * Restore inferring index dtype * Test fixups * Refinements * Refinements * fixup * fixup * fixup * Fix sorting and non-sorting * Cleanup * Call ensure_plantform_int last * fixup * fixup * REF: Compute correct result_index upfront in groupby * Add test * Remove test * Move unobserved to the end * cleanup * cleanup * cleanup * Merge fixup * fixup * fixup * Fixup and test * whatsnew * type ignore * Refactor & type annotations * Better bikeshed

REF: Compute correct result_index upfront in groupby

e32b789

rhshadrach added Bug Refactor Internal refactoring of code Groupby Categorical Categorical Data Type labels Oct 27, 2023

rhshadrach added 13 commits October 30, 2023 19:58

Refinements

31a7c92

Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_o…

5ecfbeb

…bserved_pre

Refinements

8ce08d1

Refinements

6296f4a

Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_o…

68f2aeb

…bserved_pre

Restore inferring index dtype

7141425

Merge branch 'gb_observed_pre' of https://github.com/rhshadrach/pandas …

7f74812

…into gb_observed_pre � Conflicts: � pandas/core/groupby/ops.py

Test fixups

e39cbc8

Refinements

c82bd65

Refinements

3a9892d

fixup

25770be

fixup

a338efc

fixup

dbdec9f

rhshadrach mentioned this pull request Nov 10, 2023

BUG: Group keys contain NA values despite dropping them in groupby method #55919

Closed

3 tasks

rhshadrach added 6 commits November 12, 2023 09:50

Fix sorting and non-sorting

0ae70b7

Cleanup

99d2beb

Call ensure_plantform_int last

a477dc0

fixup

7fb7ca6

fixup

b79cc85

REF: Compute correct result_index upfront in groupby

da9169d

rhshadrach force-pushed the gb_observed_pre branch from 6ee8fcb to da9169d Compare November 14, 2023 22:10

This was referenced Nov 15, 2023

DOC: Grouper.ngroups vs Grouper.group_info[2] #49980

Closed

BUG: GroupBy.value_counts sorting order #56016

Merged

rhshadrach added 2 commits November 17, 2023 16:36

Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_o…

d2eee13

…bserved_pre

Merge branch 'gb_observed_pre' of https://github.com/rhshadrach/pandas …

b247544

…into gb_observed_pre # Conflicts: # pandas/core/groupby/ops.py

fixup

dce05da

rhshadrach commented Feb 2, 2024

View reviewed changes

rhshadrach added 2 commits February 3, 2024 06:06

Fixup and test

72209a8

whatsnew

b58b69d

rhshadrach marked this pull request as ready for review February 3, 2024 11:19

type ignore

fe99dc5

mroeschke reviewed Feb 3, 2024

View reviewed changes

pandas/core/groupby/grouper.py Outdated Show resolved Hide resolved

mroeschke reviewed Feb 3, 2024

View reviewed changes

pandas/core/groupby/ops.py Show resolved Hide resolved

mroeschke reviewed Feb 3, 2024

View reviewed changes

pandas/core/groupby/ops.py Outdated Show resolved Hide resolved

mroeschke reviewed Feb 3, 2024

View reviewed changes

pandas/core/groupby/ops.py Show resolved Hide resolved

Refactor & type annotations

766c229

rhshadrach added 2 commits February 4, 2024 08:34

Merge remote-tracking branch 'upstream/main' into gb_observed_pre

8f592ad

# Conflicts: # doc/source/whatsnew/v3.0.0.rst

Better bikeshed

a05ff18

mroeschke added this to the 3.0 milestone Feb 4, 2024

mroeschke approved these changes Feb 4, 2024

View reviewed changes

mroeschke merged commit 05f75c6 into pandas-dev:main Feb 7, 2024

rhshadrach deleted the gb_observed_pre branch February 7, 2024 15:12

m-richards mentioned this pull request Feb 18, 2024

COMPAT: pandas 3 related test updates geopandas/geopandas#3191

Merged

This was referenced Feb 23, 2024

PERF: groupby(...).__len__ #57595

Merged

DOC: Whatsnew notable bugfix on groupby behavior with unobserved groups #57600

Merged

rhshadrach mentioned this pull request Apr 25, 2024

BUG/PERF: groupby.transform with unobserved categories #58084

Merged

5 tasks

rhshadrach mentioned this pull request May 8, 2024

BUG: DataFrame.groupby returns invalid value when dropna=False #58644

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

REF: Compute complete result_index upfront in groupby #55738

REF: Compute complete result_index upfront in groupby #55738

Uh oh!

rhshadrach commented Oct 27, 2023 •

edited

Loading

Uh oh!

rhshadrach Feb 2, 2024

Uh oh!

rhshadrach commented Feb 3, 2024

Uh oh!

mroeschke commented Feb 3, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rhshadrach commented Feb 4, 2024

Uh oh!

mroeschke left a comment

Uh oh!

mroeschke commented Feb 7, 2024

Uh oh!

jorisvandenbossche commented Feb 10, 2024 •

edited

Loading

Uh oh!

jorisvandenbossche commented Feb 10, 2024

Uh oh!

rhshadrach commented Feb 10, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

REF: Compute complete result_index upfront in groupby #55738

REF: Compute complete result_index upfront in groupby #55738

Uh oh!

Conversation

rhshadrach commented Oct 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach Feb 2, 2024

Choose a reason for hiding this comment

Uh oh!

rhshadrach commented Feb 3, 2024

Uh oh!

mroeschke commented Feb 3, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rhshadrach commented Feb 4, 2024

Uh oh!

mroeschke left a comment

Choose a reason for hiding this comment

Uh oh!

mroeschke commented Feb 7, 2024

Uh oh!

jorisvandenbossche commented Feb 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Feb 10, 2024

Uh oh!

rhshadrach commented Feb 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rhshadrach commented Oct 27, 2023 •

edited

Loading

jorisvandenbossche commented Feb 10, 2024 •

edited

Loading

rhshadrach commented Feb 10, 2024 •

edited

Loading