Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Compute complete result_index upfront in groupby #55738

Merged
merged 45 commits into from
Feb 7, 2024

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Oct 27, 2023

Just a PoC at this point, need to try to move behavior changes out of here as much as possible.

The main change this makes is it moves the computation of unobserved groups upfront. Currently, we only include unobserved groups upfront if there is a single grouping (e.g. df.groupby('a')) but not two or more (e.g. df.groupby(['a', 'b'])). When there are two or more, we go through the groupby computations with only observed groups, and then tack on the unobserved groups at the end. By always including unobserved groups upfront, we can simplify the logic in the groupby code. Having unobserved groups sometimes included and sometimes not included upfront is also a footgun.

In order to make this change, I found I needed to rework a bit of how NA values are handled in Grouping._codes_and_uniques. This in turn fixed some NA bugs.

This PR fixes a number of issues that stem from this (shown in the test changes)

  • Unobserved groups are not included in groupby.apply if there is more than one grouping
  • Unobserved groups do not appear in groubpy.groups
  • When there are multiple groupings, len(df.groupby(..., dropna=True)) counts groups that have NA values
  • When there are multiple groupings, df.groupby(..., dropna=True).groups has groups that have NA values
  • The value of .all for empty groups was np.nan but should be True
  • The value of .any for empty groups was np.nan but should be False
  • The value of .prod for empty groups was np.nan but should be 1

In addition, it allows us to remove various similar objects throughout groupby:

  • BaseGrouper.group_keys_seq
  • BaseGrouper.reconstructed_codes
  • Grouping.group_index
  • Grouping.result_index
  • Grouping.group_arraylike
  • BinGrouper.reconstructed_codes

and a few methods

  • GroupBy._reindex_output
  • BaseGrouper._sort_idx
  • BaseGrouper._get_compressed_codes

But it does add BaseGrouper.result_index_and_codes. This computes the (aggregated) result index that takes into account dropna and observed, along with the codes for the groups themselves.

ASVs; no performance regressions (with the standard 10% cutoff) other than groupby transform operations with multiple categorical groupings.

| Change   | Before [2b67593b] <test>   | After [e285742a] <gb_observed_pre>   |   Ratio | Benchmark (Parameter)                                                                                              |
|----------|----------------------------|--------------------------------------|---------|--------------------------------------------------------------------------------------------------------------------|
| +        | 308±4μs                    | 1.25±0ms                             |    4.08 | groupby.MultipleCategories.time_groupby_transform                                                                  |
| -        | 5.10±0.04ms                | 4.61±0.06ms                          |    0.91 | groupby.Apply.time_scalar_function_multi_col(5)                                                                    |
| -        | 239±4μs                    | 216±0.2μs                            |    0.9  | groupby.Categories.time_groupby_sort(False)                                                                        |
| -        | 60.1±2μs                   | 53.8±0.4μs                           |    0.9  | groupby.GroupByMethods.time_dtype_as_field('float', 'size', 'direct', 1, 'cython')                                 |
| -        | 241±5μs                    | 215±1μs                              |    0.89 | groupby.Categories.time_groupby_ordered_sort(False)                                                                |
| -        | 556±7μs                    | 488±60μs                             |    0.88 | arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function mul>)       |
| -        | 235±5μs                    | 205±2μs                              |    0.87 | groupby.Categories.time_groupby_extra_cat_sort(False)                                                              |
| -        | 419±8μs                    | 356±10μs                             |    0.85 | arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function le>)        |
| -        | 549±10μs                   | 461±50μs                             |    0.84 | arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function truediv>) |
| -        | 21.7±0.7μs                 | 18.1±0.06μs                          |    0.83 | groupby.GroupByMethods.time_dtype_as_field('uint', 'count', 'direct', 1, 'cython')                                 |
| -        | 22.5±0.3μs                 | 18.6±0.2μs                           |    0.83 | groupby.GroupByMethods.time_dtype_as_group('float', 'count', 'direct', 1, 'cython')                                |
| -        | 22.2±0.2μs                 | 18.4±0.8μs                           |    0.83 | groupby.GroupByMethods.time_dtype_as_group('uint', 'count', 'direct', 1, 'cython')                                 |
| -        | 22.3±0.8μs                 | 18.3±0.1μs                           |    0.82 | groupby.GroupByMethods.time_dtype_as_field('float', 'count', 'direct', 1, 'cython')                                |
| -        | 21.6±0.9μs                 | 17.8±0.3μs                           |    0.82 | groupby.GroupByMethods.time_dtype_as_field('int16', 'count', 'direct', 1, 'cython')                                |
| -        | 22.0±0.6μs                 | 18.1±0.3μs                           |    0.82 | groupby.GroupByMethods.time_dtype_as_group('int', 'count', 'direct', 1, 'cython')                                  |
| -        | 21.9±0.7μs                 | 17.8±0.3μs                           |    0.81 | groupby.GroupByMethods.time_dtype_as_field('int', 'count', 'direct', 1, 'cython')                                  |
| -        | 22.0±0.2μs                 | 17.9±0.07μs                          |    0.81 | groupby.GroupByMethods.time_dtype_as_group('int16', 'count', 'direct', 1, 'cython')                                |
| -        | 21.6±0.1μs                 | 17.0±0.06μs                          |    0.79 | groupby.GroupByMethods.time_dtype_as_group('object', 'count', 'direct', 1, 'cython')                               |
| -        | 520±50μs                   | 370±30μs                             |    0.71 | arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function sub>)         |
| -        | 13.7±1ms                   | 1.80±0ms                             |    0.13 | groupby.MultipleCategories.time_groupby_nosort                                                                     |
| -        | 15.8±2ms                   | 1.73±0ms                             |    0.11 | groupby.MultipleCategories.time_groupby_extra_cat_nosort                                                           |
| -        | 13.7±0.2ms                 | 1.57±0.01ms                          |    0.11 | groupby.MultipleCategories.time_groupby_ordered_nosort                                                             |
| -        | 33.6±0.4ms                 | 1.40±0ms                             |    0.04 | groupby.MultipleCategories.time_groupby_sort                                                                       |
| -        | 35.5±1ms                   | 1.23±0ms                             |    0.03 | groupby.MultipleCategories.time_groupby_extra_cat_sort                                                             |
| -        | 33.3±0.8ms                 | 1.15±0ms                             |    0.03 | groupby.MultipleCategories.time_groupby_ordered_sort                                                               |

I haven't verified this, but I believe this would also make #55261 trivial to implement just by changing a few lines of result_index_and_ids

@rhshadrach rhshadrach added Bug Refactor Internal refactoring of code Groupby Categorical Categorical Data Type labels Oct 27, 2023
@@ -1204,7 +1204,7 @@ def test_value_counts_sort_categorical(sort, vc_sort, normalize):
elif not sort and vc_sort:
taker = [0, 2, 1, 3]
else:
taker = [2, 3, 0, 1]
taker = [2, 1, 0, 3]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach rhshadrach marked this pull request as ready for review February 3, 2024 11:19
@rhshadrach
Copy link
Member Author

@jbrockmendel @mroeschke - this is now ready

@mroeschke
Copy link
Member

Does this happen to fix #35202

@rhshadrach
Copy link
Member Author

Does this happen to fix #35202

No, see the bottom half of #55738 (comment).

@mroeschke mroeschke added this to the 3.0 milestone Feb 4, 2024
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mroeschke mroeschke merged commit 05f75c6 into pandas-dev:main Feb 7, 2024
46 of 47 checks passed
@mroeschke
Copy link
Member

Thanks @rhshadrach

@rhshadrach rhshadrach deleted the gb_observed_pre branch February 7, 2024 15:12
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 10, 2024

@rhshadrach is the following change in behaviour intentional?

df = pd.DataFrame(
    {
        "key_cat": pd.Categorical(["a", "a", "b", "b"]),
        "key_noncat": [1, 1, 1, 2],
        "values": [1.0, 2.0, 3.0, 4.0],
    }
)
df.groupby(["key_cat", "key_noncat"], observed=False)["values"].agg(lambda x: x.sum())

Using released pandas, this injected NaNs for the unobserved values:

# using pandas 2.2.0
>>> df.groupby(["key_cat", "key_noncat"], observed=False)["values"].agg(lambda x: x.sum())
key_cat  key_noncat
a        1             3.0
         2             NaN
b        1             3.0
         2             4.0
Name: values, dtype: float64

versus after (I assume) this PR on pandas main:

# using pandas main
>>> df.groupby(["key_cat", "key_noncat"], observed=False)["values"].agg(lambda x: x.sum())
key_cat  key_noncat
a        1             3.0
         2             0.0
b        1             3.0
         2             4.0
Name: values, dtype: float64

The reason this now gives a 0 is because this actually calls the UDF on the empty group, and the sum of an empty Series is 0 (while before, the UDF was never called for the unobserved group).

I don't think I have a strong opinion on what the behaviour should be, but if it is intentional, this probably warrants a whatsnew note (it was confusing to debug my failing tests).

Apart of the different result, this also means that your UDF needs to be able to handle empty input (before, the UDF essentially never got empty input for those cases, I think?)

@jorisvandenbossche
Copy link
Member

I also see that your top post has a long list of changes / fixes:

This PR fixes a number of issues that stem from this (shown in the test changes)

  • ...

I think those are mostly added to the whatsnew, but it might be clearer to have a separate section explaining the general change (unobserved groups are now properly included through all APIs), and then list some of the consequence of it (instead of those bullet points that will become part of a large list of bullet points in the groupby section)

@rhshadrach
Copy link
Member Author

rhshadrach commented Feb 10, 2024

Thanks @jorisvandenbossche - indeed this is intentional; xref #36698. I'll add a whatsnew note try to make this a notable bug fix section in a followup.

pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
* REF: Compute correct result_index upfront in groupby

* Refinements

* Refinements

* Refinements

* Restore inferring index dtype

* Test fixups

* Refinements

* Refinements

* fixup

* fixup

* fixup

* Fix sorting and non-sorting

* Cleanup

* Call ensure_plantform_int last

* fixup

* fixup

* REF: Compute correct result_index upfront in groupby

* Add test

* Remove test

* Move unobserved to the end

* cleanup

* cleanup

* cleanup

* Merge fixup

* fixup

* fixup

* Fixup and test

* whatsnew

* type ignore

* Refactor & type annotations

* Better bikeshed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment