BUG: Groupby min/max with nullable dtypes #42567

jbrockmendel · 2021-07-16T16:15:27Z

closes BUG: groupby().agg( ) with min/max on Int64 leads to incorrect results #41743
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

asvs show no change for non-nullable dtypes

mzeitlin11

Thanks for looking into this @jbrockmendel! LGTM ex a few small comments and I guess needs whatsnew

pandas/_libs/groupby.pyx

mzeitlin11 · 2021-07-16T16:31:49Z

pandas/core/groupby/ops.py

@@ -408,6 +408,7 @@ def _masked_ea_wrap_cython_operation(

        # Copy to ensure input and result masks don't end up shared
        mask = values._mask.copy()


For an aggregation op like this, I guess we can avoid the mask copy since the mask is not being modified inplace. (but not something needs to be done in this pr, just a note)

Though I suppose there's the tradeoff that then we'd lose mask contiguity guarantee in the algo?

haven't looked at the contiguity but that makes sense

pandas/tests/groupby/test_min_max.py

jreback · 2021-07-30T23:10:56Z

pandas/core/groupby/ops.py

@@ -497,6 +509,8 @@ def _call_cython_op(
        values = values.T
        if mask is not None:
            mask = mask.T
+            if mask_out is not None:


should be out-dented

no, mask_out is not None in a strict subset of cases when mask is not None

sure but it has the guard on it so shouldn't make any difference and it IS indepenent of mask no?

so shouldn't make any difference

tiny difference in an unnecessary check being avoided. mostly clarity

and it IS indepenent of mask no?

no

pandas/_libs/groupby.pyx

jreback · 2021-07-31T00:56:04Z

pandas/core/groupby/ops.py

@@ -497,6 +509,8 @@ def _call_cython_op(
        values = values.T
        if mask is not None:
            mask = mask.T
+            if mask_out is not None:


sure but it has the guard on it so shouldn't make any difference and it IS indepenent of mask no?

jreback

naming comments, not really thrilled with this expansion, but i guess its needed.

pandas/core/groupby/ops.py

jbrockmendel · 2021-08-09T03:46:28Z

not really thrilled with this expansion, but i guess its needed.

I'd be OK with not special-casing MaskedArray all throughout the codebase.

jreback · 2021-08-09T10:55:22Z

can u refresh with some places where we special case now?

jbrockmendel · 2021-08-09T16:10:30Z

can u refresh with some places where we special case now?

array_algos.quantile, GroupBy._bool_agg, GroupBy._get_cythonized_result, WrappedCythonOp._masked_ea_wrap_cython_operation, and a bunch prospectively in #42838, #39133, #37493

jreback · 2021-08-13T22:55:33Z

small naming comments & if you can rebase

jbrockmendel · 2021-08-13T23:02:26Z

just rebased; what did you think of the compromise naming of result_mask?

jreback · 2021-08-13T23:25:25Z

just rebased; what did you think of the compromise naming of result_mask?

cool with that

one place where need to update i think

jbrockmendel · 2021-08-14T02:18:23Z

one place where need to update i think

my thought with "result_mask" was to make a change in "mask" unnecessary. im not wild about mask_in bc it is awkward in cases without mask_out

jreback · 2021-08-18T15:21:45Z

one place where need to update i think

my thought with "result_mask" was to make a change in "mask" unnecessary. im not wild about mask_in bc it is awkward in cases without mask_out

right result_mask is fine, its the plain mask (when we also have mask_out) that bothers me. just need consistent naming.

jbrockmendel · 2021-08-18T16:39:01Z

right result_mask is fine, its the plain mask (when we also have mask_out) that bothers me. just need consistent naming.

"result_mask" is the alternative to "mask_out", so there is no more "mask_out". I don't like "mask_in" bc we'll end up with some functions that call it "mask_in" and others that call the same thing "mask" (bc those functions dont have result_mask/mask_out)

jorisvandenbossche

Looks good!

I agree with Brock on having consistent use of mask being better than some mask and some mask_in. I think having a function here with both mask and result_mask, I think it's clear that mask is about the input.

@jbrockmendel here or for a follow-up, but it might be good to add a benchmark for this. Based on a quick check, it seems that GroupByMethods and GroupByCythonAgg target those algos (both in benchmarks/groupby.py), and are not yet parametrized with a nullable dtype.

jorisvandenbossche · 2021-08-18T19:06:51Z

pandas/_libs/groupby.pyx

+                        if uses_mask:
+                            result_mask[i, j] = True
+                        else:
+                            out[i, j] = nan_val


How is out typically initialized? (np.empty?) I am wondering it it would be good practice to sill set out anyway (also if uses_mask=True to not have "unitialized" values)

hmm i could go either way on this

jbrockmendel · 2021-08-19T21:02:35Z

but it might be good to add a benchmark for this

good idea. will do as follow-up, as im hoping we can get this in quick-ish

jreback · 2021-08-19T21:33:10Z

I agree with Brock on having consistent use of mask being better than some mask and some mask_in. I think having a function here with both mask and result_mask, I think it's clear that mask is about the input.

I am cool with the names, just want to be consistent across the code base on these names / meaning

jbrockmendel · 2021-08-28T23:49:43Z

I am cool with the names, just want to be consistent across the code base on these names / meaning

What Joris and I are saying is that this is the best way to be consistent w/r/t naming

jorisvandenbossche · 2021-08-31T14:46:28Z

pandas/_libs/groupby.pyx

@@ -1197,6 +1199,12 @@ cdef group_min_max(groupby_t[:, ::1] out,
        True if `values` contains datetime-like entries.
    compute_max : bint, default True
        True to compute group-wise max, False to compute min
+    mask_in : ndarray[bool, ndim=2], optional


Since the group_max that wraps this uses mask, use mask here as well?

good catch, updated

jreback

looks good. in a followon a place to make sure we are hitting with tests (the error condition).

jreback · 2021-09-05T01:27:58Z

pandas/core/arrays/masked.py

@@ -123,6 +123,8 @@ def __init__(self, values: np.ndarray, mask: np.ndarray, copy: bool = False):
            raise ValueError("values must be a 1D array")
        if mask.ndim != 1:
            raise ValueError("mask must be a 1D array")
+        if values.shape != mask.shape:


this hit in tests?

jreback · 2021-09-05T01:30:27Z

also in followon should validate first/last (and maybe others) are correctly handling these dtypes, can you create an issue.

jbrockmendel added 2 commits July 16, 2021 09:13

BUG: Groupby.min/max with Int64

6f7c961

flesh out test cases

8f98501

mzeitlin11 reviewed Jul 16, 2021

View reviewed changes

mzeitlin11 added Bug Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Jul 16, 2021

jbrockmendel added 7 commits July 16, 2021 14:11

Merge branch 'master' into nullable-41743

936df47

const, rename test

b325cc0

Merge branch 'master' into nullable-41743

a021e58

Merge branch 'master' into nullable-41743

2807b25

Merge branch 'master' into nullable-41743

1988294

Merge branch 'master' into nullable-41743

25307f6

whatsnew

921ad33

jreback added this to the 1.4 milestone Jul 28, 2021

jbrockmendel added 4 commits July 29, 2021 12:44

Merge branch 'master' into nullable-41743

98f8782

Merge branch 'master' into nullable-41743

13bd9f3

Merge branch 'master' into nullable-41743

f44e77b

Merge branch 'master' into nullable-41743

cc95817

jreback requested changes Jul 30, 2021

View reviewed changes

jreback requested changes Jul 31, 2021

View reviewed changes

jbrockmendel added 3 commits July 30, 2021 19:39

Merge branch 'master' into nullable-41743

362eed5

mask -> mask_in

5f4ea99

Merge branch 'master' into nullable-41743

3eb06f5

jreback requested changes Aug 8, 2021

View reviewed changes

pandas/core/groupby/ops.py Show resolved Hide resolved

pandas/core/groupby/ops.py Show resolved Hide resolved

jbrockmendel added 2 commits August 8, 2021 20:40

Merge branch 'master' into nullable-41743

359c171

mask_out -> result_mask

35def86

Merge branch 'master' into nullable-41743

edb1beb

Merge branch 'master' into nullable-41743

e1a447b

jorisvandenbossche reviewed Aug 18, 2021

View reviewed changes

Merge branch 'master' into nullable-41743

f556ab8

Merge branch 'master' into nullable-41743

92b4617

jorisvandenbossche reviewed Aug 31, 2021

View reviewed changes

jbrockmendel added 3 commits August 31, 2021 14:37

Merge branch 'master' into nullable-41743

11d7f1d

mask_in->mask

6043105

Merge branch 'master' into nullable-41743

961073d

jreback approved these changes Sep 5, 2021

View reviewed changes

jreback merged commit 6f4c382 into pandas-dev:master Sep 5, 2021

jbrockmendel deleted the nullable-41743 branch September 5, 2021 18:04

feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021

BUG: Groupby min/max with nullable dtypes (pandas-dev#42567)

86fb205

jorisvandenbossche mentioned this pull request Sep 22, 2021

ENH: support masked arrays in groupby cython algos #37493

Closed

10 tasks

		@@ -408,6 +408,7 @@ def _masked_ea_wrap_cython_operation(

		# Copy to ensure input and result masks don't end up shared
		mask = values._mask.copy()

Uh oh!

BUG: Groupby min/max with nullable dtypes #42567

BUG: Groupby min/max with nullable dtypes #42567

Uh oh!

Conversation

jbrockmendel commented Jul 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzeitlin11 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jbrockmendel commented Aug 9, 2021

Uh oh!

jreback commented Aug 9, 2021

Uh oh!

jbrockmendel commented Aug 9, 2021

Uh oh!

jreback commented Aug 13, 2021

Uh oh!

jbrockmendel commented Aug 13, 2021

Uh oh!

jreback commented Aug 13, 2021

Uh oh!

jbrockmendel commented Aug 14, 2021

Uh oh!

jreback commented Aug 18, 2021

Uh oh!

jbrockmendel commented Aug 18, 2021

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Aug 19, 2021

Uh oh!

jreback commented Aug 19, 2021

Uh oh!

jbrockmendel commented Aug 28, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Sep 5, 2021

Uh oh!

Uh oh!

jbrockmendel commented Jul 16, 2021 •

edited

Loading