Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Groupby min/max with nullable dtypes #42567

Merged
merged 25 commits into from
Sep 5, 2021

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Jul 16, 2021

asvs show no change for non-nullable dtypes

Copy link
Member

@mzeitlin11 mzeitlin11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this @jbrockmendel! LGTM ex a few small comments and I guess needs whatsnew

pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved
pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved
@@ -408,6 +408,7 @@ def _masked_ea_wrap_cython_operation(

# Copy to ensure input and result masks don't end up shared
mask = values._mask.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an aggregation op like this, I guess we can avoid the mask copy since the mask is not being modified inplace. (but not something needs to be done in this pr, just a note)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I suppose there's the tradeoff that then we'd lose mask contiguity guarantee in the algo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't looked at the contiguity but that makes sense

pandas/tests/groupby/test_min_max.py Outdated Show resolved Hide resolved
@mzeitlin11 mzeitlin11 added Bug Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Jul 16, 2021
@jreback jreback added this to the 1.4 milestone Jul 28, 2021
@@ -497,6 +509,8 @@ def _call_cython_op(
values = values.T
if mask is not None:
mask = mask.T
if mask_out is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be out-dented

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, mask_out is not None in a strict subset of cases when mask is not None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure but it has the guard on it so shouldn't make any difference and it IS indepenent of mask no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so shouldn't make any difference

tiny difference in an unnecessary check being avoided. mostly clarity

and it IS indepenent of mask no?

no

pandas/_libs/groupby.pyx Show resolved Hide resolved
pandas/_libs/groupby.pyx Show resolved Hide resolved
@@ -497,6 +509,8 @@ def _call_cython_op(
values = values.T
if mask is not None:
mask = mask.T
if mask_out is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure but it has the guard on it so shouldn't make any difference and it IS indepenent of mask no?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming comments, not really thrilled with this expansion, but i guess its needed.

pandas/core/groupby/ops.py Show resolved Hide resolved
pandas/core/groupby/ops.py Show resolved Hide resolved
@jbrockmendel
Copy link
Member Author

not really thrilled with this expansion, but i guess its needed.

I'd be OK with not special-casing MaskedArray all throughout the codebase.

@jreback
Copy link
Contributor

jreback commented Aug 9, 2021

can u refresh with some places where we special case now?

@jbrockmendel
Copy link
Member Author

can u refresh with some places where we special case now?

array_algos.quantile, GroupBy._bool_agg, GroupBy._get_cythonized_result, WrappedCythonOp._masked_ea_wrap_cython_operation, and a bunch prospectively in #42838, #39133, #37493

@jreback
Copy link
Contributor

jreback commented Aug 13, 2021

small naming comments & if you can rebase

@jbrockmendel
Copy link
Member Author

just rebased; what did you think of the compromise naming of result_mask?

@jreback
Copy link
Contributor

jreback commented Aug 13, 2021

just rebased; what did you think of the compromise naming of result_mask?

cool with that

one place where need to update i think

@jbrockmendel
Copy link
Member Author

one place where need to update i think

my thought with "result_mask" was to make a change in "mask" unnecessary. im not wild about mask_in bc it is awkward in cases without mask_out

@jreback
Copy link
Contributor

jreback commented Aug 18, 2021

one place where need to update i think

my thought with "result_mask" was to make a change in "mask" unnecessary. im not wild about mask_in bc it is awkward in cases without mask_out

right result_mask is fine, its the plain mask (when we also have mask_out) that bothers me. just need consistent naming.

@jbrockmendel
Copy link
Member Author

right result_mask is fine, its the plain mask (when we also have mask_out) that bothers me. just need consistent naming.

"result_mask" is the alternative to "mask_out", so there is no more "mask_out". I don't like "mask_in" bc we'll end up with some functions that call it "mask_in" and others that call the same thing "mask" (bc those functions dont have result_mask/mask_out)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

I agree with Brock on having consistent use of mask being better than some mask and some mask_in. I think having a function here with both mask and result_mask, I think it's clear that mask is about the input.

@jbrockmendel here or for a follow-up, but it might be good to add a benchmark for this. Based on a quick check, it seems that GroupByMethods and GroupByCythonAgg target those algos (both in benchmarks/groupby.py), and are not yet parametrized with a nullable dtype.

if uses_mask:
result_mask[i, j] = True
else:
out[i, j] = nan_val
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is out typically initialized? (np.empty?) I am wondering it it would be good practice to sill set out anyway (also if uses_mask=True to not have "unitialized" values)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i could go either way on this

@jbrockmendel
Copy link
Member Author

but it might be good to add a benchmark for this

good idea. will do as follow-up, as im hoping we can get this in quick-ish

@jreback
Copy link
Contributor

jreback commented Aug 19, 2021

I agree with Brock on having consistent use of mask being better than some mask and some mask_in. I think having a function here with both mask and result_mask, I think it's clear that mask is about the input.

I am cool with the names, just want to be consistent across the code base on these names / meaning

@jbrockmendel
Copy link
Member Author

I am cool with the names, just want to be consistent across the code base on these names / meaning

What Joris and I are saying is that this is the best way to be consistent w/r/t naming

@@ -1197,6 +1199,12 @@ cdef group_min_max(groupby_t[:, ::1] out,
True if `values` contains datetime-like entries.
compute_max : bint, default True
True to compute group-wise max, False to compute min
mask_in : ndarray[bool, ndim=2], optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the group_max that wraps this uses mask, use mask here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, updated

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. in a followon a place to make sure we are hitting with tests (the error condition).

@@ -123,6 +123,8 @@ def __init__(self, values: np.ndarray, mask: np.ndarray, copy: bool = False):
raise ValueError("values must be a 1D array")
if mask.ndim != 1:
raise ValueError("mask must be a 1D array")
if values.shape != mask.shape:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this hit in tests?

@jreback jreback merged commit 6f4c382 into pandas-dev:master Sep 5, 2021
@jreback
Copy link
Contributor

jreback commented Sep 5, 2021

also in followon should validate first/last (and maybe others) are correctly handling these dtypes, can you create an issue.

@jbrockmendel jbrockmendel deleted the nullable-41743 branch September 5, 2021 18:04
feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: groupby().agg( ) with min/max on Int64 leads to incorrect results
4 participants