-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Groupby min/max with nullable dtypes #42567
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this @jbrockmendel! LGTM ex a few small comments and I guess needs whatsnew
@@ -408,6 +408,7 @@ def _masked_ea_wrap_cython_operation( | |||
|
|||
# Copy to ensure input and result masks don't end up shared | |||
mask = values._mask.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an aggregation op like this, I guess we can avoid the mask copy since the mask is not being modified inplace. (but not something needs to be done in this pr, just a note)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I suppose there's the tradeoff that then we'd lose mask contiguity guarantee in the algo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haven't looked at the contiguity but that makes sense
pandas/core/groupby/ops.py
Outdated
@@ -497,6 +509,8 @@ def _call_cython_op( | |||
values = values.T | |||
if mask is not None: | |||
mask = mask.T | |||
if mask_out is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be out-dented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, mask_out is not None in a strict subset of cases when mask is not None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure but it has the guard on it so shouldn't make any difference and it IS indepenent of mask no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so shouldn't make any difference
tiny difference in an unnecessary check being avoided. mostly clarity
and it IS indepenent of mask no?
no
pandas/core/groupby/ops.py
Outdated
@@ -497,6 +509,8 @@ def _call_cython_op( | |||
values = values.T | |||
if mask is not None: | |||
mask = mask.T | |||
if mask_out is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure but it has the guard on it so shouldn't make any difference and it IS indepenent of mask no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming comments, not really thrilled with this expansion, but i guess its needed.
I'd be OK with not special-casing MaskedArray all throughout the codebase. |
can u refresh with some places where we special case now? |
small naming comments & if you can rebase |
just rebased; what did you think of the compromise naming of |
cool with that one place where need to update i think |
my thought with "result_mask" was to make a change in "mask" unnecessary. im not wild about mask_in bc it is awkward in cases without mask_out |
right result_mask is fine, its the plain mask (when we also have mask_out) that bothers me. just need consistent naming. |
"result_mask" is the alternative to "mask_out", so there is no more "mask_out". I don't like "mask_in" bc we'll end up with some functions that call it "mask_in" and others that call the same thing "mask" (bc those functions dont have result_mask/mask_out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I agree with Brock on having consistent use of mask
being better than some mask
and some mask_in
. I think having a function here with both mask
and result_mask
, I think it's clear that mask
is about the input.
@jbrockmendel here or for a follow-up, but it might be good to add a benchmark for this. Based on a quick check, it seems that GroupByMethods
and GroupByCythonAgg
target those algos (both in benchmarks/groupby.py
), and are not yet parametrized with a nullable dtype.
if uses_mask: | ||
result_mask[i, j] = True | ||
else: | ||
out[i, j] = nan_val |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is out
typically initialized? (np.empty
?) I am wondering it it would be good practice to sill set out
anyway (also if uses_mask=True
to not have "unitialized" values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm i could go either way on this
good idea. will do as follow-up, as im hoping we can get this in quick-ish |
I am cool with the names, just want to be consistent across the code base on these names / meaning |
What Joris and I are saying is that this is the best way to be consistent w/r/t naming |
pandas/_libs/groupby.pyx
Outdated
@@ -1197,6 +1199,12 @@ cdef group_min_max(groupby_t[:, ::1] out, | |||
True if `values` contains datetime-like entries. | |||
compute_max : bint, default True | |||
True to compute group-wise max, False to compute min | |||
mask_in : ndarray[bool, ndim=2], optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the group_max
that wraps this uses mask
, use mask
here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. in a followon a place to make sure we are hitting with tests (the error condition).
@@ -123,6 +123,8 @@ def __init__(self, values: np.ndarray, mask: np.ndarray, copy: bool = False): | |||
raise ValueError("values must be a 1D array") | |||
if mask.ndim != 1: | |||
raise ValueError("mask must be a 1D array") | |||
if values.shape != mask.shape: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this hit in tests?
also in followon should validate first/last (and maybe others) are correctly handling these dtypes, can you create an issue. |
asvs show no change for non-nullable dtypes