-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix min_count issue for groupby.sum #32914
Conversation
|
||
result = grouped.sum(min_count=2) | ||
expected = pd.DataFrame( | ||
{"b": [pd.NA] * 3, "c": [pd.NA] * 3}, dtype="Int64", index=idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not ideal but it seems we get NA
in the DataFrame
case but NaN
for Series
. Should I add a FIXME comment with a follow-up issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? that seems odd, can you track this down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you create an issue about this. I am not sure this is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which part seems incorrect? The dtype of the index being object is maybe odd, but otherwise seems okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems your comment about is not correct, e.g. we have NA in all cases. is that not what you meant here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that comment was from before adding the conversion using maybe_cast_result (so the original test had NaN for Series but NA for DataFrame): 2a3f814
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this is fine then
pandas/core/groupby/ops.py
Outdated
@@ -553,7 +553,8 @@ def _cython_operation( | |||
# Two options for avoiding this special case | |||
# 1. mask-aware ops and avoid casting to float with NaN above | |||
# 2. specify the result dtype when calling this method | |||
result = result.astype("int64") | |||
if not isna(result).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this condition to the above elif & add an appropriate comment as 3.
|
||
result = grouped.sum(min_count=2) | ||
expected = pd.DataFrame( | ||
{"b": [pd.NA] * 3, "c": [pd.NA] * 3}, dtype="Int64", index=idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? that seems odd, can you track this down
pandas/core/groupby/ops.py
Outdated
@@ -577,6 +583,14 @@ def _cython_operation( | |||
elif is_datetimelike and kind == "aggregate": | |||
result = result.astype(orig_values.dtype) | |||
|
|||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't pretty but don't really know of a better way to make SeriesGroupBy return NA instead of NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #32894 after we merge that you can just make a small mod to use the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merged #32894 if you want to refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to use maybe_cast_result_dtype here; I don't actually think we need the else any longer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little confused, do you mean maybe_cast_result
(maybe_cast_result_dtype
returns a numpy int dtype itself)? When I try using maybe_cast_result
though I just get a numpy array back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at the function which takes the name of the operation 'add' and the dtype and gives you back the casting dtype. which is eactly what we need to do here; you just need to modfy maybe_cast_result.
rather than have an else at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be causing lots of test failures (e.g., also sending numpy arrays down this path raises when maybe_cast_result tries to access a _values attribute). Also seems we'd need to remove the check for result[0] being an instance of the original dtype because then maybe_cast_result doesn't try to convert NaN to a nullable integer, but then categorical tests end up failing. Probably worth refactoring but I'm honestly not sure how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would like to fine a way of using maybe_cast_result, i don't mind have a very simpl elif here. but this is not good as written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a couple changes (some from another PR) that simplifies it a little
pandas/core/groupby/ops.py
Outdated
@@ -577,6 +583,14 @@ def _cython_operation( | |||
elif is_datetimelike and kind == "aggregate": | |||
result = result.astype(orig_values.dtype) | |||
|
|||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to use maybe_cast_result_dtype here; I don't actually think we need the else any longer
pandas/core/groupby/ops.py
Outdated
@@ -575,6 +565,11 @@ def _cython_operation( | |||
elif is_datetimelike and kind == "aggregate": | |||
result = result.astype(orig_values.dtype) | |||
|
|||
if is_integer_dtype(orig_values.dtype) and is_extension_array_dtype( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the whole entire point of maybe_cast_result
is that you don't need the if here
to be honest we should remove lines 558-566, but can do that in another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the if about 550 groupby tests fail, though I think it works when only checking for extension arrays
pandas/core/dtypes/cast.py
Outdated
and dtype.kind != "M" | ||
and not is_categorical_dtype(dtype) | ||
): | ||
cls = dtype.construct_array_type() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IF you need extra logic, then you can add it here.
@@ -575,6 +565,9 @@ def _cython_operation( | |||
elif is_datetimelike and kind == "aggregate": | |||
result = result.astype(orig_values.dtype) | |||
|
|||
if is_extension_array_dtype(orig_values.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you move line 568 (the is_extension_array_dtype) before line 558, can we then completely remove lines 558-556? (these are already extension dtypes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it looks like we still get a lot of failures. I could add back the integer check to avoid trying to cast things twice?
@dsaxton what's the status here? |
The 'regression' for Int64 could be considered a bug since Int64 is considered experimental. I've not looked in detail yet, but the 'regression' was caused by #31359 see #32861 (comment) so would to check that the other EAs which aren't experimental aren't affected by this issue.
if this issue does affect other EAs, then this PR is a candidate for backport, so would rather keep the changes here minimal. |
It doesn't affect others that I'm aware of (there's a similar but as far as I can tell unrelated problem with boolean data type which I created a separate issue for) |
@jreback @simonjayhawkins Is is possible we can merge this as is? I have a fix for a separate bug that depends on this. |
sure is this rebase on master? |
I think Simon merged this morning |
|
||
result = grouped.sum(min_count=2) | ||
expected = pd.DataFrame( | ||
{"b": [pd.NA] * 3, "c": [pd.NA] * 3}, dtype="Int64", index=idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems your comment about is not correct, e.g. we have NA in all cases. is that not what you meant here?
* Add test * Check for null * Release note * Update and comment * Update test * Hack * Try a different casting * No pd * Only for add * Undo * Release note * Fix * Space * maybe_cast_result * float -> Int * Less if Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>
there is a problem also with timedelta, see #34051 (comment), but this also occurs on 0.25.3 so won't backport for now. |
sum
withmin_count
onSeriesGroupBy
with dtypeInt64
gives large negative value rather than pd.NA #32861black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff