-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix min_count issue for groupby.sum #32914
Changes from 6 commits
2a3f814
3ce0384
e25c823
2656bc3
300b1e5
e2e08e2
ea59d54
ab1050f
98f0922
ea1ef36
cdf7dfb
b4edea3
02fc55c
3044843
fe0ab93
e6f5b4d
c29dfad
fb6b1d5
46eb601
2e65c14
a36131a
7c815b5
aeec4b0
01c1d56
fc0f406
47c19e2
8da2977
d74a905
ca3659e
0443d73
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1636,3 +1636,20 @@ def test_apply_to_nullable_integer_returns_float(values, function): | |
result = groups.agg([function]) | ||
expected.columns = MultiIndex.from_tuples([("b", function)]) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
|
||
def test_groupby_sum_below_mincount_nullable_integer(): | ||
# https://github.com/pandas-dev/pandas/issues/32861 | ||
df = pd.DataFrame({"a": [0, 1, 2], "b": [0, 1, 2], "c": [0, 1, 2]}, dtype="Int64") | ||
grouped = df.groupby("a") | ||
idx = pd.Index([0, 1, 2], dtype=object, name="a") | ||
|
||
result = grouped["b"].sum(min_count=2) | ||
expected = pd.Series([pd.NA] * 3, dtype="Int64", index=idx, name="b") | ||
tm.assert_series_equal(result, expected) | ||
|
||
result = grouped.sum(min_count=2) | ||
expected = pd.DataFrame( | ||
{"b": [pd.NA] * 3, "c": [pd.NA] * 3}, dtype="Int64", index=idx | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not ideal but it seems we get There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. huh? that seems odd, can you track this down There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you create an issue about this. I am not sure this is correct. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which part seems incorrect? The dtype of the index being object is maybe odd, but otherwise seems okay? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it seems your comment about is not correct, e.g. we have NA in all cases. is that not what you meant here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that comment was from before adding the conversion using maybe_cast_result (so the original test had NaN for Series but NA for DataFrame): 2a3f814 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok this is fine then |
||
) | ||
tm.assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't pretty but don't really know of a better way to make SeriesGroupBy return NA instead of NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #32894 after we merge that you can just make a small mod to use the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merged #32894 if you want to refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to use maybe_cast_result_dtype here; I don't actually think we need the else any longer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little confused, do you mean
maybe_cast_result
(maybe_cast_result_dtype
returns a numpy int dtype itself)? When I try usingmaybe_cast_result
though I just get a numpy array backThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at the function which takes the name of the operation 'add' and the dtype and gives you back the casting dtype. which is eactly what we need to do here; you just need to modfy maybe_cast_result.
rather than have an else at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be causing lots of test failures (e.g., also sending numpy arrays down this path raises when maybe_cast_result tries to access a _values attribute). Also seems we'd need to remove the check for result[0] being an instance of the original dtype because then maybe_cast_result doesn't try to convert NaN to a nullable integer, but then categorical tests end up failing. Probably worth refactoring but I'm honestly not sure how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would like to fine a way of using maybe_cast_result, i don't mind have a very simpl elif here. but this is not good as written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a couple changes (some from another PR) that simplifies it a little