Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix min_count issue for groupby.sum #32914

Merged
merged 30 commits into from
May 9, 2020
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
2a3f814
Add test
dsaxton Mar 22, 2020
3ce0384
Check for null
dsaxton Mar 22, 2020
e25c823
Release note
dsaxton Mar 22, 2020
2656bc3
Update and comment
dsaxton Mar 22, 2020
300b1e5
Update test
dsaxton Mar 23, 2020
e2e08e2
Hack
dsaxton Mar 23, 2020
ea59d54
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 24, 2020
ab1050f
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 25, 2020
98f0922
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 26, 2020
ea1ef36
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 26, 2020
cdf7dfb
Try a different casting
dsaxton Mar 27, 2020
b4edea3
No pd
dsaxton Mar 27, 2020
02fc55c
Only for add
dsaxton Mar 27, 2020
3044843
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 27, 2020
fe0ab93
Undo
dsaxton Mar 27, 2020
e6f5b4d
Release note
dsaxton Mar 22, 2020
c29dfad
Fix
dsaxton Mar 27, 2020
fb6b1d5
Space
dsaxton Mar 27, 2020
46eb601
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 27, 2020
2e65c14
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 28, 2020
a36131a
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Apr 1, 2020
7c815b5
maybe_cast_result
dsaxton Apr 1, 2020
aeec4b0
float -> Int
dsaxton Apr 1, 2020
01c1d56
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Apr 4, 2020
fc0f406
Less if
dsaxton Apr 4, 2020
47c19e2
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Apr 5, 2020
8da2977
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Apr 10, 2020
d74a905
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Apr 14, 2020
ca3659e
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton May 4, 2020
0443d73
Merge remote-tracking branch 'upstream/master' into mincount-sum
simonjayhawkins May 7, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -364,6 +364,7 @@ Groupby/resample/rolling

- Bug in :meth:`GroupBy.apply` raises ``ValueError`` when the ``by`` axis is not sorted and has duplicates and the applied ``func`` does not mutate passed in objects (:issue:`30667`)
- Bug in :meth:`DataFrameGroupby.transform` produces incorrect result with transformation functions (:issue:`30918`)
- Bug in :meth:`DataFrameGroupBy.sum` and :meth:`SeriesGroupBy.sum` where a large negative number would be returned when the number of non-null values was below ``min_count`` for nullable integer dtypes (:issue:`32861`)

Reshaping
^^^^^^^^^
Expand Down
14 changes: 14 additions & 0 deletions pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
)
from pandas.core.dtypes.missing import _maybe_fill, isna

import pandas as pd
import pandas.core.algorithms as algorithms
from pandas.core.base import SelectionMixin
import pandas.core.common as com
Expand Down Expand Up @@ -547,12 +548,17 @@ def _cython_operation(
how == "add"
and is_integer_dtype(orig_values.dtype)
and is_extension_array_dtype(orig_values.dtype)
and not isna(result).any()
):
# We need this to ensure that Series[Int64Dtype].resample().sum()
# remains int64 dtype.
# Two options for avoiding this special case
# 1. mask-aware ops and avoid casting to float with NaN above
# 2. specify the result dtype when calling this method
#
# Sometimes result can contain null values (e.g. see
# https://github.com/pandas-dev/pandas/issues/32861)
# and so we must check for that before casting to int
result = result.astype("int64")

if kind == "aggregate" and self._filter_empty_groups and not counts.all():
Expand All @@ -577,6 +583,14 @@ def _cython_operation(
elif is_datetimelike and kind == "aggregate":
result = result.astype(orig_values.dtype)

if (
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't pretty but don't really know of a better way to make SeriesGroupBy return NA instead of NaN

Copy link
Contributor

@jreback jreback Mar 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #32894 after we merge that you can just make a small mod to use the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged #32894 if you want to refactor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to use maybe_cast_result_dtype here; I don't actually think we need the else any longer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little confused, do you mean maybe_cast_result (maybe_cast_result_dtype returns a numpy int dtype itself)? When I try using maybe_cast_result though I just get a numpy array back

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at the function which takes the name of the operation 'add' and the dtype and gives you back the casting dtype. which is eactly what we need to do here; you just need to modfy maybe_cast_result.

rather than have an else at all.

Copy link
Member Author

@dsaxton dsaxton Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be causing lots of test failures (e.g., also sending numpy arrays down this path raises when maybe_cast_result tries to access a _values attribute). Also seems we'd need to remove the check for result[0] being an instance of the original dtype because then maybe_cast_result doesn't try to convert NaN to a nullable integer, but then categorical tests end up failing. Probably worth refactoring but I'm honestly not sure how.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would like to fine a way of using maybe_cast_result, i don't mind have a very simpl elif here. but this is not good as written.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a couple changes (some from another PR) that simplifies it a little

how == "add"
and is_integer_dtype(orig_values.dtype)
and is_extension_array_dtype(orig_values.dtype)
and isna(result).any()
):
result = pd.array(result.ravel(), dtype="Int64")

return result, names

def aggregate(
Expand Down
17 changes: 17 additions & 0 deletions pandas/tests/groupby/test_function.py
Original file line number Diff line number Diff line change
Expand Up @@ -1636,3 +1636,20 @@ def test_apply_to_nullable_integer_returns_float(values, function):
result = groups.agg([function])
expected.columns = MultiIndex.from_tuples([("b", function)])
tm.assert_frame_equal(result, expected)


def test_groupby_sum_below_mincount_nullable_integer():
# https://github.com/pandas-dev/pandas/issues/32861
df = pd.DataFrame({"a": [0, 1, 2], "b": [0, 1, 2], "c": [0, 1, 2]}, dtype="Int64")
grouped = df.groupby("a")
idx = pd.Index([0, 1, 2], dtype=object, name="a")

result = grouped["b"].sum(min_count=2)
expected = pd.Series([pd.NA] * 3, dtype="Int64", index=idx, name="b")
tm.assert_series_equal(result, expected)

result = grouped.sum(min_count=2)
expected = pd.DataFrame(
{"b": [pd.NA] * 3, "c": [pd.NA] * 3}, dtype="Int64", index=idx
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not ideal but it seems we get NA in the DataFrame case but NaN for Series. Should I add a FIXME comment with a follow-up issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? that seems odd, can you track this down

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create an issue about this. I am not sure this is correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part seems incorrect? The dtype of the index being object is maybe odd, but otherwise seems okay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems your comment about is not correct, e.g. we have NA in all cases. is that not what you meant here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that comment was from before adding the conversion using maybe_cast_result (so the original test had NaN for Series but NA for DataFrame): 2a3f814

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is fine then

)
tm.assert_frame_equal(result, expected)