Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: regression when applying groupby aggregation on categorical columns #31359

Merged
merged 33 commits into from
Jan 29, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
7e461a1
remove \n from docstring
charlesdong1991 Dec 3, 2018
1314059
fix conflicts
charlesdong1991 Jan 19, 2019
8bcb313
Merge remote-tracking branch 'upstream/master'
charlesdong1991 Jul 30, 2019
24c3ede
Merge remote-tracking branch 'upstream/master'
charlesdong1991 Jan 14, 2020
dea38f2
fix issue 17038
charlesdong1991 Jan 14, 2020
cd9e7ac
revert change
charlesdong1991 Jan 14, 2020
e5e912b
revert change
charlesdong1991 Jan 14, 2020
d0cddb3
Merge remote-tracking branch 'upstream/master' into fix_issue_31256
charlesdong1991 Jan 26, 2020
4e1abde
try fix 31256
charlesdong1991 Jan 27, 2020
0b917c6
pep8
charlesdong1991 Jan 27, 2020
e9cac5d
fix test
charlesdong1991 Jan 27, 2020
e03357e
fix
charlesdong1991 Jan 28, 2020
a1df393
fix up tests
charlesdong1991 Jan 28, 2020
8743c47
preserve order
charlesdong1991 Jan 28, 2020
1e10d71
add comment
charlesdong1991 Jan 28, 2020
905b3a5
better test
charlesdong1991 Jan 28, 2020
c5d670b
fixup
charlesdong1991 Jan 28, 2020
916d9b2
remove blank line
charlesdong1991 Jan 28, 2020
2208fc2
fix test
charlesdong1991 Jan 28, 2020
86a254c
style
charlesdong1991 Jan 28, 2020
c36d97b
linting
charlesdong1991 Jan 28, 2020
3f8ea8f
wip
TomAugspurger Jan 29, 2020
a6ad1a2
alternative
TomAugspurger Jan 29, 2020
c7daa46
Merge remote-tracking branch 'upstream/master' into fix_issue_31256
TomAugspurger Jan 29, 2020
c4ebfa9
Fixups
TomAugspurger Jan 29, 2020
bbad886
revert extranesou
TomAugspurger Jan 29, 2020
a6a498e
non-numeric
TomAugspurger Jan 29, 2020
2a3f5a2
xfailing test
TomAugspurger Jan 29, 2020
ceef95e
release note
TomAugspurger Jan 29, 2020
ed91cc1
fixup
TomAugspurger Jan 29, 2020
9c7af0f
fixup
TomAugspurger Jan 29, 2020
ca35648
fixup
TomAugspurger Jan 29, 2020
1b826bb
fixup
TomAugspurger Jan 29, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -626,6 +626,54 @@ consistent with the behaviour of :class:`DataFrame` and :class:`Index`.
DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
Series([], dtype: float64)

Result dtype inference changes for resample operations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The rules for the result dtype in :meth:`DataFrame.resample` aggregations have changed for extension types (:issue:`31359`).
Previously, pandas would attempt to convert the result back to the original dtype, falling back to the usual
inference rules if that was not possible. Now, pandas will only return a result of the original dtype if the
scalar values in the result are instances of the extension dtype's scalar type.

.. ipython:: python

df = pd.DataFrame({"A": ['a', 'b']}, dtype='category',
index=pd.date_range('2000', periods=2))
df


*pandas 0.25.x*

.. code-block:: python

>>> df.resample("2D").agg(lambda x: 'a').A.dtype
CategoricalDtype(categories=['a', 'b'], ordered=False)

*pandas 1.0.0*

.. ipython:: python

df.resample("2D").agg(lambda x: 'a').A.dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will this return? Not a categorical?
I am not fully sure that is correct as well. Eg if you do a first-like aggregation lambda x: x[0], you would expect this to work...

(but of course any heuristic will never be correct in all cases .. for full control in cases like this we will need an additional keyword)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it will be object dtype. That's the API breaking change. But, I'm OK with it because

  1. It matches the behavior of groupby on 0.25.3. df.groupby([1, 1]).agg(lambda x: 'a').A.dtype is object.
  2. It's incorrect (or at least surprising?) when the dtype can't hold the values (the next example where you get NaN values).

This fixes an inconsistency between ``resample`` and ``groupby``.
This also fixes a potential bug, where the **values** of the result might change
depending on how the results are cast back to the original dtype.

*pandas 0.25.x*

.. code-block:: python

>>> df.resample("2D").agg(lambda x: 'c')

A
0 NaN

*pandas 1.0.0*

.. ipython:: python

df.resample("2D").agg(lambda x: 'c')


.. _whatsnew_100.api_breaking.python:

Increased minimum version for Python
Expand Down
7 changes: 4 additions & 3 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -813,9 +813,10 @@ def _try_cast(self, result, obj, numeric_only: bool = False):
# datetime64tz is handled correctly in agg_series,
# so is excluded here.

# return the same type (Series) as our caller
cls = dtype.construct_array_type()
result = try_cast_to_ea(cls, result, dtype=dtype)
if len(result) and isinstance(result[0], dtype.type):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charlesdong1991 it looks like this change caused the regression in #32194

cls = dtype.construct_array_type()
result = try_cast_to_ea(cls, result, dtype=dtype)

elif numeric_only and is_numeric_dtype(dtype) or not numeric_only:
result = maybe_downcast_to_dtype(result, dtype)

Expand Down
11 changes: 11 additions & 0 deletions pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -543,6 +543,17 @@ def _cython_operation(
if mask.any():
result = result.astype("float64")
result[mask] = np.nan
elif (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugly! But this ensures that Series[IntDtype].resample().sum() has int dtype. Assuming we want the restriction "only try casting back to EA when the scalars results are instances of the EA dtypes type, then I think this is the least worst option.

I briefly tried avoiding the cast from Int64 -> float with NaN, but that wasn't feasible today. Our Cython reductions like group_add only handle floats IIUC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO comment suggesting this get cleaned up, pointing back to this thread?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to open an issue and will point here.

how == "add"
and is_integer_dtype(orig_values.dtype)
and is_extension_array_dtype(orig_values.dtype)
):
# We need this to ensure that Series[Int64Dtype].resample().sum()
# remains int64 dtype.
# Two options for avoiding this special case
# 1. mask-aware ops and avoid casting to float with NaN above
# 2. specify the result dtype when calling this method
result = result.astype("int64")

if kind == "aggregate" and self._filter_empty_groups and not counts.all():
assert result.ndim != 2
Expand Down
21 changes: 21 additions & 0 deletions pandas/tests/groupby/aggregate/test_aggregate.py
Original file line number Diff line number Diff line change
Expand Up @@ -663,6 +663,27 @@ def test_aggregate_mixed_types():
tm.assert_frame_equal(result, expected)


@pytest.mark.xfail(reason="Not implemented.")
def test_aggregate_udf_na_extension_type():
# https://github.com/pandas-dev/pandas/pull/31359
# This is currently failing to cast back to Int64Dtype.
# The presence of the NA causes two problems
# 1. NA is not an instance of Int64Dtype.type (numpy.int64)
# 2. The presence of an NA forces object type, so the non-NA values is
# a Python int rather than a NumPy int64. Python ints aren't
# instances of numpy.int64.
def aggfunc(x):
if all(x > 2):
return 1
else:
return pd.NA

df = pd.DataFrame({"A": pd.array([1, 2, 3])})
result = df.groupby([1, 1, 2]).agg(aggfunc)
expected = pd.DataFrame({"A": pd.array([1, pd.NA], dtype="Int64")}, index=[1, 2])
tm.assert_frame_equal(result, expected)


class TestLambdaMangling:
def test_basic(self):
df = pd.DataFrame({"A": [0, 0, 1, 1], "B": [1, 2, 3, 4]})
Expand Down
34 changes: 34 additions & 0 deletions pandas/tests/groupby/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -1342,3 +1342,37 @@ def test_series_groupby_categorical_aggregation_getitem():
result = groups["foo"].agg("mean")
expected = groups.agg("mean")["foo"]
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize(
"func, expected_values",
[(pd.Series.nunique, [1, 1, 2]), (pd.Series.count, [1, 2, 2])],
)
def test_groupby_agg_categorical_columns(func, expected_values):
# 31256
df = pd.DataFrame(
{
"id": [0, 1, 2, 3, 4],
"groups": [0, 1, 1, 2, 2],
"value": pd.Categorical([0, 0, 0, 0, 1]),
}
).set_index("id")
result = df.groupby("groups").agg(func)

expected = pd.DataFrame(
{"value": expected_values}, index=pd.Index([0, 1, 2], name="groups"),
)
tm.assert_frame_equal(result, expected)


def test_groupby_agg_non_numeric():
df = pd.DataFrame(
{"A": pd.Categorical(["a", "a", "b"], categories=["a", "b", "c"])}
)
expected = pd.DataFrame({"A": [2, 1]}, index=[1, 2])

result = df.groupby([1, 2, 1]).agg(pd.Series.nunique)
tm.assert_frame_equal(result, expected)

result = df.groupby([1, 2, 1]).nunique()
tm.assert_frame_equal(result, expected)
4 changes: 3 additions & 1 deletion pandas/tests/resample/test_datetime_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,9 @@ def test_resample_integerarray():

result = ts.resample("3T").mean()
expected = Series(
[1, 4, 7], index=pd.date_range("1/1/2000", periods=3, freq="3T"), dtype="Int64"
[1, 4, 7],
index=pd.date_range("1/1/2000", periods=3, freq="3T"),
dtype="float64",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the expected dtype. I think the old test was incorrect, as mean should be float.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, no it tries to cast back originally (which i suppose is dubious), but yeah we should also return float on nullable mean ops (thought we did genreally)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the old test was incorrect, as mean should be float.

Yes, I agree that should be float.

)
tm.assert_series_equal(result, expected)

Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/resample/test_timedelta.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ def test_resample_categorical_data_with_timedeltaindex():
index=pd.to_timedelta([0, 10], unit="s"),
)
expected = expected.reindex(["Group_obj", "Group"], axis=1)
expected["Group"] = expected["Group_obj"].astype("category")
expected["Group"] = expected["Group_obj"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another test change. The expected result here is less clear I think. You could argue for either categorical or not...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an API change from 0.25.x. Dunno what to do here :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 0.25.x, we have the following

In [42]: df = pd.DataFrame({"A": ['a', 'b']}, dtype='category', index=pd.date_range('2000', periods=2))

In [43]: df.resample("2D").agg(lambda x: 'a')
Out[43]:
   A
0  a

In [44]: df.resample("2D").agg(lambda x: 'c')
Out[44]:
     A
0  NaN

We're potentially changing the values of the result by returning a CategoricalDtype there. I don't think that behavior is correct, so I'm in favor of making this breaking change... I think...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're potentially changing the values of the result by returning a CategoricalDtype there. I don't think that behavior is correct, so I'm in favor of making this breaking change... I think...

That seems exactly the bug we were fixing initially. For groupby it was a regression, so fine to do it as a fix for resample I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now that I commented about in two different places (also in the whatsnew), with a contradicting message ...

I don't know what is best for now in this PR, but in general: if we are going to take the rule of "only trying to cast back if the scalars are of the dtype.type", then I think we should try to cast to categorical (meaning, we should see the string "a" as a valid scalar of a categorical dtype with string categories). The specific case of strings is a bit difficult of course, as also the categories' dtype is object and can hold anything.

Now, long term, I am thinking we should maybe not use this "scalar of dtype.type" rule (we can say that to explain it, but not as actual check in the code). But we could rather dictate that _from_sequence should be strict. And then it's _from_sequence that can hold the logic to know if it's a proper scalar (eg to know that pd.NA is also fine, or to know that "a" is a proper scalar for a categorical with categories that include "a", or knowing that a python int is also fine in addition to np.int64 scalar for an IntDtype, etc).
For the specific case of categorical, it could then fail if it's getting scalars (eg "c" in the example above) which cannot be a scalar of its dtype (since the categories are part of the dtype, it can know "c" is not a valid scalar in that case).

You will still get variable behaviour depending on what the agg function returns, eg .agg(lambda x: 'a') you get back categorical and .agg(lambda x: 'c') not, while both are simple functions that return a string. But with any rule such small differences will be unavoidable for the default behaviour (hence having a keyword to have full control might be interesting to investigate)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long comment here :) That should probably be moved to the follow-up issue.
About what is best for this PR, I am not fully sure. I suppose what you have now is fine. The isinstance(result[0], dtype.type) check just doesn't really work for categoricals ..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed this earlier. I'm writing up the followup issue now, but will address this there.

tm.assert_frame_equal(result, expected)


Expand Down