-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: regression when applying groupby aggregation on categorical columns #31359
Changes from all commits
7e461a1
1314059
8bcb313
24c3ede
dea38f2
cd9e7ac
e5e912b
d0cddb3
4e1abde
0b917c6
e9cac5d
e03357e
a1df393
8743c47
1e10d71
905b3a5
c5d670b
916d9b2
2208fc2
86a254c
c36d97b
3f8ea8f
a6ad1a2
c7daa46
c4ebfa9
bbad886
a6a498e
2a3f5a2
ceef95e
ed91cc1
9c7af0f
ca35648
1b826bb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -813,9 +813,10 @@ def _try_cast(self, result, obj, numeric_only: bool = False): | |
# datetime64tz is handled correctly in agg_series, | ||
# so is excluded here. | ||
|
||
# return the same type (Series) as our caller | ||
cls = dtype.construct_array_type() | ||
result = try_cast_to_ea(cls, result, dtype=dtype) | ||
if len(result) and isinstance(result[0], dtype.type): | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @charlesdong1991 it looks like this change caused the regression in #32194 |
||
cls = dtype.construct_array_type() | ||
result = try_cast_to_ea(cls, result, dtype=dtype) | ||
|
||
elif numeric_only and is_numeric_dtype(dtype) or not numeric_only: | ||
result = maybe_downcast_to_dtype(result, dtype) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -543,6 +543,17 @@ def _cython_operation( | |
if mask.any(): | ||
result = result.astype("float64") | ||
result[mask] = np.nan | ||
elif ( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ugly! But this ensures that I briefly tried avoiding the cast from Int64 -> float with NaN, but that wasn't feasible today. Our Cython reductions like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO comment suggesting this get cleaned up, pointing back to this thread? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Going to open an issue and will point here. |
||
how == "add" | ||
and is_integer_dtype(orig_values.dtype) | ||
and is_extension_array_dtype(orig_values.dtype) | ||
): | ||
# We need this to ensure that Series[Int64Dtype].resample().sum() | ||
# remains int64 dtype. | ||
# Two options for avoiding this special case | ||
# 1. mask-aware ops and avoid casting to float with NaN above | ||
# 2. specify the result dtype when calling this method | ||
result = result.astype("int64") | ||
|
||
if kind == "aggregate" and self._filter_empty_groups and not counts.all(): | ||
assert result.ndim != 2 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -122,7 +122,9 @@ def test_resample_integerarray(): | |
|
||
result = ts.resample("3T").mean() | ||
expected = Series( | ||
[1, 4, 7], index=pd.date_range("1/1/2000", periods=3, freq="3T"), dtype="Int64" | ||
[1, 4, 7], | ||
index=pd.date_range("1/1/2000", periods=3, freq="3T"), | ||
dtype="float64", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed the expected dtype. I think the old test was incorrect, as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, no it tries to cast back originally (which i suppose is dubious), but yeah we should also return float on nullable mean ops (thought we did genreally) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that should be float. |
||
) | ||
tm.assert_series_equal(result, expected) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -105,7 +105,7 @@ def test_resample_categorical_data_with_timedeltaindex(): | |
index=pd.to_timedelta([0, 10], unit="s"), | ||
) | ||
expected = expected.reindex(["Group_obj", "Group"], axis=1) | ||
expected["Group"] = expected["Group_obj"].astype("category") | ||
expected["Group"] = expected["Group_obj"] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another test change. The expected result here is less clear I think. You could argue for either categorical or not... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is an API change from 0.25.x. Dunno what to do here :/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On 0.25.x, we have the following In [42]: df = pd.DataFrame({"A": ['a', 'b']}, dtype='category', index=pd.date_range('2000', periods=2))
In [43]: df.resample("2D").agg(lambda x: 'a')
Out[43]:
A
0 a
In [44]: df.resample("2D").agg(lambda x: 'c')
Out[44]:
A
0 NaN We're potentially changing the values of the result by returning a CategoricalDtype there. I don't think that behavior is correct, so I'm in favor of making this breaking change... I think... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That seems exactly the bug we were fixing initially. For groupby it was a regression, so fine to do it as a fix for resample I think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see now that I commented about in two different places (also in the whatsnew), with a contradicting message ... I don't know what is best for now in this PR, but in general: if we are going to take the rule of "only trying to cast back if the scalars are of the dtype.type", then I think we should try to cast to categorical (meaning, we should see the string "a" as a valid scalar of a categorical dtype with string categories). The specific case of strings is a bit difficult of course, as also the categories' dtype is object and can hold anything. Now, long term, I am thinking we should maybe not use this "scalar of dtype.type" rule (we can say that to explain it, but not as actual check in the code). But we could rather dictate that You will still get variable behaviour depending on what the agg function returns, eg There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry for the long comment here :) That should probably be moved to the follow-up issue. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, missed this earlier. I'm writing up the followup issue now, but will address this there. |
||
tm.assert_frame_equal(result, expected) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will this return? Not a categorical?
I am not fully sure that is correct as well. Eg if you do a first-like aggregation
lambda x: x[0]
, you would expect this to work...(but of course any heuristic will never be correct in all cases .. for full control in cases like this we will need an additional keyword)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it will be object dtype. That's the API breaking change. But, I'm OK with it because
df.groupby([1, 1]).agg(lambda x: 'a').A.dtype
is object.