-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: regression when applying groupby aggregation on categorical columns #31359
BUG: regression when applying groupby aggregation on categorical columns #31359
Conversation
Hello @charlesdong1991! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-01-29 19:30:04 UTC |
@jbrockmendel did you have thoughts on this approach? |
pandas/core/arrays/base.py
Outdated
@@ -47,7 +47,7 @@ def try_cast_to_ea(cls_or_instance, obj, dtype=None): | |||
ExtensionArray or obj | |||
""" | |||
try: | |||
result = cls_or_instance._from_sequence(obj, dtype=dtype) | |||
result = cls_or_instance._from_sequence(obj, dtype=dtype.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this change necessary?
pandas/core/groupby/groupby.py
Outdated
if ( | ||
is_extension_array_dtype(dtype) | ||
and dtype.kind != "M" | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
up until a moment ago this had a length check; without that check this looks like it isnt actually changing anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the previous commit, i did add a length check, but i do not feel it was right solution. the issue is dtype
preserves its categories @jbrockmendel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so dtype.categories
preserves the original value, in this case, [0, 1] as its categories, however, after aggregation, we should NOT use the original categories, but the new one based on the result
ATM just confusion. Pending the CI, I'll take @charlesdong1991 's word for it that just changing |
Mmm, this fix doesn't look quite right. The expected dtype of Will think about this a bit. |
@TomAugspurger for the whatsnew note you are now focussing on resample. Can you explain a bit how it comes that there is such a change for resample and not for groupby? I don't see much resample specific change in the diff? |
Will double check, but the behavior change from 0.25.x is just for resample. The change from master affects both. Make sense? |
.. ipython:: python | ||
|
||
df.resample("2D").agg(lambda x: 'a').A.dtype | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will this return? Not a categorical?
I am not fully sure that is correct as well. Eg if you do a first-like aggregation lambda x: x[0]
, you would expect this to work...
(but of course any heuristic will never be correct in all cases .. for full control in cases like this we will need an additional keyword)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it will be object dtype. That's the API breaking change. But, I'm OK with it because
- It matches the behavior of groupby on 0.25.3.
df.groupby([1, 1]).agg(lambda x: 'a').A.dtype
is object. - It's incorrect (or at least surprising?) when the dtype can't hold the values (the next example where you get NaN values).
Yes, that's how I understood it. I just find it strange from looking at the diff ;) |
Thanks for fixing and pushing the release! @TomAugspurger 👍 Maybe I should not disrupt you, I just saw the changes you've made and maybe it's nicer to also add |
Fixed, thanks! |
Good to merge when CI passes? I think this is the final thing before 1.0. |
Will take another look now |
No objections to merging, agreed that some follow-up discussion is necessary on _from_sequence |
All green. Merging. Thanks for kicking this off @charlesdong1991! |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon! If these instruction are inaccurate, feel free to suggest an improvement. |
Handling the backport. |
Thanks for your fix and help to release 1.0 for community! ALL credit to you!! 👍 |
# return the same type (Series) as our caller | ||
cls = dtype.construct_array_type() | ||
result = try_cast_to_ea(cls, result, dtype=dtype) | ||
if len(result) and isinstance(result[0], dtype.type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@charlesdong1991 it looks like this change caused the regression in #32194
08087d6 is the first bad commit
|
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff