Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: regression when applying groupby aggregation on categorical columns #31359

Merged
merged 33 commits into from
Jan 29, 2020

Conversation

charlesdong1991
Copy link
Member

@pep8speaks
Copy link

pep8speaks commented Jan 27, 2020

Hello @charlesdong1991! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-29 19:30:04 UTC

@TomAugspurger
Copy link
Contributor

@jbrockmendel did you have thoughts on this approach?

@@ -47,7 +47,7 @@ def try_cast_to_ea(cls_or_instance, obj, dtype=None):
ExtensionArray or obj
"""
try:
result = cls_or_instance._from_sequence(obj, dtype=dtype)
result = cls_or_instance._from_sequence(obj, dtype=dtype.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this change necessary?

if (
is_extension_array_dtype(dtype)
and dtype.kind != "M"
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up until a moment ago this had a length check; without that check this looks like it isnt actually changing anything

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the previous commit, i did add a length check, but i do not feel it was right solution. the issue is dtype preserves its categories @jbrockmendel

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so dtype.categories preserves the original value, in this case, [0, 1] as its categories, however, after aggregation, we should NOT use the original categories, but the new one based on the result

@jbrockmendel
Copy link
Member

did you have thoughts on this approach?

ATM just confusion. Pending the CI, I'll take @charlesdong1991 's word for it that just changing dtype=dtype to dtype=dtype.name fixes the new test, but it isn't at all clear to me why that would work.

@charlesdong1991 charlesdong1991 marked this pull request as ready for review January 28, 2020 21:08
@TomAugspurger
Copy link
Contributor

Mmm, this fix doesn't look quite right.

The expected dtype of SeriesGroupBy[categorical].count() should be int64, not categorical. IIUC, this change returning a Categorical whose categories are the counts, which I don't think we want. It seems to me like we shouldn't be trying to cast back to an EA here. We know that count shouldn't cast back, but for arbitrary user-defined functions we don't know that.

Will think about this a bit.

@jorisvandenbossche
Copy link
Member

@TomAugspurger for the whatsnew note you are now focussing on resample. Can you explain a bit how it comes that there is such a change for resample and not for groupby? I don't see much resample specific change in the diff?

@TomAugspurger
Copy link
Contributor

Will double check, but the behavior change from 0.25.x is just for resample. The change from master affects both. Make sense?

.. ipython:: python

df.resample("2D").agg(lambda x: 'a').A.dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will this return? Not a categorical?
I am not fully sure that is correct as well. Eg if you do a first-like aggregation lambda x: x[0], you would expect this to work...

(but of course any heuristic will never be correct in all cases .. for full control in cases like this we will need an additional keyword)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it will be object dtype. That's the API breaking change. But, I'm OK with it because

  1. It matches the behavior of groupby on 0.25.3. df.groupby([1, 1]).agg(lambda x: 'a').A.dtype is object.
  2. It's incorrect (or at least surprising?) when the dtype can't hold the values (the next example where you get NaN values).

@jorisvandenbossche
Copy link
Member

he behavior change from 0.25.x is just for resample. The change from master affects both. Make sense?

Yes, that's how I understood it. I just find it strange from looking at the diff ;)

@charlesdong1991
Copy link
Member Author

Thanks for fixing and pushing the release! @TomAugspurger 👍

Maybe I should not disrupt you, I just saw the changes you've made and maybe it's nicer to also add *pandas 1.0.0* above line 670 like you did on line 651 ^^

@TomAugspurger
Copy link
Contributor

nicer to also add pandas 1.0.0 above line 670 like you did on line 651 ^^

Fixed, thanks!

@TomAugspurger
Copy link
Contributor

Good to merge when CI passes? I think this is the final thing before 1.0.

@jbrockmendel
Copy link
Member

Good to merge when CI passes?

Will take another look now

@jbrockmendel
Copy link
Member

No objections to merging, agreed that some follow-up discussion is necessary on _from_sequence

@TomAugspurger
Copy link
Contributor

All green. Merging.

Thanks for kicking this off @charlesdong1991!

@TomAugspurger TomAugspurger merged commit 06ef193 into pandas-dev:master Jan 29, 2020
@lumberbot-app
Copy link

lumberbot-app bot commented Jan 29, 2020

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
$ git checkout 1.0.x
$ git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick -m1 06ef193a5c1957c0a76e3e88bc7b834b38972c39
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am 'Backport PR #31359: BUG: regression when applying groupby aggregation on categorical columns'
  1. Push to a named branch :
git push YOURFORK 1.0.x:auto-backport-of-pr-31359-on-1.0.x
  1. Create a PR against branch 1.0.x, I would have named this PR:

"Backport PR #31359 on branch 1.0.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

@TomAugspurger
Copy link
Contributor

Handling the backport.

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jan 29, 2020
TomAugspurger added a commit that referenced this pull request Jan 29, 2020
…mns (#31359) (#31428)

Co-authored-by: Kaiqi Dong <kaiqidong1991@gmail.com>
@charlesdong1991
Copy link
Member Author

Thanks for your fix and help to release 1.0 for community! ALL credit to you!! 👍

# return the same type (Series) as our caller
cls = dtype.construct_array_type()
result = try_cast_to_ea(cls, result, dtype=dtype)
if len(result) and isinstance(result[0], dtype.type):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charlesdong1991 it looks like this change caused the regression in #32194

@simonjayhawkins
Copy link
Member

I’m not a huge fan of the current fix - do we know where the regression was introduced?

I'm not sure, but I think it's some combination of https://github.com/pandas-dev/pandas/pull/24488/files#diff-6d2a4be351f3d7ec35b9c6696a615242L771, and calling _try_cast_result in more places.

08087d6 is the first bad commit
commit 08087d6
Author: jbrockmendel jbrockmendel@gmail.com
Date: Wed Nov 6 13:25:06 2019 -0800

BUG: fix TypeErrors raised within _python_agg_general (#29425)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression bugs when applying GroupBy Aggregations to Categorical columns
8 participants