BUG: regression when applying groupby aggregation on categorical columns #31359

charlesdong1991 · 2020-01-27T19:32:45Z

closes Regression bugs when applying GroupBy Aggregations to Categorical columns #31256
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2020-01-27T19:32:50Z

Hello @charlesdong1991! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-29 19:30:04 UTC

TomAugspurger · 2020-01-28T20:57:58Z

@jbrockmendel did you have thoughts on this approach?

jbrockmendel · 2020-01-28T21:00:11Z

pandas/core/arrays/base.py

@@ -47,7 +47,7 @@ def try_cast_to_ea(cls_or_instance, obj, dtype=None):
    ExtensionArray or obj
    """
    try:
-        result = cls_or_instance._from_sequence(obj, dtype=dtype)
+        result = cls_or_instance._from_sequence(obj, dtype=dtype.name)


why is this change necessary?

jbrockmendel · 2020-01-28T21:00:49Z

pandas/core/groupby/groupby.py

+            if (
+                is_extension_array_dtype(dtype)
+                and dtype.kind != "M"
+            ):


up until a moment ago this had a length check; without that check this looks like it isnt actually changing anything

in the previous commit, i did add a length check, but i do not feel it was right solution. the issue is dtype preserves its categories @jbrockmendel

so dtype.categories preserves the original value, in this case, [0, 1] as its categories, however, after aggregation, we should NOT use the original categories, but the new one based on the result

jbrockmendel · 2020-01-28T21:03:17Z

did you have thoughts on this approach?

ATM just confusion. Pending the CI, I'll take @charlesdong1991 's word for it that just changing dtype=dtype to dtype=dtype.name fixes the new test, but it isn't at all clear to me why that would work.

TomAugspurger · 2020-01-28T23:00:16Z

Mmm, this fix doesn't look quite right.

The expected dtype of SeriesGroupBy[categorical].count() should be int64, not categorical. IIUC, this change returning a Categorical whose categories are the counts, which I don't think we want. It seems to me like we shouldn't be trying to cast back to an EA here. We know that count shouldn't cast back, but for arbitrary user-defined functions we don't know that.

Will think about this a bit.

jorisvandenbossche · 2020-01-29T18:49:37Z

@TomAugspurger for the whatsnew note you are now focussing on resample. Can you explain a bit how it comes that there is such a change for resample and not for groupby? I don't see much resample specific change in the diff?

TomAugspurger · 2020-01-29T18:51:26Z

Will double check, but the behavior change from 0.25.x is just for resample. The change from master affects both. Make sense?

doc/source/whatsnew/v1.0.0.rst

jorisvandenbossche · 2020-01-29T18:53:05Z

doc/source/whatsnew/v1.0.0.rst

+.. ipython:: python
+
+   df.resample("2D").agg(lambda x: 'a').A.dtype
+


What will this return? Not a categorical?
I am not fully sure that is correct as well. Eg if you do a first-like aggregation lambda x: x[0], you would expect this to work...

(but of course any heuristic will never be correct in all cases .. for full control in cases like this we will need an additional keyword)

Right, it will be object dtype. That's the API breaking change. But, I'm OK with it because

It matches the behavior of groupby on 0.25.3. df.groupby([1, 1]).agg(lambda x: 'a').A.dtype is object.

It's incorrect (or at least surprising?) when the dtype can't hold the values (the next example where you get NaN values).

jorisvandenbossche · 2020-01-29T18:54:18Z

he behavior change from 0.25.x is just for resample. The change from master affects both. Make sense?

Yes, that's how I understood it. I just find it strange from looking at the diff ;)

pandas/tests/groupby/test_categorical.py

charlesdong1991 · 2020-01-29T18:57:39Z

Thanks for fixing and pushing the release! @TomAugspurger 👍

Maybe I should not disrupt you, I just saw the changes you've made and maybe it's nicer to also add *pandas 1.0.0* above line 670 like you did on line 651 ^^

TomAugspurger · 2020-01-29T19:05:39Z

nicer to also add pandas 1.0.0 above line 670 like you did on line 651 ^^

Fixed, thanks!

TomAugspurger · 2020-01-29T19:27:06Z

Good to merge when CI passes? I think this is the final thing before 1.0.

jbrockmendel · 2020-01-29T19:27:37Z

Good to merge when CI passes?

Will take another look now

jbrockmendel · 2020-01-29T19:32:08Z

No objections to merging, agreed that some follow-up discussion is necessary on _from_sequence

TomAugspurger · 2020-01-29T20:20:43Z

All green. Merging.

Thanks for kicking this off @charlesdong1991!

lumberbot-app · 2020-01-29T20:21:05Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

$ git checkout 1.0.x
$ git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

$ git cherry-pick -m1 06ef193a5c1957c0a76e3e88bc7b834b38972c39

You will likely have some merge/cherry-pick conflict here, fix them and commit:

$ git commit -am 'Backport PR #31359: BUG: regression when applying groupby aggregation on categorical columns'

Push to a named branch :

git push YOURFORK 1.0.x:auto-backport-of-pr-31359-on-1.0.x

Create a PR against branch 1.0.x, I would have named this PR:

"Backport PR #31359 on branch 1.0.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

TomAugspurger · 2020-01-29T20:23:22Z

Handling the backport.

…mns (pandas-dev#31359)

…mns (#31359) (#31428) Co-authored-by: Kaiqi Dong <kaiqidong1991@gmail.com>

charlesdong1991 · 2020-01-29T21:10:47Z

Thanks for your fix and help to release 1.0 for community! ALL credit to you!! 👍

simonjayhawkins · 2020-03-27T10:49:45Z

pandas/core/groupby/groupby.py

-                # return the same type (Series) as our caller
-                cls = dtype.construct_array_type()
-                result = try_cast_to_ea(cls, result, dtype=dtype)
+                if len(result) and isinstance(result[0], dtype.type):


@charlesdong1991 it looks like this change caused the regression in #32194

simonjayhawkins · 2020-03-27T13:32:46Z

I’m not a huge fan of the current fix - do we know where the regression was introduced?

I'm not sure, but I think it's some combination of https://github.com/pandas-dev/pandas/pull/24488/files#diff-6d2a4be351f3d7ec35b9c6696a615242L771, and calling _try_cast_result in more places.

08087d6 is the first bad commit
commit 08087d6
Author: jbrockmendel jbrockmendel@gmail.com
Date: Wed Nov 6 13:25:06 2019 -0800

BUG: fix TypeErrors raised within _python_agg_general (#29425)

charlesdong1991 added 9 commits December 3, 2018 17:43

remove \n from docstring

7e461a1

fix conflicts

1314059

Merge remote-tracking branch 'upstream/master'

8bcb313

Merge remote-tracking branch 'upstream/master'

24c3ede

fix issue 17038

dea38f2

revert change

cd9e7ac

revert change

e5e912b

Merge remote-tracking branch 'upstream/master' into fix_issue_31256

d0cddb3

try fix 31256

4e1abde

charlesdong1991 added 2 commits January 27, 2020 20:33

pep8

0b917c6

fix test

e9cac5d

fix

e03357e

jbrockmendel reviewed Jan 28, 2020

View reviewed changes

fix up tests

a1df393

charlesdong1991 marked this pull request as ready for review January 28, 2020 21:08

charlesdong1991 added 8 commits January 28, 2020 22:23

preserve order

8743c47

add comment

1e10d71

better test

905b3a5

fixup

c5d670b

remove blank line

916d9b2

fix test

2208fc2

style

86a254c

linting

c36d97b

TomAugspurger closed this Jan 28, 2020

TomAugspurger reopened this Jan 28, 2020

jorisvandenbossche reviewed Jan 29, 2020

View reviewed changes

pandas/tests/groupby/test_categorical.py Outdated Show resolved Hide resolved

fixup

9c7af0f

TomAugspurger added 2 commits January 29, 2020 13:06

fixup

ca35648

fixup

1b826bb

TomAugspurger merged commit 06ef193 into pandas-dev:master Jan 29, 2020

lumberbot-app bot added the Still Needs Manual Backport label Jan 29, 2020

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jan 29, 2020

BUG: regression when applying groupby aggregation on categorical colu…

3c719f2

…mns (pandas-dev#31359)

TomAugspurger mentioned this pull request Jan 29, 2020

BUG: regression when applying groupby aggregation on categorical colu… #31428

Merged

TomAugspurger added a commit that referenced this pull request Jan 29, 2020

BUG: regression when applying groupby aggregation on categorical colu…

bfa2b9e

…mns (#31359) (#31428) Co-authored-by: Kaiqi Dong <kaiqidong1991@gmail.com>

simonjayhawkins removed the Still Needs Manual Backport label Jan 30, 2020

charlesdong1991 mentioned this pull request Feb 1, 2020

BUG: groupby().agg fails on categorical column #31470

Closed

5 tasks

simonjayhawkins reviewed Mar 27, 2020

View reviewed changes

simonjayhawkins mentioned this pull request Mar 28, 2020

BUG: Don't cast nullable Boolean to float in groupby #33089

Merged

6 tasks

This was referenced May 6, 2020

Extension dtypes are modified when performing groupby aggregate with agg dict #32194

Closed

BUG: Fix min_count issue for groupby.sum #32914

Merged

TomAugspurger mentioned this pull request Oct 29, 2020

ENH: improve the resulting dtype for groupby operations on nullable dtypes #37494

Closed

		.. ipython:: python

		df.resample("2D").agg(lambda x: 'a').A.dtype

Uh oh!

BUG: regression when applying groupby aggregation on categorical columns #31359

BUG: regression when applying groupby aggregation on categorical columns #31359

Uh oh!

Conversation

charlesdong1991 commented Jan 27, 2020

Uh oh!

pep8speaks commented Jan 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-01-29 19:30:04 UTC

Uh oh!

TomAugspurger commented Jan 28, 2020

Uh oh!

jbrockmendel Jan 28, 2020

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Jan 28, 2020

Choose a reason for hiding this comment

Uh oh!

charlesdong1991 Jan 28, 2020

Choose a reason for hiding this comment

Uh oh!

charlesdong1991 Jan 28, 2020

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Jan 28, 2020

Uh oh!

TomAugspurger commented Jan 28, 2020

Uh oh!

jorisvandenbossche commented Jan 29, 2020

Uh oh!

TomAugspurger commented Jan 29, 2020

Uh oh!

Uh oh!

jorisvandenbossche Jan 29, 2020

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jan 29, 2020

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jan 29, 2020

Uh oh!

Uh oh!

charlesdong1991 commented Jan 29, 2020

Uh oh!

TomAugspurger commented Jan 29, 2020

Uh oh!

TomAugspurger commented Jan 29, 2020

Uh oh!

jbrockmendel commented Jan 29, 2020

Uh oh!

jbrockmendel commented Jan 29, 2020

Uh oh!

TomAugspurger commented Jan 29, 2020

Uh oh!

lumberbot-app bot commented Jan 29, 2020

Uh oh!

TomAugspurger commented Jan 29, 2020

Uh oh!

charlesdong1991 commented Jan 29, 2020

Uh oh!

simonjayhawkins Mar 27, 2020

Choose a reason for hiding this comment

Uh oh!

simonjayhawkins commented Mar 27, 2020

Uh oh!

Uh oh!

pep8speaks commented Jan 27, 2020 •

edited

Loading