Group by a categorical Series of unequal length #44180

stanwest · 2021-10-25T14:17:49Z

closes BUG: Should be able to group by a categorical Series of unequal length #44179
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

While this fix enables grouping by a Series that has a categorical data type and length unequal to the axis of grouping, it maintains the requirement (developed in #3017, #9741, and #18525) that the length of a Categorical object used for grouping must match that of the axis of grouping. Additionally, this fix checks that requirement consistently with other groupers whose lengths must match—namely list, tuple, Index, and numpy.ndarray—and produces the same exception message when the lengths differ.

stanwest · 2021-10-26T13:47:55Z

I gather that the two cancelled checks are also cancelled for other recent commits. What might have caused the failures of "CI / Checks / Testing docstring validation script" and "Database / Linux_py38_IO / Upload coverage to Codecov"?

pandas/tests/groupby/test_categorical.py

Add tests of grouping Series with length equal to that of the grouper and index both equal and unequal to that of the grouper. Operate on the GroupBy and check the result.

jreback · 2021-11-01T23:29:57Z

cc @rhshadrach if any comments

rhshadrach

Thanks for the PR! Generally looks good - some minor requests below. Also, I think we should add a test where .groupby(..., dropna=False).

doc/source/whatsnew/v1.4.0.rst

pandas/tests/groupby/test_categorical.py

stanwest · 2021-11-10T14:23:08Z

Thanks for the PR! Generally looks good - some minor requests below.

Thanks for the appreciation! I believe that I've addressed the minor requests.

Also, I think we should add a test where .groupby(..., dropna=False).

Although I'm open to hearing more about what you have in mind, to me the dropna functionality seems beyond the scope of the bug and fix. That is, the bug and fix concern whether the groupby method either rejects a Series grouper solely because it has a categorical data type or accepts and reindexes it, and that precedes forming groups and deciding whether to drop any.

rhshadrach · 2021-11-11T22:32:18Z

@stanwest - I believe the bugfix you've done here results in new cases (since they would raise before) where the grouper has null values (e.g. #44179 (comment)). It seems prudent to me to test these cases.

jreback · 2021-11-12T03:12:03Z

@stanwest - I believe the bugfix you've done here results in new cases (since they would raise before) where the grouper has null values (e.g. #44179 (comment)). It seems prudent to me to test these cases.

agree

jreback · 2021-11-28T20:59:46Z

can you merge master and address the open comment

stanwest · 2021-12-09T01:01:39Z

@stanwest - I believe the bugfix you've done here results in new cases (since they would raise before) where the grouper has null values (e.g. #44179 (comment)). It seems prudent to me to test these cases.

I might otherwise agree, but I've found the behavior of pandas in related cases sometimes to be inconsistent with the documentation or otherwise surprising. Consider the following demonstration:

import pandas as pd

pd.options.display.max_colwidth = None

def outcome(dtype, uneq_len, uneq_index, dropna, operation):
    """Return the result or exception from a groupby operation."""
    ser = pd.Series(range(6))
    grouper = pd.Series([*"ABBA"] + (3 if uneq_len else 2) * [pd.NA], dtype=dtype)
    if uneq_index:
        grouper = grouper.reindex(grouper.index - 1)
    try:
        val = operation(ser.groupby(grouper, dropna=dropna))
    except ValueError as exc:
        val = exc
    return val

def summary(obj):
    """Return a summary representation of the given object."""
    if isinstance(obj, pd.Series):
        return (
            f"{type(obj).__name__}"
            f"({obj.to_dict()}, index={type(obj.index).__name__}(...))"
        )
    else:
        return repr(obj)

pd.Series(
    {
        (oper_name, dtype, dropna, uneq_len, uneq_index): summary(
            outcome(dtype, uneq_len, uneq_index, dropna, operation)
        )
        for oper_name, operation in {
            "groups": lambda gby: gby.groups,
            "agg(list)": lambda gby: gby.aggregate(list),
        }.items()
        for dtype in ("object", "category")
        for dropna in (False, True)
        for uneq_len in (False, True)
        for uneq_index in (False, True)
    },
).rename_axis(index=["operation", "dtype", "dropna", "uneq_len", "uneq_index"])

It groups the same series by another series of object or categorical data type, equal or unequal length, and equal or unequal index, either keeping or dropping missing keys, and then either produces the groups attribute or aggregates each group with list and represents the result or exception. It constructs the grouper such that group "A" should obtain 0 and 3; group "B", 1 and 2; and any group for missing values in the grouper, 4 and 5. (Although groups obtains the indices and aggregate obtains the values, the two are equal for the series being grouped to aid inspection.) With the present PR applied, it produces the following output, the rows of which I've manually numbered:

    operation  dtype     dropna  uneq_len  uneq_index
 1  groups     object    False   False     False                       ValueError('Categorical categories cannot be null')
 2                                         True                        ValueError('Categorical categories cannot be null')
 3                               True      False                       ValueError('Categorical categories cannot be null')
 4                                         True                        ValueError('Categorical categories cannot be null')
 5                       True    False     False                                                {'A': [0, 3], 'B': [1, 2]}
 6                                         True                                                 {'A': [0, 3], 'B': [1, 2]}
 7                               True      False                                                {'A': [0, 3], 'B': [1, 2]}
 8                                         True                                                 {'A': [0, 3], 'B': [1, 2]}
 9             category  False   False     False                                                {'A': [0, 3], 'B': [1, 2]}
10                                         True                                                 {'A': [0, 3], 'B': [1, 2]}
11                               True      False                                                {'A': [0, 3], 'B': [1, 2]}
12                                         True                                                 {'A': [0, 3], 'B': [1, 2]}
13                       True    False     False                                                {'A': [0, 3], 'B': [1, 2]}
14                                         True                                                 {'A': [0, 3], 'B': [1, 2]}
15                               True      False                                                {'A': [0, 3], 'B': [1, 2]}
16                                         True                                                 {'A': [0, 3], 'B': [1, 2]}
17  agg(list)  object    False   False     False         Series({'A': [0, 3], 'B': [1, 2], nan: [4, 5]}, index=Index(...))
18                                         True          Series({'A': [0, 3], 'B': [1, 2], nan: [4, 5]}, index=Index(...))
19                               True      False         Series({'A': [0, 3], 'B': [1, 2], nan: [4, 5]}, index=Index(...))
20                                         True          Series({'A': [0, 3], 'B': [1, 2], nan: [4, 5]}, index=Index(...))
21                       True    False     False                      Series({'A': [0, 3], 'B': [1, 2]}, index=Index(...))
22                                         True                       Series({'A': [0, 3], 'B': [1, 2]}, index=Index(...))
23                               True      False                      Series({'A': [0, 3], 'B': [1, 2]}, index=Index(...))
24                                         True                       Series({'A': [0, 3], 'B': [1, 2]}, index=Index(...))
25             category  False   False     False           Series({'A': [0, 3], 'B': [1, 2]}, index=CategoricalIndex(...))
26                                         True            Series({'A': [0, 3], 'B': [1, 2]}, index=CategoricalIndex(...))
27                               True      False           Series({'A': [0, 3], 'B': [1, 2]}, index=CategoricalIndex(...))
28                                         True            Series({'A': [0, 3], 'B': [1, 2]}, index=CategoricalIndex(...))
29                       True    False     False           Series({'A': [0, 3], 'B': [1, 2]}, index=CategoricalIndex(...))
30                                         True            Series({'A': [0, 3], 'B': [1, 2]}, index=CategoricalIndex(...))
31                               True      False           Series({'A': [0, 3], 'B': [1, 2]}, index=CategoricalIndex(...))
32                                         True            Series({'A': [0, 3], 'B': [1, 2]}, index=CategoricalIndex(...))

The present PR affects the cases with categorical data type and unequal length (namely rows 11, 12, 15, 16, 27, 28, 31, and 32): Instead of raising ValueError('Length of grouper (7) and axis (6) must be same length'), those cases return the same output as when the lengths are equal.

The cases with dropna=False seem not entirely correct. For the groups operation, the cases with object data type (rows 1–4) suffer from issue #35202, and those with "category" data type (rows 9–12) lack an item for missing values. I expected a result of {'A': [0, 3], 'B': [1, 2], nan: [4, 5]}. A comparable case using two identical groupings includes an item for missing values:

In [2]: pd.Series(range(6)).groupby(2 * [pd.Series([*"ABBA"] + 2 * [pd.NA], dtype="category")], dropna=False).groups
Out[2]: {('A', 'A'): [0, 3], ('B', 'B'): [1, 2], (nan, nan): [4, 5]}

Interestingly, the result is the same with dropna=True, as though the groups attribute is to express the groups prior to any filtering due to the dropna argument. If that is the case, then rows 1–16 should produce {'A': [0, 3], 'B': [1, 2], nan: [4, 5]}.

For the agg(list) operation, the cases with object data type (rows 17–20) produce what I expect, but I expected those with "category" data type (rows 25–28) to include an item for missing values (Series({'A': [0, 3], 'B': [1, 2], nan: [4, 5]}, index=CategoricalIndex(...))).

Additionally, and perhaps related to the above mix of behaviors, I've found no groupby tests that cover both grouping by categorical data types and dropna=False.

To summarize my view, while some behaviors were questionable before this PR and remain so, this PR removes an unnecessary ValueError and makes the behaviors more consistent.

pandas/core/groupby/grouper.py

rhshadrach · 2021-12-16T03:25:33Z

Thanks for the detailed response. I agree this PR moves in the right direction, and from your post the following two examples

gb = pd.Series(range(2)).groupby([pd.Series(["A", pd.NA], dtype="category")], dropna=False)
print(gb.size())

gb = pd.Series(range(2)).groupby([pd.Series(["A", np.nan], dtype="category")], dropna=False)
print(gb.size())

already produce the incorrect result (drop the null group). Is there an issue for this? I'm +1 on the behavior here.

jreback · 2021-12-20T01:28:17Z

ok i think this is fine, @rhshadrach your last comment i think we should open a new issue but not block here right?

stanwest · 2021-12-20T19:01:46Z

Is there an issue for this?

Yes: In issue #36327, grouping with dropna=False by a Categorical (as in the original comment of that issue) or by a column with categorical data type (as in a follow-up comment) incorrectly drops the null groups. I believe that that issue corresponds to rows 9–12 and 25–28 in the demonstration above.

Also, the discussion below the demonstration noted that grouping by two levels that include missing values, even when dropna=True, retains the null groups, which seems incorrect and differs from the behavior of .aggregate(). That inconsistency is not specific to categorical groupers and was mentioned in a comment to issue #10484. For example, the following groupby seems confused about its length:

In [2]: gby = pd.Series(range(3)).groupby(2 * [pd.Series(["A", "B", pd.NA])])

In [3]: list(gby)
Out[3]: 
[(('A', 'A'),
  0    0
  dtype: int64),
 (('B', 'B'),
  1    1
  dtype: int64)]

In [4]: len(gby)
Out[4]: 3

The remaining related issue I see is the aforementioned issue #35202 that affects rows 1–4.

jreback · 2021-12-22T03:07:26Z

@stanwest i was asking if there is an issue for this comment: #44180 (comment)

jreback · 2021-12-22T03:07:36Z

thanks @stanwest

rhshadrach · 2021-12-23T02:13:23Z

@jreback - The issue @stanwest pointed to is the same as the one in my comment above. I think we're good here.

stanwest · 2021-12-24T15:15:01Z

Thank you for the discussion and for merging this.

stanwest added 2 commits October 25, 2021 09:20

CLN: Remove check for instance of Series performed by prior clause

8d10ab4

BUG: Group by categorical Series with unequal length GH44179

08399e1

jreback added Categorical Categorical Data Type Groupby labels Oct 26, 2021

jreback requested changes Oct 26, 2021

View reviewed changes

pandas/tests/groupby/test_categorical.py Outdated Show resolved Hide resolved

TST: Expand test of grouping by categorical Series

e8a0c9b

Add tests of grouping Series with length equal to that of the grouper and index both equal and unequal to that of the grouper. Operate on the GroupBy and check the result.

stanwest requested a review from jreback November 1, 2021 18:07

jreback added this to the 1.4 milestone Nov 1, 2021

jreback approved these changes Nov 1, 2021

View reviewed changes

rhshadrach requested changes Nov 8, 2021

View reviewed changes

doc/source/whatsnew/v1.4.0.rst Outdated Show resolved Hide resolved

pandas/tests/groupby/test_categorical.py Outdated Show resolved Hide resolved

pandas/tests/groupby/test_categorical.py Outdated Show resolved Hide resolved

stanwest added 2 commits November 8, 2021 10:57

DOC: Edit whatsnew entry for GH44179

74580bf

CLN: Revise tests affected by fix for GH44179

65fcf42

Merge upstream

b62a6f0

stanwest requested a review from rhshadrach December 9, 2021 11:49

jreback requested changes Dec 14, 2021

View reviewed changes

pandas/core/groupby/grouper.py Show resolved Hide resolved

stanwest requested a review from jreback December 15, 2021 14:11

jreback approved these changes Dec 20, 2021

View reviewed changes

jreback merged commit 3a4821e into pandas-dev:master Dec 22, 2021

stanwest deleted the groupby-cat-with-uneq-len branch January 3, 2022 17:14

Uh oh!

Group by a categorical Series of unequal length #44180

Group by a categorical Series of unequal length #44180

Uh oh!

Conversation

stanwest commented Oct 25, 2021

Uh oh!

stanwest commented Oct 26, 2021

Uh oh!

Uh oh!

jreback commented Nov 1, 2021

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stanwest commented Nov 10, 2021

Uh oh!

rhshadrach commented Nov 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Nov 12, 2021

Uh oh!

jreback commented Nov 28, 2021

Uh oh!

stanwest commented Dec 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rhshadrach commented Dec 16, 2021

Uh oh!

jreback commented Dec 20, 2021

Uh oh!

stanwest commented Dec 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Dec 22, 2021

Uh oh!

jreback commented Dec 22, 2021

Uh oh!

rhshadrach commented Dec 23, 2021

Uh oh!

stanwest commented Dec 24, 2021

Uh oh!

Uh oh!

rhshadrach commented Nov 11, 2021 •

edited

Loading

stanwest commented Dec 9, 2021 •

edited

Loading

stanwest commented Dec 20, 2021 •

edited

Loading