Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group by a categorical Series of unequal length #44180

Merged
merged 6 commits into from
Dec 22, 2021

Conversation

stanwest
Copy link
Contributor

While this fix enables grouping by a Series that has a categorical data type and length unequal to the axis of grouping, it maintains the requirement (developed in #3017, #9741, and #18525) that the length of a Categorical object used for grouping must match that of the axis of grouping. Additionally, this fix checks that requirement consistently with other groupers whose lengths must match—namely list, tuple, Index, and numpy.ndarray—and produces the same exception message when the lengths differ.

@stanwest
Copy link
Contributor Author

I gather that the two cancelled checks are also cancelled for other recent commits. What might have caused the failures of "CI / Checks / Testing docstring validation script" and "Database / Linux_py38_IO / Upload coverage to Codecov"?

@jreback jreback added Categorical Categorical Data Type Groupby labels Oct 26, 2021
Add tests of grouping Series with length equal to that of the grouper
and index both equal and unequal to that of the grouper.  Operate on
the GroupBy and check the result.
@stanwest stanwest requested a review from jreback November 1, 2021 18:07
@jreback jreback added this to the 1.4 milestone Nov 1, 2021
@jreback
Copy link
Contributor

jreback commented Nov 1, 2021

cc @rhshadrach if any comments

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Generally looks good - some minor requests below. Also, I think we should add a test where .groupby(..., dropna=False).

@stanwest
Copy link
Contributor Author

Thanks for the PR! Generally looks good - some minor requests below.

Thanks for the appreciation! I believe that I've addressed the minor requests.

Also, I think we should add a test where .groupby(..., dropna=False).

Although I'm open to hearing more about what you have in mind, to me the dropna functionality seems beyond the scope of the bug and fix. That is, the bug and fix concern whether the groupby method either rejects a Series grouper solely because it has a categorical data type or accepts and reindexes it, and that precedes forming groups and deciding whether to drop any.

@rhshadrach
Copy link
Member

rhshadrach commented Nov 11, 2021

@stanwest - I believe the bugfix you've done here results in new cases (since they would raise before) where the grouper has null values (e.g. #44179 (comment)). It seems prudent to me to test these cases.

@jreback
Copy link
Contributor

jreback commented Nov 12, 2021

@stanwest - I believe the bugfix you've done here results in new cases (since they would raise before) where the grouper has null values (e.g. #44179 (comment)). It seems prudent to me to test these cases.

agree

@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

can you merge master and address the open comment

@stanwest
Copy link
Contributor Author

stanwest commented Dec 9, 2021

@stanwest - I believe the bugfix you've done here results in new cases (since they would raise before) where the grouper has null values (e.g. #44179 (comment)). It seems prudent to me to test these cases.

I might otherwise agree, but I've found the behavior of pandas in related cases sometimes to be inconsistent with the documentation or otherwise surprising. Consider the following demonstration:

import pandas as pd

pd.options.display.max_colwidth = None

def outcome(dtype, uneq_len, uneq_index, dropna, operation):
    """Return the result or exception from a groupby operation."""
    ser = pd.Series(range(6))
    grouper = pd.Series([*"ABBA"] + (3 if uneq_len else 2) * [pd.NA], dtype=dtype)
    if uneq_index:
        grouper = grouper.reindex(grouper.index - 1)
    try:
        val = operation(ser.groupby(grouper, dropna=dropna))
    except ValueError as exc:
        val = exc
    return val

def summary(obj):
    """Return a summary representation of the given object."""
    if isinstance(obj, pd.Series):
        return (
            f"{type(obj).__name__}"
            f"({obj.to_dict()}, index={type(obj.index).__name__}(...))"
        )
    else:
        return repr(obj)

pd.Series(
    {
        (oper_name, dtype, dropna, uneq_len, uneq_index): summary(
            outcome(dtype, uneq_len, uneq_index, dropna, operation)
        )
        for oper_name, operation in {
            "groups": lambda gby: gby.groups,
            "agg(list)": lambda gby: gby.aggregate(list),
        }.items()
        for dtype in ("object", "category")
        for dropna in (False, True)
        for uneq_len in (False, True)
        for uneq_index in (False, True)
    },
).rename_axis(index=["operation", "dtype", "dropna", "uneq_len", "uneq_index"])

It groups the same series by another series of object or categorical data type, equal or unequal length, and equal or unequal index, either keeping or dropping missing keys, and then either produces the groups attribute or aggregates each group with list and represents the result or exception. It constructs the grouper such that group "A" should obtain 0 and 3; group "B", 1 and 2; and any group for missing values in the grouper, 4 and 5. (Although groups obtains the indices and aggregate obtains the values, the two are equal for the series being grouped to aid inspection.) With the present PR applied, it produces the following output, the rows of which I've manually numbered:

    operation  dtype     dropna  uneq_len  uneq_index
 1  groups     object    False   False     False                       ValueError('Categorical categories cannot be null')
 2                                         True                        ValueError('Categorical categories cannot be null')
 3                               True      False                       ValueError('Categorical categories cannot be null')
 4                                         True                        ValueError('Categorical categories cannot be null')
 5                       True    False     False                                                {'A': [0, 3], 'B': [1, 2]}
 6                                         True                                                 {'A': [0, 3], 'B': [1, 2]}
 7                               True      False                                                {'A': [0, 3], 'B': [1, 2]}
 8                                         True                                                 {'A': [0, 3], 'B': [1, 2]}
 9             category  False   False     False                                                {'A': [0, 3], 'B': [1, 2]}
10                                         True                                                 {'A': [0, 3], 'B': [1, 2]}
11                               True      False                                                {'A': [0, 3], 'B': [1, 2]}
12                                         True                                                 {'A': [0, 3], 'B': [1, 2]}
13                       True    False     False                                                {'A': [0, 3], 'B': [1, 2]}
14                                         True                                                 {'A': [0, 3], 'B': [1, 2]}
15                               True      False                                                {'A': [0, 3], 'B': [1, 2]}
16                                         True                                                 {'A': [0, 3], 'B': [1, 2]}
17  agg(list)  object    False   False     False         Series({'A': [0, 3], 'B': [1, 2], nan: [4, 5]}, index=Index(...))
18                                         True          Series({'A': [0, 3], 'B': [1, 2], nan: [4, 5]}, index=Index(...))
19                               True      False         Series({'A': [0, 3], 'B': [1, 2], nan: [4, 5]}, index=Index(...))
20                                         True          Series({'A': [0, 3], 'B': [1, 2], nan: [4, 5]}, index=Index(...))
21                       True    False     False                      Series({'A': [0, 3], 'B': [1, 2]}, index=Index(...))
22                                         True                       Series({'A': [0, 3], 'B': [1, 2]}, index=Index(...))
23                               True      False                      Series({'A': [0, 3], 'B': [1, 2]}, index=Index(...))
24                                         True                       Series({'A': [0, 3], 'B': [1, 2]}, index=Index(...))
25             category  False   False     False           Series({'A': [0, 3], 'B': [1, 2]}, index=CategoricalIndex(...))
26                                         True            Series({'A': [0, 3], 'B': [1, 2]}, index=CategoricalIndex(...))
27                               True      False           Series({'A': [0, 3], 'B': [1, 2]}, index=CategoricalIndex(...))
28                                         True            Series({'A': [0, 3], 'B': [1, 2]}, index=CategoricalIndex(...))
29                       True    False     False           Series({'A': [0, 3], 'B': [1, 2]}, index=CategoricalIndex(...))
30                                         True            Series({'A': [0, 3], 'B': [1, 2]}, index=CategoricalIndex(...))
31                               True      False           Series({'A': [0, 3], 'B': [1, 2]}, index=CategoricalIndex(...))
32                                         True            Series({'A': [0, 3], 'B': [1, 2]}, index=CategoricalIndex(...))

The present PR affects the cases with categorical data type and unequal length (namely rows 11, 12, 15, 16, 27, 28, 31, and 32): Instead of raising ValueError('Length of grouper (7) and axis (6) must be same length'), those cases return the same output as when the lengths are equal.

The cases with dropna=False seem not entirely correct. For the groups operation, the cases with object data type (rows 1–4) suffer from issue #35202, and those with "category" data type (rows 9–12) lack an item for missing values. I expected a result of {'A': [0, 3], 'B': [1, 2], nan: [4, 5]}. A comparable case using two identical groupings includes an item for missing values:

In [2]: pd.Series(range(6)).groupby(2 * [pd.Series([*"ABBA"] + 2 * [pd.NA], dtype="category")], dropna=False).groups
Out[2]: {('A', 'A'): [0, 3], ('B', 'B'): [1, 2], (nan, nan): [4, 5]}

Interestingly, the result is the same with dropna=True, as though the groups attribute is to express the groups prior to any filtering due to the dropna argument. If that is the case, then rows 1–16 should produce {'A': [0, 3], 'B': [1, 2], nan: [4, 5]}.

For the agg(list) operation, the cases with object data type (rows 17–20) produce what I expect, but I expected those with "category" data type (rows 25–28) to include an item for missing values (Series({'A': [0, 3], 'B': [1, 2], nan: [4, 5]}, index=CategoricalIndex(...))).

Additionally, and perhaps related to the above mix of behaviors, I've found no groupby tests that cover both grouping by categorical data types and dropna=False.

To summarize my view, while some behaviors were questionable before this PR and remain so, this PR removes an unnecessary ValueError and makes the behaviors more consistent.

@stanwest stanwest requested a review from rhshadrach December 9, 2021 11:49
@stanwest stanwest requested a review from jreback December 15, 2021 14:11
@rhshadrach
Copy link
Member

Thanks for the detailed response. I agree this PR moves in the right direction, and from your post the following two examples

gb = pd.Series(range(2)).groupby([pd.Series(["A", pd.NA], dtype="category")], dropna=False)
print(gb.size())

gb = pd.Series(range(2)).groupby([pd.Series(["A", np.nan], dtype="category")], dropna=False)
print(gb.size())

already produce the incorrect result (drop the null group). Is there an issue for this? I'm +1 on the behavior here.

@jreback
Copy link
Contributor

jreback commented Dec 20, 2021

ok i think this is fine, @rhshadrach your last comment i think we should open a new issue but not block here right?

@stanwest
Copy link
Contributor Author

stanwest commented Dec 20, 2021

Is there an issue for this?

Yes: In issue #36327, grouping with dropna=False by a Categorical (as in the original comment of that issue) or by a column with categorical data type (as in a follow-up comment) incorrectly drops the null groups. I believe that that issue corresponds to rows 9–12 and 25–28 in the demonstration above.

Also, the discussion below the demonstration noted that grouping by two levels that include missing values, even when dropna=True, retains the null groups, which seems incorrect and differs from the behavior of .aggregate(). That inconsistency is not specific to categorical groupers and was mentioned in a comment to issue #10484. For example, the following groupby seems confused about its length:

In [2]: gby = pd.Series(range(3)).groupby(2 * [pd.Series(["A", "B", pd.NA])])

In [3]: list(gby)
Out[3]: 
[(('A', 'A'),
  0    0
  dtype: int64),
 (('B', 'B'),
  1    1
  dtype: int64)]

In [4]: len(gby)
Out[4]: 3

The remaining related issue I see is the aforementioned issue #35202 that affects rows 1–4.

@jreback
Copy link
Contributor

jreback commented Dec 22, 2021

@stanwest i was asking if there is an issue for this comment: #44180 (comment)

@jreback jreback merged commit 3a4821e into pandas-dev:master Dec 22, 2021
@jreback
Copy link
Contributor

jreback commented Dec 22, 2021

thanks @stanwest

@rhshadrach
Copy link
Member

@jreback - The issue @stanwest pointed to is the same as the one in my comment above. I think we're good here.

@stanwest
Copy link
Contributor Author

Thank you for the discussion and for merging this.

@stanwest stanwest deleted the groupby-cat-with-uneq-len branch January 3, 2022 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Should be able to group by a categorical Series of unequal length
3 participants