-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Group by a categorical Series of unequal length #44180
Group by a categorical Series of unequal length #44180
Conversation
I gather that the two cancelled checks are also cancelled for other recent commits. What might have caused the failures of "CI / Checks / Testing docstring validation script" and "Database / Linux_py38_IO / Upload coverage to Codecov"? |
Add tests of grouping Series with length equal to that of the grouper and index both equal and unequal to that of the grouper. Operate on the GroupBy and check the result.
cc @rhshadrach if any comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Generally looks good - some minor requests below. Also, I think we should add a test where .groupby(..., dropna=False)
.
Thanks for the appreciation! I believe that I've addressed the minor requests.
Although I'm open to hearing more about what you have in mind, to me the |
@stanwest - I believe the bugfix you've done here results in new cases (since they would raise before) where the grouper has null values (e.g. #44179 (comment)). It seems prudent to me to test these cases. |
agree |
can you merge master and address the open comment |
I might otherwise agree, but I've found the behavior of pandas in related cases sometimes to be inconsistent with the documentation or otherwise surprising. Consider the following demonstration: import pandas as pd
pd.options.display.max_colwidth = None
def outcome(dtype, uneq_len, uneq_index, dropna, operation):
"""Return the result or exception from a groupby operation."""
ser = pd.Series(range(6))
grouper = pd.Series([*"ABBA"] + (3 if uneq_len else 2) * [pd.NA], dtype=dtype)
if uneq_index:
grouper = grouper.reindex(grouper.index - 1)
try:
val = operation(ser.groupby(grouper, dropna=dropna))
except ValueError as exc:
val = exc
return val
def summary(obj):
"""Return a summary representation of the given object."""
if isinstance(obj, pd.Series):
return (
f"{type(obj).__name__}"
f"({obj.to_dict()}, index={type(obj.index).__name__}(...))"
)
else:
return repr(obj)
pd.Series(
{
(oper_name, dtype, dropna, uneq_len, uneq_index): summary(
outcome(dtype, uneq_len, uneq_index, dropna, operation)
)
for oper_name, operation in {
"groups": lambda gby: gby.groups,
"agg(list)": lambda gby: gby.aggregate(list),
}.items()
for dtype in ("object", "category")
for dropna in (False, True)
for uneq_len in (False, True)
for uneq_index in (False, True)
},
).rename_axis(index=["operation", "dtype", "dropna", "uneq_len", "uneq_index"]) It groups the same series by another series of
The present PR affects the cases with categorical data type and unequal length (namely rows 11, 12, 15, 16, 27, 28, 31, and 32): Instead of raising The cases with In [2]: pd.Series(range(6)).groupby(2 * [pd.Series([*"ABBA"] + 2 * [pd.NA], dtype="category")], dropna=False).groups
Out[2]: {('A', 'A'): [0, 3], ('B', 'B'): [1, 2], (nan, nan): [4, 5]} Interestingly, the result is the same with For the Additionally, and perhaps related to the above mix of behaviors, I've found no To summarize my view, while some behaviors were questionable before this PR and remain so, this PR removes an unnecessary |
Thanks for the detailed response. I agree this PR moves in the right direction, and from your post the following two examples
already produce the incorrect result (drop the null group). Is there an issue for this? I'm +1 on the behavior here. |
ok i think this is fine, @rhshadrach your last comment i think we should open a new issue but not block here right? |
Yes: In issue #36327, grouping with Also, the discussion below the demonstration noted that grouping by two levels that include missing values, even when In [2]: gby = pd.Series(range(3)).groupby(2 * [pd.Series(["A", "B", pd.NA])])
In [3]: list(gby)
Out[3]:
[(('A', 'A'),
0 0
dtype: int64),
(('B', 'B'),
1 1
dtype: int64)]
In [4]: len(gby)
Out[4]: 3 The remaining related issue I see is the aforementioned issue #35202 that affects rows 1–4. |
@stanwest i was asking if there is an issue for this comment: #44180 (comment) |
thanks @stanwest |
Thank you for the discussion and for merging this. |
While this fix enables grouping by a
Series
that has a categorical data type and length unequal to the axis of grouping, it maintains the requirement (developed in #3017, #9741, and #18525) that the length of aCategorical
object used for grouping must match that of the axis of grouping. Additionally, this fix checks that requirement consistently with other groupers whose lengths must match—namelylist
,tuple
,Index
, andnumpy.ndarray
—and produces the same exception message when the lengths differ.