BUG: groupby with categorical result order differs for index type #49223

rhshadrach · 2022-10-21T03:38:32Z

When a categorical grouper is used on a DataFrame with a range index, the sort order of the values depends only on whether the categorical is ordered. When the grouper is the index, the sort order of the values depends only on the sort argument to groupby.

Code

categories = np.arange(4, -1, -1)
for range_index in [True, False]:
    df_ordered = DataFrame(
        {
            "a": Categorical([2, 1, 2, 3], categories=categories, ordered=True),
            "b": range(4)
        }
    )
    df_unordered = DataFrame(
        {
            "a": Categorical([2, 1, 2, 3], categories=categories, ordered=False),
            "b": range(4)
        }
    )

    if not range_index:
        df_ordered = df_ordered.set_index('a')
        df_unordered = df_unordered.set_index('a')

    gb_ordered_sort = df_ordered.groupby("a", sort=True, observed=True)
    gb_ordered_nosort = df_ordered.groupby("a", sort=False, observed=True)
    gb_unordered_sort = df_unordered.groupby("a", sort=True, observed=True)
    gb_unordered_nosort = df_unordered.groupby("a", sort=False, observed=True)

    print('Range Index:', range_index)
    print('  ordered=True, sort=True:', gb_ordered_sort.sum()["b"].tolist())
    print('  ordered=True, sort=False:', gb_ordered_nosort.sum()["b"].tolist())
    print('  ordered=False, sort=True:', gb_unordered_sort.sum()["b"].tolist())
    print('  ordered=False, sort=False:', gb_unordered_nosort.sum()["b"].tolist())
    print('---')

Range Index: True
  ordered=True, sort=True: [3, 2, 1]
  ordered=True, sort=False: [3, 2, 1]
  ordered=False, sort=True: [2, 1, 3]
  ordered=False, sort=False: [2, 1, 3]
---
Range Index: False
  ordered=True, sort=True: [3, 2, 1]
  ordered=True, sort=False: [2, 1, 3]
  ordered=False, sort=True: [3, 2, 1]
  ordered=False, sort=False: [2, 1, 3]
---

The text was updated successfully, but these errors were encountered:

gt-on-1234 · 2022-10-26T17:55:29Z

@rhshadrach Is this the intended behaviour?

 Range Index: True
  ordered=True, sort=True: [3, 2, 1]
  ordered=True, sort=False: [3, 2, 1]
  ordered=False, sort=True: [2, 1, 3]
  ordered=False, sort=False: [2, 1, 3]
---
Range Index: False
  ordered=True, sort=True: [3, 2, 1]
  ordered=True, sort=False: [3, 2, 1]
  ordered=False, sort=True: [2, 1, 3]
  ordered=False, sort=False: [2, 1, 3]
---

I.e. The sort order of the values only depends on the sort argument in both cases.

rhshadrach · 2022-10-26T22:44:18Z

The sort order of the values only depends on the sort argument in both cases.

I think this is correct - but the example in your post depends on ordered and not sort. I'm guessing it might just be a typo. Also, when a category is ordered and sort=True, the result should be sorted by the specified category order. When a category is unordered and sort=True however, I think the result should be sorted by the data the underlies the category - e.g. strings lexicographically and numbers numerically. Here is what I would expect:

 Range Index: True
  ordered=True, sort=True: [3, 2, 1]  # Categories sorted by specified order
  ordered=True, sort=False: [2, 1, 3]
  ordered=False, sort=True: [1, 2, 3]  # Categories sorted numerically
  ordered=False, sort=False: [2, 1, 3]
---
Range Index: False
  ordered=True, sort=True: [3, 2, 1]  # Categories sorted by specified order
  ordered=True, sort=False: [2, 1, 3]
  ordered=False, sort=True: [1, 2, 3]  # Categories sorted numerically
  ordered=False, sort=False: [2, 1, 3]
---

rhshadrach · 2022-10-26T22:52:32Z

Perhaps the above expectation of how ordered influences the sort order is wrong. For example:

categories = np.arange(4, -1, -1)
df_ordered = pd.DataFrame(
    {
        "a": pd.Categorical([2, 1, 2, 3], categories=categories, ordered=True),
        "b": range(4)
    }
)
print(df_ordered.sort_values("a"))
#    a  b
# 3  3  3
# 0  2  0
# 2  2  2
# 1  1  1

df_unordered = pd.DataFrame(
    {
        "a": pd.Categorical([2, 1, 2, 3], categories=categories, ordered=False),
        "b": range(4)
    }
)
print(df_unordered.sort_values("a"))
#    a  b
# 3  3  3
# 0  2  0
# 2  2  2
# 1  1  1

Somewhat related discussion: #12785

gt-on-1234 · 2022-10-27T18:34:23Z

In that case, does it make sense for sort=True to sort categoricals by their specified order if they are ordered or not?

E.g.

Range Index: True
  ordered=True, sort=True: [3, 2, 1]  # Categories sorted by the specified order
  ordered=True, sort=False: [2, 3, 1]  
  ordered=False, sort=True: [3, 2, 1]  # Categories sorted by the specified order
  ordered=False, sort=False: [2, 1, 3]
---
Range Index: True
  ordered=True, sort=True: [3, 2, 1]  # Categories sorted by the specified order
  ordered=True, sort=False: [2, 3, 1]  
  ordered=False, sort=True: [3, 2, 1]  # Categories sorted by the specified order
  ordered=False, sort=False: [2, 1, 3]
---

rhshadrach · 2022-10-28T16:53:14Z

@gt-on-1234 - yes, I think so. That we should sort by that data's natural order when ordered=False is entirely separate from this issue, and I'm not sure it's a good idea.

rhshadrach added Bug Groupby Categorical Categorical Data Type labels Oct 21, 2022

drammock mentioned this issue Oct 25, 2022

fix spectrum dataframe test mne-tools/mne-python#11280

Merged

gt-on-1234 added a commit to gt-on-1234/pandas that referenced this issue Oct 25, 2022

Closes pandas-dev#49223

120392d

gt-on-1234 mentioned this issue Oct 25, 2022

BUG: Consistent result ordering for groupbys with categorical groupings #49318

Closed

4 tasks

rhshadrach mentioned this issue Oct 28, 2022

BUG: groupby with CategoricalIndex doesn't include unobserved categories #49373

Merged

6 tasks

mroeschke closed this as completed in #49373 Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby with categorical result order differs for index type #49223

BUG: groupby with categorical result order differs for index type #49223

rhshadrach commented Oct 21, 2022 •

edited

Loading

gt-on-1234 commented Oct 26, 2022 •

edited

Loading

rhshadrach commented Oct 26, 2022

rhshadrach commented Oct 26, 2022

gt-on-1234 commented Oct 27, 2022 •

edited

Loading

rhshadrach commented Oct 28, 2022

BUG: groupby with categorical result order differs for index type #49223

BUG: groupby with categorical result order differs for index type #49223

Comments

rhshadrach commented Oct 21, 2022 • edited Loading

gt-on-1234 commented Oct 26, 2022 • edited Loading

rhshadrach commented Oct 26, 2022

rhshadrach commented Oct 26, 2022

gt-on-1234 commented Oct 27, 2022 • edited Loading

rhshadrach commented Oct 28, 2022

rhshadrach commented Oct 21, 2022 •

edited

Loading

gt-on-1234 commented Oct 26, 2022 •

edited

Loading

gt-on-1234 commented Oct 27, 2022 •

edited

Loading