Skip to content

BUG: GroupBy.value_counts doesn't preserve original order for non-grouping rows #59307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
sfc-gh-joshi opened this issue Jul 24, 2024 · 13 comments · Fixed by #59745
Closed
3 tasks done

BUG: GroupBy.value_counts doesn't preserve original order for non-grouping rows #59307

sfc-gh-joshi opened this issue Jul 24, 2024 · 13 comments · Fixed by #59745

Comments

@sfc-gh-joshi
Copy link

sfc-gh-joshi commented Jul 24, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame({"X": ["B", "A", "A", "B"], "Y": [4, 1, 3, -1]})
df.groupby("X", sort=False).value_counts(sort=False)  # works as expected
df.groupby("X", sort=True).value_counts(sort=False)  # column "Y" does not match original order

Issue Description

(possibly related to #55951)

When a GroupBy.value_counts operation is performed, the order of rows within the non-grouping columns does not respect the order of elements in the original frame.

In this example, with value_counts(sort=False) I would expect the row B 4 1 to appear above B -1 1, as that would correspond go their order in the original frame.

>>> df = pd.DataFrame({"X": ["B", "A", "A", "B"], "Y": [4, 1, 3, -1]})
>>> df.groupby("X", sort=False).value_counts(sort=False)
X  Y 
A   1    1
    3    1
B  -1    1
    4    1
Name: count, dtype: int64

When the groupby has sort=False set, this order is respected.

>>> df.groupby("X", sort=False).value_counts(sort=False)
X  Y 
B   4    1
A   1    1
    3    1
B  -1    1
Name: count, dtype: int64

Expected Behavior

Calling groupby(sort=True).value_counts(sort=False) should preserve the order of members within groups, or documentation should be changed to reflect this.

Installed Versions

INSTALLED VERSIONS

commit : 67a58cd
python : 3.11.7
python-bits : 64
OS : Darwin
OS-release : 23.5.0
Version : Darwin Kernel Version 23.5.0: Wed May 1 20:14:38 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+1239.g67a58cddc2
numpy : 2.1.0.dev0+git20240720.d489c83
pytz : 2024.1
dateutil : 2.9.0.post0
pip : 23.3.1
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
psycopg2 : None
pymysql : None
pyarrow : None
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2023.4
qtpy : None
pyqt5 : None

@sfc-gh-joshi sfc-gh-joshi added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 24, 2024
@sgysherry
Copy link

take

@sfc-gh-joshi sfc-gh-joshi changed the title BUG: GroupBy.value_counts doesn't preserve original order for groupby(sort=True) BUG: GroupBy.value_counts doesn't preserve original order for non-grouping rows Jul 25, 2024
@sfc-gh-joshi
Copy link
Author

I found an additional example with groupby(sort=False):

>>> data = {
...     "by": ["male", "male", "female", "male", "female", "male"],
...     "value1": ["low", "medium", "high", "low", "high", "low"],
...     "value2": ["US", "FR", "US", "FR", "FR", "FR"],
... }
>>> pd.DataFrame(data).groupby(by=["by"], sort=False).value_counts()
by      value1  value2
male    low     FR        2
                US        1
        medium  FR        1
female  high    US        1
                FR        1
Name: count, dtype: int64

The female high US 1 row has moved below male medium FR 1 despite appearing earlier in the input frame.

@rhshadrach
Copy link
Member

Thanks for the report. DataFrameGroupBy.value_counts was designed to mirror DataFrame.value_counts. In particular, for the sort argument as documented here:

Sort by frequencies when True. Sort by DataFrame column values when False.

The documentation for the groupby case should make this clear. I've reworked this issue as a documentation issue.

@rhshadrach rhshadrach added Docs Groupby and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2024
@KevsterAmp
Copy link
Contributor

take

@sfc-gh-joshi
Copy link
Author

Sort by frequencies when True. Sort by DataFrame column values when False.

What's the expected behavior for when both the groupby's sort argument and the value_counts sort argument are False? Right now it seems to not sort by the column values (as in the example from my original post):

>>> df = pd.DataFrame({"X": ["B", "A", "A", "B"], "Y": [4, 1, 3, -1]})
>>> df.groupby("X", sort=False).value_counts(sort=False)
X  Y 
B   4    1
A   1    1
    3    1
B  -1    1
Name: count, dtype: int64

@rhshadrach
Copy link
Member

rhshadrach commented Aug 9, 2024

There seems to be an undesirable disagreement between Series.value_counts and DataFrame.value_counts. For Series.value_counts:

Sort by frequencies when True. Preserve the order of the data when False.

Whereas DataFrame is:

Sort by frequencies when True. Sort by DataFrame column values when False.

It seems to me these should behave the same. I propose to change DataFrame behavior with sort=False should preserve the order of the input data. This is because:

  • The current DataFrame sort=False behavior would be readily achievable by following the call with .sort_index().
  • This is the behavior of sort=False for other methods (e.g. df.groupby(..., sort=False)).
  • I would anticipate Series.value_counts sees more activity than DataFrame.value_counts.

Currently, it appears to me the Series behavior is not readily achievable in groupby. This is because we may need to sort by the groupings but not sort by the other columns in the DataFrame. However, I expect it should not be difficult to accomplish (but am not certain).

If we are to agree that sort=False should preserve the order of the input in all cases, then it appears to me .groupby(..., sort=False).value_counts(sort=False) behavior is currently correct, but .groupby(..., sort=True).value_counts(sort=False) is not.

@rhshadrach
Copy link
Member

rhshadrach commented Aug 21, 2024

@pandas-dev/pandas-core - I'd like thoughts on the proposal in the comment immediately above.

Assuming the proposal is acceptable, I think:

  • the inconsistency of the API;
  • the expected infrequent use of DataFrame.value_counts (especially since sort=False is not the default);
  • the impact on users code if not changed (would have to be sensitive to row-order); and
  • the ease of changing back to previous behavior (adding .sort_index())

warrants making this change as a bugfix and without deprecation.

@MarcoGorelli
Copy link
Member

My initial reaction is that I agree

warrants making this change as a bugfix and without deprecation.

sure, so long as it's in 3.0 (I wouldnt' be in favour of backporting to 2.3)

@sfc-gh-joshi
Copy link
Author

I found a separate, possibly-related issue: for groupby(sort=False).value_counts(sort=True, normalize=True), normalization is applied after the frequency column is sorted. This isn't a concern for plain DataFrame/Series.value_counts since those normalize against the size of the whole dataset, but the distinction is actually meaningful for GroupBy since it normalizes against the size of each group.

>>> import pandas as pd
>>> data = {"by": ["a", "a", "a", "b", "b", "b", "b", "b", "b"], "values": [1, 1, 0, 1, 1, 1, 0, 0, 0]}
>>> df = pd.DataFrame(data)
>>> df.groupby(by="by", sort=False).value_counts(sort=True, normalize=True)
by  values
b   1         0.500000
    0         0.500000
a   1         0.666667
    0         0.333333
Name: proportion, dtype: float64
>>> df.groupby(by="by", sort=False).value_counts(sort=True, normalize=False)
by  values
b   1         3
    0         3
a   1         2
    0         1
Name: count, dtype: int64

Is this a documentation issue, or a behavioral one? Either way, should I file this as a separate bug, or is this within scope for your proposal @rhshadrach?

sfc-gh-joshi added a commit to snowflakedb/snowpark-python that referenced this issue Aug 30, 2024
<!---
Please answer these questions before creating your pull request. Thanks!
--->

1. Which Jira issue is this PR addressing? Make sure that there is an
accompanying issue to your PR.

   <!---
   In this section, please add a Snowflake Jira issue number.
   
Note that if a corresponding GitHub issue exists, you should still
include
   the Snowflake Jira issue number. For example, for GitHub issue
#1400, you should
   add "SNOW-1335071" here.
    --->

   Fixes SNOW-1489371

2. Fill out the following pre-review checklist:

- [x] I am adding a new automated test(s) to verify correctness of my
new code
- [ ] If this test skips Local Testing mode, I'm requesting review from
@snowflakedb/local-testing
   - [ ] I am adding new logging messages
   - [ ] I am adding a new telemetry message
   - [ ] I am adding new credentials
   - [ ] I am adding a new dependency
- [ ] If this is a new feature/behavior, I'm adding the Local Testing
parity changes.

3. Please describe how your code solves the related issue.

This PR adds support for GroupBy.value_counts, accepting all parameters
except `bin`, which we do not support for DataFrame/Series.value_counts.

Upstream modin defaults to pandas for both
DataFrameGroupBy/SeriesGroupBy.value_counts, so some of these changes
should be eventually upstreamed.

pandas has different behavior than what might be expected from
documentation; this PR tries to align with existing behavior as much as
possible. This is documented in this pandas issue:
pandas-dev/pandas#59307

1. When `normalize=True`, pandas sorts by the pre-normalization counts,
leading to counterintuitive results. This only matters when `groupby` is
called with `sort=False` and `value_counts` with `sort=True`. See test
cases for an example.
2. pandas does not always respect the original order of data, depending
on the configuration of sort flags in `groupby` and the `value_counts`
call itself. The behaviors are as follows (copied from query compiler
comments):
```
# pandas currently provides the following behaviors based on the different sort flags.
# These behaviors are not entirely consistent with documentation; see this issue for discussion:
# pandas-dev/pandas#59307
#
# Example data (using pandas 2.2.1 behavior):
# >>> df = pd.DataFrame({"X": ["B", "A", "A", "B", "B", "B"], "Y": [4, 1, 3, -2, -1, -1]})
#
# 1. groupby(sort=True).value_counts(sort=True)
#   Sort on non-grouping columns, then sort on frequencies, then sort on grouping columns.
# >>> df.groupby("X", sort=True).value_counts(sort=True)
# X  Y
# A   1    1
#     3    1
# B  -1    2
#    -2    1
#     4    1
# Name: count, dtype: int64
#
# 2. groupby(sort=True).value_counts(sort=False)
#   Sort on non-grouping columns, then sort on grouping columns.
# >>> df.groupby("X", sort=True).value_counts(sort=True)
# X  Y
# X  Y
# A   1    1
#     3    1
# B  -2    1
#    -1    2
#     4    1
# Name: count, dtype: int64
#
# 3. groupby(sort=False).value_counts(sort=True)
#   Sort on frequencies.
# >>> df.groupby("X", sort=False).value_counts(sort=True)
# X  Y
# B  -1    2
#     4    1
# A   1    1
#     3    1
# B  -2    1
# Name: count, dtype: int64
#
# 4. groupby(sort=False).value_counts(sort=False)
#   Sort on nothing (entries match the order of the original frame).
# X  Y
# B   4    1
# A   1    1
#     3    1
# B  -2    1
#    -1    2
# Name: count, dtype: int64
#
# Lastly, when `normalize` is set with groupby(sort=False).value_counts(sort=True, normalize=True),
# pandas will sort by the pre-normalization counts rather than the resulting proportions. As this
# is an uncommon edge case, we cannot handle this using existing QC methods efficiently, so we just
# update our testing code to account for this.
# See comment 
```

---------

Co-authored-by: Andong Zhan <andong.zhan@snowflake.com>
@rhshadrach
Copy link
Member

@sfc-gh-joshi - the documentation for sort=True states Sort by frequencies. Isn't that the current behavior? In any case, this appears separate to me, can you open a new issue.

@sfc-gh-joshi
Copy link
Author

the documentation for sort=True states Sort by frequencies.

That's true, it just feels counterintuitive that there's a scenario where sort=True where the column in question is not sorted in the expected fashion. I've filed #59725 for further discussion.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 9, 2024

@pandas-dev/pandas-core - I'd like thoughts on the proposal in the comment immediately above.

Assuming the proposal is acceptable, I think:

  • the inconsistency of the API;
  • the expected infrequent use of DataFrame.value_counts (especially since sort=False is not the default);
  • the impact on users code if not changed (would have to be sensitive to row-order); and
  • the ease of changing back to previous behavior (adding .sort_index())

warrants making this change as a bugfix and without deprecation.

+1

@jorisvandenbossche
Copy link
Member

Agreed that for the above mentioned reasons (#59307 (comment)), it sounds good to do this as a breaking change without deprecation, and then best done in 3.0.

@KevsterAmp KevsterAmp removed their assignment Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants