Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.groupby with axis=1 is deprecated. <-- please do not deprecate it #56226

Closed
3 tasks done
johentsch opened this issue Nov 28, 2023 · 7 comments
Closed
3 tasks done
Labels
Closing Candidate May be closeable, needs more eyeballs Deprecate Functionality to remove in pandas Groupby
Milestone

Comments

@johentsch
Copy link

johentsch commented Nov 28, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame.from_dict(
    {
        ('column_1', 'a'): {
            ('group1', 'group2', 0): '01a',
            ('group1', 'group2', 1): '11a',
            ('group1', 'group2', 2): '21a'
        },
        ('column_1', 'b'): {
            ('group1', 'group2', 0): '01b',
            ('group1', 'group2', 1): '11b',
            ('group1', 'group2', 2): '21b'
        },
        ('column_2', 'a'): {
            ('group1', 'group2', 0): '02a',
            ('group1', 'group2', 1): '12a',
            ('group1', 'group2', 2): '22a'
        },
        ('column_2', 'b'): {
            ('group1', 'group2', 0): '02b',
            ('group1', 'group2', 1): '12b',
            ('group1', 'group2', 2): '22b'
        }}
       
)
desired_result = df.groupby(level=0, axis=1).apply(lambda df: df.apply(tuple, axis=1)) # semantically chrystal-clear, why deprecate?
should_be_equivalent = df.T.groupby(level=0).apply(lambda df: df.apply(tuple)).T # has one column level more, not less
this_works = df.T.groupby(level=0).apply(lambda df: df.apply(tuple, result_type="reduce")).T
desired_result.equals(should_be_equivalent) # False
desired_result.equals(this_works)           # True

Issue Description

Related to #51203, #51778 where people fail to see the use of .groupby(axis=1). These issues fail to name good reasons for deprecating this that could counter this: It should be retained for API consistency and for backward compatibility. It does not matter if transforms are happening under the hood, as a user I don't care. I care about consistency and being able to use my code in the future.

So here comes the problem: Here's a dataframe with two-by-two multicolumns, four columns in total. The aim is to eliminate the second level and to end up with only two columns in total (according to the first level) that should contain tuples being pairs which correspond to a combination of the two squashed columns from the second level (desired_result). The problem is, when, instead of grouping columns and applying tuple to rows, I transform, group, and apply tuple to all colums, the output of tuple, by default, will be interpreted as a transformation of the column (corresponding to result_type="expand" behaviour) and not as a "squashed" tuple that would get rid of the second, now-index level (corresponding to result_type="reduce" behaviour). That is, I still end up with four columns. Even worse, the fact that group_keys=True by default results in an additional column level duplicating the original first one.

As shown above, the fix is to include .apply(tuple, result_type="reduce"). In this case, no need to include group_keys=False to avoid the duplicated column level.

So my plea, again: Please do not deprecate .groupby(axis=1). Rather, seek to make the behaviours consistent.

Installed Versions

INSTALLED VERSIONS

commit : 2a953cf
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 6.1.62-1-MANJARO
Version : #1 SMP PREEMPT_DYNAMIC Thu Nov 9 03:01:44 UTC 2023
machine : x86_64
processor :
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.3
numpy : 1.26.2
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 69.0.2
pip : 23.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : 5.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.18.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : 2.0.23
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

...lib/python3.10/site-packages/_distutils_hack/init.py:33: UserWarning:

Setuptools is replacing distutils.

@johentsch johentsch added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 28, 2023
@MarcoGorelli
Copy link
Member

this_works = df.T.groupby(level=0).apply(lambda df: df.apply(tuple, result_type="reduce")).T

this looks fine to be honest - transposing is expensive and shouldn't be done under-the-hood in my opinion

@MarcoGorelli MarcoGorelli added Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Nov 28, 2023
@johentsch
Copy link
Author

johentsch commented Nov 28, 2023

transposing is expensive and shouldn't be done under-the-hood in my opinion

Then the best solution for this issue is to implement a performant .groupby(axis=1) without transform, rather than pretending this is some esoteric use-case in which users just go figure with their transforms, opaque default behaviours of .apply() and what not. In my opinion, #51203 invokes good arguments to keep this.

@rhshadrach
Copy link
Member

rhshadrach commented Nov 28, 2023

Thanks for the feedback here! Though we may disagree, it's important to hear from users on issues such as this.

Then the best solution for this issue is to implement a performant .groupby(axis=1) without transform

I don't see how that's possible short of adding a whole new implementation for groupby just for axis=1. The current implementation is already a large maintenance burden, and this would exacerbate the issue.

The problem is, when, instead of grouping columns and applying tuple to rows, I transform, group, and apply tuple to all colums, the output of tuple, by default, will be interpreted as a transformation of the column

You should be able to use .groupby(...).agg here to accomplish what you want, but currently it works the same as .apply. We've laid the ground work to fix this in pandas 3.0 as part of the 2.x series, I hope to tackle it in the future.

rather than pretending this is some esoteric use-case

I don't think this helps the conversation. My view is that it sees much less usage compared to axis=0 in groupby and that the maintenance burden is not worth supporting this feature. But I do acknowledge (as I did in #51203) that there are benefits to keeping axis=1.

@rhshadrach rhshadrach added Groupby Deprecate Functionality to remove in pandas labels Nov 28, 2023
@johentsch
Copy link
Author

Thank you @rhshadrach! I'll go with the two-fold transposing solution for now, even though it cramps my Zen. Pandas is at the core of a data analysis library I'm developing and it uses MultiIndex for every index and for columns a fair bit, too. In terms of usability it makes a significant difference to be able to group by column levels, so I'm glad as long as y'all be putting up with the maintenance burden of it. 🙏

@chubukov
Copy link

chubukov commented Dec 7, 2023

Thanks for the feedback here! Though we may disagree, it's important to hear from users on issues such as this.

I will just add two cents that groupby(axis=1) is a common use case for me. The grouping variable is typically either a level of a MultiIndex or an external Series. To be fair, it is mostly syntactic sugar, and I suspect I would be able to do everything with a combination of explicit for loops and concat. My number of columns is generally small, so I doubt I am taking advantage of performant implementations.

Is there a recommended compact equivalent to for i,g in df.groupby(level=level_name,axis=1):? I have come up with many alternatives, but they're all very ugly.

@jbrockmendel jbrockmendel added this to the 3.0 milestone Dec 21, 2023
@phofl
Copy link
Member

phofl commented Dec 28, 2023

@chubukov

Why can't you just transpose your DataFrame?

@lithomas1
Copy link
Member

We talked a little about this at the dev meeting today.

The consensus here was that the added maintenance burden isn't worth it, just to save doing some transpose calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Deprecate Functionality to remove in pandas Groupby
Projects
None yet
Development

No branches or pull requests

7 participants