Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Incorrect index with groupby and groupby.agg when observed=False using categorical columns with/without as_index=False #46492

Closed
3 tasks done
RogerThomas opened this issue Mar 24, 2022 · 3 comments
Labels
Bug Categorical Categorical Data Type Groupby

Comments

@RogerThomas
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

#!/usr/bin/env python
import argparse
import pandas as pd


def main(args):
    df_data = {
        "A": ["A1", "A2", "A3", "A1"],
        "B": ["B1", "B2", "B3", "B1"],
        "C": [1.1, 1.2, 1.3, 1.4]
    }
    df = pd.DataFrame(df_data)

    as_index = args.as_index

    print("df:")
    print(df)
    print('---------')

    print(f"Non-Category: Groupby non-agg, as_index={as_index}")
    gdf = df.groupby(["A", "B"], as_index=as_index)["C"].sum()
    print(gdf)
    print('---------')

    print(f"Non-Category: Groupby agg, as_index={as_index}")
    gdf = df.groupby(["A", "B"], as_index=as_index).agg({"C": "sum"})
    print(gdf)
    print('---------')

    df["A"] = df["A"].astype("category")
    df["B"] = df["B"].astype("category")

    print(f"Category: Groupby non-agg, as_index={as_index}")
    gdf = df.groupby(["A", "B"], as_index=as_index)["C"].sum()
    print(gdf)
    print('---------')

    print(f"Category: Groupby agg, as_index={as_index}")
    gdf = df.groupby(["A", "B"], as_index=as_index).agg({"C": "sum"})
    print(gdf)
    print('---------')


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Main script",
        epilog="Example usage: python test_gb_with_cats.py filename",
    )
    parser.add_argument("--as-index", default=False, action="store_true")
    args = parser.parse_args()

    main(args)

Issue Description

When using cateogry types as gb columns with as_index=True

Both df.groupby and df.groupby.agg work but the result seems to have all permutations of the underyling categories.

See output of reproducible example below

df:
    A   B    C
0  A1  B1  1.1
1  A2  B2  1.2
2  A3  B3  1.3
3  A1  B1  1.4
---------
Non-Category: Groupby non-agg, as_index=True
A   B
A1  B1    2.5
A2  B2    1.2
A3  B3    1.3
Name: C, dtype: float64
---------
Non-Category: Groupby agg, as_index=True
         C
A  B
A1 B1  2.5
A2 B2  1.2
A3 B3  1.3
---------
Category: Groupby non-agg, as_index=True
A   B
A1  B1    2.5
    B2    0.0
    B3    0.0
A2  B1    0.0
    B2    1.2
    B3    0.0
A3  B1    0.0
    B2    0.0
    B3    1.3
Name: C, dtype: float64
---------
Category: Groupby agg, as_index=True
         C
A  B
A1 B1  2.5
   B2  0.0
   B3  0.0
A2 B1  0.0
   B2  1.2
   B3  0.0
A3 B1  0.0
   B2  0.0
   B3  1.3
---------

When using cateogry types as gb columns with as_index=False

df.groupby works but again the result seems to have all permutations of the underyling categories.
However, df.groupby.agg fails with a crpytic error message like

ValueError: Length of values (3) does not match length of index (9)

Expected Behavior

Both groupby and groupby.agg should work with/without as_index and shouldn't do a cross join on the underlying categories

Installed Versions

INSTALLED VERSIONS

commit : 66e3805
python : 3.7.13.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Mon Aug 30 06:12:21 PDT 2021; root:xnu-7195.141.6~3/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.3.5
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.2
pip : 22.0.4
setuptools : 57.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.3 (dt dec pq3 ext lo64)
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.5.0
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 4.0.1
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : 1.4.32
tables : None
tabulate : 0.8.9
xarray : None
xlrd : 1.1.0
xlwt : None
numba : 0.53.1

@RogerThomas RogerThomas added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 24, 2022
@samukweku
Copy link
Contributor

@RogerThomas if you pass observed=True for categorical group by, is there any change?

@RogerThomas
Copy link
Contributor Author

Yes indeed @samukweku passing observed=True for the categorical group bys does indeed fix the issue, i.e.:

#!/usr/bin/env python
import argparse
import pandas as pd


def main(args):
    df_data = {
        "A": ["A1", "A2", "A3", "A1"],
        "B": ["B1", "B2", "B3", "B1"],
        "C": [1.1, 1.2, 1.3, 1.4]
    }
    df = pd.DataFrame(df_data)

    as_index = args.as_index

    print("df:")
    print(df)
    print('---------')

    print(f"Non-Category: Groupby non-agg, as_index={as_index}")
    gdf = df.groupby(["A", "B"], as_index=as_index)["C"].sum()
    print(gdf)
    print('---------')

    print(f"Non-Category: Groupby agg, as_index={as_index}")
    gdf = df.groupby(["A", "B"], as_index=as_index).agg({"C": "sum"})
    print(gdf)
    print('---------')

    df["A"] = df["A"].astype("category")
    df["B"] = df["B"].astype("category")

    print(f"Category: Groupby non-agg, as_index={as_index}")
    gdf = df.groupby(["A", "B"], as_index=as_index, observed=True)["C"].sum()
    print(gdf)
    print('---------')

    print(f"Category: Groupby agg, as_index={as_index}")
    gdf = df.groupby(["A", "B"], as_index=as_index, observed=True).agg({"C": "sum"})
    print(gdf)
    print('---------')


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Main script",
        epilog="Example usage: python test_gb_with_cats.py filename",
    )
    parser.add_argument("--as-index", default=False, action="store_true")
    args = parser.parse_args()

    main(args)

@mroeschke mroeschke changed the title BUG: strange behaviour of groupby and groupby.agg when using categorical columns with/without as_index=False BUG: Incorrect index with groupby and groupby.agg when observed=False using categorical columns with/without as_index=False Jul 6, 2022
@mroeschke mroeschke added Groupby Categorical Categorical Data Type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 6, 2022
@rhshadrach
Copy link
Member

Thanks for the report! For observed=False (the default), this is expected behavior. For agg raising on categories with as_index=False, this is a duplicate of #36698. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby
Projects
None yet
Development

No branches or pull requests

4 participants