Skip to content

BUG: group by with categorical columns causes an exception #45128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
ItayGabbay opened this issue Dec 30, 2021 · 4 comments · Fixed by #45143
Closed
3 tasks done

BUG: group by with categorical columns causes an exception #45128

ItayGabbay opened this issue Dec 30, 2021 · 4 comments · Fixed by #45143
Labels
Bug Categorical Categorical Data Type Groupby
Milestone

Comments

@ItayGabbay
Copy link

ItayGabbay commented Dec 30, 2021

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

data = {
    'a': np.random.randint(0, 1000, 300000),
    'b': np.random.randint(0, 1000, 300000),
    'c': np.random.randint(0, 1000, 300000),
    'd': np.random.randint(0, 1000, 300000),
    'e': np.random.randint(0, 1000, 300000),
    'target': np.random.randint(0, 2, 300000)
}
df = pd.DataFrame(data)
df = df.astype(dtype={"a": "category",
                      "b": "category",
                      "c": "category",
                      "d": "category",
                      "e": "category",
                      "target": "int"})
group_unique_data = df.groupby(['a', 'b', 'c', 'd', 'e'], dropna=False)
group_unique_labels = group_unique_data.nunique()['target']

Issue Description

When trying to group by a DataFrame with many categories, the following exception is raised:

    group_unique_labels = group_unique_data.nunique()['target']
  File "/Users/itaygabbay/.virtualenvs/mlchecks/lib/python3.9/site-packages/pandas/core/groupby/generic.py", line 1804, in nunique
    results = self._apply_to_column_groupbys(
  File "/Users/itaygabbay/.virtualenvs/mlchecks/lib/python3.9/site-packages/pandas/core/groupby/generic.py", line 1710, in _apply_to_column_groupbys
    results = [
  File "/Users/itaygabbay/.virtualenvs/mlchecks/lib/python3.9/site-packages/pandas/core/groupby/generic.py", line 1711, in <listcomp>
    func(col_groupby) for _, col_groupby in self._iterate_column_groupbys(obj)
  File "/Users/itaygabbay/.virtualenvs/mlchecks/lib/python3.9/site-packages/pandas/core/groupby/generic.py", line 1805, in <lambda>
    lambda sgb: sgb.nunique(dropna), obj=obj
  File "/Users/itaygabbay/.virtualenvs/mlchecks/lib/python3.9/site-packages/pandas/core/groupby/generic.py", line 673, in nunique
    return self._reindex_output(result, fill_value=0)
  File "/Users/itaygabbay/.virtualenvs/mlchecks/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 3164, in _reindex_output
    index, _ = MultiIndex.from_product(
  File "/Users/itaygabbay/.virtualenvs/mlchecks/lib/python3.9/site-packages/pandas/core/indexes/multi.py", line 620, in from_product
    codes = cartesian_product(codes)
  File "/Users/itaygabbay/.virtualenvs/mlchecks/lib/python3.9/site-packages/pandas/core/reshape/util.py", line 54, in cartesian_product
    return [tile_compat(np.repeat(x, b[i]), np.product(a[i])) for i, x in enumerate(X)]
  File "/Users/itaygabbay/.virtualenvs/mlchecks/lib/python3.9/site-packages/pandas/core/reshape/util.py", line 54, in <listcomp>
    return [tile_compat(np.repeat(x, b[i]), np.product(a[i])) for i, x in enumerate(X)]
  File "<__array_function__ internals>", line 5, in repeat
  File "/Users/itaygabbay/.virtualenvs/mlchecks/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 479, in repeat
    return _wrapfunc(a, 'repeat', repeats, axis=axis)
  File "/Users/itaygabbay/.virtualenvs/mlchecks/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
    return bound(*args, **kwds)
numpy.core._exceptions.MemoryError: Unable to allocate 1.78 PiB for an array with shape (1000000000000000,) and data type int16

However, when I cast the category columns to object dtype everything works smoothly.

Expected Behavior

I would expect that I'll get the output of the groupby operation

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.9.0.final.0
python-bits : 64
OS : Darwin
OS-release : 21.1.0
Version : Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; root:xnu-8019.41.5~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.3.4
numpy : 1.21.4
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.3.0
Cython : None
pytest : 6.2.5
hypothesis : 6.31.6
sphinx : 4.3.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.30.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.5.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.3
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@phofl
Copy link
Member

phofl commented Dec 30, 2021

Categoricals are not memory efficient compared to int or object. They have to store categories codes etc. The operation you are performing really blows the result, so this is not that surprising I think. But investigations are welcome

@phofl phofl added Categorical Categorical Data Type Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 30, 2021
@Liam3851
Copy link
Contributor

Liam3851 commented Dec 30, 2021

I believe this is just the confusing observed=False default already referenced in #43999, #35967?

@ItayGabbay you need to use observed=True in this groupby to get the same output as you would with integer categories. With categoricals, by default, all groups are generated as output whether observed or not. In the example case that generates 10^15 groups (1000 a's x 1000 b's x 1000 c's x 1000 d's x 1000 e's) even though only 300,000 of those groups have observations.

@ItayGabbay
Copy link
Author

@Liam3851 thanks for the suggestion.
I've tried also with the observed=True but it still raises the same exception.

@Liam3851
Copy link
Contributor

Liam3851 commented Dec 31, 2021

Ok yes I think you've hit a bug in observed=True. groupby.nunique acts on the groupbys of each column:

obj = self._obj_with_exclusions
results = self._apply_to_column_groupbys(
lambda sgb: sgb.nunique(dropna), obj=obj
)

but it's not passing through the observed keyword when creating the groupbys on the individual columns:

for i, colname in enumerate(obj.columns):
yield colname, SeriesGroupBy(
obj.iloc[:, i],
selection=colname,
grouper=self.grouper,
exclusions=self.exclusions,
)

And so it appears it's just getting the same result as if it were unspecified.

Edit: I've confirmed that a one-line fix passing observed=self.observed at line 1450 fixes the original issue (if the original example were to use observed=True). However this code path is hit by other means that still require testing.

@jreback jreback added this to the 1.4 milestone Dec 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants