Skip to content

Categorical Column GroupBy agg with as_index=False produces NaN rows 7.5X Slower with unexpected extra Cardinality #15217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dragoljub opened this issue Jan 25, 2017 · 6 comments · Fixed by #20583
Labels
Bug Categorical Categorical Data Type Groupby Performance Memory or execution speed performance
Milestone

Comments

@dragoljub
Copy link

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(10,100, size=(200,6)), columns=['C'+str(i) for i in range(6)])
df['C0'] = ['A','B','C','D']*50
df['C1'] = ['E','F']*100
df['C2'] = ['H','I','J','K', 'L']*40

for col in df.columns[:3]:
    df[col] = df[col].astype('category')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
C0    200 non-null category
C1    200 non-null category
C2    200 non-null category
C3    200 non-null int32
C4    200 non-null int32
C5    200 non-null int32
dtypes: category(3), int32(3)
memory usage: 3.1 KB

%time ix_true = df.groupby(df.columns.tolist()[:3], as_index=True)['C5'].max()
Wall time: 2 ms

ix_true.shape
(20,)

%time ix_false = df.groupby(df.columns.tolist()[:3], as_index=False)['C5'].max()
Wall time: 15 ms

ix_false.shape
(40, 4)

ix_true
C0  C1  C2
A   E   H     93
        I     99
        J     88
        K     91
        L     94
B   F   H     98
        I     89
        J     94
        K     92
        L     96
C   E   H     96
        I     96
        J     85
        K     88
        L     98
D   F   H     96
        I     84
        J     71
        K     96
        L     94
Name: C5, dtype: int32

ix_false
	C0	C1	C2	C5
0	A	E	H	93.0
1	A	E	I	99.0
2	A	E	J	88.0
3	A	E	K	91.0
4	A	E	L	94.0
5	A	F	H	NaN
6	A	F	I	NaN
7	A	F	J	NaN
8	A	F	K	NaN
9	A	F	L	NaN
10	B	E	H	NaN
11	B	E	I	NaN
12	B	E	J	NaN
13	B	E	K	NaN
14	B	E	L	NaN
15	B	F	H	98.0
16	B	F	I	89.0
17	B	F	J	94.0
18	B	F	K	92.0
19	B	F	L	96.0
20	C	E	H	96.0
21	C	E	I	96.0
22	C	E	J	85.0
23	C	E	K	88.0
24	C	E	L	98.0
25	C	F	H	NaN
26	C	F	I	NaN
27	C	F	J	NaN
28	C	F	K	NaN
29	C	F	L	NaN
30	D	E	H	NaN
31	D	E	I	NaN
32	D	E	J	NaN
33	D	E	K	NaN
34	D	E	L	NaN
35	D	F	H	96.0
36	D	F	I	84.0
37	D	F	J	71.0
38	D	F	K	96.0
39	D	F	L	94.0

Problem description

Using as_index=False in df.groupby(df.columns.tolist()[:3], as_index=False)['C5'].max() with categorical columns produces NaN output rows.

Expected Output

I expect that the output should not contain any extra Cardinality explosion and have the same number of rows. as the as_index=True case.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 8.1.1
setuptools: 20.10.1
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.7.1
IPython: 4.2.0
sphinx: 1.3.6
patsy: 0.4.0
dateutil: 2.5.0
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.2.5
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@dragoljub
Copy link
Author

Probably related to the above issue.

If I perform a combined .max() aggregation on a three column selection from the categorical groupby I get lots of NaN results that do not appear when not using categorical types.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(10,100, size=(200,6)), columns=['C'+str(i) for i in range(6)])
df['C0'] = ['A','B','C','D']*50
df['C1'] = ['E','F']*100
df['C2'] = ['H','I','J','K', 'L']*40

for col in df.columns[:3]:
    df[col] = df[col].astype('category')

df.groupby(['C0', 'C1', 'C2'])[['C3', 'C4', 'C5']].max()

            C3    C4    C5
C0 C1 C2                  
A  E  H   90.0  94.0  99.0
      I   97.0  87.0  94.0
      J   88.0  98.0  85.0
      K   80.0  91.0  97.0
      L   96.0  95.0  98.0
   F  H    NaN   NaN   NaN
      I    NaN   NaN   NaN
      J    NaN   NaN   NaN
      K    NaN   NaN   NaN
      L    NaN   NaN   NaN
B  E  H    NaN   NaN   NaN
      I    NaN   NaN   NaN
      J    NaN   NaN   NaN
      K    NaN   NaN   NaN
      L    NaN   NaN   NaN
   F  H   99.0  97.0  89.0
      I   80.0  97.0  94.0
      J   97.0  90.0  98.0
      K   89.0  85.0  98.0
      L   87.0  94.0  99.0
C  E  H   99.0  98.0  93.0
      I   94.0  88.0  92.0
      J   89.0  86.0  79.0
      K   99.0  98.0  98.0
      L   98.0  94.0  96.0
   F  H    NaN   NaN   NaN
      I    NaN   NaN   NaN
      J    NaN   NaN   NaN
      K    NaN   NaN   NaN
      L    NaN   NaN   NaN
D  E  H    NaN   NaN   NaN
      I    NaN   NaN   NaN
      J    NaN   NaN   NaN
      K    NaN   NaN   NaN
      L    NaN   NaN   NaN
   F  H   95.0  92.0  88.0
      I   98.0  90.0  87.0
      J   91.0  93.0  82.0
      K   93.0  99.0  99.0
      L   93.0  84.0  80.0

@jreback
Copy link
Contributor

jreback commented Jan 25, 2017

I think this is the same code path as this one: #14942

essentially we should be reindexing each level of the output rather than the cartesian product of them (which is why the nans happen).

pull-requests to fix are welcome! (its a straightforward fix)

@jreback jreback added this to the 0.20.0 milestone Jan 25, 2017
@dragoljub
Copy link
Author

@jreback looks like #13204 attempted to fix this.

On this line of groupby.py:
https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby.py#L3934

We are using MultiIndex.from_product(). Should we instead be using MultiIndex.from_arrays()?

Perhaps its just a typo or do not understanding why we would want a Cartesian product.

@jreback
Copy link
Contributor

jreback commented Jan 25, 2017

it's not a typo

the idea is that you should be getting back all categories that were there orginalky for a categorical grouper - that is the contract

@dragoljub
Copy link
Author

I thought the goal of re-indexing was to return the reduced carnality CategoricalIndex but with all before group by category levels still present?

From what I see the CatagoricalIndex levels do have all catagory levels populated, but we still have the extra NaN rows.

Not sure which is causing which. Do we have extra NaN rows & all category levels due to the MultiIndex.from_product() or from some other line of code.

df.groupby(['C0', 'C1', 'C2']).sum().index

MultiIndex(levels=[['A', 'B', 'C', 'D'], ['E', 'F'], ['H', 'I', 'J', 'K', 'L']],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]],
           names=['C0', 'C1', 'C2'])

@jreback
Copy link
Contributor

jreback commented Jan 25, 2017

yes the MultiIndex.from_product is not what we want here. I think we can maybe skip the reindexing entirely as the categories are already preserved.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 29, 2017
@jreback jreback modified the milestones: Next Major Release, Next Minor Release Mar 29, 2017
@jreback jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017
@jreback jreback added the Performance Memory or execution speed performance label Feb 10, 2018
@jreback jreback modified the milestones: Next Major Release, 0.23.0 Apr 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants