-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Categorical Column GroupBy agg with as_index=False produces NaN rows 7.5X Slower with unexpected extra Cardinality #15217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Probably related to the above issue. If I perform a combined .max() aggregation on a three column selection from the categorical groupby I get lots of NaN results that do not appear when not using categorical types. import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10,100, size=(200,6)), columns=['C'+str(i) for i in range(6)])
df['C0'] = ['A','B','C','D']*50
df['C1'] = ['E','F']*100
df['C2'] = ['H','I','J','K', 'L']*40
for col in df.columns[:3]:
df[col] = df[col].astype('category')
df.groupby(['C0', 'C1', 'C2'])[['C3', 'C4', 'C5']].max()
C3 C4 C5
C0 C1 C2
A E H 90.0 94.0 99.0
I 97.0 87.0 94.0
J 88.0 98.0 85.0
K 80.0 91.0 97.0
L 96.0 95.0 98.0
F H NaN NaN NaN
I NaN NaN NaN
J NaN NaN NaN
K NaN NaN NaN
L NaN NaN NaN
B E H NaN NaN NaN
I NaN NaN NaN
J NaN NaN NaN
K NaN NaN NaN
L NaN NaN NaN
F H 99.0 97.0 89.0
I 80.0 97.0 94.0
J 97.0 90.0 98.0
K 89.0 85.0 98.0
L 87.0 94.0 99.0
C E H 99.0 98.0 93.0
I 94.0 88.0 92.0
J 89.0 86.0 79.0
K 99.0 98.0 98.0
L 98.0 94.0 96.0
F H NaN NaN NaN
I NaN NaN NaN
J NaN NaN NaN
K NaN NaN NaN
L NaN NaN NaN
D E H NaN NaN NaN
I NaN NaN NaN
J NaN NaN NaN
K NaN NaN NaN
L NaN NaN NaN
F H 95.0 92.0 88.0
I 98.0 90.0 87.0
J 91.0 93.0 82.0
K 93.0 99.0 99.0
L 93.0 84.0 80.0 |
I think this is the same code path as this one: #14942 essentially we should be reindexing each level of the output rather than the cartesian product of them (which is why the nans happen). pull-requests to fix are welcome! (its a straightforward fix) |
@jreback looks like #13204 attempted to fix this. On this line of groupby.py: We are using Perhaps its just a typo or do not understanding why we would want a Cartesian product. |
it's not a typo the idea is that you should be getting back all categories that were there orginalky for a categorical grouper - that is the contract |
I thought the goal of re-indexing was to return the reduced carnality CategoricalIndex but with all before group by category levels still present? From what I see the CatagoricalIndex levels do have all catagory levels populated, but we still have the extra NaN rows. Not sure which is causing which. Do we have extra NaN rows & all category levels due to the MultiIndex.from_product() or from some other line of code. df.groupby(['C0', 'C1', 'C2']).sum().index
MultiIndex(levels=[['A', 'B', 'C', 'D'], ['E', 'F'], ['H', 'I', 'J', 'K', 'L']],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]],
names=['C0', 'C1', 'C2']) |
yes the MultiIndex.from_product is not what we want here. I think we can maybe skip the reindexing entirely as the categories are already preserved. |
Code Sample, a copy-pastable example if possible
Problem description
Using
as_index=False
indf.groupby(df.columns.tolist()[:3], as_index=False)['C5'].max()
with categorical columns produces NaN output rows.Expected Output
I expect that the output should not contain any extra Cardinality explosion and have the same number of rows. as the
as_index=True
case.Output of
pd.show_versions()
pandas: 0.19.2
nose: 1.3.7
pip: 8.1.1
setuptools: 20.10.1
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.7.1
IPython: 4.2.0
sphinx: 1.3.6
patsy: 0.4.0
dateutil: 2.5.0
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.2.5
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: