Skip to content

Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category' #31274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
suokunlong opened this issue Jan 24, 2020 · 3 comments

Comments

@suokunlong
Copy link

This issue was initially reported as #19622, which was then marked as a duplicate of #15217 and closed since then. However, this issue is still reproducible with the most recent version 0.25.3

To reproduce, copy-paste the following code and run:

import pandas as pd
import numpy as np
import time

n_rows = 1000
n_categories = 250

df = pd.DataFrame({'cat1': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
                   'cat2': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
                   'cat3': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
                   'real_numbers': np.random.normal(size=n_rows)})

# Case 1: Pivot table without multi-index columns
now = time.time()
pd.pivot_table(df, values='real_numbers', index='cat1', columns=['cat2'])
print("Elapsed time Case 1:", time.time() - now)

# Case 2: Pivot table with multi-index columns
now = time.time()
pd.pivot_table(df, values='real_numbers', index='cat1', columns=['cat2', 'cat3'])
print("Elapsed time Case 2:", time.time() - now)

# Typecast a feature to categorical.
df_cast = df.astype({'cat2': 'category'})

# Case 3: Pivot table without multi-index columns, but with at least one categorical column (i.e., cat2)
# - still OK, but already show performance issue here.
now = time.time()
pd.pivot_table(df_cast, values='real_numbers', index=['cat1'], columns=['cat2'])
print("Elapsed time Case 3:", time.time() - now)

# Case 4: Pivot table with multi-index columns
# mixture of categorical (i.e., cat2) and non-categorical column (i.e., cat3)
# - strong increase in elapsed time/ memory footprint.
now = time.time()
pd.pivot_table(df_cast, values='real_numbers', index=['cat1'], columns=['cat2', 'cat3'])
print("Elapsed time Case 4:", time.time() - now)

Problem description

Elapsed time Case 1: 0.03320956230163574
Elapsed time Case 2: 0.017639636993408203
Elapsed time Case 3: 0.35034942626953125
Elapsed time Case 4: 119.43837356567383

This issue will eat all my memory if the dataframe is large.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-26-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : zh_CN.UTF-8
LOCALE : zh_CN.UTF-8

pandas : 0.25.3
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.2
Cython : None
pytest : None
hypothesis : None
sphinx : 2.2.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.10.2
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

@TomAugspurger
Copy link
Contributor

By default, Categorical dtypes preserve unobserved categories in the output. You'll need to convert to remove the unused categories before pivoting, or use the observed=True keyword.

@suokunlong
Copy link
Author

I tested and confirm that the "observed = True" keyword solves the problem. So now the problem is, why this option is not set as the default in pandas pivot_table function?

This option should be set to True by default, and Pandas should detect whether any of the groupers are Categoricals, if yes, then do as what observed = True should do; if no, then ignore the True option, provided that the False approach is faster.

@TomAugspurger
Copy link
Contributor

The intent of Categorical is to represent data from a fixed set of categories, and preserving the unobserved categories through the pivot (with counts of 0) matches that intent.

We have another issue for a dictionary encoded dtype, which would give you the memory-saving aspect of categorical without the unobserved categories behavior.

@TomAugspurger TomAugspurger added this to the No action milestone Jan 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants