Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category' #31274

suokunlong · 2020-01-24T08:21:46Z

This issue was initially reported as #19622, which was then marked as a duplicate of #15217 and closed since then. However, this issue is still reproducible with the most recent version 0.25.3

To reproduce, copy-paste the following code and run:

import pandas as pd
import numpy as np
import time

n_rows = 1000
n_categories = 250

df = pd.DataFrame({'cat1': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
                   'cat2': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
                   'cat3': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
                   'real_numbers': np.random.normal(size=n_rows)})

# Case 1: Pivot table without multi-index columns
now = time.time()
pd.pivot_table(df, values='real_numbers', index='cat1', columns=['cat2'])
print("Elapsed time Case 1:", time.time() - now)

# Case 2: Pivot table with multi-index columns
now = time.time()
pd.pivot_table(df, values='real_numbers', index='cat1', columns=['cat2', 'cat3'])
print("Elapsed time Case 2:", time.time() - now)

# Typecast a feature to categorical.
df_cast = df.astype({'cat2': 'category'})

# Case 3: Pivot table without multi-index columns, 
＃ but with at least one categorical column (i.e., cat2)
# - still OK, but already show performance issue here.
now = time.time()
pd.pivot_table(df_cast, values='real_numbers', index=['cat1'], columns=['cat2'])
print("Elapsed time Case 3:", time.time() - now)

# Case 4: Pivot table with multi-index columns
# mixture of categorical (i.e., cat2) and non-categorical column (i.e., cat3)
# - strong increase in elapsed time/ memory footprint.
now = time.time()
pd.pivot_table(df_cast, values='real_numbers', index=['cat1'], columns=['cat2', 'cat3'])
print("Elapsed time Case 4:", time.time() - now)

Problem description

Elapsed time Case 1: 0.03320956230163574
Elapsed time Case 2: 0.017639636993408203
Elapsed time Case 3: 0.35034942626953125
Elapsed time Case 4: 119.43837356567383

This issue will eat all my memory if the dataframe is large.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-26-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : zh_CN.UTF-8
LOCALE : zh_CN.UTF-8

pandas : 0.25.3
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.2
Cython : None
pytest : None
hypothesis : None
sphinx : 2.2.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.10.2
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-01-24T13:51:25Z

By default, Categorical dtypes preserve unobserved categories in the output. You'll need to convert to remove the unused categories before pivoting, or use the observed=True keyword.

suokunlong · 2020-01-24T14:55:43Z

I tested and confirm that the "observed = True" keyword solves the problem. So now the problem is, why this option is not set as the default in pandas pivot_table function?

This option should be set to True by default, and Pandas should detect whether any of the groupers are Categoricals, if yes, then do as what observed = True should do; if no, then ignore the True option, provided that the False approach is faster.

TomAugspurger · 2020-01-24T15:03:06Z

The intent of Categorical is to represent data from a fixed set of categories, and preserving the unobserved categories through the pivot (with counts of 0) matches that intent.

We have another issue for a dictionary encoded dtype, which would give you the memory-saving aspect of categorical without the unobserved categories behavior.

TomAugspurger closed this as completed Jan 24, 2020

TomAugspurger added this to the No action milestone Jan 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category' #31274

Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category' #31274

suokunlong commented Jan 24, 2020

INSTALLED VERSIONS

TomAugspurger commented Jan 24, 2020

suokunlong commented Jan 24, 2020

TomAugspurger commented Jan 24, 2020

Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category' #31274

Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category' #31274

Comments

suokunlong commented Jan 24, 2020

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Jan 24, 2020

suokunlong commented Jan 24, 2020

TomAugspurger commented Jan 24, 2020

Output of `pd.show_versions()`