Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category' #19622

MishaVeldhoen · 2018-02-09T17:22:56Z

Code Sample

import pandas as pd
import numpy as np
import time

n_rows = 1000
n_categories = 250

df = pd.DataFrame({'cat1': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
                   'cat2': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
                   'cat3': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
                   'real_numbers': np.random.normal(size=n_rows)})

# Case 1: Pivot table without multi-index columns
now = time.time()
pd.pivot_table(df, values='real_numbers', index='cat1', columns=['cat2'])
print("Elapsed time:", time.time() - now)

# Case 2: Pivot table with multi-index columns
now = time.time()
pd.pivot_table(df, values='real_numbers', index='cat1', columns=['cat2', 'cat3'])
print("Elapsed time:", time.time() - now)

# Typecast a feature to categorical.
df_cast = df.astype({'cat2': 'category'})

# Case 3: Pivot table without multi-index columns - still OK
now = time.time()
pd.pivot_table(df_cast, values='real_numbers', index=['cat1'], columns=['cat2'])
print("Elapsed time:", time.time() - now)

# Case 4: Pivot table with multi-index columns - strong increase in elapsed time/ memory footprint.
now = time.time()
pd.pivot_table(df_cast, values='real_numbers', index=['cat1'], columns=['cat2', 'cat3'])
print("Elapsed time:", time.time() - now)

Problem description

Creating a pivot table (with a multi-index) of a relatively small data frame with integer and float columns (case 2) goes much faster and uses much less resources compared to when the pivot table is created of the same data frame, but with one of the columns converted to a category (case 4).

On my (fairly old) system I find roughly a 200x increase in elapsed time between cases 2 and 4 from the code sample. Furthermore, memory consumption in case 2 is negligible, while in case 4 it runs well over 2GB. Increasing the number of categories n_categories in the original data frame the increases the resource usage/ execution time further.

On the other hand, when no multi-index is used, there only seems to be a very small drop in performance (e.g. from case 1 to case 3).

There are already various issues submitted that involve memory usage of the pivot_table function, however as far as I see for large data frames. This issue seems different as it happens with relatively small data frames, and only in the specific case of a categorical variable in a multi-index.

Expected Output

I am unaware of the implementation details, but I would assume that a categorical variable has an integer representation under the hood and I would therefore expect no performance penalty at all, regardless of the number of columns chosen.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None

pandas: 0.23.0.dev0+259.g7dcc86443
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.2
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-02-09T17:37:50Z

Do you mind doing some profiling of the two to find the difference? If you run a line profiler https://github.com/rkern/line_profiler on pd.pivot_table for the two cases, the difference should stick out.

MishaVeldhoen · 2018-02-09T23:54:52Z

Yes, I’ll look into it!

jreback · 2018-02-10T18:14:44Z

duplicate of this issue: #15217

pivot is just calling groupby.

jreback · 2018-02-10T18:15:25Z

welcome to have you look at that one (and I suggest the soln there) :>

suokunlong · 2020-01-24T08:06:22Z

I still reproduce this issue with pandas version 0.25.3. It seems never fix at all, may not be a duplicate of #15217.

MishaVeldhoen changed the title ~~Very large resource usage when using pivot_table with multiple columns out of which at least one is of type 'category'~~ Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category' Feb 9, 2018

jreback closed this as completed Feb 10, 2018

jreback added Groupby Performance Memory or execution speed performance Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request labels Feb 10, 2018

jreback added this to the No action milestone Feb 10, 2018

suokunlong mentioned this issue Jan 24, 2020

Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category' #31274

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category' #19622

Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category' #19622

MishaVeldhoen commented Feb 9, 2018 •

edited

Loading

TomAugspurger commented Feb 9, 2018

MishaVeldhoen commented Feb 9, 2018

jreback commented Feb 10, 2018

jreback commented Feb 10, 2018

suokunlong commented Jan 24, 2020

Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category' #19622

Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category' #19622

Comments

MishaVeldhoen commented Feb 9, 2018 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented Feb 9, 2018

MishaVeldhoen commented Feb 9, 2018

jreback commented Feb 10, 2018

jreback commented Feb 10, 2018

suokunlong commented Jan 24, 2020

MishaVeldhoen commented Feb 9, 2018 •

edited

Loading

Output of `pd.show_versions()`