Skip to content

Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category' #19622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MishaVeldhoen opened this issue Feb 9, 2018 · 5 comments
Labels
Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request Groupby Performance Memory or execution speed performance

Comments

@MishaVeldhoen
Copy link
Contributor

MishaVeldhoen commented Feb 9, 2018

Code Sample

import pandas as pd
import numpy as np
import time

n_rows = 1000
n_categories = 250

df = pd.DataFrame({'cat1': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
                   'cat2': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
                   'cat3': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
                   'real_numbers': np.random.normal(size=n_rows)})

# Case 1: Pivot table without multi-index columns
now = time.time()
pd.pivot_table(df, values='real_numbers', index='cat1', columns=['cat2'])
print("Elapsed time:", time.time() - now)

# Case 2: Pivot table with multi-index columns
now = time.time()
pd.pivot_table(df, values='real_numbers', index='cat1', columns=['cat2', 'cat3'])
print("Elapsed time:", time.time() - now)

# Typecast a feature to categorical.
df_cast = df.astype({'cat2': 'category'})

# Case 3: Pivot table without multi-index columns - still OK
now = time.time()
pd.pivot_table(df_cast, values='real_numbers', index=['cat1'], columns=['cat2'])
print("Elapsed time:", time.time() - now)

# Case 4: Pivot table with multi-index columns - strong increase in elapsed time/ memory footprint.
now = time.time()
pd.pivot_table(df_cast, values='real_numbers', index=['cat1'], columns=['cat2', 'cat3'])
print("Elapsed time:", time.time() - now)

Problem description

Creating a pivot table (with a multi-index) of a relatively small data frame with integer and float columns (case 2) goes much faster and uses much less resources compared to when the pivot table is created of the same data frame, but with one of the columns converted to a category (case 4).

On my (fairly old) system I find roughly a 200x increase in elapsed time between cases 2 and 4 from the code sample. Furthermore, memory consumption in case 2 is negligible, while in case 4 it runs well over 2GB. Increasing the number of categories n_categories in the original data frame the increases the resource usage/ execution time further.

On the other hand, when no multi-index is used, there only seems to be a very small drop in performance (e.g. from case 1 to case 3).

There are already various issues submitted that involve memory usage of the pivot_table function, however as far as I see for large data frames. This issue seems different as it happens with relatively small data frames, and only in the specific case of a categorical variable in a multi-index.

Expected Output

I am unaware of the implementation details, but I would assume that a categorical variable has an integer representation under the hood and I would therefore expect no performance penalty at all, regardless of the number of columns chosen.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None

pandas: 0.23.0.dev0+259.g7dcc86443
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.2
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@MishaVeldhoen MishaVeldhoen changed the title Very large resource usage when using pivot_table with multiple columns out of which at least one is of type 'category' Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category' Feb 9, 2018
@TomAugspurger
Copy link
Contributor

Do you mind doing some profiling of the two to find the difference? If you run a line profiler https://github.com/rkern/line_profiler on pd.pivot_table for the two cases, the difference should stick out.

@MishaVeldhoen
Copy link
Contributor Author

Yes, I’ll look into it!

@jreback
Copy link
Contributor

jreback commented Feb 10, 2018

duplicate of this issue: #15217

pivot is just calling groupby.

@jreback jreback closed this as completed Feb 10, 2018
@jreback jreback added Groupby Performance Memory or execution speed performance Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request labels Feb 10, 2018
@jreback jreback added this to the No action milestone Feb 10, 2018
@jreback
Copy link
Contributor

jreback commented Feb 10, 2018

welcome to have you look at that one (and I suggest the soln there) :>

@suokunlong
Copy link

I still reproduce this issue with pandas version 0.25.3. It seems never fix at all, may not be a duplicate of #15217.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request Groupby Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants