Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category' #19622
Labels
Categorical
Categorical Data Type
Duplicate Report
Duplicate issue or pull request
Groupby
Performance
Memory or execution speed performance
Code Sample
Problem description
Creating a pivot table (with a multi-index) of a relatively small data frame with integer and float columns (case 2) goes much faster and uses much less resources compared to when the pivot table is created of the same data frame, but with one of the columns converted to a category (case 4).
On my (fairly old) system I find roughly a 200x increase in elapsed time between cases 2 and 4 from the code sample. Furthermore, memory consumption in case 2 is negligible, while in case 4 it runs well over 2GB. Increasing the number of categories
n_categories
in the original data frame the increases the resource usage/ execution time further.On the other hand, when no multi-index is used, there only seems to be a very small drop in performance (e.g. from case 1 to case 3).
There are already various issues submitted that involve memory usage of the pivot_table function, however as far as I see for large data frames. This issue seems different as it happens with relatively small data frames, and only in the specific case of a categorical variable in a multi-index.
Expected Output
I am unaware of the implementation details, but I would assume that a categorical variable has an integer representation under the hood and I would therefore expect no performance penalty at all, regardless of the number of columns chosen.
Output of
pd.show_versions()
pandas: 0.23.0.dev0+259.g7dcc86443
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.2
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: