You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue was initially reported as #19622, which was then marked as a duplicate of #15217 and closed since then. However, this issue is still reproducible with the most recent version 0.25.3
To reproduce, copy-paste the following code and run:
importpandasaspdimportnumpyasnpimporttimen_rows=1000n_categories=250df=pd.DataFrame({'cat1': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
'cat2': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
'cat3': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
'real_numbers': np.random.normal(size=n_rows)})
# Case 1: Pivot table without multi-index columnsnow=time.time()
pd.pivot_table(df, values='real_numbers', index='cat1', columns=['cat2'])
print("Elapsed time Case 1:", time.time() -now)
# Case 2: Pivot table with multi-index columnsnow=time.time()
pd.pivot_table(df, values='real_numbers', index='cat1', columns=['cat2', 'cat3'])
print("Elapsed time Case 2:", time.time() -now)
# Typecast a feature to categorical.df_cast=df.astype({'cat2': 'category'})
# Case 3: Pivot table without multi-index columns,
# butwithatleastonecategoricalcolumn (i.e., cat2)
# - still OK, but already show performance issue here.now=time.time()
pd.pivot_table(df_cast, values='real_numbers', index=['cat1'], columns=['cat2'])
print("Elapsed time Case 3:", time.time() -now)
# Case 4: Pivot table with multi-index columns# mixture of categorical (i.e., cat2) and non-categorical column (i.e., cat3)# - strong increase in elapsed time/ memory footprint.now=time.time()
pd.pivot_table(df_cast, values='real_numbers', index=['cat1'], columns=['cat2', 'cat3'])
print("Elapsed time Case 4:", time.time() -now)
Problem description
Elapsed time Case 1: 0.03320956230163574
Elapsed time Case 2: 0.017639636993408203
Elapsed time Case 3: 0.35034942626953125
Elapsed time Case 4: 119.43837356567383
This issue will eat all my memory if the dataframe is large.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-26-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : zh_CN.UTF-8
LOCALE : zh_CN.UTF-8
By default, Categorical dtypes preserve unobserved categories in the output. You'll need to convert to remove the unused categories before pivoting, or use the observed=True keyword.
I tested and confirm that the "observed = True" keyword solves the problem. So now the problem is, why this option is not set as the default in pandas pivot_table function?
This option should be set to True by default, and Pandas should detect whether any of the groupers are Categoricals, if yes, then do as what observed = True should do; if no, then ignore the True option, provided that the False approach is faster.
The intent of Categorical is to represent data from a fixed set of categories, and preserving the unobserved categories through the pivot (with counts of 0) matches that intent.
We have another issue for a dictionary encoded dtype, which would give you the memory-saving aspect of categorical without the unobserved categories behavior.
This issue was initially reported as #19622, which was then marked as a duplicate of #15217 and closed since then. However, this issue is still reproducible with the most recent version 0.25.3
To reproduce, copy-paste the following code and run:
Problem description
Elapsed time Case 1: 0.03320956230163574
Elapsed time Case 2: 0.017639636993408203
Elapsed time Case 3: 0.35034942626953125
Elapsed time Case 4: 119.43837356567383
This issue will eat all my memory if the dataframe is large.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-26-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : zh_CN.UTF-8
LOCALE : zh_CN.UTF-8
pandas : 0.25.3
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.2
Cython : None
pytest : None
hypothesis : None
sphinx : 2.2.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.10.2
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
The text was updated successfully, but these errors were encountered: