Skip to content

PERF: dataframe.resample is very slowly in ver 1.4 and 1.4.1 #46066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
dovsay opened this issue Feb 19, 2022 · 7 comments
Closed
3 tasks done

PERF: dataframe.resample is very slowly in ver 1.4 and 1.4.1 #46066

dovsay opened this issue Feb 19, 2022 · 7 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@dovsay
Copy link

dovsay commented Feb 19, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

df_con = df_con.reindex(columns = ['datetime','open','high','low','close','volume','amount'])
##    重新采样——5 to 15

df_con.drop(columns=['amount'],inplace=True)
df_con.set_index('datetime',inplace=True)
ohlc_dict = {                                                                                                             
        'open':'first',                                                                                                    
        'high':'max',                                                                                                       
        'low':'min',                                                                                                        
        'close': 'last',                                                                                                    
        'volume': 'sum'
        }
o =time.time()
df_con = df_con.resample('15min',closed='right', label='right').apply(ohlc_dict)
print(time.time() - o)

#----------------------

df_con the dataframe read from the csv file includes 14690 rows

codes print 8.608731031417847 (about 9 seconds on average)

Installed Versions

INSTALLED VERSIONS

commit : 06d2301
python : 3.9.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : zh_CN
LOCALE : Chinese (Simplified)_China.936

pandas : 1.4.1
numpy : 1.22.2
pytz : 2021.3
dateutil : 2.8.2
pip : 22.0.3
setuptools : 60.9.1
Cython : None
pytest : None
hypothesis : None
sphinx : 4.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.31.1
pandas_datareader: None
bs4 : None
bottleneck : 1.3.2
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
None

Prior Performance

same code as above
It prints 0.015625476837158203(about 0.015 seconds on average),when i change the version to 1.3.5

@dovsay dovsay added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Feb 19, 2022
@lukemanley
Copy link
Member

Thanks for the report! Can you update the example such that df_con is populated with mock data (e.g. np.random.rand) so that it is fully reproducible?

@dovsay
Copy link
Author

dovsay commented Feb 20, 2022

import pandas as pd
import numpy as np
import time

df_txt = pd.DataFrame(np.random.rand(2000000,5),columns = ['open','high','low','close','volume'])
df_txt['datetime'] = pd.date_range(start='1/1/2022', periods=2000000,freq='5min')


df_web = pd.DataFrame(np.random.rand(5,5),columns = ['open','high','low','close','volume'])
df_web['datetime'] = pd.date_range(start='3/3/2022', periods=5,freq='5min')


#df_web = pd.DataFrame(columns = ['datetime','open','high','low','close','volume','amount'])

df_con = pd.concat([df_txt,df_web],ignore_index=True)

df_con.set_index('datetime',inplace=True)
ohlc_dict = {                                                                                                             
        'open':'first',                                                                                                    
        'high':'max',                                                                                                       
        'low':'min',                                                                                                        
        'close': 'last',                                                                                                    
        'volume': 'sum'
        }
o = time.time()
df_con = df_con.resample('15min',closed='right', label='right').apply(ohlc_dict)
print(time.time() - o)

#------------------------------------------
‘’‘
After testing with random data for many times and comparing with my original code, I found that the problem occurs in the parameters of the concat function. When one of the data parameters of the concat is empty dataframe, it does not affect the speed of concat, but it will greatly reduce the speed of the resample function runs. When none of the dataframe merged by concat is empty, the running time is about 0.45s. When one concat dataframe is empty, the resample running time is 154 seconds. The above conclusions run on version 1.4.1. The code runs fine on version 1.3.5 and takes about 0.2s
’‘’

@lukemanley
Copy link
Member

This looks like a regression. Concatenating an empty frame is resulting in all columns cast to object.

pd.concat([
    pd.DataFrame({'A': [1.0, 2.0]}),
    pd.DataFrame(columns=['A']),
]).dtypes

main:

A    object
dtype: object

1.3.5:

A    float64
dtype: object

@lukemanley lukemanley added Regression Functionality that used to work in a prior pandas version Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode Bug and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 20, 2022
@phofl
Copy link
Member

phofl commented Feb 21, 2022

@lukemanley So the reason for the performance regression is a concat call? Because this is known, see #45637

@lukemanley
Copy link
Member

@lukemanley So the reason for the performance regression is a concat call? Because this is known, see #45637

Yes, I think so. If you add this line after the concat, the example runs very quickly (less than a second):

df_con = df_con.astype(df_txt.dtypes)

@phofl
Copy link
Member

phofl commented Feb 21, 2022

Thx, so I think we can close this

@lukemanley
Copy link
Member

Sure, thanks for pointing that out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants