-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: numerical inconsistency in calculating rolling std, when the same data from different begining #60053
Comments
The values are tremendous when looking at the data from 2016-07 to 2016-11 (~ approximately in terms of 10^7—10^10). If most of the values in a rolling window are very close to each other and much smaller than the outlier/extrema, the variance can become extremely small, leading to a near-zero standard deviation. The small values will have almost no impact on the mean, as the outlier/extrema skews the mean towards itself. This reduces the variance and the standard deviation becomes close to zero. Needed to understand how this function is tackling the outlier/extrema. |
Note that there is a warning in the documentation:
|
This is a limitation for now: see #58712 |
Thank you very much, sincerely look forward to your solution |
Summary: floating point have errors, can use numba as an alternative. The limitation of only passing a single argument numba function will be lifted in pandas 3.0. I believe there is nothing more to do here, closing. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I have a data Series,which has a length of 93230,I want to calculate rolling std,but I got 0 for last one, that’s alomost impossible, so I check the result for only the last 1000 std, it’s the same with numpy.std, I found maybe from a special beging, the rolling std will give different results!
Expected Behavior
I expect they give the same result: 0.701844, no matter what the begining is, because the rolling 1000 should only use the latest 1000 numbers for the last std
std_problem.csv
Installed Versions
I test with pandas = 2.1.2 and pandas = 2.2.3, both have the same problem
pd.show_versions()
INSTALLED VERSIONS
commit : a60ad39
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 6.8.0-45-generic
Version : #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.1.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0
setuptools : 69.5.1
pip : 24.0
lxml.etree : 5.2.2
jinja2 : 3.1.4
IPython : 8.20.0
pandas_datareader : 0.10.0
bs4 : 4.12.3
bottleneck : 1.3.7
fsspec : 2024.6.1
matplotlib : 3.8.4
numba : 0.59.1
numexpr : 2.10.1
pyarrow : 16.1.0
s3fs : 2024.6.1
scipy : 1.12.0
sqlalchemy : 2.0.31
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.6.0
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2024.1
INSTALLED VERSIONS
commit : 0691c5c
python : 3.11.10
python-bits : 64
OS : Linux
OS-release : 6.8.0-45-generic
Version : #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.3
numpy : 2.1.2
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.2
IPython : 8.28.0
tzdata : 2024.2
The text was updated successfully, but these errors were encountered: