Skip to content

data frame rolling std return wrong result with large elements #28244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
KaneFu opened this issue Aug 31, 2019 · 3 comments · Fixed by #36433
Closed

data frame rolling std return wrong result with large elements #28244

KaneFu opened this issue Aug 31, 2019 · 3 comments · Fixed by #36433
Labels
Bug Window rolling, ewma, expanding
Milestone

Comments

@KaneFu
Copy link

KaneFu commented Aug 31, 2019

Code Sample, a copy-pastable example if possible

arr = [9.54e+08, 6.225e-01, np.nan, 0, 1.14, 0]
arr = pd.DataFrame(arr)
print(arr.rolling(5,3).std())
###output
              0
0           NaN
1           NaN
2           NaN
3  5.507922e+08
4  4.770000e+08
5  0.000000e+00
#### Problem description

Expected output.iloc[5,0] = 0.551, as we can see the standeviation of last five elements shouldn't be zero. <br>
But if i change the first element from 9.54e+08 to 9.45e+07, the last standeviation is 0.816 which is still wrong. If i change the first element to 9.45e+5, everything is okay.  <br>
The first element should not affect the result since the window=5, but there seems to be a bug with large elements, and a magnitude of e+08 definitely will not overflow.  <br>
There must be some bugs.


#### Output of ``pd.show_versions()``

<details>

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 18.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 3.0.7
pip: 19.2.1
setuptools: 36.4.0
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.1
pyarrow: None
xarray: 0.12.3
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml.etree: 3.7.3
bs4: 4.6.0
</details>
@dsaxton
Copy link
Member

dsaxton commented Sep 2, 2019

Seems like the issue is in roll_var from pandas._libs.window: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/window.pyx#L701

Not sure how you'd debug this, but it looks like the function behaves as though all the remaining elements are zero:

import numpy as np
import pandas as pd
from pandas._libs.window import roll_var

lst = [10**10, 2, 3, 4, 5, 6, 7, 8, 9]
df = pd.DataFrame(lst)
df.rolling(5, 3).var()

#               0
# 0           NaN
# 1           NaN
# 2  3.333333e+19
# 3  2.500000e+19
# 4  2.000000e+19
# 5  0.000000e+00
# 6  0.000000e+00
# 7  0.000000e+00
# 8  0.000000e+00

roll_var(np.asarray(lst, dtype=np.float64), 5, 3, None, None, 1)

# array([           nan,            nan, 3.33333333e+19, 2.50000000e+19,
#        2.00000000e+19, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
#        0.00000000e+00])

lst = [10**10, 0, 0, 0, 0, 0, 0, 0, 0]
df = pd.DataFrame(lst)
df.rolling(5, 3).var()

#               0
# 0           NaN
# 1           NaN
# 2  3.333333e+19
# 3  2.500000e+19
# 4  2.000000e+19
# 5  0.000000e+00
# 6  0.000000e+00
# 7  0.000000e+00
# 8  0.000000e+00

@KaneFu
Copy link
Author

KaneFu commented Sep 5, 2019

Seems like the issue is indeed in rolling_var, I change to rolling.apply(lambda x: np.nanstd(x, ddof)), everything is fine. I found as long as the difference of the elements is larger than 1e8, rolling_var get unexpected result. I finally chose to replace all rolling.std() with rolling().apply.
Still confused with this bug.
Thanks a lot!

@phofl
Copy link
Member

phofl commented Sep 16, 2020

I looked into this a bit.
The numerical issues occur at

ssqdm_x[0] = ssqdm_x[0] + ((nobs[0] - 1) * delta ** 2) / nobs[0]

The first value here equals delta during the first iteration. Delta is equally large through the next iterations. The power of two of delta is so large, that we run into floating point arithmetical issues. The small values coming later are lost here.

I can't see a way to fix this. Should we add something to the docs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants