Skip to content

Rolling mean on a time-based period with one data point does not return the exact mean #32761

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bcbrock opened this issue Mar 16, 2020 · 1 comment · Fixed by #36348
Closed
Labels
Window rolling, ewma, expanding
Milestone

Comments

@bcbrock
Copy link

bcbrock commented Mar 16, 2020

First, thank you all for creating and maintaining this package, and for the obvious care that has gone into the numerical analysis of the rolling methods.

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
print('Pandas Version', pd.__version__)
pd.options.display.float_format = '{:,.15f}'.format
np.random.seed(42)

index = pd.date_range(start='2020-01-01', end= '2020-01-02', freq='1s').append(pd.DatetimeIndex(['2020-01-03']))
data = np.random.rand(len(index))

df = pd.DataFrame({'data':data}, index=index)
df['mean'] = df.rolling('60s').mean()
df

Result

>>> print('Pandas Version', pd.__version__)
Pandas Version 1.0.2
...
>>> df
                                 data              mean
2020-01-01 00:00:00 0.374540118847362 0.374540118847362
2020-01-01 00:00:01 0.950714306409916 0.662627212628639
2020-01-01 00:00:02 0.731993941811405 0.685749455689561
2020-01-01 00:00:03 0.598658484197037 0.663976712816430
2020-01-01 00:00:04 0.156018640442437 0.562385098341631
...                               ...               ...
2020-01-01 23:59:57 0.888246435024149 0.461337636515083
2020-01-01 23:59:58 0.825973496654809 0.468473569401559
2020-01-01 23:59:59 0.487629482790733 0.473417509587717
2020-01-02 00:00:00 0.705341871943242 0.480022530163534
2020-01-03 00:00:00 0.105076984427931 0.105076984428225

[86402 rows x 2 columns]

Problem description

In the example above I show a rolling mean over a 1-minute window, but the final point jumps ahead by one day. Since this is the only point in the final window, I expect the mean here to be exactly the single value in the window. You can see however that there is some error in the low-order 4 dights displayed.

I believe this is due to an omission in the function roll_mean_variable which I am reading in the file pandas/_libs/window/aggregations.pyx. I believe that the function should zero the running sum whenever the number of observations goes to 0 after a "remove" operation. Otherwise, some noise from previous periods will be left in the running sum.

If I apply this patch to my local copy then I get the exact mean.

diff --git a/pandas/_libs/window/aggregations.pyx b/pandas/_libs/window/aggregations.pyx
index 495b436..76ba1dd 100644
--- a/pandas/_libs/window/aggregations.pyx
+++ b/pandas/_libs/window/aggregations.pyx
@@ -344,20 +344,24 @@ def roll_mean_variable(ndarray[float64_t] values, ndarray[int64_t] start,
                     val = values[j]
                     add_mean(val, &nobs, &sum_x, &neg_ct)
 
             else:
 
                 # calculate deletes
                 for j in range(start[i - 1], s):
                     val = values[j]
                     remove_mean(val, &nobs, &sum_x, &neg_ct)
 
+                # Reset sum if NOBS goes to 0
+                if nobs == 0:
+                    sum_x = 0
+
                 # calculate adds
                 for j in range(end[i - 1], e):
                     val = values[j]
                     add_mean(val, &nobs, &sum_x, &neg_ct)
 
             output[i] = calc_mean(minp, nobs, neg_ct, sum_x)
 
             if not is_monotonic_bounds:
                 for j in range(s, e):
                     val = values[j]

I understand that there is a philosophical question: Given that we know that most rolling means are likely not bit-accurate anyway, is it worth doing a check for this special case? One argument would be that for time series like mine with gaps here and there, the gaps would serve to clear the accumulated errors every time a gap occurred, and in general the accuracy of the procedure would improve.

I notice that the rolling variance calculations do reset the accumulators to 0 when they completely depopulate a window. I am curious though, is there a numerical-analysis reason for doing the adds before the removes there? This order nullifies the resetting-effect of clearing the accumulators I believe, because my windows will never go to empty with this ordering. This is the code I'm referring to:

Thank you,

Bishop Brock

Expected Output

I expect the rolling mean of any window specified with a time period with a single data point to be equal to that data point.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.2+0.g7485dbe6f.dirty
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.2.1
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1
numba : 0.45.1

@matteosantama
Copy link
Contributor

New here, but I think your solution makes a lot of sense. It seems like a zero-cost opportunity to improve the accuracy of the calculation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants