Rolling mean on a time-based period with one data point does not return the exact mean #32761

bcbrock · 2020-03-16T22:28:45Z

First, thank you all for creating and maintaining this package, and for the obvious care that has gone into the numerical analysis of the rolling methods.

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
print('Pandas Version', pd.__version__)
pd.options.display.float_format = '{:,.15f}'.format
np.random.seed(42)

index = pd.date_range(start='2020-01-01', end= '2020-01-02', freq='1s').append(pd.DatetimeIndex(['2020-01-03']))
data = np.random.rand(len(index))

df = pd.DataFrame({'data':data}, index=index)
df['mean'] = df.rolling('60s').mean()
df

Result

>>> print('Pandas Version', pd.__version__)
Pandas Version 1.0.2
...
>>> df
                                 data              mean
2020-01-01 00:00:00 0.374540118847362 0.374540118847362
2020-01-01 00:00:01 0.950714306409916 0.662627212628639
2020-01-01 00:00:02 0.731993941811405 0.685749455689561
2020-01-01 00:00:03 0.598658484197037 0.663976712816430
2020-01-01 00:00:04 0.156018640442437 0.562385098341631
...                               ...               ...
2020-01-01 23:59:57 0.888246435024149 0.461337636515083
2020-01-01 23:59:58 0.825973496654809 0.468473569401559
2020-01-01 23:59:59 0.487629482790733 0.473417509587717
2020-01-02 00:00:00 0.705341871943242 0.480022530163534
2020-01-03 00:00:00 0.105076984427931 0.105076984428225

[86402 rows x 2 columns]

Problem description

In the example above I show a rolling mean over a 1-minute window, but the final point jumps ahead by one day. Since this is the only point in the final window, I expect the mean here to be exactly the single value in the window. You can see however that there is some error in the low-order 4 dights displayed.

I believe this is due to an omission in the function roll_mean_variable which I am reading in the file pandas/_libs/window/aggregations.pyx. I believe that the function should zero the running sum whenever the number of observations goes to 0 after a "remove" operation. Otherwise, some noise from previous periods will be left in the running sum.

If I apply this patch to my local copy then I get the exact mean.

diff --git a/pandas/_libs/window/aggregations.pyx b/pandas/_libs/window/aggregations.pyx
index 495b436..76ba1dd 100644
--- a/pandas/_libs/window/aggregations.pyx
+++ b/pandas/_libs/window/aggregations.pyx
@@ -344,20 +344,24 @@ def roll_mean_variable(ndarray[float64_t] values, ndarray[int64_t] start,
                     val = values[j]
                     add_mean(val, &nobs, &sum_x, &neg_ct)
 
             else:
 
                 # calculate deletes
                 for j in range(start[i - 1], s):
                     val = values[j]
                     remove_mean(val, &nobs, &sum_x, &neg_ct)
 
+                # Reset sum if NOBS goes to 0
+                if nobs == 0:
+                    sum_x = 0
+
                 # calculate adds
                 for j in range(end[i - 1], e):
                     val = values[j]
                     add_mean(val, &nobs, &sum_x, &neg_ct)
 
             output[i] = calc_mean(minp, nobs, neg_ct, sum_x)
 
             if not is_monotonic_bounds:
                 for j in range(s, e):
                     val = values[j]

I understand that there is a philosophical question: Given that we know that most rolling means are likely not bit-accurate anyway, is it worth doing a check for this special case? One argument would be that for time series like mine with gaps here and there, the gaps would serve to clear the accumulated errors every time a gap occurred, and in general the accuracy of the procedure would improve.

I notice that the rolling variance calculations do reset the accumulators to 0 when they completely depopulate a window. I am curious though, is there a numerical-analysis reason for doing the adds before the removes there? This order nullifies the resetting-effect of clearing the accumulators I believe, because my windows will never go to empty with this ordering. This is the code I'm referring to:

pandas/pandas/_libs/window/aggregations.pyx

Line 520 in 5da500a

else:

Thank you,

Bishop Brock

Expected Output

I expect the rolling mean of any window specified with a time period with a single data point to be equal to that data point.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.2+0.g7485dbe6f.dirty
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.2.1
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1
numba : 0.45.1

The text was updated successfully, but these errors were encountered:

matteosantama · 2020-03-20T15:42:51Z

New here, but I think your solution makes a lot of sense. It seems like a zero-cost opportunity to improve the accuracy of the calculation

jbrockmendel added the Window rolling, ewma, expanding label Jun 10, 2020

phofl mentioned this issue Sep 14, 2020

[BUG]: Implement Kahan summation for rolling().mean() to avoid numerical issues #36348

Merged

9 tasks

jreback added this to the 1.2 milestone Sep 15, 2020

jreback closed this as completed in #36348 Sep 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling mean on a time-based period with one data point does not return the exact mean #32761

Rolling mean on a time-based period with one data point does not return the exact mean #32761

bcbrock commented Mar 16, 2020

INSTALLED VERSIONS

matteosantama commented Mar 20, 2020

Rolling mean on a time-based period with one data point does not return the exact mean #32761

Rolling mean on a time-based period with one data point does not return the exact mean #32761

Comments

bcbrock commented Mar 16, 2020

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

matteosantama commented Mar 20, 2020

Output of `pd.show_versions()`