Skip to content

BUG: Mean of zeros turn out to be non-zero #36031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
harahu opened this issue Sep 1, 2020 · 6 comments · Fixed by #36348
Closed
2 of 3 tasks

BUG: Mean of zeros turn out to be non-zero #36031

harahu opened this issue Sep 1, 2020 · 6 comments · Fixed by #36348
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Window rolling, ewma, expanding
Milestone

Comments

@harahu
Copy link
Contributor

harahu commented Sep 1, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

'''Two examples of a time-weighted rolling average, both of which should result in a final value of 0.'''
import pandas as pd


def succeeding_test():
    df = pd.DataFrame(
        {"A": [300239975158033.0, -0.0, -0.0]},
        index=[
            pd.Timestamp("19700101 09:00:00"),
            pd.Timestamp("19700101 09:00:03"),
            pd.Timestamp("19700101 09:00:06"),
        ],
    )
    v = df.resample("1s").ffill().rolling("3s", closed="left", min_periods=3).mean().values[-1]
    assert v == 0


def failing_test():
    df = pd.DataFrame(
        {"A": [3002399751580331.0, -0.0, -0.0]},  # First value is a single digit longer.
        index=[
            pd.Timestamp("19700101 09:00:00"),
            pd.Timestamp("19700101 09:00:03"),
            pd.Timestamp("19700101 09:00:06"),
        ],
    )
    v = df.resample("1s").ffill().rolling("3s", closed="left", min_periods=3).mean().values[-1]
    assert v == 0  # v turns out to be == -0.3333333

Problem description

In both these cases, the final rolling mean value is the mean of three zeros, but in the failing_test, this does not result in a zero value. I am expecting the failing_test to have the same behaviour as the succeeding_test. The difference lies in the magnitude of the first value in the df, but that shouldn't affect the last mean, as it falls outside the rolling window for that value.

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f2ca0a2
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-42-generic
Version : #46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.1
numpy : 1.16.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 46.1.1
Cython : None
pytest : 4.6.11
hypothesis : 4.57.1
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : None
xarray : 0.16.0
xlrd : 1.2.0
xlwt : None
numba : 0.47.0

@harahu harahu added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 1, 2020
@jbrockmendel
Copy link
Member

cc @mroeschke i can reproduce this on master

@jbrockmendel jbrockmendel added Window rolling, ewma, expanding and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 1, 2020
@aelafifi
Copy link

aelafifi commented Sep 4, 2020

I found this issue is related to C float number 2 ** 53 + 1

print(9007199254740993.0)
# Output: 9007199254740992.0

In your failing test, the float number 3002399751580331.0 when multiplied by 3 results exactly 2**53+1
which will be converted to 2**53 and that difference of 1 is the reason for the result of the last window to be -0.3333333
which is -1 (the difference between 2**53 and 2**53-1) divided by 3 (window size)

@harahu
Copy link
Contributor Author

harahu commented Sep 5, 2020

...and that difference of 1 is the reason for the result of the last window to be -0.3333333
which is -1 (the difference between 2**53 and 2**53-1) divided by 3 (window size)

But the weird thing here is that this number does not fall within the last window. I can see why this would affect the correctness of the first window, but why the last? The how is this remainder of -1 transferred across window bounds? Any hyposesis, @aelafifi?

@mroeschke
Copy link
Member

I suspect this is a numerical precision issue based on how mean is calculated under the hood (the sums are added and removed across the entire dataset instead of the mean being calculated in isolation from the window slice taken)

@harahu
Copy link
Contributor Author

harahu commented Sep 7, 2020

Sounds plausible. Any way I can contribute to getting this fixed as someone who just knows the public API of Pandas?

@jreback
Copy link
Contributor

jreback commented Sep 7, 2020

using kahan summation under the hood would fix this (we do this for var already)

PR welcome

@jreback jreback added this to the 1.2 milestone Sep 14, 2020
@jreback jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed Bug labels Sep 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants