Numerical errors in rolling std (0.23.4) #27593

ghost · 2019-07-25T19:52:36Z

@DieterDePaepe I saw your numpy/numpy#7753 (comment), can you provide more information and perhaps a reproducer? What pandas version? Your comment is from Jan 2019 and #18481 was merged in Late Nov 2018, does the issue still exist for you?

xref #9420 , #18430, #27202, #9420

DieterDePaepe · 2019-07-26T13:12:21Z

I encountered this on pandas 0.23.4, have not yet checked newer versions.

Here's a case:

import numpy as np
import pandas as pd

def sliding_window_view(x, shape, step=None, subok=False, writeable=False):
    # MIT License
    # Copyright (c) 2018 Fanjin Zeng
    # This work is licensed under the terms of the MIT license, see <https://opensource.org/licenses/MIT>.
    # https://gist.github.com/Fnjn/b061b28c05b5b0e768c60964d2cafa8d

    # first convert input to array, possibly keeping subclass
    x = np.array(x, copy=False, subok=subok)

    try:
        shape = np.array(shape, np.int)
    except:
        raise TypeError('`shape` must be a sequence of integer')
    else:
        if shape.ndim > 1:
            raise ValueError('`shape` must be one-dimensional sequence of integer')
        if len(x.shape) != len(shape):
            raise ValueError("`shape` length doesn't match with input array dimensions")
        if np.any(shape <= 0):
            raise ValueError('`shape` cannot contain non-positive value')

    if step is None:
        step = np.ones(len(x.shape), np.intp)
    else:
        try:
            step = np.array(step, np.intp)
        except:
            raise TypeError('`step` must be a sequence of integer')
        else:
            if step.ndim > 1:
                raise ValueError('`step` must be one-dimensional sequence of integer')
            if len(x.shape) != len(step):
                raise ValueError("`step` length doesn't match with input array dimensions")
            if np.any(step <= 0):
                raise ValueError('`step` cannot contain non-positive value')

    o = (np.array(x.shape) - shape) // step + 1  # output shape
    if np.any(o <= 0):
        raise ValueError('window shape cannot larger than input array shape')

    strides = x.strides
    view_strides = strides * step

    view_shape = np.concatenate((o, shape), axis=0)
    view_strides = np.concatenate((view_strides, strides), axis=0)
    view = np.lib.stride_tricks.as_strided(x, view_shape, view_strides, subok=subok, writeable=writeable)

    if writeable:
        return view.copy()
    else:
        return view

STD_VAR_STABILITY_DATA = np.array([
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 43.33, 69.39, 76.01, 76.03, 75.19, 82.21, 91.37, 86.44, 88.09,
    88.56, 98.88, 91.62, 93.97, 90.81, 88.25, 95.3, 100., 95.96, 98.13, 97.57, 94.02, 95.24, 92.59,
    98.98, 100., 100., 100., 97.88, 96.33, 98.07, 95.18, 93.52, 79.99, 37.08, 13.9, 17.43, 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 58.58, 70.16, 83.06, 82.79, 85.38, 100., 100., 100.,
    100., 100., 85.97, 56.18, 12., 12., 18.69, 12., 12., 13.9, 13.94, 25.69, 34.33, 65.06,
    80.1, 85.65, 84.57, 83.74, 94.75, 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    90.1, 79.01, 65.47, 54.24, 25.05, 15.01, 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 15.94, 42.61, 71.12, 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 120., 120., 120., 120., 120., 120., 120., 14.69, 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 15.19, 14.81, 22.67, 31.61, 32.21, 39.68, 47.36, 52.63, 61.79, 62.49, 67.66, 120.,
    120., 120., 120., 109.44, 87.13, 51.72, 55.24, 57.78, 62.97, 66.43, 120., 120., 120., 120.,
    110.46, 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 31.04, 52.73, 49.78, 57.56, 66.5, 66.92, 75.89, 88.17,
    97.98, 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 49.6, 45.2, 13.15, 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12.])


df = pd.DataFrame(STD_VAR_STABILITY_DATA)
pd_stds = df.rolling(24).std()[23:].values[:, 0]
np_stds = np.std(sliding_window_view(STD_VAR_STABILITY_DATA, [24]), axis=1)

for i, (np_std, pd_std) in enumerate(zip(np_stds, pd_stds)):
    if not np.all(np_std == pd_std):
        print("Difference for:", i, np_std, pd_std, np.std(STD_VAR_STABILITY_DATA[i:i+24]))

DieterDePaepe · 2019-07-26T15:49:35Z

Updated

ghost · 2019-07-27T17:00:20Z

@DieterDePaepe , the default for np.std is ddof=0. the default for pd.std is ddof=1. The discrepancy is unfortunate but it's not a numerical issue.

If you pass ddof=1 to both of them in your example, the output (on my computer, YMMV) is:

                i    np   pd        
Difference for: 229 0.0 5.66823507992799e-07
Difference for: 230 0.0 5.66823507992799e-07
Difference for: 231 0.0 5.66823507992799e-07

Instead of all errors. All the windows here have exactly 0 variance so pandas does fumble that a little, but the window mean in these cases is still 9 orders of magnitude larger than pandas' error.

Those differences are easy to explain. Pandas calculates a true rolling variance, whereas numpy calculates the variance for each window separately, in your code. As a result,
pandas is consistently 5-8x faster. Whether that's a worthwhile trade-off depends on your application.

%timeit np.sqrt(pd._libs.window.roll_var(STD_VAR_STABILITY_DATA,24,24, None, 'both', ddof=1)[23:])
%timeit np.std(sliding_window_view(STD_VAR_STABILITY_DATA, [24]), axis=1, ddof=1)

a=np.sqrt(pd._libs.window.roll_var(STD_VAR_STABILITY_DATA,24,24, None, 'both', ddof=1)[23:])
b=np.std(sliding_window_view(STD_VAR_STABILITY_DATA, [24]), axis=1, ddof=1)
np.allclose(a,b,atol=5e-6)
True

Update
An error of 1e-07 in std means that the variance calculation it derives from had an error of 1e-14 which is close to the maximum precision offered by float64. This is just how floating point calculation behave.

ghost · 2019-07-27T18:13:03Z

Please consider updating your numpy/numpy#7753 (comment) there, since it's generally misleading.

DieterDePaepe · 2019-07-29T13:40:01Z

Apologies for the mixup of the ddof parameter, when this thread started, I no longer had my original demonstrator, and only had the test data in a unit test. I made a mistake in re-implementing the demonstrator I originally had.

Running on pandas 0.23.4, I do receive more differences between pandas and numpy, though indeed all absolute errors are below 5e-6. I was able to get an error up to 3e-6 though (see below).

In my original use case, which caused me to investigate the rounding issues, I was mainly concerned with the relative error (result_std - actual_std)/(actual_std). And using panda's I could get a relative error as high as 1 for low std values.

Example:

# Same as above, with one value appended.
STD_VAR_STABILITY_DATA = np.array([
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 43.33, 69.39, 76.01, 76.03, 75.19, 82.21, 91.37, 86.44, 88.09,
    88.56, 98.88, 91.62, 93.97, 90.81, 88.25, 95.3, 100., 95.96, 98.13, 97.57, 94.02, 95.24, 92.59,
    98.98, 100., 100., 100., 97.88, 96.33, 98.07, 95.18, 93.52, 79.99, 37.08, 13.9, 17.43, 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 58.58, 70.16, 83.06, 82.79, 85.38, 100., 100., 100.,
    100., 100., 85.97, 56.18, 12., 12., 18.69, 12., 12., 13.9, 13.94, 25.69, 34.33, 65.06,
    80.1, 85.65, 84.57, 83.74, 94.75, 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    90.1, 79.01, 65.47, 54.24, 25.05, 15.01, 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 15.94, 42.61, 71.12, 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 120., 120., 120., 120., 120., 120., 120., 14.69, 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 15.19, 14.81, 22.67, 31.61, 32.21, 39.68, 47.36, 52.63, 61.79, 62.49, 67.66, 120.,
    120., 120., 120., 109.44, 87.13, 51.72, 55.24, 57.78, 62.97, 66.43, 120., 120., 120., 120.,
    110.46, 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 31.04, 52.73, 49.78, 57.56, 66.5, 66.92, 75.89, 88.17,
    97.98, 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 49.6, 45.2, 13.15, 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12, 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.0000175])

For the last window:

Numpy std: 3.4969604857939526e-06
Pandas std: 0.0
Absolute diff: 3.4969604857939526e-06

But I do agree Pandas faster approach will be useful in many other cases, I'll update my comment in the Numpy issue as requested.

jreback · 2019-07-29T13:50:36Z

see the original change here: #6817

some amount of errors is expected

ghost changed the title ~~User reports numerical errors in rolling std~~ Numerical errors in rolling std (0.23.4) Jul 26, 2019

ghost closed this as completed Jul 27, 2019

ghost mentioned this issue Jul 28, 2019

precision issue with rolling_std #9420

Closed

DieterDePaepe mentioned this issue Jul 29, 2019

Suggestion: Sliding Window Function numpy/numpy#7753

Closed

ghost mentioned this issue Sep 11, 2019

data frame rolling std return wrong result with large elements #28244

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Numerical errors in rolling std (0.23.4) #27593

Numerical errors in rolling std (0.23.4) #27593

ghost commented Jul 25, 2019 •

edited by ghost

Loading

DieterDePaepe commented Jul 26, 2019 •

edited

Loading

Uh oh!

DieterDePaepe commented Jul 26, 2019

Uh oh!

ghost commented Jul 27, 2019 •

edited by ghost

Loading

Uh oh!

ghost commented Jul 27, 2019

Uh oh!

DieterDePaepe commented Jul 29, 2019

Uh oh!

jreback commented Jul 29, 2019

Uh oh!

Uh oh!

Numerical errors in rolling std (0.23.4) #27593

Numerical errors in rolling std (0.23.4) #27593

Comments

ghost commented Jul 25, 2019 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DieterDePaepe commented Jul 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DieterDePaepe commented Jul 26, 2019

Uh oh!

ghost commented Jul 27, 2019 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jul 27, 2019

Uh oh!

DieterDePaepe commented Jul 29, 2019

Uh oh!

jreback commented Jul 29, 2019

Uh oh!

ghost commented Jul 25, 2019 •

edited by ghost

Loading

DieterDePaepe commented Jul 26, 2019 •

edited

Loading

ghost commented Jul 27, 2019 •

edited by ghost

Loading