Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerical errors in rolling std (0.23.4) #27593

Closed
ghost opened this issue Jul 25, 2019 · 6 comments
Closed

Numerical errors in rolling std (0.23.4) #27593

ghost opened this issue Jul 25, 2019 · 6 comments

Comments

@ghost
Copy link

ghost commented Jul 25, 2019

@DieterDePaepe I saw your numpy/numpy#7753 (comment), can you provide more information and perhaps a reproducer? What pandas version? Your comment is from Jan 2019 and #18481 was merged in Late Nov 2018, does the issue still exist for you?

xref #9420 , #18430, #27202, #9420

@DieterDePaepe
Copy link

DieterDePaepe commented Jul 26, 2019

I encountered this on pandas 0.23.4, have not yet checked newer versions.

Here's a case:

import numpy as np
import pandas as pd

def sliding_window_view(x, shape, step=None, subok=False, writeable=False):
    # MIT License
    # Copyright (c) 2018 Fanjin Zeng
    # This work is licensed under the terms of the MIT license, see <https://opensource.org/licenses/MIT>.
    # https://gist.github.com/Fnjn/b061b28c05b5b0e768c60964d2cafa8d

    # first convert input to array, possibly keeping subclass
    x = np.array(x, copy=False, subok=subok)

    try:
        shape = np.array(shape, np.int)
    except:
        raise TypeError('`shape` must be a sequence of integer')
    else:
        if shape.ndim > 1:
            raise ValueError('`shape` must be one-dimensional sequence of integer')
        if len(x.shape) != len(shape):
            raise ValueError("`shape` length doesn't match with input array dimensions")
        if np.any(shape <= 0):
            raise ValueError('`shape` cannot contain non-positive value')

    if step is None:
        step = np.ones(len(x.shape), np.intp)
    else:
        try:
            step = np.array(step, np.intp)
        except:
            raise TypeError('`step` must be a sequence of integer')
        else:
            if step.ndim > 1:
                raise ValueError('`step` must be one-dimensional sequence of integer')
            if len(x.shape) != len(step):
                raise ValueError("`step` length doesn't match with input array dimensions")
            if np.any(step <= 0):
                raise ValueError('`step` cannot contain non-positive value')

    o = (np.array(x.shape) - shape) // step + 1  # output shape
    if np.any(o <= 0):
        raise ValueError('window shape cannot larger than input array shape')

    strides = x.strides
    view_strides = strides * step

    view_shape = np.concatenate((o, shape), axis=0)
    view_strides = np.concatenate((view_strides, strides), axis=0)
    view = np.lib.stride_tricks.as_strided(x, view_shape, view_strides, subok=subok, writeable=writeable)

    if writeable:
        return view.copy()
    else:
        return view

STD_VAR_STABILITY_DATA = np.array([
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 43.33, 69.39, 76.01, 76.03, 75.19, 82.21, 91.37, 86.44, 88.09,
    88.56, 98.88, 91.62, 93.97, 90.81, 88.25, 95.3, 100., 95.96, 98.13, 97.57, 94.02, 95.24, 92.59,
    98.98, 100., 100., 100., 97.88, 96.33, 98.07, 95.18, 93.52, 79.99, 37.08, 13.9, 17.43, 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 58.58, 70.16, 83.06, 82.79, 85.38, 100., 100., 100.,
    100., 100., 85.97, 56.18, 12., 12., 18.69, 12., 12., 13.9, 13.94, 25.69, 34.33, 65.06,
    80.1, 85.65, 84.57, 83.74, 94.75, 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    90.1, 79.01, 65.47, 54.24, 25.05, 15.01, 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 15.94, 42.61, 71.12, 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 120., 120., 120., 120., 120., 120., 120., 14.69, 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 15.19, 14.81, 22.67, 31.61, 32.21, 39.68, 47.36, 52.63, 61.79, 62.49, 67.66, 120.,
    120., 120., 120., 109.44, 87.13, 51.72, 55.24, 57.78, 62.97, 66.43, 120., 120., 120., 120.,
    110.46, 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 31.04, 52.73, 49.78, 57.56, 66.5, 66.92, 75.89, 88.17,
    97.98, 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 49.6, 45.2, 13.15, 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12.])


df = pd.DataFrame(STD_VAR_STABILITY_DATA)
pd_stds = df.rolling(24).std()[23:].values[:, 0]
np_stds = np.std(sliding_window_view(STD_VAR_STABILITY_DATA, [24]), axis=1)

for i, (np_std, pd_std) in enumerate(zip(np_stds, pd_stds)):
    if not np.all(np_std == pd_std):
        print("Difference for:", i, np_std, pd_std, np.std(STD_VAR_STABILITY_DATA[i:i+24]))

@ghost ghost changed the title User reports numerical errors in rolling std Numerical errors in rolling std (0.23.4) Jul 26, 2019
@DieterDePaepe
Copy link

Updated

@ghost
Copy link
Author

ghost commented Jul 27, 2019

@DieterDePaepe , the default for np.std is ddof=0. the default for pd.std is ddof=1. The discrepancy is unfortunate but it's not a numerical issue.

If you pass ddof=1 to both of them in your example, the output (on my computer, YMMV) is:

                i    np   pd        
Difference for: 229 0.0 5.66823507992799e-07
Difference for: 230 0.0 5.66823507992799e-07
Difference for: 231 0.0 5.66823507992799e-07

Instead of all errors. All the windows here have exactly 0 variance so pandas does fumble that a little, but the window mean in these cases is still 9 orders of magnitude larger than pandas' error.

Those differences are easy to explain. Pandas calculates a true rolling variance, whereas numpy calculates the variance for each window separately, in your code. As a result,
pandas is consistently 5-8x faster. Whether that's a worthwhile trade-off depends on your application.

%timeit np.sqrt(pd._libs.window.roll_var(STD_VAR_STABILITY_DATA,24,24, None, 'both', ddof=1)[23:])
%timeit np.std(sliding_window_view(STD_VAR_STABILITY_DATA, [24]), axis=1, ddof=1)

a=np.sqrt(pd._libs.window.roll_var(STD_VAR_STABILITY_DATA,24,24, None, 'both', ddof=1)[23:])
b=np.std(sliding_window_view(STD_VAR_STABILITY_DATA, [24]), axis=1, ddof=1)
np.allclose(a,b,atol=5e-6)
True

Update
An error of 1e-07 in std means that the variance calculation it derives from had an error of 1e-14 which is close to the maximum precision offered by float64. This is just how floating point calculation behave.

@ghost ghost closed this as completed Jul 27, 2019
@ghost
Copy link
Author

ghost commented Jul 27, 2019

Please consider updating your numpy/numpy#7753 (comment) there, since it's generally misleading.

@DieterDePaepe
Copy link

Apologies for the mixup of the ddof parameter, when this thread started, I no longer had my original demonstrator, and only had the test data in a unit test. I made a mistake in re-implementing the demonstrator I originally had.

Running on pandas 0.23.4, I do receive more differences between pandas and numpy, though indeed all absolute errors are below 5e-6. I was able to get an error up to 3e-6 though (see below).

In my original use case, which caused me to investigate the rounding issues, I was mainly concerned with the relative error (result_std - actual_std)/(actual_std). And using panda's I could get a relative error as high as 1 for low std values.

Example:

# Same as above, with one value appended.
STD_VAR_STABILITY_DATA = np.array([
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 43.33, 69.39, 76.01, 76.03, 75.19, 82.21, 91.37, 86.44, 88.09,
    88.56, 98.88, 91.62, 93.97, 90.81, 88.25, 95.3, 100., 95.96, 98.13, 97.57, 94.02, 95.24, 92.59,
    98.98, 100., 100., 100., 97.88, 96.33, 98.07, 95.18, 93.52, 79.99, 37.08, 13.9, 17.43, 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 58.58, 70.16, 83.06, 82.79, 85.38, 100., 100., 100.,
    100., 100., 85.97, 56.18, 12., 12., 18.69, 12., 12., 13.9, 13.94, 25.69, 34.33, 65.06,
    80.1, 85.65, 84.57, 83.74, 94.75, 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    90.1, 79.01, 65.47, 54.24, 25.05, 15.01, 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 15.94, 42.61, 71.12, 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 120., 120., 120., 120., 120., 120., 120., 14.69, 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 15.19, 14.81, 22.67, 31.61, 32.21, 39.68, 47.36, 52.63, 61.79, 62.49, 67.66, 120.,
    120., 120., 120., 109.44, 87.13, 51.72, 55.24, 57.78, 62.97, 66.43, 120., 120., 120., 120.,
    110.46, 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 31.04, 52.73, 49.78, 57.56, 66.5, 66.92, 75.89, 88.17,
    97.98, 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
    100., 100., 100., 49.6, 45.2, 13.15, 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12, 12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.,
    12., 12., 12., 12., 12., 12., 12., 12., 12., 12., 12.0000175])

For the last window:

  • Numpy std: 3.4969604857939526e-06
  • Pandas std: 0.0
  • Absolute diff: 3.4969604857939526e-06

But I do agree Pandas faster approach will be useful in many other cases, I'll update my comment in the Numpy issue as requested.

@jreback
Copy link
Contributor

jreback commented Jul 29, 2019

see the original change here: #6817

some amount of errors is expected

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants