Skip to content

[BUG]: Implement Kahan summation for rolling().mean() to avoid numerical issues #36348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Sep 15, 2020

Conversation

phofl
Copy link
Member

@phofl phofl commented Sep 13, 2020

I implemented the Kahan summation as suggested by @jreback. I used the variable names from https://en.wikipedia.org/wiki/Kahan_summation_algorithm. If there exists a name convention am not aware of, I am happy to rename the variables.

@phofl phofl added Window rolling, ewma, expanding Bug labels Sep 14, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe this is shared code with mean; so pls test that as well

@phofl
Copy link
Member Author

phofl commented Sep 14, 2020

Done

@mroeschke
Copy link
Member

I dont think this hits resample.mean. I think @jreback might have meant just rolling.mean.

Could you do the same for rolling.sum? #13254

@phofl
Copy link
Member Author

phofl commented Sep 14, 2020

@mroeschke
Yeah you are right. I don't know what I have seen that made me believe resample.mean() would use the same function. Thanks.

rolling.sum() uses another Cython function. That was not fixed with my implementation. I would be happy to implement Kahan summation there too. Should I open another PR or mix it in here?

@jreback
Copy link
Contributor

jreback commented Sep 14, 2020

i believe this is shared code with mean; so pls test that as well

sorry I meant rolling.sum

yeah ideally we could share code, though these are simple enough and we don't want to introduce functions that it may be just easier to duplicate what is needed.

ok on this PR or followon to do rolling.sum

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this PR or more likely kahan sum, I think there are a couple of issues on the tracker. if you'd have a look can link them all (and use examples from them).


# Not NaN
if notnan(val):
nobs[0] = nobs[0] + 1
sum_x[0] = sum_x[0] + val
y = val - c[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment that using kahan summation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already added a comment in the function header. Is this sufficient?

@jreback jreback added this to the 1.2 milestone Sep 14, 2020
@jreback jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed Bug labels Sep 14, 2020
@phofl
Copy link
Member Author

phofl commented Sep 14, 2020

@jreback Implemented Kahan summation for rolling().sum(). I also added tests for the issues I've found.

Functions for add_mean, remove_mean, add_sum and remove_sum are quite similar now. We could share this code part for these functions.

@phofl
Copy link
Member Author

phofl commented Sep 14, 2020

Will look into Test failures

@phofl
Copy link
Member Author

phofl commented Sep 14, 2020

Tests should be fixed now. Somehow mixed plus and minus up

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment otherwise LGTM

result = (
df.resample("1s").ffill().rolling("3s", closed="left", min_periods=3).mean()
)
assert result.values[-1] == 0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we compare the entire DataFrame result for this test as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok looks good, ex @mroeschke comments.

can you run all of the rolling asv's and report the results. I would expect a small slowdown.

@phofl
Copy link
Member Author

phofl commented Sep 15, 2020

@jreback

I ran asv for rolling. Is the output within the range you expected?

      before           after         ratio
     [1b2f1f47]       [c70c91cd]
     <master>         <36031>   
+     1.48±0.02ms       2.64±0.6ms     1.78  rolling.ForwardWindowMethods.time_rolling('Series', 1000, 'int', 'sum')
+     1.77±0.03ms       2.73±0.6ms     1.54  rolling.VariableWindowMethods.time_rolling('Series', '1d', 'float', 'sum')
+        890±10μs      1.28±0.03ms     1.43  rolling.ExpandingMethods.time_expanding('Series', 'float', 'sum')
+     1.69±0.02ms       2.37±0.7ms     1.41  rolling.ForwardWindowMethods.time_rolling('Series', 1000, 'int', 'mean')
+     1.85±0.02ms       2.50±0.2ms     1.35  rolling.ForwardWindowMethods.time_rolling('DataFrame', 1000, 'int', 'sum')
+     1.00±0.03ms      1.32±0.02ms     1.32  rolling.ExpandingMethods.time_expanding('Series', 'float', 'mean')
+     1.26±0.02ms      1.65±0.04ms     1.31  rolling.ExpandingMethods.time_expanding('DataFrame', 'float', 'sum')
+     3.52±0.06ms         4.54±1ms     1.29  rolling.ExpandingMethods.time_expanding('Series', 'float', 'count')
+     1.42±0.03ms       1.81±0.5ms     1.27  rolling.ForwardWindowMethods.time_rolling('Series', 1000, 'float', 'sum')
+     1.35±0.02ms      1.68±0.02ms     1.24  rolling.ExpandingMethods.time_expanding('DataFrame', 'float', 'mean')
+     1.59±0.05ms      1.92±0.06ms     1.21  rolling.Methods.time_rolling('DataFrame', 1000, 'float', 'sum')
+     1.66±0.05ms      1.93±0.05ms     1.16  rolling.Methods.time_rolling('DataFrame', 1000, 'float', 'mean')
+     1.71±0.02ms       1.98±0.2ms     1.16  rolling.VariableWindowMethods.time_rolling('Series', '1h', 'float', 'sum')
+     4.33±0.04ms       4.78±0.3ms     1.10  rolling.Pairwise.time_pairwise(10, 'cov', False)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@jreback jreback merged commit 078f88e into pandas-dev:master Sep 15, 2020
@jreback
Copy link
Contributor

jreback commented Sep 15, 2020

thanks @phofl very nice!

yeah this is a sligth perf hit, but worth the tradeoff

@phofl phofl deleted the 36031 branch September 16, 2020 11:59
rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request Sep 17, 2020
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Window rolling, ewma, expanding
Projects
None yet
3 participants