Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: use same value counts for roll_var to remove floating point artifacts #46431

Merged

Conversation

auderson
Copy link
Contributor

@auderson auderson commented Mar 19, 2022

ENH: use same value counts for roll_var to remove floating point artifacts
Please refer to #42064 (comment) for the explaination of this new method.

@auderson
Copy link
Contributor Author

Looks like there's a network issue in the test.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a battery of tests that exercise this code.

you can use something like the doc examples as tests. they should fail before and pass after.

need to exercise np.inf and np.nan as well (as you are including this in the code)'

pls parameterize over both std and var

pandas/_libs/window/aggregations.pyx Outdated Show resolved Hide resolved
pandas/_libs/window/aggregations.pyx Outdated Show resolved Hide resolved
@jreback jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Window rolling, ewma, expanding labels Mar 19, 2022
@mroeschke
Copy link
Member

Isn't there logic that needs to be added to remove_var too? Essentially if we need to track if each window has all unique, non-nan values, then this sliding window algorithm needs to account for values moving out of the window?

@auderson
Copy link
Contributor Author

Isn't there logic that needs to be added to remove_var too? Essentially if we need to track if each window has all unique, non-nan values, then this sliding window algorithm needs to account for values moving out of the window?

Here's my thoughts:

Suppose we have x number of same values, then the window enters into some nans. Now same_value_count is still x. Then we enter a window with valid numbers.

If the new number is still the same, same_value_count will continue to increment by 1. Though it's actutally larger than num of same values in current window, we're safe to set result as zero. If the new number differs, same_value_count will be reset to 1.

So in summary, I think it's OK that same_value_count is not updated in remove_var, because same_value_count basically records the num of same value in the entire series.

@auderson
Copy link
Contributor Author

auderson commented Mar 20, 2022

I will write some tests in the upcoming days, but I'm not sure where I should put them. Do I need to add some in pandas/tests/window/test_rolling.py?

If only doc test is needed, do I just have to write some examples in rolling.var, std 's docs (in fact I already update the old examples in the docs)?

@mroeschke
Copy link
Member

Gotcha. Not sure if there's a condition where remove_var is called in an iteration while add_var is not which could also maybe mess up the counting(?), but if not then the resetting should work.

pandas/tests/window/test_rolling.py is fine for the tests. I don't think we specifically need more doc examples for this.

It would be good to have an example like

data = pd.Series([0, 0, 0, 1, 2, 3, 3, 3, 2, 1]).rolling(3).var()

But the 0, 1, 2, 3 are numbers where the floating point artifacts used to show up

@auderson
Copy link
Contributor Author

auderson commented Mar 21, 2022

Gotcha. Not sure if there's a condition where remove_var is called in an iteration while add_var is not which could also maybe mess up the counting(?), but if not then the resetting should work.

The case where only remove_var is called happens when it enters a series of NaN or it goes to the end. For that first case, I've added corresponding tests.

Basically, the counting needn't care about removed values, as long as we know how many x have appeared when we see it, and the count is correct, then it's OK even if we count x from the beginning of the array.

And we only need to make sure every time we see a new number in add_var, the count updates correctly.

@auderson
Copy link
Contributor Author

The test failed in py38_32bit. Do I need to do anything about it? Seems like a bizarre error

@auderson auderson requested a review from jreback March 22, 2022 02:48
@auderson
Copy link
Contributor Author

Not so familiar with git. After I clicked "sync fork" in pycharm, my branch was added commits from others which may be the cause of these test failures(?) 😓

My last version failed in py38_32bit and I can't figure out why. @mroeschke Would you tell me what I should do next?

@mroeschke
Copy link
Member

I would recommend following the contributing guide to set up git correctly (not sure how Pycharm decided to do this): https://pandas.pydata.org/docs/development/contributing.html#working-with-the-code

Namely you probably need to merge the main branch into your branch

git remote add upstream https://github.com/pandas-dev/pandas.git
git checkout roll_var_remove_floating_point_artifacts
git fetch upstream
git merge upstream/main

@auderson auderson force-pushed the roll_var_remove_floating_point_artifacts branch from 4a976e5 to ac1cec4 Compare March 25, 2022 06:00
@auderson
Copy link
Contributor Author

@mroeschke I reset my branch to the earliest state and merge with main and then force-pushed it. It really works.
The tests all passed expect for the doc build. But that seems to be a bug in main?

@auderson
Copy link
Contributor Author

@mroeschke looks like Py_ssize_t can't pass 32bit test

@auderson auderson force-pushed the roll_var_remove_floating_point_artifacts branch from d1b6f9b to 7909ffe Compare March 28, 2022 14:32
@jreback
Copy link
Contributor

jreback commented Apr 9, 2022

do not merge to anything bigger

wait for review and merge here

@auderson
Copy link
Contributor Author

auderson commented Apr 9, 2022

The former tests here require "all rolling kurt for all equal values should return Nan" (which referenced a wrong issue id?#18804)
With the new impl, the default result of same values is 0 for skew and -3 for kurt.
Possiblly related: #30993

def test_rolling_skew_eq_value_fperr(step):
# #18804 all rolling skew for all equal values should return Nan
a = Series([1.1] * 15).rolling(window=10, step=step).skew()
assert np.isnan(a).all()
def test_rolling_kurt_eq_value_fperr(step):
# #18804 all rolling kurt for all equal values should return Nan
a = Series([1.1] * 15).rolling(window=10, step=step).kurt()
assert np.isnan(a).all()

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looked pretty good, just needs to resolve the merge conflict

…ating_point_artifacts

# Conflicts:
#	doc/source/whatsnew/v1.5.0.rst
@auderson
Copy link
Contributor Author

This looked pretty good, just needs to resolve the merge conflict

I put these changes in Other enhancements section, not sure if it's right.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make an additional test that tests something like [5, 7, 5, 7, 5, 7] e.g. nothing is repeated yet its likey degenerate.

doc/source/whatsnew/v1.5.0.rst Outdated Show resolved Hide resolved
pandas/tests/window/test_rolling.py Show resolved Hide resolved
pandas/tests/window/test_rolling.py Show resolved Hide resolved
@jreback jreback added this to the 1.5 milestone Apr 10, 2022
@auderson auderson force-pushed the roll_var_remove_floating_point_artifacts branch from 2258681 to 738c45b Compare April 11, 2022 04:01
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on green

@mroeschke
Copy link
Member

@auderson could you merge main one more time?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm @mroeschke merge on green

…ating_point_artifacts

# Conflicts:
#	doc/source/whatsnew/v1.5.0.rst
@auderson auderson force-pushed the roll_var_remove_floating_point_artifacts branch from 12f123d to 4110084 Compare April 19, 2022 07:05
@mroeschke mroeschke merged commit a8968bf into pandas-dev:main Apr 19, 2022
@mroeschke
Copy link
Member

Awesome thanks @auderson (failure unrelated)!

@auderson
Copy link
Contributor Author

Nice👍

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
…facts (pandas-dev#46431)

* fix merge

* bug fix: nobs == 0 == min_periods

* use Py_ssize_t in place of int64_t

* use tm.assert_series_equal

* add more checks in roll numba methods

* add more checks in roll numba methods

* cancel check exact

* test 32bit: undo Py_ssize_t

* add explanation for test_rolling_var_same_value_count_logic

* comments updated

* add equal-zero test for test_rolling_var_numerical_issues & test_rolling_var_floating_artifact_precision

* update docs

* add more tests

* update whats new

* update whats new

* update doc

Co-authored-by: auderson <liao.renjie@techfin.ai>
Co-authored-by: Jeff Reback <jeff@reback.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: rolling std can't handle mixture of (relatively) big and small numbers
3 participants