Skip to content

Feature/rolling #668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 20, 2016
Merged

Feature/rolling #668

merged 1 commit into from
Feb 20, 2016

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Dec 2, 2015

This is an initial take at the rolling aggregation object and methods in xray. This PR implements:

  • A new Rolling class available in DataArray objects:

    rolling_obj = da.rolling(time=7)
  • bottleneck.move_* functions are wrapped and are available in the following manor:

    rolling_obj.mean()
  • generic reduce method

    from numpy import nanpercentile
    rolling_obj.reduce(nanpercentile, q=5)
  • iterating through the rolling object:

     for label, da_window in rolling_obj:
         # da_window is a view of da

TODO:

  • Documentation
  • Cleanup _setup_windows
  • Inject bottleneck methods in some generic way
  • Possibly create a Window object to hold windowed DataArrays

closes #641 #130
xref pandas-dev/pandas#11603, pandas-dev/pandas#11704
cc @shoyer @bartnijssen

center : boolean, default False
Set the labels at the center of the window.
kwarg : dim=window
dim : str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this docstring format -- does it render in a reasonable way in the API docs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just checked, not perfect.

temp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty decent to me! I'll take it :)

@jhamman
Copy link
Member Author

jhamman commented Dec 2, 2015

@shoyer - thanks for the first look. I'll give it another hack.

yield (label, self.obj.isel(**{self.dim: indices}))
else:
# return nans for this window
temp = self.obj.isel(**{self.dim: indices}).copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhamman
Copy link
Member Author

jhamman commented Dec 3, 2015

@shoyer - I made some more progress here tonight. How do you suggest we handle the bottleneck dependency? That is the reason for the failing tests at the moment.

values = bn.move_sum(self.obj.values, window=self.window,
min_count=self.min_periods, axis=self._axis_num)
coords = self.obj.coords
coords[self.dim] = self.window_labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this now because window_labels is the same as the original coordinate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right - nice catch.

@shoyer
Copy link
Member

shoyer commented Dec 3, 2015

How do you suggest we handle the bottleneck dependency? That is the reason for the failing tests at the moment.

You can either add a try/except around a top level import of bottleneck, or only import bottleneck locally inside functions which need it. I think I would prefer the later approach because it results in more intelligible error messages (ImportError: No module named bottleneck rather than NameError: name 'bottleneck' is not defined).

@jhamman
Copy link
Member Author

jhamman commented Dec 3, 2015

@shoyer - would you mind taking a look at what I've just tried (and failed) in ops.py and common.py? I think I'm missing a big piece of the injection puzzle.

@shoyer
Copy link
Member

shoyer commented Dec 3, 2015

Internet at work today is only working 20% of the time. I'm happy to take a look once things get back online :).

On Thu, Dec 3, 2015 at 1:05 PM, Joe Hamman notifications@github.com
wrote:

@shoyer - would you mind taking a look at what I've just tried (and failed) in ops.py and common.py? I think I'm missing a big piece of the injection puzzle.

Reply to this email directly or view it on GitHub:
#668 (comment)

@jhamman
Copy link
Member Author

jhamman commented Dec 3, 2015

sounds good. Thanks. That's got to slow down work at a tech company 💤

@shoyer
Copy link
Member

shoyer commented Dec 3, 2015

For iteration, what about only iterating over full windows? Thinking about how I might use iteration, I think this might be more useful than returning some shrunk windows.

Concretely, this means that if you iterate over xray.DataArray(range(6), dims='x').rolling(x=3), results would have labels from 1 through 5.

I think you've done a pretty reasonable job of interpreting min_periods for iteration, but I would still vote for defining it only as an argument to the aggregation methods and not worrying about it for iteration. It keeps things a bit simpler and easier to understand. OTOH, if you can think of use cases for min_periods with iteration, I could be convinced :).

@shoyer
Copy link
Member

shoyer commented Dec 3, 2015

@shoyer - would you mind taking a look at what I've just tried (and failed) in ops.py and common.py? I think I'm missing a big piece of the injection puzzle.

I'll give this a test, but it looks like you have all the pieces to me....

@jhamman
Copy link
Member Author

jhamman commented Dec 4, 2015

For iteration, what about only iterating over full windows? Thinking about how I might use iteration, I think this might be more useful than returning some shrunk windows.

I did consider this at first and it wouldn't be all that hard to implement but I chose not to go this route because I wanted consistency between reduce, _bottleneck_reduce and __iter__. In theory, all three of these should provide the same answer:

rolling_obj = da.rolling(time=4)

rolling_obj.mean()  # bottleneck move_mean
rolling_obj.reduce(np.nanmean)  # numpy nanmean over each window
concat([da.mean(dim='time') for _, da in rolling_obj], dim=rolling_obj.window_labels)  # manual mean via iterable - same as reduce 

I think you've done a pretty reasonable job of interpreting min_periods for iteration, but I would still vote for defining it only as an argument to the aggregation methods and not worrying about it for iteration.

How did pandas land on this. To me it makes more sense as an argument to __init__ but I'll go with whatever pandas decided for consistency.

@shoyer
Copy link
Member

shoyer commented Dec 4, 2015

I wanted consistency between reduce, _bottleneck_reduce and iter.

Agreed, this would be nice. But if min_count=0, this won't be the case, because you will average over partial windows at the start of the rolling iteration. For example, you apply the aggregation function to windows of size [1, 2, 3, 3, 3, 3]. And the labels are also not consistent.

@shoyer
Copy link
Member

shoyer commented Dec 4, 2015

How did pandas land on this. To me it makes more sense as an argument to init but I'll go with whatever pandas decided for consistency.

Still unresolved, though Jeff Reback agrees with you. It's being discussed in the rolling PR currently.

Also: what about changing the default min_count to 0? I think that would be more consistent with pandas, which skips over missing values by default.

@jhamman
Copy link
Member Author

jhamman commented Dec 7, 2015

I'll give this a test, but it looks like you have all the pieces to me....

I'm getting this TypeError:

E   TypeError: move_sum() takes at least 2 positional arguments (0 given)

which makes me think the injection is failing to pass the arguments in

func = cls._bottleneck_reduce(f)
func.__name__ = method_name
func.__doc__ = 'todo'
setattr(cls, method_name, f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be func not f -- this should fix your failing tests

@shoyer
Copy link
Member

shoyer commented Dec 28, 2015

@jhamman how are we doing here? Are you waiting on a review from me?

# pandas does some fancy stuff in the last position
np.testing.assert_allclose(s_rolling.values[:-1],
da_rolling.values[:-1])
np.testing.assert_allclose(s_rolling.index,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer - The only inconsistency I can find between my approach here and how Pandas does this is when center == True and window >= 3'. The difference only occurs in the last position. We are using the shift method to adjust for center but it seems pandas treats the last position differently in this case. I'm not quite sure how to proceed here.

This is still the only failing test in this PR. I'd like to consider accepting the work done here with the caveat that xarray rolling differs from pandas in this one way. A possible followup PR may be able to reconcile this difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a specific example for which there is a discrepancy?

I'm getting a different error when I run your tests:


self = <xarray.test.test_dataarray.TestDataArray testMethod=test_rolling_pandas_compat>

    def test_rolling_pandas_compat(self):
        s = pd.Series(range(10))
        da = DataArray.from_series(s)

        for center in (False, True):
            for window in [1, 2, 3, 4]:
                for min_periods in [None, 0, 1, 2]:
                    if min_periods is not None and window < min_periods:
                        min_periods = window
                    s_rolling = pd.rolling_mean(s, window, center=center,
                                                min_periods=min_periods)
                    da_rolling = da.rolling(index=window, center=center,
                                            min_periods=min_periods).mean()
                    # pandas does some fancy stuff in the last position
                    np.testing.assert_allclose(s_rolling.values[:-1],
>                                              da_rolling.values[:-1])

xarray/test/test_dataarray.py:1301:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

x_id = array([ True, False, False, False, False, False, False, False, False], dtype=bool)
y_id = array([False, False, False, False, False, False, False, False, False], dtype=bool), hasval = 'nan'

    def chk_same_position(x_id, y_id, hasval='nan'):
        """Handling nan/inf: check that x and y have the nan/inf at the same
            locations."""
        try:
            assert_array_equal(x_id, y_id)
        except AssertionError:
            msg = build_err_msg([x, y],
                                err_msg + '\nx and y %s location mismatch:'
                                % (hasval), verbose=verbose, header=header,
                                names=('x', 'y'), precision=precision)
>           raise AssertionError(msg)
E           AssertionError:
E           Not equal to tolerance rtol=1e-07, atol=0
E
E           x and y nan location mismatch:
E            x: array([ nan,  0.5,  1.5,  2.5,  3.5,  4.5,  5.5,  6.5,  7.5])
E            y: array([ -9.223372e+18,   5.000000e-01,   1.500000e+00,   2.500000e+00,
E                    3.500000e+00,   4.500000e+00,   5.500000e+00,   6.500000e+00,
E                    7.500000e+00])

/usr/local/anaconda/envs/xarray-dev/lib/python2.7/site-packages/numpy/testing/utils.py:656: AssertionError

It looks like there might be a casting issue where int is not getting converted to float properly.

@jhamman
Copy link
Member Author

jhamman commented Feb 17, 2016

@shoyer -

This could use another review from you. Failing tests have been fixed. There's a bunch more functionality that can be built out in future pull requests. This provides the basic rolling functionality we were going for.

One thing to note, with center=True, we differ from pandas in the last position in some instances. I made a note in my unit test that we're okay with that for now.

Lastly, this will need a squash and I'll do that once we're settled on the features.

@@ -16,9 +16,42 @@ What's New
v0.7.2 (unreleased)
-------------------

This is a bug fix release that includes two small, backwards compatible enhancements.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to leave this out for now :)

@shoyer
Copy link
Member

shoyer commented Feb 18, 2016

What is the full set of functions like empty_like that we would want for the public API? zeros_like, ones_like, full_like, maybe missing_like?

empty_like (without a fill value) doesn't make a lot of sense for dask.array, but all the rest of these do, and it would be nice to have the xarray functions work by copying dask arrays by making new dask arrays.

One possibility, instead of making a separate missing_like, is to make xr.empty_like always fill with NaN. Users can always drop into numpy directly if they really want to make an array and not set the values -- this very rarely makes an actual difference for performance since memory allocation is so slow.

@shoyer
Copy link
Member

shoyer commented Feb 18, 2016

I'd also still love to see an explicit example where our behavior differs from pandas (in the last position if center=True) so we can try to figure out what's going on. This might actually be a bug on the pandas side :).

Generally this PR is looking very close. We could differ some of the API design work by keeping empty_like private for now, but I'm also happy to hash this out here.

@jhamman
Copy link
Member Author

jhamman commented Feb 19, 2016

Okay, _full_like (formerly empty_like) has been made private. I think it will be a useful feature for a bunch of applications but its not the point here. I'd like to build out functionality akin to what @shoyer indicated (zeros_like, ones_like, and missing_like). Actually, that can all be done by _full_like as it stands but it sounds like we need to put some more thought into the API before making it public.

An example of how we differ from Pandas in the last position with center=True. This stems from how we "center" the data using shift.

In [7]: arr
Out[7]: 
<xarray.DataArray (y: 5)>
array([ 2.5,  3. ,  3.5,  4. ,  4.5])
Coordinates:
    x        int64 1
  * y        (y) int64 0 1 2 3 4

In [8]: arr.rolling(y=3, center=True, min_periods=1).mean()
Out[8]: 
<xarray.DataArray (y: 5)>
array([ 2.75,  3.  ,  3.5 ,  4.  ,   nan])
Coordinates:
    x        int64 1
  * y        (y) int64 0 1 2 3 4

In [9]: pd.rolling_mean(arr.to_series(), 3, center=True, min_periods=1)
Out[9]: 
y
0    2.75
1    3.00
2    3.50
3    4.00
4    4.25
dtype: float64

min_periods=min_periods)

actual = rolling_obj.reduce(np.mean)
expected = rolling_obj.mean()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth adding the other aggregation methods to the loop? e.g., for method in ['mean', 'sum', 'min', 'max', ....]

The rolling method performs rolling aggregation and summaries along a single named dimension. Example usage:

    In [1]: import xarray as xr; import numpy as np

    In [2]: arr = xr.DataArray(np.arange(0, 7.5, 0.5).reshape(3, 5),
                               dims=('x', 'y'))

    In [3]: arr
    Out[3]:
    <xarray.DataArray (x: 3, y: 5)>
    array([[ 0. ,  0.5,  1. ,  1.5,  2. ],
           [ 2.5,  3. ,  3.5,  4. ,  4.5],
           [ 5. ,  5.5,  6. ,  6.5,  7. ]])
    Coordinates:
      * x        (x) int64 0 1 2
      * y        (y) int64 0 1 2 3 4

    In [4]: arr.rolling(y=3, min_periods=2).mean()
    Out[4]:
    <xarray.DataArray (x: 3, y: 5)>
    array([[  nan,  0.25,  0.5 ,  1.  ,  1.5 ],
           [  nan,  2.75,  3.  ,  3.5 ,  4.  ],
           [  nan,  5.25,  5.5 ,  6.  ,  6.5 ]])
    Coordinates:
      * x        (x) int64 0 1 2
      * y        (y) int64 0 1 2 3 4
@shoyer
Copy link
Member

shoyer commented Feb 20, 2016

OK, this looks good to me. Merge when you're ready!

jhamman pushed a commit that referenced this pull request Feb 20, 2016
@jhamman jhamman merged commit a59e6dd into pydata:master Feb 20, 2016
@jhamman jhamman deleted the feature/Rolling branch February 20, 2016 02:32
@shoyer
Copy link
Member

shoyer commented Feb 20, 2016

Woot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add rolling_apply method or function
2 participants