Feature/rolling #668

jhamman · 2015-12-02T21:20:34Z

This is an initial take at the rolling aggregation object and methods in xray. This PR implements:

A new Rolling class available in DataArray objects:
```
rolling_obj = da.rolling(time=7)
```
bottleneck.move_* functions are wrapped and are available in the following manor:
```
rolling_obj.mean()
```

generic reduce method

from numpy import nanpercentile
rolling_obj.reduce(nanpercentile, q=5)

iterating through the rolling object:

 for label, da_window in rolling_obj:
     # da_window is a view of da

TODO:

Documentation
Cleanup _setup_windows
Inject bottleneck methods in some generic way
~~Possibly create a Window object to hold windowed DataArrays~~

closes #641 #130
xref pandas-dev/pandas#11603, pandas-dev/pandas#11704
cc @shoyer @bartnijssen

shoyer · 2015-12-02T22:51:12Z

xray/core/common.py

+        center : boolean, default False
+            Set the labels at the center of the window.
+        kwarg : dim=window
+            dim : str


I like this docstring format -- does it render in a reasonable way in the API docs?

just checked, not perfect.

looks pretty decent to me! I'll take it :)

jhamman · 2015-12-02T23:54:32Z

@shoyer - thanks for the first look. I'll give it another hack.

shoyer · 2015-12-03T00:22:27Z

xray/core/rolling.py

+                yield (label, self.obj.isel(**{self.dim: indices}))
+            else:
+                # return nans for this window
+                temp = self.obj.isel(**{self.dim: indices}).copy()


look at _dummy_copy from core.groupby: https://github.com/xray/xray/blob/78df2ed528b2247313b28ae5b61faa801d426a50/xray/core/groupby.py#L48

jhamman · 2015-12-03T08:33:33Z

@shoyer - I made some more progress here tonight. How do you suggest we handle the bottleneck dependency? That is the reason for the failing tests at the moment.

shoyer · 2015-12-03T19:24:41Z

xray/core/rolling.py

+        values = bn.move_sum(self.obj.values, window=self.window,
+                             min_count=self.min_periods, axis=self._axis_num)
+        coords = self.obj.coords
+        coords[self.dim] = self.window_labels


you don't need this now because window_labels is the same as the original coordinate

right - nice catch.

shoyer · 2015-12-03T20:59:28Z

How do you suggest we handle the bottleneck dependency? That is the reason for the failing tests at the moment.

You can either add a try/except around a top level import of bottleneck, or only import bottleneck locally inside functions which need it. I think I would prefer the later approach because it results in more intelligible error messages (ImportError: No module named bottleneck rather than NameError: name 'bottleneck' is not defined).

jhamman · 2015-12-03T21:05:49Z

@shoyer - would you mind taking a look at what I've just tried (and failed) in ops.py and common.py? I think I'm missing a big piece of the injection puzzle.

shoyer · 2015-12-03T21:36:50Z

Internet at work today is only working 20% of the time. I'm happy to take a look once things get back online :).

On Thu, Dec 3, 2015 at 1:05 PM, Joe Hamman notifications@github.com
wrote:

@shoyer - would you mind taking a look at what I've just tried (and failed) in ops.py and common.py? I think I'm missing a big piece of the injection puzzle.

Reply to this email directly or view it on GitHub:
#668 (comment)

jhamman · 2015-12-03T21:41:09Z

sounds good. Thanks. That's got to slow down work at a tech company 💤

shoyer · 2015-12-03T22:22:00Z

For iteration, what about only iterating over full windows? Thinking about how I might use iteration, I think this might be more useful than returning some shrunk windows.

Concretely, this means that if you iterate over xray.DataArray(range(6), dims='x').rolling(x=3), results would have labels from 1 through 5.

I think you've done a pretty reasonable job of interpreting min_periods for iteration, but I would still vote for defining it only as an argument to the aggregation methods and not worrying about it for iteration. It keeps things a bit simpler and easier to understand. OTOH, if you can think of use cases for min_periods with iteration, I could be convinced :).

shoyer · 2015-12-03T22:29:43Z

@shoyer - would you mind taking a look at what I've just tried (and failed) in ops.py and common.py? I think I'm missing a big piece of the injection puzzle.

I'll give this a test, but it looks like you have all the pieces to me....

jhamman · 2015-12-04T16:55:25Z

For iteration, what about only iterating over full windows? Thinking about how I might use iteration, I think this might be more useful than returning some shrunk windows.

I did consider this at first and it wouldn't be all that hard to implement but I chose not to go this route because I wanted consistency between reduce, _bottleneck_reduce and __iter__. In theory, all three of these should provide the same answer:

rolling_obj = da.rolling(time=4)

rolling_obj.mean()  # bottleneck move_mean
rolling_obj.reduce(np.nanmean)  # numpy nanmean over each window
concat([da.mean(dim='time') for _, da in rolling_obj], dim=rolling_obj.window_labels)  # manual mean via iterable - same as reduce

I think you've done a pretty reasonable job of interpreting min_periods for iteration, but I would still vote for defining it only as an argument to the aggregation methods and not worrying about it for iteration.

How did pandas land on this. To me it makes more sense as an argument to __init__ but I'll go with whatever pandas decided for consistency.

shoyer · 2015-12-04T17:37:26Z

I wanted consistency between reduce, _bottleneck_reduce and iter.

Agreed, this would be nice. But if min_count=0, this won't be the case, because you will average over partial windows at the start of the rolling iteration. For example, you apply the aggregation function to windows of size [1, 2, 3, 3, 3, 3]. And the labels are also not consistent.

shoyer · 2015-12-04T17:40:47Z

How did pandas land on this. To me it makes more sense as an argument to init but I'll go with whatever pandas decided for consistency.

Still unresolved, though Jeff Reback agrees with you. It's being discussed in the rolling PR currently.

Also: what about changing the default min_count to 0? I think that would be more consistent with pandas, which skips over missing values by default.

jhamman · 2015-12-07T07:29:49Z

I'll give this a test, but it looks like you have all the pieces to me....

I'm getting this TypeError:

E   TypeError: move_sum() takes at least 2 positional arguments (0 given)

which makes me think the injection is failing to pass the arguments in

shoyer · 2015-12-07T07:59:51Z

xray/core/ops.py

+            func = cls._bottleneck_reduce(f)
+            func.__name__ = method_name
+            func.__doc__ = 'todo'
+            setattr(cls, method_name, f)


This should be func not f -- this should fix your failing tests

shoyer · 2015-12-28T12:49:03Z

@jhamman how are we doing here? Are you waiting on a review from me?

jhamman · 2016-01-26T02:08:19Z

xarray/test/test_dataarray.py

+                    # pandas does some fancy stuff in the last position
+                    np.testing.assert_allclose(s_rolling.values[:-1],
+                                               da_rolling.values[:-1])
+                    np.testing.assert_allclose(s_rolling.index,


@shoyer - The only inconsistency I can find between my approach here and how Pandas does this is when center == True and window >= 3'. The difference only occurs in the last position. We are using the shift method to adjust for center but it seems pandas treats the last position differently in this case. I'm not quite sure how to proceed here.

This is still the only failing test in this PR. I'd like to consider accepting the work done here with the caveat that xarray rolling differs from pandas in this one way. A possible followup PR may be able to reconcile this difference.

Can you give a specific example for which there is a discrepancy?

I'm getting a different error when I run your tests:

self = <xarray.test.test_dataarray.TestDataArray testMethod=test_rolling_pandas_compat> def test_rolling_pandas_compat(self): s = pd.Series(range(10)) da = DataArray.from_series(s) for center in (False, True): for window in [1, 2, 3, 4]: for min_periods in [None, 0, 1, 2]: if min_periods is not None and window < min_periods: min_periods = window s_rolling = pd.rolling_mean(s, window, center=center, min_periods=min_periods) da_rolling = da.rolling(index=window, center=center, min_periods=min_periods).mean() # pandas does some fancy stuff in the last position np.testing.assert_allclose(s_rolling.values[:-1], > da_rolling.values[:-1]) xarray/test/test_dataarray.py:1301: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ x_id = array([ True, False, False, False, False, False, False, False, False], dtype=bool) y_id = array([False, False, False, False, False, False, False, False, False], dtype=bool), hasval = 'nan' def chk_same_position(x_id, y_id, hasval='nan'): """Handling nan/inf: check that x and y have the nan/inf at the same locations.""" try: assert_array_equal(x_id, y_id) except AssertionError: msg = build_err_msg([x, y], err_msg + '\nx and y %s location mismatch:' % (hasval), verbose=verbose, header=header, names=('x', 'y'), precision=precision) > raise AssertionError(msg) E AssertionError: E Not equal to tolerance rtol=1e-07, atol=0 E E x and y nan location mismatch: E x: array([ nan, 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5]) E y: array([ -9.223372e+18, 5.000000e-01, 1.500000e+00, 2.500000e+00, E 3.500000e+00, 4.500000e+00, 5.500000e+00, 6.500000e+00, E 7.500000e+00]) /usr/local/anaconda/envs/xarray-dev/lib/python2.7/site-packages/numpy/testing/utils.py:656: AssertionError

It looks like there might be a casting issue where int is not getting converted to float properly.

jhamman · 2016-02-17T19:28:05Z

@shoyer -

This could use another review from you. Failing tests have been fixed. There's a bunch more functionality that can be built out in future pull requests. This provides the basic rolling functionality we were going for.

One thing to note, with center=True, we differ from pandas in the last position in some instances. I made a note in my unit test that we're okay with that for now.

~~Lastly, this will need a squash and I'll do that once we're settled on the features.~~

shoyer · 2016-02-18T16:22:40Z

doc/whats-new.rst

@@ -16,9 +16,42 @@ What's New
 v0.7.2 (unreleased)
 -------------------

+This is a bug fix release that includes two small, backwards compatible enhancements.


probably better to leave this out for now :)

shoyer · 2016-02-18T17:03:37Z

What is the full set of functions like empty_like that we would want for the public API? zeros_like, ones_like, full_like, maybe missing_like?

empty_like (without a fill value) doesn't make a lot of sense for dask.array, but all the rest of these do, and it would be nice to have the xarray functions work by copying dask arrays by making new dask arrays.

One possibility, instead of making a separate missing_like, is to make xr.empty_like always fill with NaN. Users can always drop into numpy directly if they really want to make an array and not set the values -- this very rarely makes an actual difference for performance since memory allocation is so slow.

shoyer · 2016-02-18T17:07:11Z

I'd also still love to see an explicit example where our behavior differs from pandas (in the last position if center=True) so we can try to figure out what's going on. This might actually be a bug on the pandas side :).

Generally this PR is looking very close. We could differ some of the API design work by keeping empty_like private for now, but I'm also happy to hash this out here.

jhamman · 2016-02-19T05:40:25Z

Okay, _full_like (formerly empty_like) has been made private. I think it will be a useful feature for a bunch of applications but its not the point here. I'd like to build out functionality akin to what @shoyer indicated (zeros_like, ones_like, and missing_like). Actually, that can all be done by _full_like as it stands but it sounds like we need to put some more thought into the API before making it public.

An example of how we differ from Pandas in the last position with center=True. This stems from how we "center" the data using shift.

In [7]: arr
Out[7]: 
<xarray.DataArray (y: 5)>
array([ 2.5,  3. ,  3.5,  4. ,  4.5])
Coordinates:
    x        int64 1
  * y        (y) int64 0 1 2 3 4

In [8]: arr.rolling(y=3, center=True, min_periods=1).mean()
Out[8]: 
<xarray.DataArray (y: 5)>
array([ 2.75,  3.  ,  3.5 ,  4.  ,   nan])
Coordinates:
    x        int64 1
  * y        (y) int64 0 1 2 3 4

In [9]: pd.rolling_mean(arr.to_series(), 3, center=True, min_periods=1)
Out[9]: 
y
0    2.75
1    3.00
2    3.50
3    4.00
4    4.25
dtype: float64

shoyer · 2016-02-19T18:14:56Z

xarray/test/test_dataarray.py

+                                             min_periods=min_periods)
+
+                    actual = rolling_obj.reduce(np.mean)
+                    expected = rolling_obj.mean()


is it worth adding the other aggregation methods to the loop? e.g., for method in ['mean', 'sum', 'min', 'max', ....]

The rolling method performs rolling aggregation and summaries along a single named dimension. Example usage: In [1]: import xarray as xr; import numpy as np In [2]: arr = xr.DataArray(np.arange(0, 7.5, 0.5).reshape(3, 5), dims=('x', 'y')) In [3]: arr Out[3]: <xarray.DataArray (x: 3, y: 5)> array([[ 0. , 0.5, 1. , 1.5, 2. ], [ 2.5, 3. , 3.5, 4. , 4.5], [ 5. , 5.5, 6. , 6.5, 7. ]]) Coordinates: * x (x) int64 0 1 2 * y (y) int64 0 1 2 3 4 In [4]: arr.rolling(y=3, min_periods=2).mean() Out[4]: <xarray.DataArray (x: 3, y: 5)> array([[ nan, 0.25, 0.5 , 1. , 1.5 ], [ nan, 2.75, 3. , 3.5 , 4. ], [ nan, 5.25, 5.5 , 6. , 6.5 ]]) Coordinates: * x (x) int64 0 1 2 * y (y) int64 0 1 2 3 4

shoyer · 2016-02-20T02:26:58Z

OK, this looks good to me. Merge when you're ready!

Feature/rolling

shoyer · 2016-02-20T02:37:52Z

Woot!

shoyer reviewed Dec 2, 2015
View reviewed changes

shoyer reviewed Dec 3, 2015
View reviewed changes

shoyer reviewed Dec 7, 2015
View reviewed changes

jhamman reviewed Jan 26, 2016
View reviewed changes

shoyer reviewed Feb 18, 2016
View reviewed changes

shoyer reviewed Feb 19, 2016
View reviewed changes

jhamman pushed a commit that referenced this pull request Feb 20, 2016

Merge pull request #668 from jhamman/feature/Rolling

a59e6dd

Feature/rolling

jhamman merged commit a59e6dd into pydata:master Feb 20, 2016

jhamman deleted the feature/Rolling branch February 20, 2016 02:32

jhamman mentioned this pull request Feb 20, 2016

Wrap bottleneck for fast moving window aggregations #130

Closed

jhamman mentioned this pull request Feb 20, 2016

update rolling doc string #772

Merged

jhamman mentioned this pull request Apr 14, 2016

Remove non-existing empty_like from Sphinx autosummary #823

Merged

jhamman mentioned this pull request Jan 15, 2018

Slow performance of rolling.reduce #1831

Closed

Uh oh!

Feature/rolling #668

Feature/rolling #668

Uh oh!

Conversation

jhamman commented Dec 2, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhamman commented Dec 2, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhamman commented Dec 3, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer commented Dec 3, 2015

Uh oh!

jhamman commented Dec 3, 2015

Uh oh!

shoyer commented Dec 3, 2015

@shoyer - would you mind taking a look at what I've just tried (and failed) in ops.py and common.py? I think I'm missing a big piece of the injection puzzle.

Uh oh!

jhamman commented Dec 3, 2015

Uh oh!

shoyer commented Dec 3, 2015

Uh oh!

shoyer commented Dec 3, 2015

Uh oh!

jhamman commented Dec 4, 2015

Uh oh!

shoyer commented Dec 4, 2015

Uh oh!

shoyer commented Dec 4, 2015

Uh oh!

jhamman commented Dec 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer commented Dec 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhamman commented Feb 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer commented Feb 18, 2016

Uh oh!

shoyer commented Feb 18, 2016

Uh oh!

jhamman commented Feb 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer commented Feb 20, 2016

Uh oh!

shoyer commented Feb 20, 2016

Uh oh!

Uh oh!

@shoyer - would you mind taking a look at what I've just tried (and failed) in `ops.py` and `common.py`? I think I'm missing a big piece of the injection puzzle.