-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Feature/rolling #668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/rolling #668
Conversation
center : boolean, default False | ||
Set the labels at the center of the window. | ||
kwarg : dim=window | ||
dim : str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this docstring format -- does it render in a reasonable way in the API docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty decent to me! I'll take it :)
@shoyer - thanks for the first look. I'll give it another hack. |
yield (label, self.obj.isel(**{self.dim: indices})) | ||
else: | ||
# return nans for this window | ||
temp = self.obj.isel(**{self.dim: indices}).copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at _dummy_copy
from core.groupby
: https://github.com/xray/xray/blob/78df2ed528b2247313b28ae5b61faa801d426a50/xray/core/groupby.py#L48
@shoyer - I made some more progress here tonight. How do you suggest we handle the bottleneck dependency? That is the reason for the failing tests at the moment. |
values = bn.move_sum(self.obj.values, window=self.window, | ||
min_count=self.min_periods, axis=self._axis_num) | ||
coords = self.obj.coords | ||
coords[self.dim] = self.window_labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this now because window_labels is the same as the original coordinate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right - nice catch.
You can either add a try/except around a top level import of bottleneck, or only import bottleneck locally inside functions which need it. I think I would prefer the later approach because it results in more intelligible error messages ( |
@shoyer - would you mind taking a look at what I've just tried (and failed) in |
Internet at work today is only working 20% of the time. I'm happy to take a look once things get back online :). On Thu, Dec 3, 2015 at 1:05 PM, Joe Hamman notifications@github.com
|
sounds good. Thanks. That's got to slow down work at a tech company 💤 |
For iteration, what about only iterating over full windows? Thinking about how I might use iteration, I think this might be more useful than returning some shrunk windows. Concretely, this means that if you iterate over I think you've done a pretty reasonable job of interpreting |
I'll give this a test, but it looks like you have all the pieces to me.... |
I did consider this at first and it wouldn't be all that hard to implement but I chose not to go this route because I wanted consistency between rolling_obj = da.rolling(time=4)
rolling_obj.mean() # bottleneck move_mean
rolling_obj.reduce(np.nanmean) # numpy nanmean over each window
concat([da.mean(dim='time') for _, da in rolling_obj], dim=rolling_obj.window_labels) # manual mean via iterable - same as reduce
How did pandas land on this. To me it makes more sense as an argument to |
Agreed, this would be nice. But if min_count=0, this won't be the case, because you will average over partial windows at the start of the rolling iteration. For example, you apply the aggregation function to windows of size [1, 2, 3, 3, 3, 3]. And the labels are also not consistent. |
Still unresolved, though Jeff Reback agrees with you. It's being discussed in the rolling PR currently. Also: what about changing the default min_count to 0? I think that would be more consistent with pandas, which skips over missing values by default. |
I'm getting this E TypeError: move_sum() takes at least 2 positional arguments (0 given) which makes me think the injection is failing to pass the arguments in |
func = cls._bottleneck_reduce(f) | ||
func.__name__ = method_name | ||
func.__doc__ = 'todo' | ||
setattr(cls, method_name, f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be func
not f
-- this should fix your failing tests
@jhamman how are we doing here? Are you waiting on a review from me? |
# pandas does some fancy stuff in the last position | ||
np.testing.assert_allclose(s_rolling.values[:-1], | ||
da_rolling.values[:-1]) | ||
np.testing.assert_allclose(s_rolling.index, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer - The only inconsistency I can find between my approach here and how Pandas does this is when center == True and window >= 3'. The difference only occurs in the last position. We are using the shift method to adjust for center but it seems pandas treats the last position differently in this case. I'm not quite sure how to proceed here.
This is still the only failing test in this PR. I'd like to consider accepting the work done here with the caveat that xarray rolling differs from pandas in this one way. A possible followup PR may be able to reconcile this difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give a specific example for which there is a discrepancy?
I'm getting a different error when I run your tests:
self = <xarray.test.test_dataarray.TestDataArray testMethod=test_rolling_pandas_compat>
def test_rolling_pandas_compat(self):
s = pd.Series(range(10))
da = DataArray.from_series(s)
for center in (False, True):
for window in [1, 2, 3, 4]:
for min_periods in [None, 0, 1, 2]:
if min_periods is not None and window < min_periods:
min_periods = window
s_rolling = pd.rolling_mean(s, window, center=center,
min_periods=min_periods)
da_rolling = da.rolling(index=window, center=center,
min_periods=min_periods).mean()
# pandas does some fancy stuff in the last position
np.testing.assert_allclose(s_rolling.values[:-1],
> da_rolling.values[:-1])
xarray/test/test_dataarray.py:1301:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
x_id = array([ True, False, False, False, False, False, False, False, False], dtype=bool)
y_id = array([False, False, False, False, False, False, False, False, False], dtype=bool), hasval = 'nan'
def chk_same_position(x_id, y_id, hasval='nan'):
"""Handling nan/inf: check that x and y have the nan/inf at the same
locations."""
try:
assert_array_equal(x_id, y_id)
except AssertionError:
msg = build_err_msg([x, y],
err_msg + '\nx and y %s location mismatch:'
% (hasval), verbose=verbose, header=header,
names=('x', 'y'), precision=precision)
> raise AssertionError(msg)
E AssertionError:
E Not equal to tolerance rtol=1e-07, atol=0
E
E x and y nan location mismatch:
E x: array([ nan, 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5])
E y: array([ -9.223372e+18, 5.000000e-01, 1.500000e+00, 2.500000e+00,
E 3.500000e+00, 4.500000e+00, 5.500000e+00, 6.500000e+00,
E 7.500000e+00])
/usr/local/anaconda/envs/xarray-dev/lib/python2.7/site-packages/numpy/testing/utils.py:656: AssertionError
It looks like there might be a casting issue where int is not getting converted to float properly.
@shoyer - This could use another review from you. Failing tests have been fixed. There's a bunch more functionality that can be built out in future pull requests. This provides the basic rolling functionality we were going for. One thing to note, with
|
@@ -16,9 +16,42 @@ What's New | |||
v0.7.2 (unreleased) | |||
------------------- | |||
|
|||
This is a bug fix release that includes two small, backwards compatible enhancements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably better to leave this out for now :)
What is the full set of functions like
One possibility, instead of making a separate |
I'd also still love to see an explicit example where our behavior differs from pandas (in the last position if Generally this PR is looking very close. We could differ some of the API design work by keeping |
Okay, An example of how we differ from Pandas in the last position with In [7]: arr
Out[7]:
<xarray.DataArray (y: 5)>
array([ 2.5, 3. , 3.5, 4. , 4.5])
Coordinates:
x int64 1
* y (y) int64 0 1 2 3 4
In [8]: arr.rolling(y=3, center=True, min_periods=1).mean()
Out[8]:
<xarray.DataArray (y: 5)>
array([ 2.75, 3. , 3.5 , 4. , nan])
Coordinates:
x int64 1
* y (y) int64 0 1 2 3 4
In [9]: pd.rolling_mean(arr.to_series(), 3, center=True, min_periods=1)
Out[9]:
y
0 2.75
1 3.00
2 3.50
3 4.00
4 4.25
dtype: float64 |
min_periods=min_periods) | ||
|
||
actual = rolling_obj.reduce(np.mean) | ||
expected = rolling_obj.mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth adding the other aggregation methods to the loop? e.g., for method in ['mean', 'sum', 'min', 'max', ....]
The rolling method performs rolling aggregation and summaries along a single named dimension. Example usage: In [1]: import xarray as xr; import numpy as np In [2]: arr = xr.DataArray(np.arange(0, 7.5, 0.5).reshape(3, 5), dims=('x', 'y')) In [3]: arr Out[3]: <xarray.DataArray (x: 3, y: 5)> array([[ 0. , 0.5, 1. , 1.5, 2. ], [ 2.5, 3. , 3.5, 4. , 4.5], [ 5. , 5.5, 6. , 6.5, 7. ]]) Coordinates: * x (x) int64 0 1 2 * y (y) int64 0 1 2 3 4 In [4]: arr.rolling(y=3, min_periods=2).mean() Out[4]: <xarray.DataArray (x: 3, y: 5)> array([[ nan, 0.25, 0.5 , 1. , 1.5 ], [ nan, 2.75, 3. , 3.5 , 4. ], [ nan, 5.25, 5.5 , 6. , 6.5 ]]) Coordinates: * x (x) int64 0 1 2 * y (y) int64 0 1 2 3 4
OK, this looks good to me. Merge when you're ready! |
Woot! |
This is an initial take at the rolling aggregation object and methods in xray. This PR implements:
A new
Rolling
class available inDataArray
objects:bottleneck.move_*
functions are wrapped and are available in the following manor:generic reduce method
iterating through the rolling object:
TODO:
_setup_windows
Possibly create aWindow
object to hold windowedDataArray
scloses #641 #130
xref pandas-dev/pandas#11603, pandas-dev/pandas#11704
cc @shoyer @bartnijssen