Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow no padding for rolling windows #5603

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

kmsquire
Copy link
Contributor

@kmsquire kmsquire commented Jul 14, 2021

Related to #2007.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 14, 2021

Unit Test Results

         6 files           6 suites   1h 9m 48s ⏱️
17 081 tests 15 238 ✔️ 1 735 💤 108 ❌
95 682 runs  86 859 ✔️ 8 175 💤 648 ❌

For more details on these failures, see this check.

Results for commit 5a2eadc.

♻️ This comment has been updated with latest results.

@kmsquire
Copy link
Contributor Author

FWIW, the test failures are all the same error, which should be fixed, but should have nothing to do with this PR.

@dcherian dcherian requested a review from mathause July 16, 2021 16:58
@dcherian
Copy link
Contributor

Thanks @kmsquire I'll try and review this soon.

@pep8speaks
Copy link

pep8speaks commented Jul 16, 2021

Hello @kmsquire! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-08-11 16:38:29 UTC

@kmsquire kmsquire changed the title Allow no padding for rolling windows WIP: Allow no padding for rolling windows Jul 16, 2021
@kmsquire
Copy link
Contributor Author

@dcherian Thanks. I thought I was done, but I'm finding things that aren't working as expected. I'm working on adding more tests for expected behavior, so marking as WIP. Will ping back here when I'm finished (maybe today, maybe early next week).

* Adds a `pad: bool` parameter to DataArray.rolling / Dataset.rolling
Also changes the object returned during iteration, so that it is the same
(modulo transposition) as the view created during rolling_window.construct()
along the rolling axis.
@kmsquire kmsquire changed the title WIP: Allow no padding for rolling windows Allow no padding for rolling windows Jul 16, 2021
@kmsquire kmsquire force-pushed the feature/rolling-pad branch 2 times, most recently from 9862d69 to 7d24702 Compare July 16, 2021 23:28
@kmsquire
Copy link
Contributor Author

@dcherian Okay, I think this is in good shape. I added some more tests, and fixed a few more bugs. Most of the fixes have been squashed back down to the original commit.

I left the second commit separate for now because it's breaking. Previously, iterating over a rolling window returned only returned blocks present in the original array, and ignored the chunk of nans before the start or after the end of the array. This meant that the view of the DataArray or Dataset that was returned for each iteration was potentially a different size.

Here, instead, the iterator was changed so that each returned view matches the corresponding slice of the output of the construct() function. To me, this seems more intuitive, and makes it easier to develop algorithms without having to call construct.

Other than the fact that it's breaking, the main drawback (and difference with construct()) is that the windows themselves are labeled with coordinates along the rolling axis, except for any nan values before the start or after the end of the actual data, which do not have labels. For comparison, in construct, the window dimension does not have any coordinates associated with it by default.

After writing this, I'm wondering if it might be useful to simply drop the coordinates along the rolling axis, so that the behavior matches the behavior of construct() even more closely.

I'm open to thoughts/comments/criticisms/suggestions/questions/whatever.

@kmsquire
Copy link
Contributor Author

Also, FWIW, the test failure was caused by a problem in zarr/fsspec (fsspec/filesystem_spec#707), which is fixed in master on fsspec (fsspec/filesystem_spec#710).

So it should be fixed here whenever fsspec makes a release.

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an initial review and have a few questions. I haven't reviewed __iter__ yet.

Thanks for doing this @kmsquire it turned out to be harder than expected!

xarray/core/rolling.py Outdated Show resolved Hide resolved
xarray/core/rolling.py Outdated Show resolved Hide resolved
xarray/core/rolling.py Outdated Show resolved Hide resolved
xarray/core/rolling.py Outdated Show resolved Hide resolved
xarray/core/rolling.py Outdated Show resolved Hide resolved
xarray/core/rolling.py Outdated Show resolved Hide resolved
xarray/core/variable.py Outdated Show resolved Hide resolved
kmsquire and others added 2 commits July 19, 2021 13:49
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
@kmsquire
Copy link
Contributor Author

@dcherian Thank you for reviewing. I've started working through your comments.

* Generalize type of dim (str -> Hashable)
* Rename to_array -> to_list
* Make error message clearer (only print unexpected inputs)
* Allow lists of length one as inputs
* Add tests
@kmsquire
Copy link
Contributor Author

The code also introduces a problem, the following no longer works:

xr.DataArray([1, 2, 3], coords=dict(x=[1, 1, 1])).rolling(x=3).mean()

thus you may have to use isel instead of sel if possible. Therefore I was also unable to test the following (from #2007 (comment)):

monthly.pad(month=n_months, mode="wrap").rolling(center=True, month=n_months, pad=False).mean(skipna=False)

Fixed and added a test for this. The example at the bottom now works.

@kmsquire
Copy link
Contributor Author

Mentioned in one of the comments above, but I think I've reached about the amount of time that I can spend on this right now. If there are other minor changes, please do let me know. I can also back out the breaking change if desired (although that will probably take some commit surgery).

@dcherian
Copy link
Contributor

Thanks @kmsquire we can take it from here.

There was some discussion about syntax at the dev meeting today. There were multiple votes in favour of allowing full control of padding in the rolling object itself. So we could have

  1. .rolling(time=5, x=3, pad={"x": {"mode": "wrap"}, "time": False}), OR
  2. .rolling(time=5, x=3).pad({"x": {"mode": "wrap"}, "time": False})

@pydata/xarray thoughts?

@kmsquire
Copy link
Contributor Author

There was some discussion about syntax at the dev meeting today. There were multiple votes in favour of allowing full control of padding in the rolling object itself. So we could have

  1. .rolling(time=5, x=3, pad={"x": {"mode": "wrap"}, "time": False}), OR
  2. .rolling(time=5, x=3).pad({"x": {"mode": "wrap"}, "time": False})

FWIW, I think 1 would be more efficient (but perhaps harder to implement). With this one, the DataArrayRolling object is only created once, with all of the required information.

If 2 were implemented, what should the return value of the .pad() call be? Would it call construct on the rolling object and return a DataArray? Or would it return an updated DataArrayRolling object? Or would it be a new type entirely (e.g., DataArrayPaddedRolling)?

@dcherian
Copy link
Contributor

dcherian commented Jul 22, 2021

Your points about (2) being hard to interpret are a good reason to go with (1)! I was thinking it would return a DatasetRolling object but that's not the only interpretation as you point out.

* upstream/main: (31 commits)
  Refactor index vs. coordinate variable(s) (pydata#5636)
  pre-commit: autoupdate hook versions (pydata#5685)
  Flexible Indexes: Avoid len(index) in map_blocks (pydata#5670)
  Speed up _mapping_repr (pydata#5661)
  update the link to `scipy`'s intersphinx file (pydata#5665)
  Bump styfle/cancel-workflow-action from 0.9.0 to 0.9.1 (pydata#5663)
  pre-commit: autoupdate hook versions (pydata#5660)
  fix the binder environment (pydata#5650)
  Update api.rst (pydata#5639)
  Kwargs to rasterio open (pydata#5609)
  Bump codecov/codecov-action from 1 to 2.0.2 (pydata#5633)
  new blank whats-new for v0.19.1
  v0.19.0 release notes (pydata#5632)
  remove deprecations scheduled for 0.19 (pydata#5630)
  Make typing-extensions optional (pydata#5624)
  Plots get labels from pint arrays (pydata#5561)
  Add to_numpy() and as_numpy() methods (pydata#5568)
  pin fsspec (pydata#5627)
  pre-commit: autoupdate hook versions (pydata#5617)
  Add dataarray scatter with 3d support (pydata#4909)
  ...
Failing for center=True+pad=True; and maybe all pad=False?
@pytest.mark.parametrize("min_periods", (1, None))
@pytest.mark.parametrize("backend", ["numpy"], indirect=True)
def test_rolling_wrapped_bottleneck(da, name, center, min_periods):
bn = pytest.importorskip("bottleneck", minversion="1.1")
@pytest.mark.parametrize("center", (True, False, None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failing for center=True or pad=False (I think).

I think there's an issue with constructing expected: it doesn't seem like get_expected_rolling_indices is correct for all cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmdupuis3 I fixed the merge conflicts but there's still failing tests. We'd appreciate any help you can give here.

* upstream/main: (716 commits)
  Faq pull request (According to pull request pydata#7604 & issue pydata#1285 (pydata#7638)
  add timeouts for tests (pydata#7657)
  Pull Request Labeler - Undo workaround sync-labels bug (pydata#7667)
  [pre-commit.ci] pre-commit autoupdate (pydata#7651)
  Allow all integer dtypes in `polyval` (pydata#7619)
  [skip-ci] dev whats-new (pydata#7660)
  Redo whats-new for 2023.03.0 (pydata#7659)
  Set copy=False when calling pd.Series (pydata#7642)
  Pin pandas < 2 (pydata#7650)
  Whats-new for release 2023.03.0 (pydata#7643)
  Bump pypa/gh-action-pypi-publish from 1.7.1 to 1.8.1 (pydata#7648)
  Use more descriptive link texts (pydata#7625)
  Fix missing 'dim' argument in _get_nan_block_lengths (pydata#7598)
  Fix `pcolormesh` with str coords (pydata#7612)
  [skip-ci] Fix groupby binary ops benchmarks (pydata#7603)
  Remove incomplete sentence in IO docs (pydata#7631)
  Allow indexing unindexed dimensions using dask arrays (pydata#5873)
  Bump pypa/gh-action-pypi-publish from 1.6.4 to 1.7.1 (pydata#7618)
  [pre-commit.ci] pre-commit autoupdate (pydata#7620)
  add a test for scatter colorbar extend (pydata#7616)
  ...
* upstream/main:
  Save groupby codes after factorizing, pass to flox (pydata#7206)
  [skip-ci] Add compute to groupby benchmarks (pydata#7690)
  Delete built-in cfgrib backend (pydata#7670)
  Added a pronunciation guide to the word Xarray in the README.MD fil… (pydata#7677)
  boundarynorm fix (pydata#7553)
  Fix lazy negative slice rewriting. (pydata#7586)
  [pre-commit.ci] pre-commit autoupdate (pydata#7687)
  Adjust sidebar font colors (pydata#7674)
  Bump pypa/gh-action-pypi-publish from 1.8.1 to 1.8.3 (pydata#7682)
  Raise PermissionError when insufficient permissions (pydata#7629)
@cmdupuis3
Copy link

Hey all, has there been any movement on this PR, or maybe elsewhere? It seems like a nice feature to me.

* upstream/main: (765 commits)
  increase typing annotations coverage in `xarray/core/indexing.py` (pydata#8857)
  pandas 3 MultiIndex fixes (pydata#8847)
  FIX: adapt handling of copy keyword argument in scipy backend for numpy >= 2.0dev (pydata#8851)
  FIX: do not cast _FillValue/missing_value in CFMaskCoder if _Unsigned is provided (pydata#8852)
  Implement setitem syntax for `.oindex` and `.vindex` properties (pydata#8845)
  Support pandas copy-on-write behaviour (pydata#8846)
  correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713)
  Expand use of `.oindex` and `.vindex` (pydata#8790)
  Return a dataclass from Grouper.factorize (pydata#8777)
  [skip-ci] Fix upstream-dev env (pydata#8839)
  Add dask-expr for windows envs (pydata#8837)
  [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835)
  Add `dask-expr` to environment-3.12.yml (pydata#8827)
  Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736)
  Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784)
  try to get the `upstream-dev` CI to complete again (pydata#8823)
  Bump the actions group with 1 update (pydata#8818)
  Update documentation for clarity (pydata#8817)
  DOC: link to zarr.convenience.consolidate_metadata (pydata#8816)
  Refactor Grouper objects (pydata#8776)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strided rolling.construct results in unexpected behavior
6 participants