-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow indexing unindexed dimensions using dask arrays #5873
Conversation
b7e4d61
to
71d5ff6
Compare
7b627e7
to
952fe85
Compare
952fe85
to
270873a
Compare
270873a
to
f7991fd
Compare
f7991fd
to
e4cfe63
Compare
This is a naive attempt to make `isel` work with Dask Known limitation: it triggers the computation.
e4cfe63
to
bc4271c
Compare
xarray/tests/test_indexing.py
Outdated
idx = da.argmax("time") | ||
actual = da.isel(time=idx) | ||
|
||
assert np.all(actual == 42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
idx = da.argmax("time") | |
actual = da.isel(time=idx) | |
assert np.all(actual == 42) | |
with raise_if_dask_computes(): | |
actual = da.isel(time=dask.array.from_array([9], chunks=(1,)) | |
expected = da.isel(time=[9]) | |
assert_identical(expected, actual) |
with from . import raise_if_dask_computes
at the top.
We could test 9
, [9]
, and [9, 9]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edited
When the indexer is an independent Dask array, the computation is triggered.
By modifying the unit test as follow, it raises the "RuntimeError: Too many computes. Total: 1 > max: 0"
def test_indexing_dask_array():
da = DataArray(
np.ones(10 * 3 * 3).reshape((10, 3, 3)),
dims=("time", "x", "y"),
).chunk(dict(time=-1, x=1, y=1))
with raise_if_dask_computes():
actual = da.isel(time=dask.array.from_array([9], chunks=(1,)))
I thought it would only be triggered when the indexer is constructed from da
.
That's why the unit test was using idx = da.argmax("time")
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would only be triggered when the indexer is constructed from da.
Yeah that raise_if_dask_computes
decorator is great for checking these ideas.
I've got it fixed but we should wait for @benbovy to finish with #5692 (though it doesn't look like that PR conflicts with these changes). It would be useful to add more test cases here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other tests could be the ones in #4276, #4663, and this one from #2511
import dask.array as da
import numpy as np
import xarray as xr
from xarray.tests import raise_if_dask_computes
darr = xr.DataArray(data=[0.2, 0.4, 0.6], coords={"z": range(3)}, dims=("z",))
indexer = xr.DataArray(
data=np.random.randint(0, 3, 8).reshape(4, 2).astype(int),
coords={"y": range(4), "x": range(2)},
dims=("y", "x"),
)
with raise_if_dask_computes():
actual = darr[indexer.chunk({"y": 2})]
xr.testing.assert_identical(actual, darr[indexer])
Which actually does work now, but end up computing the indexer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I was not expecting these changes to be merged, it was only to support the discussion on the issue but it's nice if can be integrated.
I'll try to add the unit tests.
* main: (34 commits) Explicit indexes (pydata#5692) Remove test_rasterio_vrt_network (pydata#6371) Allow write_empty_chunks to be set in Zarr encoding (pydata#6348) Add new tutorial video (pydata#6353) Revert "explicitly install `ipython_genutils` (pydata#6350)" (pydata#6361) [pre-commit.ci] pre-commit autoupdate (pydata#6357) Run pyupgrade on core/groupby (pydata#6351) Generate reductions for DataArray, Dataset, GroupBy and Resample (pydata#5950) explicitly install `ipython_genutils` (pydata#6350) Bump actions/setup-python from 2 to 3 (pydata#6338) Bump actions/checkout from 2 to 3 (pydata#6337) In documentation on adding a new backend, add missing import and tweak headings (pydata#6330) Lengthen underline, correct spelling, and reword (pydata#6326) quantile: use skipna=None (pydata#6303) New whatsnew section v2022.03.0 release notes (pydata#6319) fix typos (using codespell) (pydata#6316) Add General issue template (pydata#6314) Disable CI runs on forks (pydata#6315) Enable running sphinx-build on Windows (pydata#6237) ...
0e888fc
to
9f5e31b
Compare
if not np.issubdtype(k.dtype, np.integer): | ||
raise TypeError( | ||
f"invalid indexer array, does not have integer dtype: {k!r}" | ||
) | ||
if k.ndim != 1: | ||
if k.ndim > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k might be a 0d dask array (see below and #4276)
@@ -109,7 +111,11 @@ def _advanced_indexer_subspaces(key): | |||
return (), () | |||
|
|||
non_slices = [k for k in key if not isinstance(k, slice)] | |||
ndim = len(np.broadcast(*non_slices).shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just an optimization. Happy to add it in a different PR.
key = tuple( | ||
k.data.item() if isinstance(k, Variable) and k.ndim == 0 else k for k in key | ||
k.data if isinstance(k, Variable) and k.ndim == 0 else k for k in key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Necessary for #4276
(da.array([1, 99, 99, 99]), np.arange(4) > 0), | ||
(da.array([99, 99, 99, 99]), Variable(("x"), da.array([1, 2, 3, 4])) > 0), | ||
(da.array([1, 99, 99, 99]), np.array([False, True, True, True])), | ||
(da.array([99, 99, 99, 99]), Variable(("x"), np.array([True] * 4))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last test was silently computing earlier.
* upstream/main: (291 commits) Update error message when saving multiindex (pydata#7475) DOC: cross ref the groupby tutorial (pydata#7555) [pre-commit.ci] pre-commit autoupdate (pydata#7543) supress namespace_package deprecation warning (doctests) (pydata#7548) [skip-ci] Add PDF of Xarray logo (pydata#7530) Support complex arrays in xr.corr (pydata#7392) clarification for thresh arg of dataset.dropna() (pydata#7481) [pre-commit.ci] pre-commit autoupdate (pydata#7524) Require to explicitly defining optional dimensions such as hue and markersize (pydata#7277) Use plt.rc_context for default styles (pydata#7318) Update HOW_TO_RELEASE.md (pydata#7512) Update whats-new for dev (pydata#7511) Fix whats-new for 2023.02.0 (pydata#7506) Update apply_ufunc output_sizes error message (pydata#7509) Zarr: drop "source" and "original_shape" from encoding (pydata#7500) [pre-commit.ci] pre-commit autoupdate (pydata#7507) Whats-new for 2023.03.0 Add `inclusive` argument to `cftime_range` and `date_range` and deprecate `closed` argument (pydata#7373) Make text match code example (pydata#7499) Remove Dask single-threaded setting in tests (pydata#7489) ...
Hey, thanks @dcherian for taking over and merging this PR ! |
* main: (40 commits) Faq pull request (According to pull request pydata#7604 & issue pydata#1285 (pydata#7638) add timeouts for tests (pydata#7657) Pull Request Labeler - Undo workaround sync-labels bug (pydata#7667) [pre-commit.ci] pre-commit autoupdate (pydata#7651) Allow all integer dtypes in `polyval` (pydata#7619) [skip-ci] dev whats-new (pydata#7660) Redo whats-new for 2023.03.0 (pydata#7659) Set copy=False when calling pd.Series (pydata#7642) Pin pandas < 2 (pydata#7650) Whats-new for release 2023.03.0 (pydata#7643) Bump pypa/gh-action-pypi-publish from 1.7.1 to 1.8.1 (pydata#7648) Use more descriptive link texts (pydata#7625) Fix missing 'dim' argument in _get_nan_block_lengths (pydata#7598) Fix `pcolormesh` with str coords (pydata#7612) [skip-ci] Fix groupby binary ops benchmarks (pydata#7603) Remove incomplete sentence in IO docs (pydata#7631) Allow indexing unindexed dimensions using dask arrays (pydata#5873) Bump pypa/gh-action-pypi-publish from 1.6.4 to 1.7.1 (pydata#7618) [pre-commit.ci] pre-commit autoupdate (pydata#7620) add a test for scatter colorbar extend (pydata#7616) ...
* upstream/main: (716 commits) Faq pull request (According to pull request pydata#7604 & issue pydata#1285 (pydata#7638) add timeouts for tests (pydata#7657) Pull Request Labeler - Undo workaround sync-labels bug (pydata#7667) [pre-commit.ci] pre-commit autoupdate (pydata#7651) Allow all integer dtypes in `polyval` (pydata#7619) [skip-ci] dev whats-new (pydata#7660) Redo whats-new for 2023.03.0 (pydata#7659) Set copy=False when calling pd.Series (pydata#7642) Pin pandas < 2 (pydata#7650) Whats-new for release 2023.03.0 (pydata#7643) Bump pypa/gh-action-pypi-publish from 1.7.1 to 1.8.1 (pydata#7648) Use more descriptive link texts (pydata#7625) Fix missing 'dim' argument in _get_nan_block_lengths (pydata#7598) Fix `pcolormesh` with str coords (pydata#7612) [skip-ci] Fix groupby binary ops benchmarks (pydata#7603) Remove incomplete sentence in IO docs (pydata#7631) Allow indexing unindexed dimensions using dask arrays (pydata#5873) Bump pypa/gh-action-pypi-publish from 1.6.4 to 1.7.1 (pydata#7618) [pre-commit.ci] pre-commit autoupdate (pydata#7620) add a test for scatter colorbar extend (pydata#7616) ...
pre-commit run --all-files
whats-new.rst
This is a naive attempt to make
isel
work with Dask.Known limitation: it triggers the computation.
WIP
The code presented here is mainly to support the discussion on #2511.
It has not been unit tested and should probably not be merged as is.