Skip to content

decode_cf destroys chunks #1779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rabernat opened this issue Dec 14, 2017 · 2 comments
Closed

decode_cf destroys chunks #1779

rabernat opened this issue Dec 14, 2017 · 2 comments

Comments

@rabernat
Copy link
Contributor

Code Sample, a copy-pastable example if possible

import numpy as np
import xarray as xr
xr.DataArray(np.random.rand(1000)).to_dataset(name='random').chunk(100)
ds_cf = xr.decode_cf(ds) 
assert not ds_cf.chunks

Problem description

Calling decode_cf causes variables whose data is dask arrays to be wrapped in two layers of abstractions: DaskIndexingAdapter and LazilyIndexedArray. In the example above

>>> ds.random.variable._data
dask.array<da.random.random_sample, shape=(1000,), dtype=float64, chunksize=(100,)>
>>> ds_cf.random.variable._data
LazilyIndexedArray(array=DaskIndexingAdapter(array=dask.array<da.random.random_sample, shape=(1000,), dtype=float64, chunksize=(100,)>), key=BasicIndexer((slice(None, None, None),))) 

At least part of the problem comes from this line:
https://github.com/pydata/xarray/blob/master/xarray/conventions.py#L1045

This is especially problematic if we want to concatenate several such datasets together with dask. Chunking the decoded dataset creates a nested dask-within-dask array which is sure to cause undesirable behavior down the line

>>> dict(ds_cf.chunk().random.data.dask)
{('xarray-random-bf5298b8790e93c1564b5dca9e04399e',
  0): (<function dask.array.core.getter>, 'xarray-random-bf5298b8790e93c1564b5dca9e04399e', (slice(0, 1000, None),)),
 'xarray-random-bf5298b8790e93c1564b5dca9e04399e': ImplicitToExplicitIndexingAdapter(array=LazilyIndexedArray(array=DaskIndexingAdapter(array=dask.array<da.random.random_sample, shape=(1000,), dtype=float64, chunksize=(100,)>), key=BasicIndexer((slice(None, None, None),))))}

Expected Output

If we call decode_cf on a dataset made of dask arrays, it should preserve the chunks of the original dask arrays. Hopefully this can be addressed by #1752.

Output of xr.show_versions()

commit: 85174cd python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

xarray: 0.10.0-52-gd8842a6
pandas: 0.20.3
numpy: 1.13.1
scipy: 0.19.1
netCDF4: 1.2.9
h5netcdf: 0.4.1
Nio: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.16.0
matplotlib: 2.1.0
cartopy: 0.15.1
seaborn: 0.8.1
setuptools: 36.3.0
pip: 9.0.1
conda: None
pytest: 3.2.1
IPython: 6.1.0
sphinx: 1.6.5

@shoyer
Copy link
Member

shoyer commented Dec 14, 2017

This is basically the same problem as #1372

@rabernat
Copy link
Contributor Author

Oops! Closing as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants