Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: cannot chunk a DataArray that originated as a coordinate #6204

Open
spencerkclark opened this issue Jan 28, 2022 · 2 comments
Open

[Bug]: cannot chunk a DataArray that originated as a coordinate #6204

spencerkclark opened this issue Jan 28, 2022 · 2 comments
Labels

Comments

@spencerkclark
Copy link
Member

What happened?

If I construct the following DataArray, and try to chunk its "x" coordinate, I get back a NumPy-backed DataArray:

In [2]: a = xr.DataArray([1, 2, 3], dims=["x"], coords=[[4, 5, 6]])

In [3]: a.x.chunk()
Out[3]:
<xarray.DataArray 'x' (x: 3)>
array([4, 5, 6])
Coordinates:
  * x        (x) int64 4 5 6

If I construct a copy of the "x" coordinate, things work as I would expect:

In [4]: x = xr.DataArray(a.x, dims=a.x.dims, coords=a.x.coords, name="x")

In [5]: x.chunk()
Out[5]:
<xarray.DataArray 'x' (x: 3)>
dask.array<xarray-<this-array>, shape=(3,), dtype=int64, chunksize=(3,), chunktype=numpy.ndarray>
Coordinates:
  * x        (x) int64 4 5 6

What did you expect to happen?

I would expect the following to happen:

In [2]: a = xr.DataArray([1, 2, 3], dims=["x"], coords=[[4, 5, 6]])

In [3]: a.x.chunk()
Out[3]:
<xarray.DataArray 'x' (x: 3)>
dask.array<xarray-<this-array>, shape=(3,), dtype=int64, chunksize=(3,), chunktype=numpy.ndarray>
Coordinates:
  * x        (x) int64 4 5 6

Minimal Complete Verifiable Example

No response

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 15:59:12)
[Clang 11.0.1 ]
python-bits: 64
OS: Darwin
OS-release: 21.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.5
libnetcdf: 4.6.3

xarray: 0.20.1
pandas: 1.3.5
numpy: 1.19.4
scipy: 1.5.4
netCDF4: 1.5.5
pydap: None
h5netcdf: 0.8.1
h5py: 2.10.0
Nio: None
zarr: 2.7.0
cftime: 1.2.1
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.22.0
distributed: None
matplotlib: 3.2.2
cartopy: 0.19.0.post1
seaborn: None
numbagg: None
fsspec: 2021.06.0
cupy: None
pint: 0.15
sparse: None
setuptools: 49.6.0.post20210108
pip: 20.2.4
conda: 4.10.1
pytest: 6.0.1
IPython: 7.27.0
sphinx: 3.2.1

@spencerkclark spencerkclark added bug needs triage Issue that has not been reviewed by xarray team member labels Jan 28, 2022
@dcherian dcherian removed the needs triage Issue that has not been reviewed by xarray team member label Mar 16, 2022
@dcherian
Copy link
Contributor

I've run in to this before. The underlying variable object is IndexVariable which has a dummy chunk method

xarray/xarray/core/variable.py

Lines 2707 to 2709 in 95bb9ae

def chunk(self, chunks={}, name=None, lock=False):
# Dummy - do not chunk. This method is invoked e.g. by Dataset.chunk()
return self.copy(deep=False)

aulemahal added a commit to Ouranosinc/xclim that referenced this issue Nov 30, 2023
…#1542)

<!--Please ensure the PR fulfills the following requirements! -->
<!-- If this is your first PR, make sure to add your details to the
AUTHORS.rst! -->
### Pull Request Checklist:
- [x] This PR addresses an already opened issue (for bug fixes /
features)
    - This PR fixes #1536
- [x] Tests for the changes have been added (for bug fixes / features)
- [ ] (If applicable) Documentation has been added / updated (for bug
fixes / features)
- [x] CHANGES.rst has been updated (with summary of main changes)
- [x] Link to issue (:issue:`number`) and pull request (:pull:`number`)
has been added

### What kind of change does this PR introduce?

* New function `xc.core.utils._chunk_like` to chunk a list of inputs
according to one chunk dictionary. It also circumvents
pydata/xarray#6204 by recreating DataArrays
that where obtained from dimension coordinates.
* Generalization of `uses_dask` so it can accept a list of objects.
* Usage of `_chunk_like` to ensure the inputs of
`cosine_of_solar_zenith_angle` are chunked when needed, in
`mean_radiant_temperature` and `potential_evapotranspiration`.

The effect of this is simply that the `cosine_of_solar_zenith_angle`
will be performed on blocks of the same size as in the original data,
even though its inputs (the dimension coordinate) did not carry that
information. Before this PR, the calculation was done as a single block,
of the same size as the full array.

### Does this PR introduce a breaking change?
No.

### Other information:
Dask might warn something like `PerformanceWarning: Increasing number of
chunks by factor of NN`. Where NN should be the number of chunks along
the `lat` dimension, if present. That exactly what we want, so it's ok.
@sanghyukmoon
Copy link

I encountered a similar issue when I tried to create a new array from a set of Index Coordinates, e.g.,

# ds.z, ds.y, ds.x are fully loaded into memory, and so is ds.coords.r.
ds.coords['r'] = np.sqrt((ds.z - z0)**2 + (ds.y - y0)**2 + (ds.x - x0)**2)

I'm currently circumventing the problem by using _chunk_like introduced in Ouranosinc/xclim#1542.

x, y, z = _chunk_like(ds.x, ds.y, ds.z, chunks=ds.chunksizes)
ds.coords['r'] = np.sqrt((z - z0)**2 + (y - y0)**2 + (x - x0)**2)
# Now, ds.coords.r carries a dask array!

As @dcherian noted, the underlying cause of the issue seems to be that the IndexVariable always have to be fully loaded into memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants