Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multidimensional dask coordinates unexpectedly computed #3068

Closed
djhoese opened this issue Jul 1, 2019 · 8 comments · Fixed by #3453
Closed

Multidimensional dask coordinates unexpectedly computed #3068

djhoese opened this issue Jul 1, 2019 · 8 comments · Fixed by #3453

Comments

@djhoese
Copy link
Contributor

djhoese commented Jul 1, 2019

MCVE Code Sample

from dask.diagnostics import ProgressBar
import xarray as xr
import numpy as np
import dask.array as da

a = xr.DataArray(da.zeros((10, 10), chunks=2), dims=('y', 'x'), coords={'y': np.arange(10), 'x': np.arange(10), 'lons': (('y', 'x'), da.zeros((10, 10), chunks=2))}) 
b = xr.DataArray(da.zeros((10, 10), chunks=2), dims=('y', 'x'), coords={'y': np.arange(10), 'x': np.arange(10), 'lons': (('y', 'x'), da.zeros((10, 10), chunks=2))}) 

with ProgressBar():
    c = a + b

Output:

[########################################] | 100% Completed |  0.1s

Problem Description

Using arrays with 2D dask array coordinates results in the coordinates being computed for any binary operations (anything combining two or more DataArrays). I use ProgressBar in the above example to show when coordinates are being computed.

In my own work, when I learned that 2D dask coordinates were possible, I started adding longitude and latitude coordinates. These are rather large and can take a while to load/compute so I was surprised that simple operations (ex. a.fillna(b)) were causing things to be computed and taking a long time.

Is this computation by design or a possible bug?

Expected Output

No output from the ProgressBar, hoping that no coordinates would be computed/loaded.

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.7 | packaged by conda-forge | (default, Feb 28 2019, 02:16:08)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
python-bits: 64
OS: Darwin
OS-release: 18.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2

xarray: 0.12.1
pandas: 0.24.2
numpy: 1.14.3
scipy: 1.3.0
netCDF4: 1.5.1.2
pydap: None
h5netcdf: 0.7.4
h5py: 2.9.0
Nio: None
zarr: 2.3.2
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: 1.0.22
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 2.0.0
distributed: 2.0.0
matplotlib: 3.1.0
cartopy: 0.17.1.dev147+HEAD.detached.at.5e624fe
seaborn: None
setuptools: 41.0.1
pip: 19.1.1
conda: None
pytest: 4.6.3
IPython: 7.5.0
sphinx: 2.1.2

@djhoese
Copy link
Contributor Author

djhoese commented Jul 1, 2019

Ok I'm getting a little more of an understanding on this. The main issue is that the dask array is not literally considered the same object because I'm creating the object twice. If I create a single dask array and pass it:

lons = da.zeros((10, 10), chunks=2)
a = xr.DataArray(da.zeros((10, 10), chunks=2), dims=('y', 'x'), coords={'y': np.arange(10), 'x': np.arange(10), 'lons': (('y', 'x'), lons)})
b = xr.DataArray(da.zeros((10, 10), chunks=2), dims=('y', 'x'), coords={'y': np.arange(10), 'x': np.arange(10), 'lons': (('y', 'x'), lons)})

I still get the progress bar because xarray is creating two new DataArray objects for this lons coordinate. So lons_data_arr.variable._data is not lons_data_arr2.variable._data causing the equivalency check here to fail.

If I make a single DataArray that becomes the coordinate variable then it seems to work:

lons2 = xr.DataArray(lons, dims=('y', 'x'))
a = xr.DataArray(da.zeros((10, 10), chunks=2), dims=('y', 'x'), coords={'y': np.arange(10), 'x': np.arange(10), 'lons': (('y', 'x'), lons2)})
b = xr.DataArray(da.zeros((10, 10), chunks=2), dims=('y', 'x'), coords={'y': np.arange(10), 'x': np.arange(10), 'lons': (('y', 'x'), lons2)})

I get no progress bar.

@djhoese
Copy link
Contributor Author

djhoese commented Jul 1, 2019

Ok another update. In the previous example I accidentally added the lons coordinate DataArray with the dimensions redefined (('y', 'x'), lons2) which is technically redundant but it worked (no progress bar).

However, if I fix this redundancy and do:

a = xr.DataArray(da.zeros((10, 10), chunks=2), dims=('y', 'x'), coords={'lons': lons2})
b = xr.DataArray(da.zeros((10, 10), chunks=2), dims=('y', 'x'), coords={'lons': lons2})
with ProgressBar():
    c = a + b

I do get a progress bar again (lons2 is being computed). I've tracked it down to this transpose which is transposing when it doesn't need to which is causing the dask array to change:

https://github.com/pydata/xarray/blob/master/xarray/core/variable.py#L1223

I'm not sure if this would be considered a bug in dask or xarray. Also, not sure why the redundant version of the example worked.

@djhoese
Copy link
Contributor Author

djhoese commented Jul 1, 2019

Modifying this line to be:

if dims == expanded_vars.sizes:
    return expanded_vars
return expanded_var.transpose(*dims)

Then this issue is avoided for at least the + case.

@dhirschfeld
Copy link

FYI: @djhoese, you can inline code snippits using the permanent link to the source:

return expanded_var.transpose(*dims)

@shoyer
Copy link
Member

shoyer commented Jul 2, 2019

The source of the problem here is that when combining objects, xarray needs to decide what coordinates should remain. Our current heuristic, which pre-dates dask support, was really designed for array in memory: we keep around coordinates if they are equal on both arguments, and remove them otherwise. In some cases we can avoid the computation, if we know that the coordinates are the same object.

I am open to ideas on how to make this work better.

@djhoese
Copy link
Contributor Author

djhoese commented Jul 2, 2019

@shoyer Understood. That explains why something like this wasn't caught before, but what would be the best solution for a short term fix?

For the long term, I also understand that there isn't really a good way to check equality of two dask arrays. I wonder if dask's graph optimization could be used to "simplify" two dask arrays' graph separately and check the graph equality. For example, two dask arrays created by doing da.zeros((10, 10), chunks=2) + 5 should be theoretically equal because their dask graphs are made up of the same tasks.

Edit: "short term fix": What is the best way to avoid the unnecessary transpose? Or is this not even the right way to approach this? Change dask to avoid the unnecessary transpose or change xarray to not do the tranpose or something else?

@shoyer
Copy link
Member

shoyer commented Jul 5, 2019

For the long term, I also understand that there isn't really a good way to check equality of two dask arrays. I wonder if dask's graph optimization could be used to "simplify" two dask arrays' graph separately and check the graph equality. For example, two dask arrays created by doing da.zeros((10, 10), chunks=2) + 5 should be theoretically equal because their dask graphs are made up of the same tasks.

Dask actually already does this canonicalization. If two arrays have the same name, they use the same dask graph, e.g.,

In [5]: x = da.zeros((10, 10), chunks=2) + 5

In [6]: y = da.zeros((10, 10), chunks=2) + 5

In [7]: x.name
Out[7]: 'add-f7441a0f46f5cf40458391cd08406c23'

In [8]: y.name
Out[8]: 'add-f7441a0f46f5cf40458391cd08406c23'

So xarray could safely look at .name on dask arrays (e.g., inside Variable.equals or duck_array_ops.array_equiv) for determining that two dask arrays are the same, rather than merely using is to check if they are the same objects.

@djhoese
Copy link
Contributor Author

djhoese commented Jul 5, 2019

Ah, good call. The transpose currently in xarray would still be a problem though.

dcherian added a commit to dcherian/xarray that referenced this issue Oct 27, 2019
dcherian added a commit to dcherian/xarray that referenced this issue Oct 28, 2019
Dask arrays with the same graph have the same name. We can use this to quickly
compare dask-backed variables without computing.

Fixes pydata#3068 and pydata#3311
dcherian added a commit to dcherian/xarray that referenced this issue Oct 30, 2019
commit 08f7f74
Merge: 53c0f4e 278d2e6
Author: dcherian <deepak@cherian.net>
Date:   Tue Oct 29 09:36:58 2019 -0600

    Merge remote-tracking branch 'upstream/master' into fix/dask-computes

    * upstream/master:
      upgrade black verison to 19.10b0 (pydata#3456)
      Remove outdated code related to compatibility with netcdftime (pydata#3450)

commit 53c0f4e
Author: dcherian <deepak@cherian.net>
Date:   Tue Oct 29 09:25:27 2019 -0600

    Add identity check to lazy_array_equiv

commit 5e742e4
Author: dcherian <deepak@cherian.net>
Date:   Tue Oct 29 09:22:15 2019 -0600

    update whats new

commit ee0d422
Merge: e99148e 74ca69a
Author: dcherian <deepak@cherian.net>
Date:   Tue Oct 29 09:18:38 2019 -0600

    Merge remote-tracking branch 'upstream/master' into fix/dask-computes

    * upstream/master:
      Remove deprecated behavior from dataset.drop docstring (pydata#3451)
      jupyterlab dark theme (pydata#3443)
      Drop groups associated with nans in group variable (pydata#3406)
      Allow ellipsis (...) in transpose (pydata#3421)
      Another groupby.reduce bugfix. (pydata#3403)
      add icomoon license (pydata#3448)

commit e99148e
Author: dcherian <deepak@cherian.net>
Date:   Tue Oct 29 09:17:58 2019 -0600

    add concat test

commit 4a66e7c
Author: dcherian <deepak@cherian.net>
Date:   Mon Oct 28 10:19:32 2019 -0600

    review suggestions.

commit 8739ddd
Author: dcherian <deepak@cherian.net>
Date:   Mon Oct 28 08:32:15 2019 -0600

    better docstring

commit e84cc97
Author: dcherian <deepak@cherian.net>
Date:   Sun Oct 27 20:22:13 2019 -0600

    Optimize dask array equality checks.

    Dask arrays with the same graph have the same name. We can use this to quickly
    compare dask-backed variables without computing.

    Fixes pydata#3068 and pydata#3311
dcherian added a commit to dcherian/xarray that referenced this issue Nov 2, 2019
commit 0711eb0
Author: dcherian <deepak@cherian.net>
Date:   Thu Oct 31 21:18:58 2019 -0600

    bugfix.

commit 4ee2963
Author: Deepak Cherian <dcherian@users.noreply.github.com>
Date:   Thu Oct 31 11:27:05 2019 -0600

    pep8

commit 6e4c11f
Merge: 08f7f74 53c5199
Author: Deepak Cherian <dcherian@users.noreply.github.com>
Date:   Thu Oct 31 11:25:12 2019 -0600

    Merge branch 'master' into fix/dask-computes

commit 08f7f74
Merge: 53c0f4e 278d2e6
Author: dcherian <deepak@cherian.net>
Date:   Tue Oct 29 09:36:58 2019 -0600

    Merge remote-tracking branch 'upstream/master' into fix/dask-computes

    * upstream/master:
      upgrade black verison to 19.10b0 (pydata#3456)
      Remove outdated code related to compatibility with netcdftime (pydata#3450)

commit 53c0f4e
Author: dcherian <deepak@cherian.net>
Date:   Tue Oct 29 09:25:27 2019 -0600

    Add identity check to lazy_array_equiv

commit 5e742e4
Author: dcherian <deepak@cherian.net>
Date:   Tue Oct 29 09:22:15 2019 -0600

    update whats new

commit ee0d422
Merge: e99148e 74ca69a
Author: dcherian <deepak@cherian.net>
Date:   Tue Oct 29 09:18:38 2019 -0600

    Merge remote-tracking branch 'upstream/master' into fix/dask-computes

    * upstream/master:
      Remove deprecated behavior from dataset.drop docstring (pydata#3451)
      jupyterlab dark theme (pydata#3443)
      Drop groups associated with nans in group variable (pydata#3406)
      Allow ellipsis (...) in transpose (pydata#3421)
      Another groupby.reduce bugfix. (pydata#3403)
      add icomoon license (pydata#3448)

commit e99148e
Author: dcherian <deepak@cherian.net>
Date:   Tue Oct 29 09:17:58 2019 -0600

    add concat test

commit 4a66e7c
Author: dcherian <deepak@cherian.net>
Date:   Mon Oct 28 10:19:32 2019 -0600

    review suggestions.

commit 8739ddd
Author: dcherian <deepak@cherian.net>
Date:   Mon Oct 28 08:32:15 2019 -0600

    better docstring

commit e84cc97
Author: dcherian <deepak@cherian.net>
Date:   Sun Oct 27 20:22:13 2019 -0600

    Optimize dask array equality checks.

    Dask arrays with the same graph have the same name. We can use this to quickly
    compare dask-backed variables without computing.

    Fixes pydata#3068 and pydata#3311
dcherian added a commit that referenced this issue Nov 5, 2019
* Optimize dask array equality checks.

Dask arrays with the same graph have the same name. We can use this to quickly
compare dask-backed variables without computing.

Fixes #3068 and #3311

* better docstring

* review suggestions.

* add concat test

* update whats new

* Add identity check to lazy_array_equiv

* pep8

* bugfix.
joleenf added a commit to joleenf/satpy that referenced this issue Feb 26, 2021
This feature of xarray is not working well with dask
pydata/xarray#3068
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants