-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multidimensional dask coordinates unexpectedly computed #3068
Comments
Ok I'm getting a little more of an understanding on this. The main issue is that the dask array is not literally considered the same object because I'm creating the object twice. If I create a single dask array and pass it: lons = da.zeros((10, 10), chunks=2)
a = xr.DataArray(da.zeros((10, 10), chunks=2), dims=('y', 'x'), coords={'y': np.arange(10), 'x': np.arange(10), 'lons': (('y', 'x'), lons)})
b = xr.DataArray(da.zeros((10, 10), chunks=2), dims=('y', 'x'), coords={'y': np.arange(10), 'x': np.arange(10), 'lons': (('y', 'x'), lons)}) I still get the progress bar because xarray is creating two new If I make a single DataArray that becomes the coordinate variable then it seems to work: lons2 = xr.DataArray(lons, dims=('y', 'x'))
a = xr.DataArray(da.zeros((10, 10), chunks=2), dims=('y', 'x'), coords={'y': np.arange(10), 'x': np.arange(10), 'lons': (('y', 'x'), lons2)})
b = xr.DataArray(da.zeros((10, 10), chunks=2), dims=('y', 'x'), coords={'y': np.arange(10), 'x': np.arange(10), 'lons': (('y', 'x'), lons2)}) I get no progress bar.
|
Ok another update. In the previous example I accidentally added the However, if I fix this redundancy and do: a = xr.DataArray(da.zeros((10, 10), chunks=2), dims=('y', 'x'), coords={'lons': lons2})
b = xr.DataArray(da.zeros((10, 10), chunks=2), dims=('y', 'x'), coords={'lons': lons2})
with ProgressBar():
c = a + b I do get a progress bar again (lons2 is being computed). I've tracked it down to this https://github.com/pydata/xarray/blob/master/xarray/core/variable.py#L1223 I'm not sure if this would be considered a bug in dask or xarray. Also, not sure why the redundant version of the example worked. |
Modifying this line to be: if dims == expanded_vars.sizes:
return expanded_vars
return expanded_var.transpose(*dims) Then this issue is avoided for at least the |
FYI: @djhoese, you can inline code snippits using the permanent link to the source: xarray/xarray/core/variable.py Line 1223 in e5bb647
|
The source of the problem here is that when combining objects, xarray needs to decide what coordinates should remain. Our current heuristic, which pre-dates dask support, was really designed for array in memory: we keep around coordinates if they are equal on both arguments, and remove them otherwise. In some cases we can avoid the computation, if we know that the coordinates are the same object. I am open to ideas on how to make this work better. |
@shoyer Understood. That explains why something like this wasn't caught before, but what would be the best solution for a short term fix? For the long term, I also understand that there isn't really a good way to check equality of two dask arrays. I wonder if dask's graph optimization could be used to "simplify" two dask arrays' graph separately and check the graph equality. For example, two dask arrays created by doing Edit: "short term fix": What is the best way to avoid the unnecessary transpose? Or is this not even the right way to approach this? Change dask to avoid the unnecessary transpose or change xarray to not do the tranpose or something else? |
Dask actually already does this canonicalization. If two arrays have the same
So xarray could safely look at |
Ah, good call. The |
Dask arrays with the same graph have the same name. We can use this to quickly compare dask-backed variables without computing. Fixes pydata#3068 and pydata#3311
commit 08f7f74 Merge: 53c0f4e 278d2e6 Author: dcherian <deepak@cherian.net> Date: Tue Oct 29 09:36:58 2019 -0600 Merge remote-tracking branch 'upstream/master' into fix/dask-computes * upstream/master: upgrade black verison to 19.10b0 (pydata#3456) Remove outdated code related to compatibility with netcdftime (pydata#3450) commit 53c0f4e Author: dcherian <deepak@cherian.net> Date: Tue Oct 29 09:25:27 2019 -0600 Add identity check to lazy_array_equiv commit 5e742e4 Author: dcherian <deepak@cherian.net> Date: Tue Oct 29 09:22:15 2019 -0600 update whats new commit ee0d422 Merge: e99148e 74ca69a Author: dcherian <deepak@cherian.net> Date: Tue Oct 29 09:18:38 2019 -0600 Merge remote-tracking branch 'upstream/master' into fix/dask-computes * upstream/master: Remove deprecated behavior from dataset.drop docstring (pydata#3451) jupyterlab dark theme (pydata#3443) Drop groups associated with nans in group variable (pydata#3406) Allow ellipsis (...) in transpose (pydata#3421) Another groupby.reduce bugfix. (pydata#3403) add icomoon license (pydata#3448) commit e99148e Author: dcherian <deepak@cherian.net> Date: Tue Oct 29 09:17:58 2019 -0600 add concat test commit 4a66e7c Author: dcherian <deepak@cherian.net> Date: Mon Oct 28 10:19:32 2019 -0600 review suggestions. commit 8739ddd Author: dcherian <deepak@cherian.net> Date: Mon Oct 28 08:32:15 2019 -0600 better docstring commit e84cc97 Author: dcherian <deepak@cherian.net> Date: Sun Oct 27 20:22:13 2019 -0600 Optimize dask array equality checks. Dask arrays with the same graph have the same name. We can use this to quickly compare dask-backed variables without computing. Fixes pydata#3068 and pydata#3311
commit 0711eb0 Author: dcherian <deepak@cherian.net> Date: Thu Oct 31 21:18:58 2019 -0600 bugfix. commit 4ee2963 Author: Deepak Cherian <dcherian@users.noreply.github.com> Date: Thu Oct 31 11:27:05 2019 -0600 pep8 commit 6e4c11f Merge: 08f7f74 53c5199 Author: Deepak Cherian <dcherian@users.noreply.github.com> Date: Thu Oct 31 11:25:12 2019 -0600 Merge branch 'master' into fix/dask-computes commit 08f7f74 Merge: 53c0f4e 278d2e6 Author: dcherian <deepak@cherian.net> Date: Tue Oct 29 09:36:58 2019 -0600 Merge remote-tracking branch 'upstream/master' into fix/dask-computes * upstream/master: upgrade black verison to 19.10b0 (pydata#3456) Remove outdated code related to compatibility with netcdftime (pydata#3450) commit 53c0f4e Author: dcherian <deepak@cherian.net> Date: Tue Oct 29 09:25:27 2019 -0600 Add identity check to lazy_array_equiv commit 5e742e4 Author: dcherian <deepak@cherian.net> Date: Tue Oct 29 09:22:15 2019 -0600 update whats new commit ee0d422 Merge: e99148e 74ca69a Author: dcherian <deepak@cherian.net> Date: Tue Oct 29 09:18:38 2019 -0600 Merge remote-tracking branch 'upstream/master' into fix/dask-computes * upstream/master: Remove deprecated behavior from dataset.drop docstring (pydata#3451) jupyterlab dark theme (pydata#3443) Drop groups associated with nans in group variable (pydata#3406) Allow ellipsis (...) in transpose (pydata#3421) Another groupby.reduce bugfix. (pydata#3403) add icomoon license (pydata#3448) commit e99148e Author: dcherian <deepak@cherian.net> Date: Tue Oct 29 09:17:58 2019 -0600 add concat test commit 4a66e7c Author: dcherian <deepak@cherian.net> Date: Mon Oct 28 10:19:32 2019 -0600 review suggestions. commit 8739ddd Author: dcherian <deepak@cherian.net> Date: Mon Oct 28 08:32:15 2019 -0600 better docstring commit e84cc97 Author: dcherian <deepak@cherian.net> Date: Sun Oct 27 20:22:13 2019 -0600 Optimize dask array equality checks. Dask arrays with the same graph have the same name. We can use this to quickly compare dask-backed variables without computing. Fixes pydata#3068 and pydata#3311
* Optimize dask array equality checks. Dask arrays with the same graph have the same name. We can use this to quickly compare dask-backed variables without computing. Fixes #3068 and #3311 * better docstring * review suggestions. * add concat test * update whats new * Add identity check to lazy_array_equiv * pep8 * bugfix.
This feature of xarray is not working well with dask pydata/xarray#3068
MCVE Code Sample
Output:
Problem Description
Using arrays with 2D dask array coordinates results in the coordinates being computed for any binary operations (anything combining two or more DataArrays). I use
ProgressBar
in the above example to show when coordinates are being computed.In my own work, when I learned that 2D dask coordinates were possible, I started adding
longitude
andlatitude
coordinates. These are rather large and can take a while to load/compute so I was surprised that simple operations (ex.a.fillna(b)
) were causing things to be computed and taking a long time.Is this computation by design or a possible bug?
Expected Output
No output from the
ProgressBar
, hoping that no coordinates would be computed/loaded.Output of
xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.7 | packaged by conda-forge | (default, Feb 28 2019, 02:16:08)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
python-bits: 64
OS: Darwin
OS-release: 18.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2
xarray: 0.12.1
pandas: 0.24.2
numpy: 1.14.3
scipy: 1.3.0
netCDF4: 1.5.1.2
pydap: None
h5netcdf: 0.7.4
h5py: 2.9.0
Nio: None
zarr: 2.3.2
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: 1.0.22
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 2.0.0
distributed: 2.0.0
matplotlib: 3.1.0
cartopy: 0.17.1.dev147+HEAD.detached.at.5e624fe
seaborn: None
setuptools: 41.0.1
pip: 19.1.1
conda: None
pytest: 4.6.3
IPython: 7.5.0
sphinx: 2.1.2
The text was updated successfully, but these errors were encountered: