-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading data from zarr or netcdf files loads all data into memory with distributed scheduler #8969
Comments
Thanks for opening this issue. I can reproduce this behavior. Looking at the dashboard my first suspicion was that this was an ordering issue but inspecting this, I cannot see any faults. Will have to look a little deeper |
Ah, I missed one thing. The from_zarr tasks are not detected as rootish. #8950 will fix this |
Using four zarr paths works fine but adding the fifth changes the topology in a way that this cluster won't detect the tasks as root-ish. @schlunma a mitigation until a proper fix is merged to play with a configuration value of our scheduler. import dask
dask.config.set({"distributed.scheduler.rootish-taskgroup-dependencies": len(zarr_paths) + 1}) This parameter is controlling how our "root-ish" detection works. That's basically a heuristic that tries to detect data generating tasks such that those can be scheduled a little differently. The default value for this is Be aware that this config has to be set before the cluster is started |
Thanks for the very quick answer @fjetter! Just tested this for the small MWE, and I can confirm that this fixes the problem! Amazing! I will run our more complicated pipeline tomorrow and check if this will also be fixed by this. Cheers! |
I just tested dask/dask#11558 and #8950 with zarr, xarray, and iris. Everything works smoothly with this! Thanks so much, looking forward to the new release! |
The prs were merged, so closing this one |
Describe the issue:
I am working with climate model output (mostly in zarr/netcdf format) and came across some issues when reading those data and calculating averages with a distributed scheduler. Looking at the Dask dashboard, I find that basically all the data is read into memory at some point.
Here is a screenshot of an example dashboard (input data size: ~45 GiB):
See also https://dask.discourse.group/t/reading-data-from-netcdf-or-zarr-files-loads-all-data-into-memory/3733 for more details and discussion about this.
Minimal Complete Verifiable Example:
Corresponding Dask graph
Anything else we need to know?:
This behavior is more or less independent from the scheduler parameters and the actual calculation that is performed. I also tested this with a
dask_jobqueue.SLURMCluster
and found the same. I also found the same when reading netcdf files using xarray (see MWE below).The issue does not appear if
da.concatenate
) is used.dask.array
instead of read from file are used (see MWE below).Example with xarray (same problem as with zarr)
Corresponding Dask graph using xarray
Example using dask.array directly (no problem: memory usage per worker is basically always < 2 GiB, dashboard looks clean)
Corresponding Dask graph using dask.array
Environment:
The text was updated successfully, but these errors were encountered: