-
I have a python script, running from my personal laptop (Machine A). This script starts a Coiled cluster with dask distributed. Then NetCDFs are open together with xr.open_mfdataset. Things go a bit funny with open_mfdataset. (A bit off topic for this conversation, but if I have open_mfdataset within class, my preprocess function needs to be outside of the class, and be a standalone function, otherwise I get serialization issues with partial... Anyway, back to the main problem) open_mfdataset uses a preprocess function . This function simply creates missing variables to ensure the dataset is consistent. However, when the preprocess function is being run on the cluster, my local machine A's network bandwith goes like crazy. It looks like all the NetCDF files that are being imported via dask on the remote cluster were being downloaded to my local machine (since my laptop is suddenly downloading 10Gb+ of data). I also noticed a lot of local memory usage. To me it doesn't make any sense, as any computation should run on the remote cluster, however this is the behaviour I'm experiencing. I've managed to reproduce the problem by simplifying to the extreme my preprocess function. Even with something as simple as below, I encounter the same behaviour: def preprocess(ds):
return ds I have by no mean an amazing understanding of dask/zarr/xarray, so it's very likely I'm not following the best practices (even though I've been looking). I'm often running into cases where my local machine crashes because of all the memory is being used because I'm processing too many files at once. This defies the purpose of a remote cluster I didn't write an issue, as it's a bit of a weird one and I'm not sure if it's a dask/xarray/coiled or me issue, but I'm seeking some help/advice, and if this has been ever noticed. cheers |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 3 replies
-
as a reproducible example, see where the With this slightly different gist, the local memory is used, which is "normal" since the |
Beta Was this translation helpful? Give feedback.
-
You are not using the cluster client to do anything...so code is being run locally? |
Beta Was this translation helpful? Give feedback.
-
https://github.com/heliocloud-data/science-tutorials/blob/main/S3-Dask-Demo.ipynb |
Beta Was this translation helpful? Give feedback.
-
so you're suggesting using All the documentation I read suggests that when a client is setup, as per my example, all open_mfdataset operation will use the cluster/client, especially with the see
https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html or other coiled doc such as import xarray as xr
import coiled
cluster = coiled.Cluster(
n_workers=500,
region="us-west-2",
worker_memory="64 GiB",
spot_policy="spot",
)
client = cluster.get_client()
ds = xr.open_mfdataset(
"s3://mybucket/all-data.zarr",
engine="zarr", parallel=True,
) which is the most basic example, and not much different from my example |
Beta Was this translation helpful? Give feedback.
-
I have never used coiled, but yes -- if things are happening locally and not on cluster as expected, try and explicitly send those functions/command to cluster and see if behaviour changes? |
Beta Was this translation helpful? Give feedback.
I got some help from Coiled, thanks @phofl
For reference, It turns out that the issue is related to s3fs and already lodge here:
fsspec/filesystem_spec#1747
The solution is to use this obscure option in s3fs:
default_file_cache=None
It sounds a bit insane that not many people are experiencing this issue as this means using a dask cluster with remote NetCDF files is useless as the bottleneck becomes the machine which is starting the code