Description
We've run into this when using Xarray together with Dask. The default way of calling this is like this at the moment:
file_list = []
model = "ACCESS-CM2"
variable = "hurs"
data_dir = f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/{model}/historical/r1i1p1f1/{variable}/*.nc"
file_list += [f"s3://{path}" for path in s3fs.glob(data_dir)]
# Opening the files here
files = [s3fs.open(f) for f in file_list]
ds = xr.open_mfdataset(
files,
engine="h5netcdf",
combine="nested",
concat_dim="time",
data_vars="minimal",
coords="minimal",
compat="override",
parallel=True,
)
The files are accessed initially to get the meta information before they are serialised. The initial access populates the cache with a lot of data.
This triggers a very large cache for 4 of the 130 files which is pretty bad when serialising things.
Serialising the cache doesn't seem like a great idea generally and specifically for remote file systems. Was this decision made intentionally or is this rather something that hasn't been a problem so far?
Ideally, we could purge the cache before things are serialised in fsspec / the inherited libraries. Is this something you would consider? I'd be happy to put up a PR if there is agreement about this.