Serialising a File will also serialise the cache which can grow very large

We've run into this when using Xarray together with Dask. The default way of calling this is like this at the moment:

```
file_list = []
model = "ACCESS-CM2"
variable = "hurs"
data_dir = f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/{model}/historical/r1i1p1f1/{variable}/*.nc"
file_list += [f"s3://{path}" for path in s3fs.glob(data_dir)]

# Opening the files here
files = [s3fs.open(f) for f in file_list]

ds = xr.open_mfdataset(
    files,
    engine="h5netcdf",
    combine="nested",
    concat_dim="time",
    data_vars="minimal",
    coords="minimal",
    compat="override",
    parallel=True,
)
```

The files are accessed initially to get the meta information before they are serialised. The initial access populates the cache with a lot of data.

This triggers a very large cache for 4 of the 130 files which is pretty bad when serialising things. 

Serialising the cache doesn't seem like a great idea generally and specifically for remote file systems. Was this decision made intentionally or is this rather something that hasn't been a problem so far? 

Ideally, we could purge the cache before things are serialised in fsspec / the inherited libraries. Is this something you would consider? I'd be happy to put up a PR if there is agreement about this.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Serialising a File will also serialise the cache which can grow very large #1747

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Serialising a File will also serialise the cache which can grow very large #1747

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions