Skip to content

Serialising a File will also serialise the cache which can grow very large #1747

Closed
@phofl

Description

@phofl

We've run into this when using Xarray together with Dask. The default way of calling this is like this at the moment:

file_list = []
model = "ACCESS-CM2"
variable = "hurs"
data_dir = f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/{model}/historical/r1i1p1f1/{variable}/*.nc"
file_list += [f"s3://{path}" for path in s3fs.glob(data_dir)]

# Opening the files here
files = [s3fs.open(f) for f in file_list]

ds = xr.open_mfdataset(
    files,
    engine="h5netcdf",
    combine="nested",
    concat_dim="time",
    data_vars="minimal",
    coords="minimal",
    compat="override",
    parallel=True,
)

The files are accessed initially to get the meta information before they are serialised. The initial access populates the cache with a lot of data.

This triggers a very large cache for 4 of the 130 files which is pretty bad when serialising things.

Serialising the cache doesn't seem like a great idea generally and specifically for remote file systems. Was this decision made intentionally or is this rather something that hasn't been a problem so far?

Ideally, we could purge the cache before things are serialised in fsspec / the inherited libraries. Is this something you would consider? I'd be happy to put up a PR if there is agreement about this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions