Skip to content

Opening lots of files can be slow #816

@jrbourbeau

Description

@jrbourbeau

When I open a file on S3 like this:

import fsspec

fs = fsspec.filesystem('s3', anon=True)
path = "coiled-datasets/uber-lyft-tlc/part.93.parquet"
fs.open(path, mode="rb")

The fs.open call often takes ~0.5-1.5 seconds to run. Here's a snakeviz profile (again, just of the fs.open call) where it looks like most time is spent in a details call that hits S3:

Screenshot 2023-10-26 at 4 24 31 PM

I think this is mostly to get the file size (though I'm not sure why the size is needed at file object creation time) because if I pass the file size to fs.open, then things are much faster:

Screenshot 2023-10-26 at 4 24 46 PM

@martindurant do you have a sense for what's possible here to speed up opening files?

The actual use case I'm interested in is passing a bunch (100k) of netcdf files to Xarray, whose h5netcdf engine requires open file objects.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions