Support cache exclusion based on file name pattern #2704

siddharthab · 2024-11-25T00:32:51Z

Feature Request
The current logic to cache a file on disk assumes that if the user starts reading
the file from 0 offset then the entire file is supposed to be read and so should be
preemptively cached. However, this logic assumes the absence of file headers
at the beginning of the file, and/or the absence of magic number checks.

For example, in Bioinformatics, there are BAM files that can be 100+ GB in file size.
They are typically meant to be stored remotely and accessed only through random
reads. Random access is usually enabled through use of a separate index file.
However, these files also have metadata stored in the file header that all clients will
want to read first. An attempt to read the metadata from the file header will make
gcsfuse assume that the entire file will be read and gcsfuse will begin caching the
entire file in on-disk cache. This can very quickly deplete the available cache capacity.
So the user might want to exclude only these special files while still caching all other files.

Proposed solution
The configuration can include options to exclude files from on-disk cache if their names
follow certain patterns. An example implementation is provided at #2043.

ashmeenkaur · 2024-11-25T04:46:45Z

Using a regex pattern to exclude certain files from the file cache seems like a good addition. I can see that this might be useful for other use cases as well where customers want to exclude certain files from the cache.

What are your thoughts on this, @marcoa6? Can we proceed with the implementation behind an experimental flag?

cc: @vadlakondaswetha @charith87

marcoa6 · 2024-11-25T15:33:16Z

LGTM

ashmeenkaur · 2024-11-26T05:00:51Z

Thanks @marcoa6!
@siddharthab we can proceed with the implementation behind hidden flag experimental-file-cache-exclude-regex.

siddharthab · 2024-11-26T05:46:44Z

Thanks. I will redo the PR as per the testing instructions in the contributions guide.

siddharthab added p2 P2 question Customer Issue: question about how to use tool labels Nov 25, 2024

siddharthab mentioned this issue Nov 25, 2024

Exclude files from cache based on name #2043

Open

ashmeenkaur added the feature request Feature request: request to add new features or functionality label Nov 25, 2024

ashmeenkaur removed the question Customer Issue: question about how to use tool label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support cache exclusion based on file name pattern #2704

Support cache exclusion based on file name pattern #2704

siddharthab commented Nov 25, 2024

ashmeenkaur commented Nov 25, 2024

marcoa6 commented Nov 25, 2024

ashmeenkaur commented Nov 26, 2024

siddharthab commented Nov 26, 2024

Support cache exclusion based on file name pattern #2704

Support cache exclusion based on file name pattern #2704

Comments

siddharthab commented Nov 25, 2024

ashmeenkaur commented Nov 25, 2024

marcoa6 commented Nov 25, 2024

ashmeenkaur commented Nov 26, 2024

siddharthab commented Nov 26, 2024