Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cache exclusion based on file name pattern #2704

Open
siddharthab opened this issue Nov 25, 2024 · 4 comments
Open

Support cache exclusion based on file name pattern #2704

siddharthab opened this issue Nov 25, 2024 · 4 comments
Labels
feature request Feature request: request to add new features or functionality p2 P2

Comments

@siddharthab
Copy link

Feature Request
The current logic to cache a file on disk assumes that if the user starts reading
the file from 0 offset then the entire file is supposed to be read and so should be
preemptively cached. However, this logic assumes the absence of file headers
at the beginning of the file, and/or the absence of magic number checks.

For example, in Bioinformatics, there are BAM files that can be 100+ GB in file size.
They are typically meant to be stored remotely and accessed only through random
reads. Random access is usually enabled through use of a separate index file.
However, these files also have metadata stored in the file header that all clients will
want to read first. An attempt to read the metadata from the file header will make
gcsfuse assume that the entire file will be read and gcsfuse will begin caching the
entire file in on-disk cache. This can very quickly deplete the available cache capacity.
So the user might want to exclude only these special files while still caching all other files.

Proposed solution
The configuration can include options to exclude files from on-disk cache if their names
follow certain patterns. An example implementation is provided at #2043.

@siddharthab siddharthab added p2 P2 question Customer Issue: question about how to use tool labels Nov 25, 2024
@ashmeenkaur ashmeenkaur added the feature request Feature request: request to add new features or functionality label Nov 25, 2024
@ashmeenkaur
Copy link
Collaborator

Using a regex pattern to exclude certain files from the file cache seems like a good addition. I can see that this might be useful for other use cases as well where customers want to exclude certain files from the cache.

What are your thoughts on this, @marcoa6? Can we proceed with the implementation behind an experimental flag?

cc: @vadlakondaswetha @charith87

@marcoa6
Copy link
Collaborator

marcoa6 commented Nov 25, 2024

LGTM

@ashmeenkaur
Copy link
Collaborator

Thanks @marcoa6!
@siddharthab we can proceed with the implementation behind hidden flag experimental-file-cache-exclude-regex.

@siddharthab
Copy link
Author

Thanks. I will redo the PR as per the testing instructions in the contributions guide.

@ashmeenkaur ashmeenkaur removed the question Customer Issue: question about how to use tool label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Feature request: request to add new features or functionality p2 P2
Projects
None yet
Development

No branches or pull requests

3 participants