You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Feature Request
The current logic to cache a file on disk assumes that if the user starts reading
the file from 0 offset then the entire file is supposed to be read and so should be
preemptively cached. However, this logic assumes the absence of file headers
at the beginning of the file, and/or the absence of magic number checks.
For example, in Bioinformatics, there are BAM files that can be 100+ GB in file size.
They are typically meant to be stored remotely and accessed only through random
reads. Random access is usually enabled through use of a separate index file.
However, these files also have metadata stored in the file header that all clients will
want to read first. An attempt to read the metadata from the file header will make
gcsfuse assume that the entire file will be read and gcsfuse will begin caching the
entire file in on-disk cache. This can very quickly deplete the available cache capacity.
So the user might want to exclude only these special files while still caching all other files.
Proposed solution
The configuration can include options to exclude files from on-disk cache if their names
follow certain patterns. An example implementation is provided at #2043.
The text was updated successfully, but these errors were encountered:
Using a regex pattern to exclude certain files from the file cache seems like a good addition. I can see that this might be useful for other use cases as well where customers want to exclude certain files from the cache.
What are your thoughts on this, @marcoa6? Can we proceed with the implementation behind an experimental flag?
Feature Request
The current logic to cache a file on disk assumes that if the user starts reading
the file from 0 offset then the entire file is supposed to be read and so should be
preemptively cached. However, this logic assumes the absence of file headers
at the beginning of the file, and/or the absence of magic number checks.
For example, in Bioinformatics, there are BAM files that can be 100+ GB in file size.
They are typically meant to be stored remotely and accessed only through random
reads. Random access is usually enabled through use of a separate index file.
However, these files also have metadata stored in the file header that all clients will
want to read first. An attempt to read the metadata from the file header will make
gcsfuse assume that the entire file will be read and gcsfuse will begin caching the
entire file in on-disk cache. This can very quickly deplete the available cache capacity.
So the user might want to exclude only these special files while still caching all other files.
Proposed solution
The configuration can include options to exclude files from on-disk cache if their names
follow certain patterns. An example implementation is provided at #2043.
The text was updated successfully, but these errors were encountered: