-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
- Part of [Epic] Enable parquet metadata cache by default #17000
@nuno-faria implemented the core Parquet Metadata caching logic in the following PR: - feat: Cache Parquet metadata in built in parquet reader #16971
However, as implemented there is no bound on the amount of memory that is in the cache, which will result in a "leak" over time (aka memory usage always goes up and never down)
Describe the solution you'd like
I would like the cache to have an upper memory limit so we people can turn it on / off and its resource use is capped
Describe alternatives you've considered
I personally recommend:
- Adding another Runtime Configuration Setting
datafusion.runtime.file_metadata_cache_limit
with the same interface asdatafusion.runtime.memory_limit
- Implement a basic LRU strategy for the cache (when the limit is exceeded, evict the least recently used elements until there is space)
- Tests for the above
You can get the memory usage for ParquetMetaData
using the following API: https://docs.rs/parquet/latest/parquet/file/metadata/struct.ParquetMetaData.html#method.memory_size
Some care will be needed to make this work with the traits (e.g you may have to change FileMetadata
into a trait
)
Additional context
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request