feat: Limit the memory used in the file metadata cache #17031

nuno-faria · 2025-08-04T11:33:51Z

Which issue does this PR close?

Closes [Parquet Metadata Cache]: Limit memory used #17001.

Rationale for this change

Controlling the memory used in the metadata cache prevents unbounded growth, which otherwise could even lead to OOM for long-running applications. It uses a LRU algorithm to evict entries.

The limit can be set with the runtime config datafusion.runtime.file_metadata_cache_limit (e.g., set datafusion.runtime.file_metadata_cache_limit = '1G').

What changes are included in this PR?

Added the datafusion.runtime.file_metadata_cache_limit config (defaults to '1G').
Added the lru crate to implement the LRU semantics.
Added memory related methods to the FileMetadata and FileMetadataCache traits.
Updated the DefaultFilesMetadataCache struct to limit the memory used.
Updated the CachedParquetMetaData struct to provide the Parquet metadata size.
Added unit tests.

Are these changes tested?

Yes

Are there any user-facing changes?

Added a new runtime config to control the metadata cache, but since it is disabled by default, it does not impact existing applications.

nuno-faria

cc: @alamb

nuno-faria · 2025-08-04T11:36:59Z

Cargo.toml

 indexmap = "2.10.0"
 itertools = "0.14"
 log = "^0.4"
+lru = "0.16.0"


The lru crate appears to be well maintained and used by a number of popular crates (like tracing-log, redis, and aws-sdk-s3). Let me know if it is ok to include it.

I agree it seems to be a reasonable crate. However, I think in general if we can avoid new dependencies in DataFusion that would be good -- our dependency trail is already quite large, and I realize one new dependency doesn't seem like much (but that is what we said when introducing all the existing ones too 😢 )

Note lru is also a net new dependency (no existing DataFusion dependency uses it)

It also has a bunch of unsafe which isn't necessarily a deal breaker itself, but unless it is performance critical I think we should avoid a potential source of crashing / non deterministic bugs

However, I did some research and I think implementing a LRU cache in Rust that actually has O(1) properties will be non trivial (there is a good writeup here: https://seanchen1991.github.io/posts/lru-cache/)

My personal preference would be to implement something custom but I am really torn about this, especially given it would be nice to implement other LRU caches (like listing and statistics, for example) 🤔

The best I could come up with was using a HashMap<Path, usize> that maps to an index, in a VecDequeue that implements the linked list implemented as described in the blog. I don't think it would be simple though

I'll look into it.

O(1) only tells you half of the story. A traditional LRU still requires expensive bookkeeping for a GET+MISS case since you need to preserve the LRU order for evictions -- or you perform a potentially expensive (partial) sort whenever you need to evict data. So I think the question should be if this cache is read optimized or not and if your GET operations are concurrent. If both questions are answered with YES, then I suggest we just implement SIEVE. It's simple and quite performant.

I spoke with @XiangpengHao this afternoon and we have a proposal:

Use a FIFO queue -- aka evict the oldest elements first (e.g. store Paths in a VecDeque or something and remove them from the HashSet in order when the cache is full)

Benefits of this approach

Simple to implement

Predictable, easy to explain behavior

Downsides:

Can evict item that are frequently used

Doesn't account for how costly an item is to keep in the cache

However, I think this is an appropriate tradeoff for the default implementation because

a properly sized cache will have no evictions (so the eviction strategy for the simple case doesn't matter)

users can supply their own implementations for more advanced usecases

I argue against including anything more complicated in the initial implementation because of the many tradeoffs. A cache eviction strategy needs to weigh multiple different factors, for example

The cost (eg. size in bytes) of keeping elements in the cache

The benefit (e.g. how much will the cached metadata improve future queries)

The cost of next miss (e.g. the cost to reload metadata from a local disk is likely much lower than from S3)

The relative tradeoffs between these factors likely varies substantially from system to sytem

Users with more advanced needs can implement whatever strategy is most appropriate to their system via the traits

I had a custom LRU queue implementation almost finished when I got this notification 😅.
I left the details below.

dropping https://crates.io/crates/s3-fifo as an interesting related thing

I looked at the (very nicely) coded lru in this PR and I think it is quite good. Unless there are any objections, I think we can go with what is here

datafusion/execution/src/cache/cache_manager.rs

nuno-faria · 2025-08-04T11:42:37Z

datafusion/execution/src/cache/cache_unit.rs

        self.put(key, value)
    }

    fn remove(&mut self, k: &ObjectMeta) -> Option<Arc<dyn FileMetadata>> {


I noticed that the remove method is the only one in the CacheAccessor trait that expects a &mut. This appears to be inconsistent with the other update methods, but I did not change it since the trait is public.

Yeah, I suggest we propose this cleanup in a follow on PR / ticket so we can discuss it separately if desired

datafusion/core/src/execution/context/mod.rs

alamb · 2025-08-04T13:07:38Z

Thank you -- I will review this later today

alamb

Thank you @nuno-faria -- I found this PR well commented, and well tested as always 🏆

My only hesitation is adding the lru dependency -- however, I spent a while this afternoon trying to come up with a plausible alternative and I was not able to so

I would be curious what you think / if you have some other way to implement a LRU cache without a new dependency

alamb · 2025-08-04T19:52:57Z

Cargo.toml

 indexmap = "2.10.0"
 itertools = "0.14"
 log = "^0.4"
+lru = "0.16.0"


I agree it seems to be a reasonable crate. However, I think in general if we can avoid new dependencies in DataFusion that would be good -- our dependency trail is already quite large, and I realize one new dependency doesn't seem like much (but that is what we said when introducing all the existing ones too 😢 )

Note lru is also a net new dependency (no existing DataFusion dependency uses it)

It also has a bunch of unsafe which isn't necessarily a deal breaker itself, but unless it is performance critical I think we should avoid a potential source of crashing / non deterministic bugs

datafusion/core/src/execution/context/mod.rs

alamb · 2025-08-04T19:54:24Z

datafusion/core/tests/sql/runtime_config.rs

+    update_limit(&ctx, "2G").await;
+    assert_eq!(get_limit(&ctx), Some(2 * 1024 * 1024 * 1024));
+
+    update_limit(&ctx, "123K").await;


alamb · 2025-08-04T19:56:15Z

datafusion/datasource-parquet/src/file_format.rs

        let mut source = ParquetSource::new(self.options.clone());

        // Use the CachedParquetFileReaderFactory when metadata caching is enabled
        if self.options.global.cache_metadata {


What are your thoughts about (in a follow on PR) removing the options.cache_metadata and always trying to save the metadata (which will be a noop if there is no room)?

I think caching by default would be good. The only situation where it wouldn't help would be one-time scans of parquet files that do not require the page index, but for large files the scan should largely outweigh the page index retrieval anyway.

I filed a ticket to track

[Parquet Metadata Cache] remove datafusion.execution.parquet.cache_metadata config #17047

datafusion/execution/src/cache/cache_manager.rs

alamb · 2025-08-04T20:13:04Z

datafusion/execution/src/cache/cache_manager.rs

    file_statistic_cache: Option<FileStatisticsCache>,
    list_files_cache: Option<ListFilesCache>,
-    file_metadata_cache: Option<Arc<dyn FileMetadataCache>>,
+    file_metadata_cache: Arc<dyn FileMetadataCache>,


Seeing the idea of having a default file_metadata_cache installed got me thinking about @BlakeOrth's comment here: #16971 (comment)

After this work to cache file metadata, it seems like we may want to consider adding default caches for ListFiles and FileStatistics as well 🤔 (as a follow on PR of course)

datafusion/execution/src/runtime_env.rs

alamb · 2025-08-04T20:18:57Z

datafusion/execution/src/cache/cache_unit.rs

        self.put(key, value)
    }

    fn remove(&mut self, k: &ObjectMeta) -> Option<Arc<dyn FileMetadata>> {


Yeah, I suggest we propose this cleanup in a follow on PR / ticket so we can discuss it separately if desired

datafusion/execution/src/cache/cache_unit.rs

alamb · 2025-08-04T20:52:58Z

Cargo.toml

 indexmap = "2.10.0"
 itertools = "0.14"
 log = "^0.4"
+lru = "0.16.0"


However, I did some research and I think implementing a LRU cache in Rust that actually has O(1) properties will be non trivial (there is a good writeup here: https://seanchen1991.github.io/posts/lru-cache/)

My personal preference would be to implement something custom but I am really torn about this, especially given it would be nice to implement other LRU caches (like listing and statistics, for example) 🤔

The best I could come up with was using a HashMap<Path, usize> that maps to an index, in a VecDequeue that implements the linked list implemented as described in the blog. I don't think it would be simple though

alamb · 2025-08-05T18:46:28Z

datafusion/execution/src/cache/lru_queue.rs

+/// assert_eq!(lru_queue.pop(), Some((2, 20)));
+/// assert_eq!(lru_queue.pop(), None);
+/// ```
+pub struct LruQueue<K: Eq + Hash + Clone, V> {


I didn't see that you had implemented this -- I will review shortly

This implementation uses no unsafe blocks. The "unsafest" part is when we upgrade the Weak pointers in the doubly-linked list and then unwrap them, however we guarantee that the strong reference is always in the data map.
While I believe this implementation could be more efficient, I tried to keep it as simple as possible (e.g., get does a remove and a put, instead of something more complex). A quick bench shows that it reaches >1M puts/gets/pops per second, which should be enough.

I think we can start with this and simplify / improve the performance as follow on PRs if people hit issues with it

alamb

Thank you @nuno-faria -- I think this PR looks great to me. There are several items I think we should improve, but we can also do it as a follow on PR and are not strictly required for this PR

Nice to consider:

using parking_lot: https://github.com/nuno-faria/datafusion/pull/1/files

I also reviewed the test coverage using

cargo llvm-cov test --html -p datafusion-execution -- lru_queue

And all the code is covered:

(BTW the only thing not covered is is_empty which I don't think is a problem):

alamb · 2025-08-05T20:10:35Z

datafusion/execution/src/cache/lru_queue.rs

+/// assert_eq!(lru_queue.pop(), Some((2, 20)));
+/// assert_eq!(lru_queue.pop(), None);
+/// ```
+pub struct LruQueue<K: Eq + Hash + Clone, V> {


I think we can start with this and simplify / improve the performance as follow on PRs if people hit issues with it

Use parking_lot::Mutex

Removes the previous unreachable!(). Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

nuno-faria · 2025-08-06T10:15:31Z

Thanks @alamb for the review. I think the remaining TODOs are now finished.

I also reviewed the test coverage using
cargo llvm-cov test --html -p datafusion-execution -- lru_queue

Thanks for the tip. I added the missing test to get to 100% coverage.

alamb · 2025-08-06T12:26:13Z

Thanks @alamb for the review. I think the remaining TODOs are now finished.
I also reviewed the test coverage using
cargo llvm-cov test --html -p datafusion-execution -- lru_queue
Thanks for the tip. I added the missing test to get to 100% coverage.

100% coverage -- now that is attention to detail!

alamb · 2025-08-06T12:27:32Z

I just looked at the latest changes and they look great to me. I'll plan to merge this PR in tomorrow unless anyone else would like time to review

alamb · 2025-08-07T10:29:19Z

Let's go! Thanks again @nuno-faria

abhita · 2025-09-17T12:05:21Z

datafusion/execution/src/cache/cache_unit.rs

-        self.metadata
-            .get(&k.location)
-            .map(|s| {
-                let (extra, metadata) = s.value();
-                if extra.size != k.size || extra.last_modified != k.last_modified {
-                    None
-                } else {
-                    Some(Arc::clone(metadata))
-                }
-            })
-            .unwrap_or(None)
+        let mut state = self.state.lock().unwrap();
+        state.get(k)


Hi @nuno-faria @alamb
Trying to comprehend the requirement of locks over reads.
From what I can understand is for this operation, the state of MetadataCache would be altered as the cache_hits would be increased for each read operation. Wouldn't this affect the cache read-throughput in case of concurrent reads?

Could you help shed some light onto the whole context of requirement of lock here?

In addition to the cache_hits, the get operation also changes the queue itself, since it promotes elements to the top as it executes with Least Recently Used guarantees. Because the queue mutates, the lock is required.

As for the performance, I tested it at the time and it could comfortably handle millions of reads/writes per second, so it should not be a problem. In any case, the user can always supply a custom FileMetadataCache implementation if necessary.

If you are seeing any issues or have suggestions for improvements @abhita we would love to hear them as well

If maintaining the data-structure for the eviction was the key-factor here, can we think of having a standard/common data-structure like DashMap<> for maintaining Cache-Entries and have a tracking data-structure for the Keys?(Cache structure agnostic of eviction policies)
This way, the code becomes much more extensible for plugging in custom eviction policies in the future.

This would also give us the opportunity to explore locks over Keys rather than the whole cache.
Any thoughts?
@nuno-faria @alamb

Referring to liquid-cache implementation as per : https://github.com/XiangpengHao/liquid-cache/blob/d232270cfaf495ba257d748a51123673409f7c72/src/storage/src/cache/policies.rs#L85

Tanks @abhita ! I think if you are interested in alternative approaches the best thing to do would be

Create some sort of benchmark (maybe make 10k little parquet files with small metadata) that shows cache management consuming significant time

Then you could explore various strategies to improve the performance

Got it.
Curious on why is Cache Data Structure tightly-coupled to the Eviction Policy
@alamb @nuno-faria

I am not sure how to answer that question. It sounds like a more general software structure question rather than something specific about this code

feat: Limit the memory used in the file metadata cache

b95b443

github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate execution Related to the execution crate datasource Changes to the datasource crate labels Aug 4, 2025

nuno-faria commented Aug 4, 2025

View reviewed changes

alamb reviewed Aug 4, 2025

View reviewed changes

Implement custom LRU queue

3962820

alamb reviewed Aug 5, 2025

View reviewed changes

Use parking_lot::Mutex

f6688e5

alamb mentioned this pull request Aug 5, 2025

Use parking_lot::Mutex nuno-faria/datafusion#1

Merged

alamb approved these changes Aug 5, 2025

View reviewed changes

This was referenced Aug 5, 2025

[Parquet Metadata Cache] remove datafusion.execution.parquet.cache_metadata config #17047

Closed

[Parquet Metadata Cache] Document the parquet metadata cache #17048

Closed

nuno-faria and others added 8 commits August 6, 2025 09:46

Merge pull request #1 from alamb/alamb/parking_log

dab1d27

Use parking_lot::Mutex

Add is_empty unit test

269e715

Rename config to metadata_cache_limit, Set limit to 50M

a80c49b

Remove Option from the metadata memory limit

ec8060a

Add license to lru_queue

efdb5e3

Update datafusion/execution/src/cache/cache_unit.rs

a10f734

Removes the previous unreachable!(). Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Merge branch 'main' into cache_metadata_limit

e3c83eb

Fix syntax error

31ed66f

Fix clippy error

dca5636

nuno-faria mentioned this pull request Aug 7, 2025

fix: Remove datafusion.execution.parquet.cache_metadata config #17062

Merged

alamb merged commit a9e6d4b into apache:main Aug 7, 2025
28 checks passed

nuno-faria deleted the cache_metadata_limit branch August 7, 2025 10:35

abhita reviewed Sep 17, 2025

View reviewed changes

feat: Limit the memory used in the file metadata cache #17031

feat: Limit the memory used in the file metadata cache #17031

Uh oh!

Conversation

nuno-faria commented Aug 4, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

nuno-faria left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Aug 4, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nuno-faria commented Aug 6, 2025

Uh oh!

alamb commented Aug 6, 2025

Uh oh!

alamb commented Aug 6, 2025

Uh oh!

abhita Sep 17, 2025 •

edited

Loading

abhita Sep 17, 2025 •

edited

Loading

alamb Sep 18, 2025 •

edited

Loading