Skip to content

20047: Add a memory bound FileStatisticsCache for the Listing Table#236

Open
martin-augment wants to merge 6 commits intomainfrom
pr-20047-2026-02-10-06-31-53
Open

20047: Add a memory bound FileStatisticsCache for the Listing Table#236
martin-augment wants to merge 6 commits intomainfrom
pr-20047-2026-02-10-06-31-53

Conversation

@martin-augment
Copy link
Owner

20047: To review by AI

@coderabbitai
Copy link

coderabbitai bot commented Feb 10, 2026

Walkthrough

This pull request refactors the file statistics caching architecture across DataFusion. The changes include: (1) introducing a new DFHeapSize trait in the common crate for computing heap-allocated memory across various data structures; (2) making the file statistics cache optional in ListingTable rather than mandatory; (3) replacing the DashMap-based DefaultFileStatisticsCache with a Mutex-protected LRU queue implementation that enforces configurable memory limits; (4) adding configuration options for file statistics cache limits throughout SessionContext, RuntimeEnv, and CacheManagerConfig; (5) renaming cache-related builder methods from with_files_statistics_cache to with_file_statistics_cache for consistency; and (6) updating test expectations and test files to reflect these structural changes.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-20047-2026-02-10-06-31-53

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to DataFusion's caching mechanism by implementing a memory-bound FileStatisticsCache for ListingTable. This change addresses potential memory issues by ensuring that cached file statistics adhere to a configurable memory limit, automatically evicting older entries when the limit is reached. It provides better resource management and improved stability, especially when dealing with a large number of files or long-running sessions, without requiring manual cache invalidation.

Highlights

  • Memory-Bound File Statistics Cache: Introduced a new memory-bound FileStatisticsCache for the ListingTable to efficiently manage and limit the memory used for caching file statistics, preventing unbounded memory growth.
  • DFHeapSize Trait: Added a DFHeapSize trait and implemented it for various DataFusion and Arrow types, enabling accurate calculation of heap memory usage for cached objects.
  • Configurable Cache Limit: Made the FileStatisticsCache memory limit configurable via a new runtime setting datafusion.runtime.file_statistics_cache_limit, allowing users to control memory consumption.
  • ListingTable Integration: Modified the ListingTable to optionally use the new memory-bound FileStatisticsCache, ensuring that file statistics are cached and evicted according to the configured memory limits.
  • Cache Manager Updates: Updated the CacheManager to properly initialize and manage the new memory-bound FileStatisticsCache, including handling its configuration and lifecycle.
Changelog
  • datafusion-cli/src/main.rs
    • Removed unused import of DefaultFileStatisticsCache.
    • Updated snapshot test values to reflect changes in cache hit counts.
  • datafusion/catalog-listing/src/table.rs
    • Changed collected_statistics field in ListingTable to Option<Arc<dyn FileStatisticsCache>> to allow for optional caching.
    • Removed direct instantiation of DefaultFileStatisticsCache in ListingTable::new.
    • Updated with_cache method to directly set the optional cache.
    • Modified cache access logic to safely handle the Option type for collected_statistics.
  • datafusion/common/src/heap_size.rs
    • Added new file defining the DFHeapSize trait for calculating heap memory usage.
    • Implemented DFHeapSize for various DataFusion types (e.g., Statistics, ColumnStatistics, ScalarValue) and Arrow types (e.g., DataType, Vec, HashMap, Arc, String, Field) to support memory-bound caching.
  • datafusion/common/src/lib.rs
    • Exported the new heap_size module.
  • datafusion/core/src/datasource/listing_table_factory.rs
    • Renamed with_files_statistics_cache to with_file_statistics_cache in test configurations for consistency.
  • datafusion/core/src/execution/context/mod.rs
    • Added file_statistics_cache_limit as a new runtime configuration option.
    • Implemented parsing logic for file_statistics_cache_limit values (e.g., '1M', '2G').
    • Added logic to reset file_statistics_cache_limit to its default value.
  • datafusion/core/tests/parquet/file_statistics.rs
    • Renamed with_files_statistics_cache to with_file_statistics_cache in test configurations.
  • datafusion/core/tests/sql/runtime_config.rs
    • Added a new test case to verify the functionality of file_statistics_cache_limit runtime configuration, including setting and retrieving values with different units.
  • datafusion/execution/src/cache/cache_manager.rs
    • Renamed table_files_statistics_cache to file_statistics_cache in CacheManagerConfig for clarity.
    • Added file_statistics_cache_limit to CacheManagerConfig with a default value.
    • Updated FileStatisticsCache trait with cache_limit and update_cache_limit methods.
    • Modified CacheManager::try_new to initialize DefaultFileStatisticsCache with the configured memory limit.
    • Implemented DFHeapSize for CachedFileMetadata to allow memory usage tracking.
    • Added get_file_statistic_cache_limit method to CacheManager.
  • datafusion/execution/src/cache/cache_unit.rs
    • Refactored DefaultFileStatisticsCache to use an LruQueue and Mutex for memory-bound caching instead of DashMap.
    • Introduced DefaultFileStatisticsCacheState to manage the LRU queue, memory limit, and current memory usage.
    • Implemented cache_limit and update_cache_limit methods for DefaultFileStatisticsCache.
    • Updated list_entries to report statistics_size_bytes based on the actual heap size of cached statistics.
    • Added new test cases to verify the memory-bound behavior of the cache, including eviction and rejection of oversized entries.
  • datafusion/execution/src/runtime_env.rs
    • Added file_statistics_cache_limit to the list of runtime configuration entries.
    • Included file_statistics_cache_limit in RuntimeEnvBuilder for setting the cache limit during runtime environment creation.
    • Updated from_runtime_env to correctly retrieve and set the file_statistics_cache_limit.
  • datafusion/sqllogictest/test_files/encrypted_parquet.slt
    • Added a SET datafusion.runtime.file_statistics_cache_limit = '0K' command to disable the file statistics cache for specific tests.
  • datafusion/sqllogictest/test_files/information_schema.slt
    • Added datafusion.runtime.file_statistics_cache_limit 1M to the expected output for information_schema.df_settings.
  • datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt
    • Added a SET datafusion.runtime.file_statistics_cache_limit = '0K' command to disable the file statistics cache for specific tests.
  • datafusion/sqllogictest/test_files/set_variable.slt
    • Added test cases for setting, resetting, and showing the datafusion.runtime.file_statistics_cache_limit variable.
  • docs/source/user-guide/configs.md
    • Updated the user guide documentation to include the new datafusion.runtime.file_statistics_cache_limit configuration option and its description.
Activity
  • The pull request introduces a new memory-bound file statistics cache and integrates it across various DataFusion components.
  • New configuration options for the cache limit have been added and tested.
  • Documentation has been updated to reflect these changes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@claude
Copy link

claude bot commented Feb 10, 2026

Pull Request Review: Add Memory-Bound FileStatisticsCache for ListingTable

Summary

This PR adds a memory-bounded implementation of FileStatisticsCache with LRU eviction for the ListingTable. The implementation includes comprehensive heap size tracking, runtime configuration support, and good test coverage.


Code Quality & Architecture ✅

Strengths:

  1. Well-structured implementation: Clean separation between the cache interface and implementation
  2. Good use of LRU eviction: Leverages the existing LruQueue data structure effectively
  3. Thread-safe design: Proper use of Mutex for concurrent access
  4. Configuration flexibility: Runtime-configurable memory limits via SQL SET commands
  5. Comprehensive test coverage: Multiple test cases covering edge cases and memory management

Areas for Improvement:

1. Heap Size Calculation Accuracy ⚠️

Location: datafusion/common/src/heap_size.rs:402-406

The Arc heap size calculation may not be accurate - it counts the allocation for every Arc clone, but multiple Arc instances share the same allocation. This could lead to significant overcounting in the cache memory usage.

Recommendation: Consider adding documentation that this measures "logical ownership size" rather than actual memory, or track Arc reference counts to avoid double-counting.

2. HashMap Heap Size Approximation ⚠️

Location: datafusion/common/src/heap_size.rs:359-398

The HashMap heap size calculation is based on hashbrown v0.16.0 implementation details, which is fragile and may break with future updates.

Recommendation: Add a test that validates the approximation, consider upstreaming to hashbrown/std, and document the version dependency clearly.

3. Memory Accounting in put Method 🐛

Location: datafusion/execution/src/cache/cache_unit.rs:91-112

When replacing an existing key, the code temporarily inflates memory_used (adds new before subtracting old). While evict_entries() fixes this, it could cause unnecessary evictions of other entries.

Recommendation: Calculate the net change first to avoid temporary inflation.

4. debug_assert in Production Code ⚠️

Location: datafusion/execution/src/cache/cache_unit.rs:142-145

Using debug_assert!(false, ...) is unconventional. This should either be a panic in release mode if truly impossible, or handled gracefully if it could occur due to heap size estimation errors.


Performance Considerations ✅

Strengths:

  1. O(1) average case for cache operations thanks to HashMap + doubly-linked list
  2. Bounded memory usage prevents unbounded growth
  3. LRU eviction keeps frequently-used statistics in cache

Potential Concerns:

5. Lock Contention ⚠️

Location: datafusion/execution/src/cache/cache_unit.rs:43-44

Every cache operation requires acquiring a mutex lock, which could be a bottleneck under high concurrency. Consider using DashMap or sharding the cache by path hash.

6. Path Cloning ⚠️

The code clones Path objects frequently. Profile to determine if this is a bottleneck.


Security Concerns ✅

No significant security issues found. The implementation properly bounds memory usage, uses safe Rust with appropriate synchronization, and validates entry sizes.


Test Coverage ✅

Strengths:

Comprehensive unit tests, integration tests, and edge case coverage.

Suggestions:

  • Add concurrency tests for multi-threaded access patterns
  • Add heap size validation tests to verify calculations are reasonably accurate

API & Naming ✅

Breaking Change ⚠️

Location: datafusion/execution/src/cache/cache_manager.rs:920

The rename from with_files_statistics_cache to with_file_statistics_cache is a breaking API change. Consider deprecating the old method for one release cycle.


Summary

This is a well-implemented feature that adds important memory bounds to file statistics caching.

Required Changes (High Priority):

  1. Fix the memory accounting logic in put() to avoid temporary inflation
  2. Replace debug_assert!(false) with proper error handling
  3. Add deprecation wrapper for renamed API method

Recommended Changes (Medium Priority):

  1. Document the Arc heap size overcounting issue
  2. Add concurrency tests
  3. Consider lock contention mitigation strategies

Overall Assessment: ✅ Approve with minor changes

The implementation is solid and addresses a real need for memory-bounded caching. The suggested fixes are relatively minor.


Generated with Claude Code - AI-assisted code review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a memory-bound FileStatisticsCache to improve resource management in DataFusion, leveraging a new DFHeapSize trait and an LRU eviction policy. While this is a positive step towards preventing unbounded memory usage and potential OOM issues, a critical vulnerability was identified: the parse_memory_limit function in the configuration handling logic does not correctly handle empty strings, which could lead to a panic and a Denial of Service when processing malformed SQL commands. Additionally, there are a couple of suggestions for improvement, including a correctness issue with the cache size calculation.

Comment on lines +108 to +117
impl DFHeapSize for CachedFileMetadata {
fn heap_size(&self) -> usize {
self.meta.size.heap_size()
+ self.meta.last_modified.heap_size()
+ self.meta.version.heap_size()
+ self.meta.e_tag.heap_size()
+ self.meta.location.as_ref().heap_size()
+ self.statistics.heap_size()
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The heap_size implementation for CachedFileMetadata doesn't account for the memory used by self.ordering. This can lead to the cache size being underestimated, potentially causing it to grow beyond its memory limit. The ordering field is an Option<LexOrdering>, which can hold a significant amount of data on the heap.

To fix this, DFHeapSize would need to be implemented for PhysicalExpr and related types, and then self.ordering.heap_size() should be added to the sum. Given the complexity, this might be better addressed in a follow-up PR, but it's an important correctness issue for the memory-bound cache.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The ordering field might occupy a big amount of heap space, so it should be tracked too. A TODO/issue should be added to be improved in a follow-up pull request.

builder.with_object_list_cache_ttl(Some(duration))
}
"file_statistics_cache_limit" => {
let limit = Self::parse_memory_limit(value)?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The parse_memory_limit function is vulnerable to a panic when an empty string is provided. This is because it calls split_at(limit.len() - 1) without checking if the string is empty, leading to a usize underflow. An attacker with SQL access could trigger this by running SET datafusion.runtime.file_statistics_cache_limit = '', causing a Denial of Service (DoS) by crashing the process.

Recommendation: Add a check for empty strings in parse_memory_limit and return an error instead of panicking.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The parse_memory_limit() function does not validate its input and this may lead to a panic when the input is an empty string. This is not caused by the changes in this Pull Request but it would be good to fix it while modifying this code.

Comment on lines +331 to +336
let list_files_cache = Arc::new(DefaultFileStatisticsCache::default());

let rt = RuntimeEnvBuilder::new()
.with_cache_manager(
CacheManagerConfig::default()
.with_file_statistics_cache(Some(list_files_cache)),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable list_files_cache is of type DefaultFileStatisticsCache. For clarity and to avoid confusion, it would be better to rename it to file_statistics_cache.

Suggested change
let list_files_cache = Arc::new(DefaultFileStatisticsCache::default());
let rt = RuntimeEnvBuilder::new()
.with_cache_manager(
CacheManagerConfig::default()
.with_file_statistics_cache(Some(list_files_cache)),
let file_statistics_cache = Arc::new(DefaultFileStatisticsCache::default());
let rt = RuntimeEnvBuilder::new()
.with_cache_manager(
CacheManagerConfig::default()
.with_file_statistics_cache(Some(file_statistics_cache)),

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The list_files_cache variable name is copied from another test and it is misleading for this test case. It would be good to rename it to file_statistics_cache to avoid any confusion

@augmentcode
Copy link

augmentcode bot commented Feb 10, 2026

🤖 Augment PR Summary

Summary: This PR introduces a memory-bounded, LRU-based FileStatisticsCache for ListingTable / Parquet file statistics to prevent unbounded cache growth and make cache sizing configurable at runtime.

Changes:

  • Adds DFHeapSize in datafusion-common to estimate heap usage of statistics-related types
  • Reworks DefaultFileStatisticsCache to track memory usage and evict entries via an LRU queue
  • Extends FileStatisticsCache with limit introspection and dynamic limit updates
  • Updates CacheManagerConfig/CacheManager to create/configure the statistics cache (default limit: 1MiB)
  • Wires ListingTable to use the session’s cache-manager-provided statistics cache (optional)
  • Adds runtime setting datafusion.runtime.file_statistics_cache_limit with SET/RESET/SHOW support
  • Updates CLI/tests/sqllogictest fixtures and user docs to reflect the new setting and behavior

Technical Notes: Cache entries store ObjectMeta + inferred Statistics + optional ordering, and eviction is based on estimated heap usage.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
datafusion/execution/src/cache/cache_unit.rs (1)

30-41: ⚠️ Potential issue | 🟡 Minor

Stale documentation: still references DashMap.

Line 39 says "Uses DashMap for lock-free concurrent access" but the implementation now uses Mutex<DefaultFileStatisticsCacheState> with an LRU queue. Update the doc comment to reflect the new design.

📝 Proposed fix
-/// Uses DashMap for lock-free concurrent access.
+/// Uses a Mutex-protected LRU queue with a configurable memory limit.
🤖 Fix all issues with AI agents
In `@datafusion/execution/src/cache/cache_manager.rs`:
- Around line 108-117: The heap_size implementation for CachedFileMetadata
(DFHeapSize::heap_size) omits the ordering: Option<LexOrdering> field,
undercounting memory; update the function to include the ordering's heap size
(e.g., add self.ordering.heap_size() or self.ordering.as_ref().heap_size()
depending on available trait impls) alongside the other meta fields so Optional
LexOrdering memory is accounted for in CachedFileMetadata::heap_size.

In `@datafusion/execution/src/cache/cache_unit.rs`:
- Around line 91-112: In put(), before early-returning when the new entry_size
exceeds self.memory_limit, check for and remove any existing stale entry for the
same key: use self.lru_queue.remove(key) (or the appropriate removal method on
lru_queue) to retrieve and subtract its heap_size() from self.memory_used so the
old entry is cleared; then return None. Ensure you reference the existing
symbols put, lru_queue, memory_used, memory_limit, evict_entries (if still
needed), CachedFileMetadata, and Path when locating and updating the code.

In `@datafusion/sqllogictest/test_files/set_variable.slt`:
- Line 416: The comment string contains a typo:
"rruntime.file_statistics_cache_limit" should be
"runtime.file_statistics_cache_limit"; update the comment text in the test
header (the line that reads "Test SET and SHOW
rruntime.file_statistics_cache_limit") to read "Test SET and SHOW
runtime.file_statistics_cache_limit" so the configuration key name is spelled
correctly.
🧹 Nitpick comments (6)
datafusion/core/tests/sql/runtime_config.rs (1)

330-337: Misleading variable name: list_files_cache holds a DefaultFileStatisticsCache.

This appears to be a copy-paste artifact from test_list_files_cache_limit. The variable should be renamed to something like file_statistics_cache for clarity.

Suggested fix
 async fn test_file_statistics_cache_limit() {
-    let list_files_cache = Arc::new(DefaultFileStatisticsCache::default());
+    let file_statistics_cache = Arc::new(DefaultFileStatisticsCache::default());
 
     let rt = RuntimeEnvBuilder::new()
         .with_cache_manager(
             CacheManagerConfig::default()
-                .with_file_statistics_cache(Some(list_files_cache)),
+                .with_file_statistics_cache(Some(file_statistics_cache)),
         )
datafusion/execution/src/runtime_env.rs (1)

456-459: Missing doc comment on with_file_statistics_cache_limit.

Other similar builder methods (with_metadata_cache_limit, with_object_list_cache_limit) have doc comments describing the parameter. This one should follow the same pattern.

Suggested fix
+    /// Specifies the memory limit for the file statistics cache, in bytes.
     pub fn with_file_statistics_cache_limit(mut self, limit: usize) -> Self {
datafusion/common/src/heap_size.rs (2)

172-175: Variable names u8 and i8 shadow built-in type names.

While this compiles correctly (pattern bindings are just identifiers), using Rust primitive type names as variable names is confusing to read. Consider precision/scale or p/s instead.

Suggested fix
-            Decimal32(u8, i8) => u8.heap_size() + i8.heap_size(),
-            Decimal64(u8, i8) => u8.heap_size() + i8.heap_size(),
-            Decimal128(u8, i8) => u8.heap_size() + i8.heap_size(),
-            Decimal256(u8, i8) => u8.heap_size() + i8.heap_size(),
+            Decimal32(p, s) => p.heap_size() + s.heap_size(),
+            Decimal64(p, s) => p.heap_size() + s.heap_size(),
+            Decimal128(p, s) => p.heap_size() + s.heap_size(),
+            Decimal256(p, s) => p.heap_size() + s.heap_size(),

247-251: Fields heap size slightly undercounts — the outer Arc<[FieldRef]> allocation is not included.

Fields wraps an Arc<[Arc<Field>]>. This implementation sums per-element Arc<Field>::heap_size() but doesn't account for the outer Arc's allocation (ref counts + the slice buffer). The undercount is small and constant per Fields instance, so this is acceptable as an approximation, but worth noting.

datafusion/execution/src/cache/cache_unit.rs (2)

96-103: Memory accounting only tracks value.heap_size(), not the key or LRU node overhead.

Each entry in the LRU queue also stores a cloned Path key (heap-allocated string) and a linked-list node. These are not included in memory_used, so actual heap consumption can exceed memory_limit — proportionally more so with many small entries. Consider adding the key's heap size to the accounting for more accurate bounds.

Sketch
-        let entry_size = value.heap_size();
+        let entry_size = key.as_ref().len() + value.heap_size();

Apply the same adjustment in remove and evict_entries where heap_size is subtracted.


598-626: Nit: step_by(1) is redundant.

Line 601: (0..=10).step_by(1) — the step is already 1 by default for range iterators.

Proposed fix
-        let series: Vec<i32> = (0..=10).step_by(1).collect();
+        let series: Vec<i32> = (0..=10).collect();

Comment on lines +108 to +117
impl DFHeapSize for CachedFileMetadata {
fn heap_size(&self) -> usize {
self.meta.size.heap_size()
+ self.meta.last_modified.heap_size()
+ self.meta.version.heap_size()
+ self.meta.e_tag.heap_size()
+ self.meta.location.as_ref().heap_size()
+ self.statistics.heap_size()
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

ordering field is not included in heap_size(), leading to under-counted memory usage.

CachedFileMetadata has an ordering: Option<LexOrdering> field (Line 55) that is missing from the heap size calculation. For files with sort ordering metadata, this omission means the cache's memory accounting will be lower than actual usage, potentially allowing the cache to exceed its configured memory limit.

Suggested fix
 impl DFHeapSize for CachedFileMetadata {
     fn heap_size(&self) -> usize {
         self.meta.size.heap_size()
             + self.meta.last_modified.heap_size()
             + self.meta.version.heap_size()
             + self.meta.e_tag.heap_size()
             + self.meta.location.as_ref().heap_size()
             + self.statistics.heap_size()
+            // TODO: add self.ordering.heap_size() once DFHeapSize is implemented for LexOrdering
     }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
impl DFHeapSize for CachedFileMetadata {
fn heap_size(&self) -> usize {
self.meta.size.heap_size()
+ self.meta.last_modified.heap_size()
+ self.meta.version.heap_size()
+ self.meta.e_tag.heap_size()
+ self.meta.location.as_ref().heap_size()
+ self.statistics.heap_size()
}
}
impl DFHeapSize for CachedFileMetadata {
fn heap_size(&self) -> usize {
self.meta.size.heap_size()
self.meta.last_modified.heap_size()
self.meta.version.heap_size()
self.meta.e_tag.heap_size()
self.meta.location.as_ref().heap_size()
self.statistics.heap_size()
// TODO: add self.ordering.heap_size() once DFHeapSize is implemented for LexOrdering
}
}
🤖 Prompt for AI Agents
In `@datafusion/execution/src/cache/cache_manager.rs` around lines 108 - 117, The
heap_size implementation for CachedFileMetadata (DFHeapSize::heap_size) omits
the ordering: Option<LexOrdering> field, undercounting memory; update the
function to include the ordering's heap size (e.g., add
self.ordering.heap_size() or self.ordering.as_ref().heap_size() depending on
available trait impls) alongside the other meta fields so Optional LexOrdering
memory is accounted for in CachedFileMetadata::heap_size.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The ordering field might occupy a big amount of heap space, so it should be tracked too. A TODO/issue should be added to be improved in a follow-up pull request.

Comment on lines +91 to +112
fn put(
&mut self,
key: &Path,
value: CachedFileMetadata,
) -> Option<CachedFileMetadata> {
let entry_size = value.heap_size();

if entry_size > self.memory_limit {
return None;
}

let old_value = self.lru_queue.put(key.clone(), value);
self.memory_used += entry_size;

if let Some(old_entry) = &old_value {
self.memory_used -= old_entry.heap_size();
}

self.evict_entries();

old_value
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

put() does not remove the stale entry when the new value exceeds the memory limit.

When entry_size > self.memory_limit (line 98), the method returns None immediately. If the key already exists in the cache (e.g., an older, now-invalid entry for the same path), that stale entry remains, wasting memory and causing get() to keep returning an invalidated value. Consider removing the old entry before the early return.

Proposed fix
         let entry_size = value.heap_size();
 
         if entry_size > self.memory_limit {
+            // Still remove the old entry for this key to avoid retaining stale data
+            self.remove(key);
             return None;
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn put(
&mut self,
key: &Path,
value: CachedFileMetadata,
) -> Option<CachedFileMetadata> {
let entry_size = value.heap_size();
if entry_size > self.memory_limit {
return None;
}
let old_value = self.lru_queue.put(key.clone(), value);
self.memory_used += entry_size;
if let Some(old_entry) = &old_value {
self.memory_used -= old_entry.heap_size();
}
self.evict_entries();
old_value
}
fn put(
&mut self,
key: &Path,
value: CachedFileMetadata,
) -> Option<CachedFileMetadata> {
let entry_size = value.heap_size();
if entry_size > self.memory_limit {
// Still remove the old entry for this key to avoid retaining stale data
self.remove(key);
return None;
}
let old_value = self.lru_queue.put(key.clone(), value);
self.memory_used += entry_size;
if let Some(old_entry) = &old_value {
self.memory_used -= old_entry.heap_size();
}
self.evict_entries();
old_value
}
🤖 Prompt for AI Agents
In `@datafusion/execution/src/cache/cache_unit.rs` around lines 91 - 112, In
put(), before early-returning when the new entry_size exceeds self.memory_limit,
check for and remove any existing stale entry for the same key: use
self.lru_queue.remove(key) (or the appropriate removal method on lru_queue) to
retrieve and subtract its heap_size() from self.memory_used so the old entry is
cleared; then return None. Ensure you reference the existing symbols put,
lru_queue, memory_used, memory_limit, evict_entries (if still needed),
CachedFileMetadata, and Path when locating and updating the code.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! If the new value is too big then it is not tracked at all, but any previous entry in the cache for the same key is not removed. Prevents calculating wrong sizes by using wrong/obsolete value for such keys.

----
datafusion.runtime.max_temp_directory_size 10G

# Test SET and SHOW rruntime.file_statistics_cache_limit
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo in comment: "rruntime" → "runtime".

✏️ Fix typo
-# Test SET and SHOW rruntime.file_statistics_cache_limit
+# Test SET and SHOW runtime.file_statistics_cache_limit
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Test SET and SHOW rruntime.file_statistics_cache_limit
# Test SET and SHOW runtime.file_statistics_cache_limit
🤖 Prompt for AI Agents
In `@datafusion/sqllogictest/test_files/set_variable.slt` at line 416, The comment
string contains a typo: "rruntime.file_statistics_cache_limit" should be
"runtime.file_statistics_cache_limit"; update the comment text in the test
header (the line that reads "Test SET and SHOW
rruntime.file_statistics_cache_limit") to read "Test SET and SHOW
runtime.file_statistics_cache_limit" so the configuration key name is spelled
correctly.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:typo; feedback: The CodeRabbit AI reviewer is correct! There is a small typo in a comment in one of the tests that would be good to be fixed. It does not cause any harm though!

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 5 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

fn heap_size(&self) -> usize {
self.num_rows.heap_size()
+ self.total_byte_size.heap_size()
+ self
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Statistics::heap_size sums per-column heap usage but doesn’t include the heap allocation of the column_statistics Vec buffer itself, which can significantly underestimate entry sizes. This can cause the memory-limited file statistics cache to retain more data than its configured limit.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! To measure the heap size of a collection like Vec it should measure the size of an item multiplied by the capacity of the collection. The implementation here iterates over the length of the collection and ignores any extra items in the capacity.

+ self.meta.last_modified.heap_size()
+ self.meta.version.heap_size()
+ self.meta.e_tag.heap_size()
+ self.meta.location.as_ref().heap_size()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CachedFileMetadata::heap_size doesn’t account for ordering, so entries with cached orderings may exceed the intended memory bound. That can make eviction behavior inconsistent with the configured cache limit.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The ordering field might occupy a big amount of heap space, so it should be tracked too. A TODO/issue should be added to be improved in a follow-up pull request.

/// Enabling the cache avoids repeatedly reading file statistics in a DataFusion session.
/// Default is disabled. Currently only Parquet files are supported.
pub table_files_statistics_cache: Option<Arc<dyn FileStatisticsCache>>,
pub file_statistics_cache: Option<Arc<dyn FileStatisticsCache>>,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The surrounding docstrings say the file statistics cache is “disabled by default”, but CacheManager::try_new will create a DefaultFileStatisticsCache whenever file_statistics_cache_limit > 0 (and the default limit is 1MiB). The docs around file_statistics_cache/with_file_statistics_cache may no longer match actual default behavior.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback: The Augment AI reviewer is correct! The docstring that the cache is disabled by default is wrong and it should be updated.

#[derive(Default)]
pub struct DefaultFileStatisticsCache {
cache: DashMap<Path, CachedFileMetadata>,
state: Mutex<DefaultFileStatisticsCacheState>,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment above DefaultFileStatisticsCache says it “Uses DashMap for lock-free concurrent access”, but the implementation is now Mutex + LruQueue. This looks like a stale comment that could mislead readers about concurrency characteristics.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback: The Augment AI reviewer is correct! The implementation has changed but the documentation has not been updated. Prevents releasing a new version with wrong documentation

) -> Option<CachedFileMetadata> {
let entry_size = value.heap_size();

if entry_size > self.memory_limit {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When entry_size > memory_limit, DefaultFileStatisticsCacheState::put returns None without removing or returning any existing value for the same key. That violates CacheAccessor::put’s contract (“Returns the previous value if one existed”) and can leave an invalid/obsolete entry resident if a too-large replacement is attempted.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! If the new value is too big then it is not tracked at all, but any previous entry in the cache for the same key is not removed. Prevents calculating wrong sizes by using wrong/obsolete value for such keys.

@martin-augment
Copy link
Owner Author

1. Heap Size Calculation Accuracy ⚠️

Location: datafusion/common/src/heap_size.rs:402-406

The Arc heap size calculation may not be accurate - it counts the allocation for every Arc clone, but multiple Arc instances share the same allocation. This could lead to significant overcounting in the cache memory usage.

value:useful; category:bug; feedback: The Claude AI reviewer is correct! The implementation counts each Arc clone separately, so it will count Nsize instead of 1size, where N is the number of clones. Prevents over-estimating the size of the same object.

@martin-augment
Copy link
Owner Author

3. Memory Accounting in put Method 🐛

Location: datafusion/execution/src/cache/cache_unit.rs:91-112

When replacing an existing key, the code temporarily inflates memory_used (adds new before subtracting old). While evict_entries() fixes this, it could cause unnecessary evictions of other entries.

Recommendation: Calculate the net change first to avoid temporary inflation.

value:valid-but-wont-fix; category:bug; feedback: The Claude AI reviewer is not correct! The proposed implementation first adds the size of the value and then subtracts the size of the old value. In the time in between the total size is not accurate! But it would be the same if first the value of the old is subtracted and then the size of the new value is added. It would be a slightly worse because during that period of time the system would think that there is more free heap to use.

@martin-augment
Copy link
Owner Author

4. debug_assert in Production Code ⚠️

Location: datafusion/execution/src/cache/cache_unit.rs:142-145

Using debug_assert!(false, ...) is unconventional. This should either be a panic in release mode if truly impossible, or handled gracefully if it could occur due to heap size estimation errors.

value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! The idea of using debug_assert!() is to report the issue only during development time. If the issue happens at production then the fact that the cache is empty would cause slow downs but it should not break the application.

@martin-augment
Copy link
Owner Author

30-41: ⚠️ Potential issue | 🟡 Minor

Stale documentation: still references DashMap.

Line 39 says "Uses DashMap for lock-free concurrent access" but the implementation now uses Mutex<DefaultFileStatisticsCacheState> with an LRU queue. Update the doc comment to reflect the new design.

value:good-to-have; category:documentation; feedback: The CodeRabbit AI reviewer is correct! The implementation has changed but the documentation has not been updated. Prevents releasing a new version with wrong documentation

@martin-augment
Copy link
Owner Author

172-175: Variable names u8 and i8 shadow built-in type names.

While this compiles correctly (pattern bindings are just identifiers), using Rust primitive type names as variable names is confusing to read. Consider precision/scale or p/s instead.

value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! The variable names shadow built-in types and this would be very confusing. It would be good to use variable names which tell more about the purpose of the variable, not its type.

@martin-augment
Copy link
Owner Author

martin-augment commented Feb 10, 2026

96-103: Memory accounting only tracks value.heap_size(), not the key or LRU node overhead.

Each entry in the LRU queue also stores a cloned Path key (heap-allocated string) and a linked-list node. These are not included in memory_used, so actual heap consumption can exceed memory_limit — proportionally more so with many small entries. Consider adding the key's heap size to the accounting for more accurate bounds.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The proposed implementation tracks only the sizes of the values in the cache, but the keys are omitted. The keys' type is Path so they should be tracked too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants