20047: Add a memory bound FileStatisticsCache for the Listing Table#241
20047: Add a memory bound FileStatisticsCache for the Listing Table#241martin-augment wants to merge 22 commits intomainfrom
Conversation
Summary of ChangesHello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances DataFusion's caching infrastructure by implementing a memory-aware file statistics cache. This new cache prevents unbounded memory growth by employing an LRU eviction strategy and accurately measures object sizes using a newly introduced Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThis pull request refactors the file statistics caching mechanism to be memory-limited and optional rather than always-present by default. A new ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Pull Request Review: Add Memory-Bound FileStatisticsCacheSummaryThis PR adds a memory-bounded file statistics cache with LRU eviction for the ListingTable. The implementation includes a comprehensive heap size tracking system and configuration support. Overall, this is a well-structured implementation with good test coverage. Code Quality & Best Practices ✅Strengths:
Areas for Improvement: 1. Heap Size Calculation for HashMap (datafusion/common/src/heap_size.rs:199-239)The HashMap heap size calculation is an approximation based on hashbrown 0.16.0. This is acknowledged in comments, but consider:
2. Arc Heap Size Implementation (datafusion/common/src/heap_size.rs:241-244)Issue: This counts Recommendation: Remove 2 * size_of::<usize>() + self.as_ref().heap_size()3. Potential Memory Accounting Bug (datafusion/execution/src/cache/cache_unit.rs:97-123)In the Recommendation: Reorder to avoid temporary inflation: if let Some(old_entry) = &old_value {
self.memory_used -= old_entry.heap_size();
self.memory_used += entry_size;
} else {
self.memory_used += key.heap_size() + entry_size;
}Performance Considerations ⚡Good:
Concerns:
Potential Bugs 🐛1. Cache Disabled BehaviorThe cache is now 2. Missing Validation in update_cache_limitSetting limit to 0 would clear the entire cache. Consider if 3. Entry Rejection Logic (datafusion/execution/src/cache/cache_unit.rs:105-109)If a single entry is larger than the cache limit, it is silently rejected. This could lead to confusing behavior. Recommendation: Log a warning or add metrics when entries are rejected. Security Concerns 🔒No major security issues identified, but consider:
Test Coverage ✅Excellent test coverage:
Suggestions:
Overall AssessmentThis is a solid implementation that adds important functionality to DataFusion. The code is well-structured and thoroughly tested. Recommendation: Approve with minor revisions to address the memory accounting issues. Priority Issues to Address:
Nice-to-Haves:
|
🤖 Augment PR SummarySummary: Introduces a memory-bounded file statistics cache for Listing tables to avoid repeatedly inferring Parquet statistics. Changes:
Technical Notes: Default limit is 1MiB; setting the limit to 0 disables the cache. Entry sizing is currently approximate and does not yet account for ordering ( 🤖 Was this summary useful? React with 👍 or 👎 |
| }; | ||
| use chrono::{DateTime, Utc}; | ||
| use half::f16; | ||
| use object_store::path::Path; |
There was a problem hiding this comment.
datafusion-common declares object_store as an optional dependency, but this module unconditionally uses object_store::path::Path, which can break builds when the feature is disabled. Consider feature-gating this (or just the Path impl) to preserve the crate’s feature matrix.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! If the new dependency is optional then Cargo will create an implicit feature for it, so the import and the impl could be feature-gated with: #[cfg(feature="object_store")]. This way the impl will be available only for clients which provide this dependency.
| self.num_rows.heap_size() | ||
| + self.total_byte_size.heap_size() | ||
| + self | ||
| .column_statistics |
There was a problem hiding this comment.
DFHeapSize for Statistics sums per-column heap allocations but doesn’t include the Vec<ColumnStatistics> backing buffer (capacity * size_of), so cache sizing can significantly undercount. This undercount will also affect statistics_size_bytes and LRU eviction decisions.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! There is no need to iterate over the column_statistics' elements to sum their heap size. There is a implementation for Vec<HeapSize> already that does the right thing - calculates the memory occupied by the Vec itself and its elements and it could be reused here.
| /// Set the cache for files statistics. | ||
| /// Set the cache for file statistics. | ||
| /// | ||
| /// Default is `None` (disabled). |
There was a problem hiding this comment.
The docs here say the file statistics cache is disabled by default, but CacheManagerConfig::default() sets a non-zero limit and CacheManager::try_new creates a default cache when file_statistics_cache is None. Consider updating the docs to reflect the new default-enabled behavior (and that setting the limit to 0 disables it).
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:annoying; category:documentation; feedback: The Augment AI reviewer is not correct! The documentation is for CacheManagerConfig, where file_statistics_cache is initialized with Default::default(), i.e. None. CacheManager::try_new() creates a default non-null value if the config does not provide one.
| ) -> (ObjectMeta, CachedFileMetadata) { | ||
| let series: Vec<i32> = (0..=10).collect(); | ||
| let values = Int32Array::from(series); | ||
| let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0])); |
There was a problem hiding this comment.
This ListArray is built with offsets = [0] (0 lists) but a non-empty values array, which may violate Arrow invariants / panic in ListArray::new or yield unexpected scalar semantics. If the intent is a single list scalar value, the offsets should represent one list (N+1 entries) to avoid relying on potentially undefined behavior.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! The offset array having just one item means that it will use an empty slice of the values which does not make much sense for this unit test. Prevents using an empty slice instead of a slice with all elements from the values
| options, | ||
| definition: None, | ||
| collected_statistics: Arc::new(DefaultFileStatisticsCache::default()), | ||
| collected_statistics: None, |
There was a problem hiding this comment.
With collected_statistics: None by default, constructing a ListingTable directly via ListingTable::try_new will now silently disable file statistics caching unless the caller explicitly calls with_cache(...). If this is intended, it may be worth calling out in the docs since other call sites outside ListingTableFactory may have relied on the previous default cache behavior.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! Explaining what is the initial state of the cache (enabled or disabled) would be useful for the application developers to know what to expect and how to enable it when needed.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@datafusion/common/src/heap_size.rs`:
- Around line 46-56: The Statistics::heap_size implementation omits the Vec's
own heap allocation for column_statistics and only sums each ColumnStatistics'
inner heap size; update DFHeapSize::heap_size for Statistics to include the
Vec's heap allocation by calling the Vec<T>::heap_size implementation on
self.column_statistics (instead of manually iterating and summing), so the
returned size accounts for both the vector's buffer and each element's heap
usage.
🧹 Nitpick comments (7)
datafusion/common/src/heap_size.rs (3)
254-258:Fields::heap_size()underestimates — only accounts for innerArc<Field>heap sizes, not the container allocation.
Fieldsis internally anArc<[FieldRef]>. The iteration sums the heap_size of eachFieldRef(which accounts for Arc overhead + Field), but misses the outerArc<[FieldRef]>allocation itself (the Arc overhead + the slice ofFieldRefpointers). For a rough approximation this may be acceptable, but it consistently underestimates by2 * size_of::<usize>() + len * size_of::<FieldRef>().
241-252: Consider potential double-counting with sharedArcreferences.Both
Arc<T>andArc<dyn DFHeapSize>impls report the full heap allocation (counts + data + inner heap). If the sameArcis referenced from multiple locations, each reference will independently report the full allocation, leading to over-counting in aggregate calculations. This is inherent to the design (documented as "approximate") but worth noting — especially forStatisticsobjects that may shareArcs.
353-361: Minor:is_nullable()anddict_is_ordered()contribute 0 — unnecessary overhead.These are
boolfields with 0 heap size. Calling.heap_size()on them is harmless but adds noise. Consider omitting them for clarity.datafusion/execution/src/cache/cache_manager.rs (1)
48-56: Consider accounting for ordering in cache size.
DFHeapSizecurrently excludesordering; if orderings can be sizable, cache eviction may under‑trigger. A lightweight estimate orDFHeapSizeforLexOrderingwould keep limits accurate.Also applies to: 108-118
datafusion/execution/src/cache/cache_unit.rs (3)
67-71: Consider restrictingDefaultFileStatisticsCacheStatevisibility topub(crate).This struct is
pubbut all its fields are private and all methods lackpubvisibility. External crate users can name the type but cannot construct or interact with it. If it's only used internally,pub(crate)would better express intent and reduce public API surface.
148-162:debug_assert!(false, ...)silently swallows accounting bugs in release builds.If an accounting invariant violation occurs (empty cache but
memory_used > memory_limit), this is silently ignored in release builds. Consider adding alog::error!/tracing::error!before returning so the bug is surfaced in production rather than silently leavingmemory_usedin an inconsistent state.Suggested improvement
} else { - // cache is empty while memory_used > memory_limit, cannot happen - debug_assert!( - false, - "This is a bug! Please report it to the Apache DataFusion developers" - ); + // cache is empty while memory_used > memory_limit — invariant violation + log::error!( + "File statistics cache memory accounting bug: memory_used={} but cache is empty. \ + Please report this to the Apache DataFusion developers.", + self.memory_used + ); + debug_assert!( + false, + "memory_used={} but cache is empty", + self.memory_used + ); + self.memory_used = 0; return; }Resetting
memory_used = 0in the error branch prevents the cache from being permanently stuck (every subsequentputwould re-enter the broken eviction loop).
212-231: Redundantpath.clone()on line 218.
pathis already a clone created on line 215 and is not used after theinsertcall. You can move it directly intoentries.insert(path, ...)to avoid an unnecessary allocation.Suggested fix
for entry in self.state.lock().unwrap().lru_queue.list_entries() { let path = entry.0.clone(); let cached = entry.1.clone(); entries.insert( - path.clone(), + path, FileStatisticsCacheEntry {
| impl DFHeapSize for Statistics { | ||
| fn heap_size(&self) -> usize { | ||
| self.num_rows.heap_size() | ||
| + self.total_byte_size.heap_size() | ||
| + self | ||
| .column_statistics | ||
| .iter() | ||
| .map(|s| s.heap_size()) | ||
| .sum::<usize>() | ||
| } | ||
| } |
There was a problem hiding this comment.
Bug: Missing Vec heap allocation in Statistics::heap_size().
self.column_statistics is a Vec<ColumnStatistics>, but the code manually iterates and sums only the inner heap sizes of each element. The Vec's own heap buffer (capacity * size_of::<ColumnStatistics>()) is not accounted for. Use the Vec<T>::heap_size() impl instead.
🐛 Proposed fix
impl DFHeapSize for Statistics {
fn heap_size(&self) -> usize {
self.num_rows.heap_size()
+ self.total_byte_size.heap_size()
- + self
- .column_statistics
- .iter()
- .map(|s| s.heap_size())
- .sum::<usize>()
+ + self.column_statistics.heap_size()
}
}🤖 Prompt for AI Agents
In `@datafusion/common/src/heap_size.rs` around lines 46 - 56, The
Statistics::heap_size implementation omits the Vec's own heap allocation for
column_statistics and only sums each ColumnStatistics' inner heap size; update
DFHeapSize::heap_size for Statistics to include the Vec's heap allocation by
calling the Vec<T>::heap_size implementation on self.column_statistics (instead
of manually iterating and summing), so the returned size accounts for both the
vector's buffer and each element's heap usage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! There is no need to iterate over the column_statistics' elements to sum their heap size. There is a implementation for Vec<HeapSize> already that does the right thing - calculates the memory occupied by the Vec itself and its elements and it could be reused here.
There was a problem hiding this comment.
Code Review
This pull request introduces a memory-bound FileStatisticsCache for the ListingTable, incorporating a new DFHeapSize trait and an LRU-based cache implementation. A security review found no specific vulnerabilities or critical issues. However, one issue was identified in a new test case that could cause a panic.
| ) -> (ObjectMeta, CachedFileMetadata) { | ||
| let series: Vec<i32> = (0..=10).collect(); | ||
| let values = Int32Array::from(series); | ||
| let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0])); |
There was a problem hiding this comment.
The OffsetBuffer for the ListArray appears to be constructed incorrectly. ListArray::new requires that the last element of the offsets buffer equals the length of the values array. In this code, offsets is [0], but the values array has a length of 11, which will cause a panic.
To create a ListArray with a single element that contains all the values from series, the offsets should be vec![0, 11].
| let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0])); | |
| let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0, 11])); |
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The offset array having just one item means that it will use an empty slice of the values which does not make much sense for this unit test. Prevents using an empty slice instead of a slice with all elements from the values
value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! There is no need to create a second clone of the |
value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! |
value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! Since neither the struct's fields nor its methods are public there is no need the struct itself to be public. Its visibility could be reduced to private. |
20047: To review by AI