20047: Add a memory bound FileStatisticsCache for the Listing Table by martin-augment · Pull Request #241 · martin-augment/datafusion

martin-augment · 2026-02-12T09:41:57Z

20047: To review by AI

…ngTable

gemini-code-assist · 2026-02-12T09:42:24Z

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances DataFusion's caching infrastructure by implementing a memory-aware file statistics cache. This new cache prevents unbounded memory growth by employing an LRU eviction strategy and accurately measures object sizes using a newly introduced DFHeapSize trait. The integration of this configurable cache into the Listing Table and the addition of a runtime setting for its memory limit provide greater control and efficiency in handling file metadata, particularly for large datasets.

Highlights

Memory-Bound File Statistics Cache: Introduced a new memory-bound FileStatisticsCache for the Listing Table, utilizing a Least Recently Used (LRU) eviction policy to manage memory consumption effectively.
Heap Size Calculation Trait: Added a DFHeapSize trait and its implementations for various DataFusion and Arrow types, enabling accurate calculation of heap memory usage for cached objects.
Configurable Cache Limit: The FileStatisticsCache memory limit is now configurable via a new runtime setting, datafusion.runtime.file_statistics_cache_limit, allowing users to control resource allocation.
Listing Table Integration: The ListingTable now optionally uses this memory-bound file statistics cache, improving performance by avoiding redundant file statistic inference while respecting memory constraints.

Changelog

datafusion-cli/src/main.rs
- Removed DefaultFileStatisticsCache import.
- Updated expected hit counts in snapshot tests due to cache behavior changes.
datafusion/catalog-listing/src/table.rs
- Modified ListingTable to hold an Option<Arc<dyn FileStatisticsCache>> instead of a direct Arc.
- Removed direct instantiation of DefaultFileStatisticsCache within ListingTable.
datafusion/common/Cargo.toml
- Updated object_store dependency to include default-features = true.
datafusion/common/src/heap_size.rs
- Added a new module defining the DFHeapSize trait.
- Implemented DFHeapSize for various DataFusion and Arrow data structures to calculate their heap memory usage.
datafusion/common/src/lib.rs
- Exported the new heap_size module.
datafusion/core/src/datasource/listing_table_factory.rs
- Updated method calls from with_files_statistics_cache to with_file_statistics_cache in tests.
datafusion/core/src/execution/context/mod.rs
- Added support for the new file_statistics_cache_limit runtime configuration setting.
- Included DEFAULT_FILE_STATISTICS_MEMORY_LIMIT constant.
datafusion/core/tests/parquet/file_statistics.rs
- Updated method calls from with_files_statistics_cache to with_file_statistics_cache.
datafusion/core/tests/sql/runtime_config.rs
- Added new tests for setting and retrieving the file_statistics_cache_limit runtime variable.
datafusion/execution/src/cache/cache_manager.rs
- Introduced DFHeapSize implementation for CachedFileMetadata.
- Updated FileStatisticsCache trait with cache_limit and update_cache_limit methods.
- Modified CacheManager to initialize FileStatisticsCache with a memory limit and to update the limit if an existing cache is provided.
- Renamed table_files_statistics_cache to file_statistics_cache in CacheManagerConfig and added file_statistics_cache_limit.
datafusion/execution/src/cache/cache_unit.rs
- Replaced DashMap with a Mutex-protected LruQueue for DefaultFileStatisticsCache to implement memory-bound LRU eviction.
- Implemented cache_limit and update_cache_limit methods for DefaultFileStatisticsCache.
- Updated list_entries to report actual statistics_size_bytes using heap_size.
- Added new tests for cache entry eviction and rejection of oversized entries.
datafusion/execution/src/runtime_env.rs
- Added file_statistics_cache_limit to the runtime configuration entries.
- Included with_file_statistics_cache_limit method in RuntimeEnvBuilder.
datafusion/sqllogictest/test_files/encrypted_parquet.slt
- Added set datafusion.runtime.file_statistics_cache_limit = "0K" to disable the cache for specific tests.
datafusion/sqllogictest/test_files/information_schema.slt
- Added datafusion.runtime.file_statistics_cache_limit 1M to the information schema output and documentation.
datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt
- Added set datafusion.runtime.file_statistics_cache_limit = "0K" to disable the cache for specific tests.
datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt
- Added set datafusion.runtime.file_statistics_cache_limit = "0K" to disable the cache for specific tests.
datafusion/sqllogictest/test_files/set_variable.slt
- Added tests for setting, resetting, and showing the datafusion.runtime.file_statistics_cache_limit variable.
docs/source/user-guide/configs.md
- Documented the new datafusion.runtime.file_statistics_cache_limit configuration option in the user guide.

Activity

The pull request was created to introduce a memory-bound FileStatisticsCache for the Listing Table.
The author, martin-augment, explicitly requested review by AI.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-12T09:42:34Z

Walkthrough

This pull request refactors the file statistics caching mechanism to be memory-limited and optional rather than always-present by default. A new DFHeapSize trait enables heap memory tracking across DataFusion data structures. The DefaultFileStatisticsCache is reimplemented using an LRU queue with configurable memory limits and eviction logic. The ListingTable now accepts optional file statistics caches, and a new runtime configuration option datafusion.runtime.file_statistics_cache_limit allows operators to control cache memory usage. Method names are standardized from plural (with_files_statistics_cache) to singular form. Multiple test files are updated to disable caching by setting the limit to zero for deterministic behavior.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-20047-2026-02-12-09-41-53

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-02-12T09:46:25Z

Pull Request Review: Add Memory-Bound FileStatisticsCache

Summary

This PR adds a memory-bounded file statistics cache with LRU eviction for the ListingTable. The implementation includes a comprehensive heap size tracking system and configuration support. Overall, this is a well-structured implementation with good test coverage.

Code Quality & Best Practices ✅

Strengths:

Well-organized code with clear separation of concerns
Comprehensive test coverage for the new functionality
Good documentation and inline comments
Proper use of Rust idioms (Arc, Mutex, Option)
Progressive enhancement approach (cache is optional)

Areas for Improvement:

1. Heap Size Calculation for HashMap (datafusion/common/src/heap_size.rs:199-239)

The HashMap heap size calculation is an approximation based on hashbrown 0.16.0. This is acknowledged in comments, but consider:

Adding a test to verify accuracy when hashbrown implementation changes
The calculation includes both stack size and heap allocations, which could double-count if keys/values have inline data

2. Arc Heap Size Implementation (datafusion/common/src/heap_size.rs:241-244)

Issue: This counts size_of::<T>() which is stack size, not heap size. The trait contract states "size of the type itself is not included". This could lead to over-counting memory usage.

Recommendation: Remove size_of::<T>() to only count reference counts and heap allocations:

2 * size_of::<usize>() + self.as_ref().heap_size()

3. Potential Memory Accounting Bug (datafusion/execution/src/cache/cache_unit.rs:97-123)

In the put method, when replacing an existing entry, the new entry_size is added before subtracting the old. This briefly inflates memory_used and could trigger unnecessary evictions.

Recommendation: Reorder to avoid temporary inflation:

if let Some(old_entry) = &old_value {
    self.memory_used -= old_entry.heap_size();
    self.memory_used += entry_size;
} else {
    self.memory_used += key.heap_size() + entry_size;
}

Performance Considerations ⚡

Good:

LRU eviction is efficient
Mutex contention should be minimal for read-heavy workloads
Default 1MiB limit is conservative and appropriate

Concerns:

Mutex per operation: Every get/put requires a mutex lock. For high-concurrency scenarios, this could be a bottleneck.
Clone on get: The current get implementation clones the CachedFileMetadata. For large statistics, this could be expensive.
Eviction overhead: evict_entries is called on every put, even when not needed.

Potential Bugs 🐛

1. Cache Disabled Behavior

The cache is now Option<Arc<dyn FileStatisticsCache>> and defaults to None. Ensure ListingTable::with_cache() is consistently called with the session cache manager.

2. Missing Validation in update_cache_limit

Setting limit to 0 would clear the entire cache. Consider if limit == 0 should disable the cache versus throwing an error.

3. Entry Rejection Logic (datafusion/execution/src/cache/cache_unit.rs:105-109)

If a single entry is larger than the cache limit, it is silently rejected. This could lead to confusing behavior.

Recommendation: Log a warning or add metrics when entries are rejected.

Security Concerns 🔒

No major security issues identified, but consider:

Memory DoS: A malicious user could create many tables with complex statistics to exhaust memory. The 1MiB default limit mitigates this.
Path validation: Ensure paths are normalized to prevent cache poisoning.

Test Coverage ✅

Excellent test coverage:

Unit tests for cache operations
Integration tests for runtime configuration
SQL logic tests for cache behavior
Tests for edge cases (too-large entries, eviction)

Suggestions:

Add a test for concurrent access patterns
Test cache behavior when limit is dynamically updated while queries are running

Overall Assessment

This is a solid implementation that adds important functionality to DataFusion. The code is well-structured and thoroughly tested.

Recommendation: Approve with minor revisions to address the memory accounting issues.

Priority Issues to Address:

Fix Arc heap size to not count size_of::<T>()
Reorder memory accounting in put to avoid temporary inflation
Consider logging when entries are rejected for being too large

Nice-to-Haves:

More informative debug assertions
Performance optimization: avoid cloning in get()
Concurrent access benchmarks

augmentcode · 2026-02-12T09:48:45Z

🤖 Augment PR Summary

Summary: Introduces a memory-bounded file statistics cache for Listing tables to avoid repeatedly inferring Parquet statistics.

Changes:

Replace the prior `DefaultFileStatisticsCache` with a mutex-protected, LRU-evicting cache sized by estimated heap usage
Add `DFHeapSize` utilities in `datafusion-common` to approximate heap usage of statistics/metadata structures
Extend the `FileStatisticsCache` trait with `cache_limit` / `update_cache_limit` and expose the limit via `CacheManager`
Add runtime setting `datafusion.runtime.file_statistics_cache_limit` (SET/RESET/SHOW, info_schema, docs) to configure/disable the cache
Wire `ListingTableFactory` / proto planning to pass the session cache into `ListingTable` and update affected tests/sqllogictests

Technical Notes: Default limit is 1MiB; setting the limit to 0 disables the cache. Entry sizing is currently approximate and does not yet account for ordering (LexOrdering) heap usage.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 5 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-02-12T09:48:46Z

datafusion/common/src/heap_size.rs

+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use object_store::path::Path;


datafusion-common declares object_store as an optional dependency, but this module unconditionally uses object_store::path::Path, which can break builds when the feature is disabled. Consider feature-gating this (or just the Path impl) to preserve the crate’s feature matrix.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! If the new dependency is optional then Cargo will create an implicit feature for it, so the import and the impl could be feature-gated with: #[cfg(feature="object_store")]. This way the impl will be available only for clients which provide this dependency.

augmentcode · 2026-02-12T09:48:46Z

datafusion/common/src/heap_size.rs

+        self.num_rows.heap_size()
+            + self.total_byte_size.heap_size()
+            + self
+                .column_statistics


DFHeapSize for Statistics sums per-column heap allocations but doesn’t include the Vec<ColumnStatistics> backing buffer (capacity * size_of), so cache sizing can significantly undercount. This undercount will also affect statistics_size_bytes and LRU eviction decisions.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:useful; category:bug; feedback: The Augment AI reviewer is correct! There is no need to iterate over the column_statistics' elements to sum their heap size. There is a implementation for Vec<HeapSize> already that does the right thing - calculates the memory occupied by the Vec itself and its elements and it could be reused here.

augmentcode · 2026-02-12T09:48:46Z

datafusion/execution/src/cache/cache_manager.rs

-    /// Set the cache for files statistics.
+    /// Set the cache for file statistics.
    ///
    /// Default is `None` (disabled).


The docs here say the file statistics cache is disabled by default, but CacheManagerConfig::default() sets a non-zero limit and CacheManager::try_new creates a default cache when file_statistics_cache is None. Consider updating the docs to reflect the new default-enabled behavior (and that setting the limit to 0 disables it).

Severity: low

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:annoying; category:documentation; feedback: The Augment AI reviewer is not correct! The documentation is for CacheManagerConfig, where file_statistics_cache is initialized with Default::default(), i.e. None. CacheManager::try_new() creates a default non-null value if the config does not provide one.

augmentcode · 2026-02-12T09:48:46Z

datafusion/execution/src/cache/cache_unit.rs

+    ) -> (ObjectMeta, CachedFileMetadata) {
+        let series: Vec<i32> = (0..=10).collect();
+        let values = Int32Array::from(series);
+        let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0]));


This ListArray is built with offsets = [0] (0 lists) but a non-empty values array, which may violate Arrow invariants / panic in ListArray::new or yield unexpected scalar semantics. If the intent is a single list scalar value, the offsets should represent one list (N+1 entries) to avoid relying on potentially undefined behavior.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The offset array having just one item means that it will use an empty slice of the values which does not make much sense for this unit test. Prevents using an empty slice instead of a slice with all elements from the values

augmentcode · 2026-02-12T09:48:46Z

datafusion/catalog-listing/src/table.rs

            options,
            definition: None,
-            collected_statistics: Arc::new(DefaultFileStatisticsCache::default()),
+            collected_statistics: None,


With collected_statistics: None by default, constructing a ListingTable directly via ListingTable::try_new will now silently disable file statistics caching unless the caller explicitly calls with_cache(...). If this is intended, it may be worth calling out in the docs since other call sites outside ListingTableFactory may have relied on the previous default cache behavior.

Severity: low

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! Explaining what is the initial state of the cache (enabled or disabled) would be useful for the application developers to know what to expect and how to enable it when needed.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@datafusion/common/src/heap_size.rs`:
- Around line 46-56: The Statistics::heap_size implementation omits the Vec's
own heap allocation for column_statistics and only sums each ColumnStatistics'
inner heap size; update DFHeapSize::heap_size for Statistics to include the
Vec's heap allocation by calling the Vec<T>::heap_size implementation on
self.column_statistics (instead of manually iterating and summing), so the
returned size accounts for both the vector's buffer and each element's heap
usage.

🧹 Nitpick comments (7)

datafusion/common/src/heap_size.rs (3)

254-258: Fields::heap_size() underestimates — only accounts for inner Arc<Field> heap sizes, not the container allocation.

Fields is internally an Arc<[FieldRef]>. The iteration sums the heap_size of each FieldRef (which accounts for Arc overhead + Field), but misses the outer Arc<[FieldRef]> allocation itself (the Arc overhead + the slice of FieldRef pointers). For a rough approximation this may be acceptable, but it consistently underestimates by 2 * size_of::<usize>() + len * size_of::<FieldRef>().

241-252: Consider potential double-counting with shared Arc references.

Both Arc<T> and Arc<dyn DFHeapSize> impls report the full heap allocation (counts + data + inner heap). If the same Arc is referenced from multiple locations, each reference will independently report the full allocation, leading to over-counting in aggregate calculations. This is inherent to the design (documented as "approximate") but worth noting — especially for Statistics objects that may share Arcs.

353-361: Minor: is_nullable() and dict_is_ordered() contribute 0 — unnecessary overhead.

These are bool fields with 0 heap size. Calling .heap_size() on them is harmless but adds noise. Consider omitting them for clarity.

datafusion/execution/src/cache/cache_manager.rs (1)

48-56: Consider accounting for ordering in cache size.
DFHeapSize currently excludes ordering; if orderings can be sizable, cache eviction may under‑trigger. A lightweight estimate or DFHeapSize for LexOrdering would keep limits accurate.

Also applies to: 108-118
datafusion/execution/src/cache/cache_unit.rs (3)
67-71: Consider restricting DefaultFileStatisticsCacheState visibility to pub(crate).

This struct is pub but all its fields are private and all methods lack pub visibility. External crate users can name the type but cannot construct or interact with it. If it's only used internally, pub(crate) would better express intent and reduce public API surface.

148-162: debug_assert!(false, ...) silently swallows accounting bugs in release builds.

If an accounting invariant violation occurs (empty cache but memory_used > memory_limit), this is silently ignored in release builds. Consider adding a log::error! / tracing::error! before returning so the bug is surfaced in production rather than silently leaving memory_used in an inconsistent state.
Suggested improvement
             } else {
-                // cache is empty while memory_used > memory_limit, cannot happen
-                debug_assert!(
-                    false,
-                    "This is a bug! Please report it to the Apache DataFusion developers"
-                );
+                // cache is empty while memory_used > memory_limit — invariant violation
+                log::error!(
+                    "File statistics cache memory accounting bug: memory_used={} but cache is empty. \
+                     Please report this to the Apache DataFusion developers.",
+                    self.memory_used
+                );
+                debug_assert!(
+                    false,
+                    "memory_used={} but cache is empty",
+                    self.memory_used
+                );
+                self.memory_used = 0;
                 return;
             }
Resetting memory_used = 0 in the error branch prevents the cache from being permanently stuck (every subsequent put would re-enter the broken eviction loop).

212-231: Redundant path.clone() on line 218.

path is already a clone created on line 215 and is not used after the insert call. You can move it directly into entries.insert(path, ...) to avoid an unnecessary allocation.
Suggested fix
         for entry in self.state.lock().unwrap().lru_queue.list_entries() {
             let path = entry.0.clone();
             let cached = entry.1.clone();
             entries.insert(
-                path.clone(),
+                path,
                 FileStatisticsCacheEntry {

coderabbitai · 2026-02-12T09:50:14Z

datafusion/common/src/heap_size.rs

+impl DFHeapSize for Statistics {
+    fn heap_size(&self) -> usize {
+        self.num_rows.heap_size()
+            + self.total_byte_size.heap_size()
+            + self
+                .column_statistics
+                .iter()
+                .map(|s| s.heap_size())
+                .sum::<usize>()
+    }
+}


⚠️ Potential issue | 🟠 Major

Bug: Missing Vec heap allocation in Statistics::heap_size().

self.column_statistics is a Vec<ColumnStatistics>, but the code manually iterates and sums only the inner heap sizes of each element. The Vec's own heap buffer (capacity * size_of::<ColumnStatistics>()) is not accounted for. Use the Vec<T>::heap_size() impl instead.

🐛 Proposed fix

impl DFHeapSize for Statistics { fn heap_size(&self) -> usize { self.num_rows.heap_size() + self.total_byte_size.heap_size() - + self - .column_statistics - .iter() - .map(|s| s.heap_size()) - .sum::<usize>() + + self.column_statistics.heap_size() } }

🤖 Prompt for AI Agents

In `@datafusion/common/src/heap_size.rs` around lines 46 - 56, The Statistics::heap_size implementation omits the Vec's own heap allocation for column_statistics and only sums each ColumnStatistics' inner heap size; update DFHeapSize::heap_size for Statistics to include the Vec's heap allocation by calling the Vec<T>::heap_size implementation on self.column_statistics (instead of manually iterating and summing), so the returned size accounts for both the vector's buffer and each element's heap usage.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! There is no need to iterate over the column_statistics' elements to sum their heap size. There is a implementation for Vec<HeapSize> already that does the right thing - calculates the memory occupied by the Vec itself and its elements and it could be reused here.

gemini-code-assist

Code Review

This pull request introduces a memory-bound FileStatisticsCache for the ListingTable, incorporating a new DFHeapSize trait and an LRU-based cache implementation. A security review found no specific vulnerabilities or critical issues. However, one issue was identified in a new test case that could cause a panic.

gemini-code-assist · 2026-02-12T09:51:28Z

datafusion/execution/src/cache/cache_unit.rs

+    ) -> (ObjectMeta, CachedFileMetadata) {
+        let series: Vec<i32> = (0..=10).collect();
+        let values = Int32Array::from(series);
+        let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0]));


The OffsetBuffer for the ListArray appears to be constructed incorrectly. ListArray::new requires that the last element of the offsets buffer equals the length of the values array. In this code, offsets is [0], but the values array has a length of 11, which will cause a panic.

To create a ListArray with a single element that contains all the values from series, the offsets should be vec![0, 11].

Suggested change

let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0]));

let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0, 11]));

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The offset array having just one item means that it will use an empty slice of the values which does not make much sense for this unit test. Prevents using an empty slice instead of a slice with all elements from the values

martin-augment · 2026-02-12T13:43:03Z

212-231: Redundant path.clone() on line 218.

path is already a clone created on line 215 and is not used after the insert call. You can move it directly into entries.insert(path, ...) to avoid an unnecessary allocation.

value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! There is no need to create a second clone of the pathsince there is a clone that is created few lines earlier in the same loop.

martin-augment · 2026-02-12T13:53:09Z

og::error!(
+                    "File statistics cache memory accounting bug: memory_used={} but cache is empty. \
+                     Please report this to the Apache DataFusion developers.",
+                    self.memory_used
+                );
+                debug_assert!(
+                    false,
+                    "memory_used={} but cache is empty",
+                    self.memory_used
+                );
+                self.memory_used = 0;

value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! debug_assert!() would report the problem only in development mode. It would be good to report something in production too, so that the app developers are notified. Reseting the memory_used variable will help to reuse the cache again without the need to restart the application.

martin-augment · 2026-02-12T13:54:12Z

67-71: Consider restricting DefaultFileStatisticsCacheState visibility to pub(crate).

This struct is pub but all its fields are private and all methods lack pub visibility. External crate users can name the type but cannot construct or interact with it. If it's only used internally, pub(crate) would better express intent and reduce public API surface.

value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! Since neither the struct's fields nor its methods are public there is no need the struct itself to be public. Its visibility could be reduced to private.

mkleen added 22 commits February 12, 2026 09:22

Add a default FileStatisticsCache implementation for the ListingTable

bf70891

fixup! Add a default FileStatisticsCache implementation for the Listi…

25b09be

…ngTable

Adapt memory usage when removing entries

dbc0cc3

Adapt heapsize for &str

604f4a8

Fix formatting

9cdfcd9

Adapt heapsize for &str and add another scalarvalue

c36f2cd

Add better error message

1d16b70

Add todo to add heapsize for ordering in CachedFileMetadata

38b5e43

Fix comment/docs on DefaultFileStatisticsCache

ab69196

Simplify test data generation

0622062

Remove potential stale entry, if entry is too large

4ca3ad1

Fix typo in sql logic test comment

5a7f613

Fix comment about default behaviour in cache manager

cba97ea

Fix variable name in test

7d17f44

Fix variable name in test

05efb88

Disable cache for sql logic test

2ffd691

Include key into memory estimation

ba9b762

Fix fmt

d123484

Fix clippy

1460392

minor

9f6e3f3

Add more key memory accounting

4858cd7

Fix Formatting

1ba6e44

augmentcode bot reviewed Feb 12, 2026

View reviewed changes

coderabbitai bot reviewed Feb 12, 2026

View reviewed changes

gemini-code-assist bot reviewed Feb 12, 2026

View reviewed changes

	let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0]));
	let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0, 11]));

Conversation

martin-augment commented Feb 12, 2026

Uh oh!

gemini-code-assist bot commented Feb 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Feb 12, 2026

Pull Request Review: Add Memory-Bound FileStatisticsCache

Summary

Code Quality & Best Practices ✅

1. Heap Size Calculation for HashMap (datafusion/common/src/heap_size.rs:199-239)

2. Arc Heap Size Implementation (datafusion/common/src/heap_size.rs:241-244)

3. Potential Memory Accounting Bug (datafusion/execution/src/cache/cache_unit.rs:97-123)

Performance Considerations ⚡

Potential Bugs 🐛

1. Cache Disabled Behavior

2. Missing Validation in update_cache_limit

3. Entry Rejection Logic (datafusion/execution/src/cache/cache_unit.rs:105-109)

Security Concerns 🔒

Test Coverage ✅

Overall Assessment

Uh oh!

augmentcode bot commented Feb 12, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 12, 2026

coderabbitai bot commented Feb 12, 2026 •

edited

Loading