Skip to content

20047: Add a memory bound FileStatisticsCache for the Listing Table#241

Open
martin-augment wants to merge 22 commits intomainfrom
pr-20047-2026-02-12-09-41-53
Open

20047: Add a memory bound FileStatisticsCache for the Listing Table#241
martin-augment wants to merge 22 commits intomainfrom
pr-20047-2026-02-12-09-41-53

Conversation

@martin-augment
Copy link
Owner

20047: To review by AI

@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances DataFusion's caching infrastructure by implementing a memory-aware file statistics cache. This new cache prevents unbounded memory growth by employing an LRU eviction strategy and accurately measures object sizes using a newly introduced DFHeapSize trait. The integration of this configurable cache into the Listing Table and the addition of a runtime setting for its memory limit provide greater control and efficiency in handling file metadata, particularly for large datasets.

Highlights

  • Memory-Bound File Statistics Cache: Introduced a new memory-bound FileStatisticsCache for the Listing Table, utilizing a Least Recently Used (LRU) eviction policy to manage memory consumption effectively.
  • Heap Size Calculation Trait: Added a DFHeapSize trait and its implementations for various DataFusion and Arrow types, enabling accurate calculation of heap memory usage for cached objects.
  • Configurable Cache Limit: The FileStatisticsCache memory limit is now configurable via a new runtime setting, datafusion.runtime.file_statistics_cache_limit, allowing users to control resource allocation.
  • Listing Table Integration: The ListingTable now optionally uses this memory-bound file statistics cache, improving performance by avoiding redundant file statistic inference while respecting memory constraints.
Changelog
  • datafusion-cli/src/main.rs
    • Removed DefaultFileStatisticsCache import.
    • Updated expected hit counts in snapshot tests due to cache behavior changes.
  • datafusion/catalog-listing/src/table.rs
    • Modified ListingTable to hold an Option<Arc<dyn FileStatisticsCache>> instead of a direct Arc.
    • Removed direct instantiation of DefaultFileStatisticsCache within ListingTable.
  • datafusion/common/Cargo.toml
    • Updated object_store dependency to include default-features = true.
  • datafusion/common/src/heap_size.rs
    • Added a new module defining the DFHeapSize trait.
    • Implemented DFHeapSize for various DataFusion and Arrow data structures to calculate their heap memory usage.
  • datafusion/common/src/lib.rs
    • Exported the new heap_size module.
  • datafusion/core/src/datasource/listing_table_factory.rs
    • Updated method calls from with_files_statistics_cache to with_file_statistics_cache in tests.
  • datafusion/core/src/execution/context/mod.rs
    • Added support for the new file_statistics_cache_limit runtime configuration setting.
    • Included DEFAULT_FILE_STATISTICS_MEMORY_LIMIT constant.
  • datafusion/core/tests/parquet/file_statistics.rs
    • Updated method calls from with_files_statistics_cache to with_file_statistics_cache.
  • datafusion/core/tests/sql/runtime_config.rs
    • Added new tests for setting and retrieving the file_statistics_cache_limit runtime variable.
  • datafusion/execution/src/cache/cache_manager.rs
    • Introduced DFHeapSize implementation for CachedFileMetadata.
    • Updated FileStatisticsCache trait with cache_limit and update_cache_limit methods.
    • Modified CacheManager to initialize FileStatisticsCache with a memory limit and to update the limit if an existing cache is provided.
    • Renamed table_files_statistics_cache to file_statistics_cache in CacheManagerConfig and added file_statistics_cache_limit.
  • datafusion/execution/src/cache/cache_unit.rs
    • Replaced DashMap with a Mutex-protected LruQueue for DefaultFileStatisticsCache to implement memory-bound LRU eviction.
    • Implemented cache_limit and update_cache_limit methods for DefaultFileStatisticsCache.
    • Updated list_entries to report actual statistics_size_bytes using heap_size.
    • Added new tests for cache entry eviction and rejection of oversized entries.
  • datafusion/execution/src/runtime_env.rs
    • Added file_statistics_cache_limit to the runtime configuration entries.
    • Included with_file_statistics_cache_limit method in RuntimeEnvBuilder.
  • datafusion/sqllogictest/test_files/encrypted_parquet.slt
    • Added set datafusion.runtime.file_statistics_cache_limit = "0K" to disable the cache for specific tests.
  • datafusion/sqllogictest/test_files/information_schema.slt
    • Added datafusion.runtime.file_statistics_cache_limit 1M to the information schema output and documentation.
  • datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt
    • Added set datafusion.runtime.file_statistics_cache_limit = "0K" to disable the cache for specific tests.
  • datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt
    • Added set datafusion.runtime.file_statistics_cache_limit = "0K" to disable the cache for specific tests.
  • datafusion/sqllogictest/test_files/set_variable.slt
    • Added tests for setting, resetting, and showing the datafusion.runtime.file_statistics_cache_limit variable.
  • docs/source/user-guide/configs.md
    • Documented the new datafusion.runtime.file_statistics_cache_limit configuration option in the user guide.
Activity
  • The pull request was created to introduce a memory-bound FileStatisticsCache for the Listing Table.
  • The author, martin-augment, explicitly requested review by AI.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Feb 12, 2026

Walkthrough

This pull request refactors the file statistics caching mechanism to be memory-limited and optional rather than always-present by default. A new DFHeapSize trait enables heap memory tracking across DataFusion data structures. The DefaultFileStatisticsCache is reimplemented using an LRU queue with configurable memory limits and eviction logic. The ListingTable now accepts optional file statistics caches, and a new runtime configuration option datafusion.runtime.file_statistics_cache_limit allows operators to control cache memory usage. Method names are standardized from plural (with_files_statistics_cache) to singular form. Multiple test files are updated to disable caching by setting the limit to zero for deterministic behavior.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-20047-2026-02-12-09-41-53

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Feb 12, 2026

Pull Request Review: Add Memory-Bound FileStatisticsCache

Summary

This PR adds a memory-bounded file statistics cache with LRU eviction for the ListingTable. The implementation includes a comprehensive heap size tracking system and configuration support. Overall, this is a well-structured implementation with good test coverage.

Code Quality & Best Practices ✅

Strengths:

  • Well-organized code with clear separation of concerns
  • Comprehensive test coverage for the new functionality
  • Good documentation and inline comments
  • Proper use of Rust idioms (Arc, Mutex, Option)
  • Progressive enhancement approach (cache is optional)

Areas for Improvement:

1. Heap Size Calculation for HashMap (datafusion/common/src/heap_size.rs:199-239)

The HashMap heap size calculation is an approximation based on hashbrown 0.16.0. This is acknowledged in comments, but consider:

  • Adding a test to verify accuracy when hashbrown implementation changes
  • The calculation includes both stack size and heap allocations, which could double-count if keys/values have inline data

2. Arc Heap Size Implementation (datafusion/common/src/heap_size.rs:241-244)

Issue: This counts size_of::<T>() which is stack size, not heap size. The trait contract states "size of the type itself is not included". This could lead to over-counting memory usage.

Recommendation: Remove size_of::<T>() to only count reference counts and heap allocations:

2 * size_of::<usize>() + self.as_ref().heap_size()

3. Potential Memory Accounting Bug (datafusion/execution/src/cache/cache_unit.rs:97-123)

In the put method, when replacing an existing entry, the new entry_size is added before subtracting the old. This briefly inflates memory_used and could trigger unnecessary evictions.

Recommendation: Reorder to avoid temporary inflation:

if let Some(old_entry) = &old_value {
    self.memory_used -= old_entry.heap_size();
    self.memory_used += entry_size;
} else {
    self.memory_used += key.heap_size() + entry_size;
}

Performance Considerations ⚡

Good:

  • LRU eviction is efficient
  • Mutex contention should be minimal for read-heavy workloads
  • Default 1MiB limit is conservative and appropriate

Concerns:

  1. Mutex per operation: Every get/put requires a mutex lock. For high-concurrency scenarios, this could be a bottleneck.
  2. Clone on get: The current get implementation clones the CachedFileMetadata. For large statistics, this could be expensive.
  3. Eviction overhead: evict_entries is called on every put, even when not needed.

Potential Bugs 🐛

1. Cache Disabled Behavior

The cache is now Option<Arc<dyn FileStatisticsCache>> and defaults to None. Ensure ListingTable::with_cache() is consistently called with the session cache manager.

2. Missing Validation in update_cache_limit

Setting limit to 0 would clear the entire cache. Consider if limit == 0 should disable the cache versus throwing an error.

3. Entry Rejection Logic (datafusion/execution/src/cache/cache_unit.rs:105-109)

If a single entry is larger than the cache limit, it is silently rejected. This could lead to confusing behavior.

Recommendation: Log a warning or add metrics when entries are rejected.

Security Concerns 🔒

No major security issues identified, but consider:

  1. Memory DoS: A malicious user could create many tables with complex statistics to exhaust memory. The 1MiB default limit mitigates this.
  2. Path validation: Ensure paths are normalized to prevent cache poisoning.

Test Coverage ✅

Excellent test coverage:

  • Unit tests for cache operations
  • Integration tests for runtime configuration
  • SQL logic tests for cache behavior
  • Tests for edge cases (too-large entries, eviction)

Suggestions:

  1. Add a test for concurrent access patterns
  2. Test cache behavior when limit is dynamically updated while queries are running

Overall Assessment

This is a solid implementation that adds important functionality to DataFusion. The code is well-structured and thoroughly tested.

Recommendation: Approve with minor revisions to address the memory accounting issues.

Priority Issues to Address:

  1. Fix Arc heap size to not count size_of::<T>()
  2. Reorder memory accounting in put to avoid temporary inflation
  3. Consider logging when entries are rejected for being too large

Nice-to-Haves:

  1. More informative debug assertions
  2. Performance optimization: avoid cloning in get()
  3. Concurrent access benchmarks

@augmentcode
Copy link

augmentcode bot commented Feb 12, 2026

🤖 Augment PR Summary

Summary: Introduces a memory-bounded file statistics cache for Listing tables to avoid repeatedly inferring Parquet statistics.

Changes:

  • Replace the prior `DefaultFileStatisticsCache` with a mutex-protected, LRU-evicting cache sized by estimated heap usage
  • Add `DFHeapSize` utilities in `datafusion-common` to approximate heap usage of statistics/metadata structures
  • Extend the `FileStatisticsCache` trait with `cache_limit` / `update_cache_limit` and expose the limit via `CacheManager`
  • Add runtime setting `datafusion.runtime.file_statistics_cache_limit` (SET/RESET/SHOW, info_schema, docs) to configure/disable the cache
  • Wire `ListingTableFactory` / proto planning to pass the session cache into `ListingTable` and update affected tests/sqllogictests

Technical Notes: Default limit is 1MiB; setting the limit to 0 disables the cache. Entry sizing is currently approximate and does not yet account for ordering (LexOrdering) heap usage.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 5 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

};
use chrono::{DateTime, Utc};
use half::f16;
use object_store::path::Path;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datafusion-common declares object_store as an optional dependency, but this module unconditionally uses object_store::path::Path, which can break builds when the feature is disabled. Consider feature-gating this (or just the Path impl) to preserve the crate’s feature matrix.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! If the new dependency is optional then Cargo will create an implicit feature for it, so the import and the impl could be feature-gated with: #[cfg(feature="object_store")]. This way the impl will be available only for clients which provide this dependency.

self.num_rows.heap_size()
+ self.total_byte_size.heap_size()
+ self
.column_statistics
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DFHeapSize for Statistics sums per-column heap allocations but doesn’t include the Vec<ColumnStatistics> backing buffer (capacity * size_of), so cache sizing can significantly undercount. This undercount will also affect statistics_size_bytes and LRU eviction decisions.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! There is no need to iterate over the column_statistics' elements to sum their heap size. There is a implementation for Vec<HeapSize> already that does the right thing - calculates the memory occupied by the Vec itself and its elements and it could be reused here.

/// Set the cache for files statistics.
/// Set the cache for file statistics.
///
/// Default is `None` (disabled).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs here say the file statistics cache is disabled by default, but CacheManagerConfig::default() sets a non-zero limit and CacheManager::try_new creates a default cache when file_statistics_cache is None. Consider updating the docs to reflect the new default-enabled behavior (and that setting the limit to 0 disables it).

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:documentation; feedback: The Augment AI reviewer is not correct! The documentation is for CacheManagerConfig, where file_statistics_cache is initialized with Default::default(), i.e. None. CacheManager::try_new() creates a default non-null value if the config does not provide one.

) -> (ObjectMeta, CachedFileMetadata) {
let series: Vec<i32> = (0..=10).collect();
let values = Int32Array::from(series);
let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0]));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ListArray is built with offsets = [0] (0 lists) but a non-empty values array, which may violate Arrow invariants / panic in ListArray::new or yield unexpected scalar semantics. If the intent is a single list scalar value, the offsets should represent one list (N+1 entries) to avoid relying on potentially undefined behavior.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The offset array having just one item means that it will use an empty slice of the values which does not make much sense for this unit test. Prevents using an empty slice instead of a slice with all elements from the values

options,
definition: None,
collected_statistics: Arc::new(DefaultFileStatisticsCache::default()),
collected_statistics: None,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With collected_statistics: None by default, constructing a ListingTable directly via ListingTable::try_new will now silently disable file statistics caching unless the caller explicitly calls with_cache(...). If this is intended, it may be worth calling out in the docs since other call sites outside ListingTableFactory may have relied on the previous default cache behavior.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! Explaining what is the initial state of the cache (enabled or disabled) would be useful for the application developers to know what to expect and how to enable it when needed.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@datafusion/common/src/heap_size.rs`:
- Around line 46-56: The Statistics::heap_size implementation omits the Vec's
own heap allocation for column_statistics and only sums each ColumnStatistics'
inner heap size; update DFHeapSize::heap_size for Statistics to include the
Vec's heap allocation by calling the Vec<T>::heap_size implementation on
self.column_statistics (instead of manually iterating and summing), so the
returned size accounts for both the vector's buffer and each element's heap
usage.
🧹 Nitpick comments (7)
datafusion/common/src/heap_size.rs (3)

254-258: Fields::heap_size() underestimates — only accounts for inner Arc<Field> heap sizes, not the container allocation.

Fields is internally an Arc<[FieldRef]>. The iteration sums the heap_size of each FieldRef (which accounts for Arc overhead + Field), but misses the outer Arc<[FieldRef]> allocation itself (the Arc overhead + the slice of FieldRef pointers). For a rough approximation this may be acceptable, but it consistently underestimates by 2 * size_of::<usize>() + len * size_of::<FieldRef>().


241-252: Consider potential double-counting with shared Arc references.

Both Arc<T> and Arc<dyn DFHeapSize> impls report the full heap allocation (counts + data + inner heap). If the same Arc is referenced from multiple locations, each reference will independently report the full allocation, leading to over-counting in aggregate calculations. This is inherent to the design (documented as "approximate") but worth noting — especially for Statistics objects that may share Arcs.


353-361: Minor: is_nullable() and dict_is_ordered() contribute 0 — unnecessary overhead.

These are bool fields with 0 heap size. Calling .heap_size() on them is harmless but adds noise. Consider omitting them for clarity.

datafusion/execution/src/cache/cache_manager.rs (1)

48-56: Consider accounting for ordering in cache size.
DFHeapSize currently excludes ordering; if orderings can be sizable, cache eviction may under‑trigger. A lightweight estimate or DFHeapSize for LexOrdering would keep limits accurate.

Also applies to: 108-118

datafusion/execution/src/cache/cache_unit.rs (3)

67-71: Consider restricting DefaultFileStatisticsCacheState visibility to pub(crate).

This struct is pub but all its fields are private and all methods lack pub visibility. External crate users can name the type but cannot construct or interact with it. If it's only used internally, pub(crate) would better express intent and reduce public API surface.


148-162: debug_assert!(false, ...) silently swallows accounting bugs in release builds.

If an accounting invariant violation occurs (empty cache but memory_used > memory_limit), this is silently ignored in release builds. Consider adding a log::error! / tracing::error! before returning so the bug is surfaced in production rather than silently leaving memory_used in an inconsistent state.

Suggested improvement
             } else {
-                // cache is empty while memory_used > memory_limit, cannot happen
-                debug_assert!(
-                    false,
-                    "This is a bug! Please report it to the Apache DataFusion developers"
-                );
+                // cache is empty while memory_used > memory_limit — invariant violation
+                log::error!(
+                    "File statistics cache memory accounting bug: memory_used={} but cache is empty. \
+                     Please report this to the Apache DataFusion developers.",
+                    self.memory_used
+                );
+                debug_assert!(
+                    false,
+                    "memory_used={} but cache is empty",
+                    self.memory_used
+                );
+                self.memory_used = 0;
                 return;
             }

Resetting memory_used = 0 in the error branch prevents the cache from being permanently stuck (every subsequent put would re-enter the broken eviction loop).


212-231: Redundant path.clone() on line 218.

path is already a clone created on line 215 and is not used after the insert call. You can move it directly into entries.insert(path, ...) to avoid an unnecessary allocation.

Suggested fix
         for entry in self.state.lock().unwrap().lru_queue.list_entries() {
             let path = entry.0.clone();
             let cached = entry.1.clone();
             entries.insert(
-                path.clone(),
+                path,
                 FileStatisticsCacheEntry {

Comment on lines +46 to +56
impl DFHeapSize for Statistics {
fn heap_size(&self) -> usize {
self.num_rows.heap_size()
+ self.total_byte_size.heap_size()
+ self
.column_statistics
.iter()
.map(|s| s.heap_size())
.sum::<usize>()
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bug: Missing Vec heap allocation in Statistics::heap_size().

self.column_statistics is a Vec<ColumnStatistics>, but the code manually iterates and sums only the inner heap sizes of each element. The Vec's own heap buffer (capacity * size_of::<ColumnStatistics>()) is not accounted for. Use the Vec<T>::heap_size() impl instead.

🐛 Proposed fix
 impl DFHeapSize for Statistics {
     fn heap_size(&self) -> usize {
         self.num_rows.heap_size()
             + self.total_byte_size.heap_size()
-            + self
-                .column_statistics
-                .iter()
-                .map(|s| s.heap_size())
-                .sum::<usize>()
+            + self.column_statistics.heap_size()
     }
 }
🤖 Prompt for AI Agents
In `@datafusion/common/src/heap_size.rs` around lines 46 - 56, The
Statistics::heap_size implementation omits the Vec's own heap allocation for
column_statistics and only sums each ColumnStatistics' inner heap size; update
DFHeapSize::heap_size for Statistics to include the Vec's heap allocation by
calling the Vec<T>::heap_size implementation on self.column_statistics (instead
of manually iterating and summing), so the returned size accounts for both the
vector's buffer and each element's heap usage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! There is no need to iterate over the column_statistics' elements to sum their heap size. There is a implementation for Vec<HeapSize> already that does the right thing - calculates the memory occupied by the Vec itself and its elements and it could be reused here.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a memory-bound FileStatisticsCache for the ListingTable, incorporating a new DFHeapSize trait and an LRU-based cache implementation. A security review found no specific vulnerabilities or critical issues. However, one issue was identified in a new test case that could cause a panic.

) -> (ObjectMeta, CachedFileMetadata) {
let series: Vec<i32> = (0..=10).collect();
let values = Int32Array::from(series);
let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0]));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The OffsetBuffer for the ListArray appears to be constructed incorrectly. ListArray::new requires that the last element of the offsets buffer equals the length of the values array. In this code, offsets is [0], but the values array has a length of 11, which will cause a panic.

To create a ListArray with a single element that contains all the values from series, the offsets should be vec![0, 11].

Suggested change
let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0]));
let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0, 11]));

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The offset array having just one item means that it will use an empty slice of the values which does not make much sense for this unit test. Prevents using an empty slice instead of a slice with all elements from the values

@martin-augment
Copy link
Owner Author

212-231: Redundant path.clone() on line 218.

path is already a clone created on line 215 and is not used after the insert call. You can move it directly into entries.insert(path, ...) to avoid an unnecessary allocation.

value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! There is no need to create a second clone of the pathsince there is a clone that is created few lines earlier in the same loop.

@martin-augment
Copy link
Owner Author

og::error!(
+                    "File statistics cache memory accounting bug: memory_used={} but cache is empty. \
+                     Please report this to the Apache DataFusion developers.",
+                    self.memory_used
+                );
+                debug_assert!(
+                    false,
+                    "memory_used={} but cache is empty",
+                    self.memory_used
+                );
+                self.memory_used = 0;

value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! debug_assert!() would report the problem only in development mode. It would be good to report something in production too, so that the app developers are notified. Reseting the memory_used variable will help to reuse the cache again without the need to restart the application.

@martin-augment
Copy link
Owner Author

67-71: Consider restricting DefaultFileStatisticsCacheState visibility to pub(crate).

This struct is pub but all its fields are private and all methods lack pub visibility. External crate users can name the type but cannot construct or interact with it. If it's only used internally, pub(crate) would better express intent and reduce public API surface.

value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! Since neither the struct's fields nor its methods are public there is no need the struct itself to be public. Its visibility could be reduced to private.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants