20047: Add a memory bound FileStatisticsCache for the Listing Table#236
20047: Add a memory bound FileStatisticsCache for the Listing Table#236martin-augment wants to merge 6 commits intomainfrom
Conversation
WalkthroughThis pull request refactors the file statistics caching architecture across DataFusion. The changes include: (1) introducing a new ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement to DataFusion's caching mechanism by implementing a memory-bound Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Pull Request Review: Add Memory-Bound FileStatisticsCache for ListingTableSummaryThis PR adds a memory-bounded implementation of FileStatisticsCache with LRU eviction for the ListingTable. The implementation includes comprehensive heap size tracking, runtime configuration support, and good test coverage. Code Quality & Architecture ✅Strengths:
Areas for Improvement:1. Heap Size Calculation Accuracy
|
There was a problem hiding this comment.
Code Review
This pull request introduces a memory-bound FileStatisticsCache to improve resource management in DataFusion, leveraging a new DFHeapSize trait and an LRU eviction policy. While this is a positive step towards preventing unbounded memory usage and potential OOM issues, a critical vulnerability was identified: the parse_memory_limit function in the configuration handling logic does not correctly handle empty strings, which could lead to a panic and a Denial of Service when processing malformed SQL commands. Additionally, there are a couple of suggestions for improvement, including a correctness issue with the cache size calculation.
| impl DFHeapSize for CachedFileMetadata { | ||
| fn heap_size(&self) -> usize { | ||
| self.meta.size.heap_size() | ||
| + self.meta.last_modified.heap_size() | ||
| + self.meta.version.heap_size() | ||
| + self.meta.e_tag.heap_size() | ||
| + self.meta.location.as_ref().heap_size() | ||
| + self.statistics.heap_size() | ||
| } | ||
| } |
There was a problem hiding this comment.
The heap_size implementation for CachedFileMetadata doesn't account for the memory used by self.ordering. This can lead to the cache size being underestimated, potentially causing it to grow beyond its memory limit. The ordering field is an Option<LexOrdering>, which can hold a significant amount of data on the heap.
To fix this, DFHeapSize would need to be implemented for PhysicalExpr and related types, and then self.ordering.heap_size() should be added to the sum. Given the complexity, this might be better addressed in a follow-up PR, but it's an important correctness issue for the memory-bound cache.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The ordering field might occupy a big amount of heap space, so it should be tracked too. A TODO/issue should be added to be improved in a follow-up pull request.
| builder.with_object_list_cache_ttl(Some(duration)) | ||
| } | ||
| "file_statistics_cache_limit" => { | ||
| let limit = Self::parse_memory_limit(value)?; |
There was a problem hiding this comment.
The parse_memory_limit function is vulnerable to a panic when an empty string is provided. This is because it calls split_at(limit.len() - 1) without checking if the string is empty, leading to a usize underflow. An attacker with SQL access could trigger this by running SET datafusion.runtime.file_statistics_cache_limit = '', causing a Denial of Service (DoS) by crashing the process.
Recommendation: Add a check for empty strings in parse_memory_limit and return an error instead of panicking.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The parse_memory_limit() function does not validate its input and this may lead to a panic when the input is an empty string. This is not caused by the changes in this Pull Request but it would be good to fix it while modifying this code.
| let list_files_cache = Arc::new(DefaultFileStatisticsCache::default()); | ||
|
|
||
| let rt = RuntimeEnvBuilder::new() | ||
| .with_cache_manager( | ||
| CacheManagerConfig::default() | ||
| .with_file_statistics_cache(Some(list_files_cache)), |
There was a problem hiding this comment.
The variable list_files_cache is of type DefaultFileStatisticsCache. For clarity and to avoid confusion, it would be better to rename it to file_statistics_cache.
| let list_files_cache = Arc::new(DefaultFileStatisticsCache::default()); | |
| let rt = RuntimeEnvBuilder::new() | |
| .with_cache_manager( | |
| CacheManagerConfig::default() | |
| .with_file_statistics_cache(Some(list_files_cache)), | |
| let file_statistics_cache = Arc::new(DefaultFileStatisticsCache::default()); | |
| let rt = RuntimeEnvBuilder::new() | |
| .with_cache_manager( | |
| CacheManagerConfig::default() | |
| .with_file_statistics_cache(Some(file_statistics_cache)), |
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The list_files_cache variable name is copied from another test and it is misleading for this test case. It would be good to rename it to file_statistics_cache to avoid any confusion
🤖 Augment PR SummarySummary: This PR introduces a memory-bounded, LRU-based Changes:
Technical Notes: Cache entries store 🤖 Was this summary useful? React with 👍 or 👎 |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
datafusion/execution/src/cache/cache_unit.rs (1)
30-41:⚠️ Potential issue | 🟡 MinorStale documentation: still references DashMap.
Line 39 says "Uses DashMap for lock-free concurrent access" but the implementation now uses
Mutex<DefaultFileStatisticsCacheState>with an LRU queue. Update the doc comment to reflect the new design.📝 Proposed fix
-/// Uses DashMap for lock-free concurrent access. +/// Uses a Mutex-protected LRU queue with a configurable memory limit.
🤖 Fix all issues with AI agents
In `@datafusion/execution/src/cache/cache_manager.rs`:
- Around line 108-117: The heap_size implementation for CachedFileMetadata
(DFHeapSize::heap_size) omits the ordering: Option<LexOrdering> field,
undercounting memory; update the function to include the ordering's heap size
(e.g., add self.ordering.heap_size() or self.ordering.as_ref().heap_size()
depending on available trait impls) alongside the other meta fields so Optional
LexOrdering memory is accounted for in CachedFileMetadata::heap_size.
In `@datafusion/execution/src/cache/cache_unit.rs`:
- Around line 91-112: In put(), before early-returning when the new entry_size
exceeds self.memory_limit, check for and remove any existing stale entry for the
same key: use self.lru_queue.remove(key) (or the appropriate removal method on
lru_queue) to retrieve and subtract its heap_size() from self.memory_used so the
old entry is cleared; then return None. Ensure you reference the existing
symbols put, lru_queue, memory_used, memory_limit, evict_entries (if still
needed), CachedFileMetadata, and Path when locating and updating the code.
In `@datafusion/sqllogictest/test_files/set_variable.slt`:
- Line 416: The comment string contains a typo:
"rruntime.file_statistics_cache_limit" should be
"runtime.file_statistics_cache_limit"; update the comment text in the test
header (the line that reads "Test SET and SHOW
rruntime.file_statistics_cache_limit") to read "Test SET and SHOW
runtime.file_statistics_cache_limit" so the configuration key name is spelled
correctly.
🧹 Nitpick comments (6)
datafusion/core/tests/sql/runtime_config.rs (1)
330-337: Misleading variable name:list_files_cacheholds aDefaultFileStatisticsCache.This appears to be a copy-paste artifact from
test_list_files_cache_limit. The variable should be renamed to something likefile_statistics_cachefor clarity.Suggested fix
async fn test_file_statistics_cache_limit() { - let list_files_cache = Arc::new(DefaultFileStatisticsCache::default()); + let file_statistics_cache = Arc::new(DefaultFileStatisticsCache::default()); let rt = RuntimeEnvBuilder::new() .with_cache_manager( CacheManagerConfig::default() - .with_file_statistics_cache(Some(list_files_cache)), + .with_file_statistics_cache(Some(file_statistics_cache)), )datafusion/execution/src/runtime_env.rs (1)
456-459: Missing doc comment onwith_file_statistics_cache_limit.Other similar builder methods (
with_metadata_cache_limit,with_object_list_cache_limit) have doc comments describing the parameter. This one should follow the same pattern.Suggested fix
+ /// Specifies the memory limit for the file statistics cache, in bytes. pub fn with_file_statistics_cache_limit(mut self, limit: usize) -> Self {datafusion/common/src/heap_size.rs (2)
172-175: Variable namesu8andi8shadow built-in type names.While this compiles correctly (pattern bindings are just identifiers), using Rust primitive type names as variable names is confusing to read. Consider
precision/scaleorp/sinstead.Suggested fix
- Decimal32(u8, i8) => u8.heap_size() + i8.heap_size(), - Decimal64(u8, i8) => u8.heap_size() + i8.heap_size(), - Decimal128(u8, i8) => u8.heap_size() + i8.heap_size(), - Decimal256(u8, i8) => u8.heap_size() + i8.heap_size(), + Decimal32(p, s) => p.heap_size() + s.heap_size(), + Decimal64(p, s) => p.heap_size() + s.heap_size(), + Decimal128(p, s) => p.heap_size() + s.heap_size(), + Decimal256(p, s) => p.heap_size() + s.heap_size(),
247-251:Fieldsheap size slightly undercounts — the outerArc<[FieldRef]>allocation is not included.
Fieldswraps anArc<[Arc<Field>]>. This implementation sums per-elementArc<Field>::heap_size()but doesn't account for the outer Arc's allocation (ref counts + the slice buffer). The undercount is small and constant perFieldsinstance, so this is acceptable as an approximation, but worth noting.datafusion/execution/src/cache/cache_unit.rs (2)
96-103: Memory accounting only tracksvalue.heap_size(), not the key or LRU node overhead.Each entry in the LRU queue also stores a cloned
Pathkey (heap-allocated string) and a linked-list node. These are not included inmemory_used, so actual heap consumption can exceedmemory_limit— proportionally more so with many small entries. Consider adding the key's heap size to the accounting for more accurate bounds.Sketch
- let entry_size = value.heap_size(); + let entry_size = key.as_ref().len() + value.heap_size();Apply the same adjustment in
removeandevict_entrieswhereheap_sizeis subtracted.
598-626: Nit:step_by(1)is redundant.Line 601:
(0..=10).step_by(1)— the step is already 1 by default for range iterators.Proposed fix
- let series: Vec<i32> = (0..=10).step_by(1).collect(); + let series: Vec<i32> = (0..=10).collect();
| impl DFHeapSize for CachedFileMetadata { | ||
| fn heap_size(&self) -> usize { | ||
| self.meta.size.heap_size() | ||
| + self.meta.last_modified.heap_size() | ||
| + self.meta.version.heap_size() | ||
| + self.meta.e_tag.heap_size() | ||
| + self.meta.location.as_ref().heap_size() | ||
| + self.statistics.heap_size() | ||
| } | ||
| } |
There was a problem hiding this comment.
ordering field is not included in heap_size(), leading to under-counted memory usage.
CachedFileMetadata has an ordering: Option<LexOrdering> field (Line 55) that is missing from the heap size calculation. For files with sort ordering metadata, this omission means the cache's memory accounting will be lower than actual usage, potentially allowing the cache to exceed its configured memory limit.
Suggested fix
impl DFHeapSize for CachedFileMetadata {
fn heap_size(&self) -> usize {
self.meta.size.heap_size()
+ self.meta.last_modified.heap_size()
+ self.meta.version.heap_size()
+ self.meta.e_tag.heap_size()
+ self.meta.location.as_ref().heap_size()
+ self.statistics.heap_size()
+ // TODO: add self.ordering.heap_size() once DFHeapSize is implemented for LexOrdering
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| impl DFHeapSize for CachedFileMetadata { | |
| fn heap_size(&self) -> usize { | |
| self.meta.size.heap_size() | |
| + self.meta.last_modified.heap_size() | |
| + self.meta.version.heap_size() | |
| + self.meta.e_tag.heap_size() | |
| + self.meta.location.as_ref().heap_size() | |
| + self.statistics.heap_size() | |
| } | |
| } | |
| impl DFHeapSize for CachedFileMetadata { | |
| fn heap_size(&self) -> usize { | |
| self.meta.size.heap_size() | |
| self.meta.last_modified.heap_size() | |
| self.meta.version.heap_size() | |
| self.meta.e_tag.heap_size() | |
| self.meta.location.as_ref().heap_size() | |
| self.statistics.heap_size() | |
| // TODO: add self.ordering.heap_size() once DFHeapSize is implemented for LexOrdering | |
| } | |
| } |
🤖 Prompt for AI Agents
In `@datafusion/execution/src/cache/cache_manager.rs` around lines 108 - 117, The
heap_size implementation for CachedFileMetadata (DFHeapSize::heap_size) omits
the ordering: Option<LexOrdering> field, undercounting memory; update the
function to include the ordering's heap size (e.g., add
self.ordering.heap_size() or self.ordering.as_ref().heap_size() depending on
available trait impls) alongside the other meta fields so Optional LexOrdering
memory is accounted for in CachedFileMetadata::heap_size.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The ordering field might occupy a big amount of heap space, so it should be tracked too. A TODO/issue should be added to be improved in a follow-up pull request.
| fn put( | ||
| &mut self, | ||
| key: &Path, | ||
| value: CachedFileMetadata, | ||
| ) -> Option<CachedFileMetadata> { | ||
| let entry_size = value.heap_size(); | ||
|
|
||
| if entry_size > self.memory_limit { | ||
| return None; | ||
| } | ||
|
|
||
| let old_value = self.lru_queue.put(key.clone(), value); | ||
| self.memory_used += entry_size; | ||
|
|
||
| if let Some(old_entry) = &old_value { | ||
| self.memory_used -= old_entry.heap_size(); | ||
| } | ||
|
|
||
| self.evict_entries(); | ||
|
|
||
| old_value | ||
| } |
There was a problem hiding this comment.
put() does not remove the stale entry when the new value exceeds the memory limit.
When entry_size > self.memory_limit (line 98), the method returns None immediately. If the key already exists in the cache (e.g., an older, now-invalid entry for the same path), that stale entry remains, wasting memory and causing get() to keep returning an invalidated value. Consider removing the old entry before the early return.
Proposed fix
let entry_size = value.heap_size();
if entry_size > self.memory_limit {
+ // Still remove the old entry for this key to avoid retaining stale data
+ self.remove(key);
return None;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fn put( | |
| &mut self, | |
| key: &Path, | |
| value: CachedFileMetadata, | |
| ) -> Option<CachedFileMetadata> { | |
| let entry_size = value.heap_size(); | |
| if entry_size > self.memory_limit { | |
| return None; | |
| } | |
| let old_value = self.lru_queue.put(key.clone(), value); | |
| self.memory_used += entry_size; | |
| if let Some(old_entry) = &old_value { | |
| self.memory_used -= old_entry.heap_size(); | |
| } | |
| self.evict_entries(); | |
| old_value | |
| } | |
| fn put( | |
| &mut self, | |
| key: &Path, | |
| value: CachedFileMetadata, | |
| ) -> Option<CachedFileMetadata> { | |
| let entry_size = value.heap_size(); | |
| if entry_size > self.memory_limit { | |
| // Still remove the old entry for this key to avoid retaining stale data | |
| self.remove(key); | |
| return None; | |
| } | |
| let old_value = self.lru_queue.put(key.clone(), value); | |
| self.memory_used += entry_size; | |
| if let Some(old_entry) = &old_value { | |
| self.memory_used -= old_entry.heap_size(); | |
| } | |
| self.evict_entries(); | |
| old_value | |
| } |
🤖 Prompt for AI Agents
In `@datafusion/execution/src/cache/cache_unit.rs` around lines 91 - 112, In
put(), before early-returning when the new entry_size exceeds self.memory_limit,
check for and remove any existing stale entry for the same key: use
self.lru_queue.remove(key) (or the appropriate removal method on lru_queue) to
retrieve and subtract its heap_size() from self.memory_used so the old entry is
cleared; then return None. Ensure you reference the existing symbols put,
lru_queue, memory_used, memory_limit, evict_entries (if still needed),
CachedFileMetadata, and Path when locating and updating the code.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! If the new value is too big then it is not tracked at all, but any previous entry in the cache for the same key is not removed. Prevents calculating wrong sizes by using wrong/obsolete value for such keys.
| ---- | ||
| datafusion.runtime.max_temp_directory_size 10G | ||
|
|
||
| # Test SET and SHOW rruntime.file_statistics_cache_limit |
There was a problem hiding this comment.
Typo in comment: "rruntime" → "runtime".
✏️ Fix typo
-# Test SET and SHOW rruntime.file_statistics_cache_limit
+# Test SET and SHOW runtime.file_statistics_cache_limit📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Test SET and SHOW rruntime.file_statistics_cache_limit | |
| # Test SET and SHOW runtime.file_statistics_cache_limit |
🤖 Prompt for AI Agents
In `@datafusion/sqllogictest/test_files/set_variable.slt` at line 416, The comment
string contains a typo: "rruntime.file_statistics_cache_limit" should be
"runtime.file_statistics_cache_limit"; update the comment text in the test
header (the line that reads "Test SET and SHOW
rruntime.file_statistics_cache_limit") to read "Test SET and SHOW
runtime.file_statistics_cache_limit" so the configuration key name is spelled
correctly.
There was a problem hiding this comment.
value:good-to-have; category:typo; feedback: The CodeRabbit AI reviewer is correct! There is a small typo in a comment in one of the tests that would be good to be fixed. It does not cause any harm though!
| fn heap_size(&self) -> usize { | ||
| self.num_rows.heap_size() | ||
| + self.total_byte_size.heap_size() | ||
| + self |
There was a problem hiding this comment.
Statistics::heap_size sums per-column heap usage but doesn’t include the heap allocation of the column_statistics Vec buffer itself, which can significantly underestimate entry sizes. This can cause the memory-limited file statistics cache to retain more data than its configured limit.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! To measure the heap size of a collection like Vec it should measure the size of an item multiplied by the capacity of the collection. The implementation here iterates over the length of the collection and ignores any extra items in the capacity.
| + self.meta.last_modified.heap_size() | ||
| + self.meta.version.heap_size() | ||
| + self.meta.e_tag.heap_size() | ||
| + self.meta.location.as_ref().heap_size() |
There was a problem hiding this comment.
CachedFileMetadata::heap_size doesn’t account for ordering, so entries with cached orderings may exceed the intended memory bound. That can make eviction behavior inconsistent with the configured cache limit.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! The ordering field might occupy a big amount of heap space, so it should be tracked too. A TODO/issue should be added to be improved in a follow-up pull request.
| /// Enabling the cache avoids repeatedly reading file statistics in a DataFusion session. | ||
| /// Default is disabled. Currently only Parquet files are supported. | ||
| pub table_files_statistics_cache: Option<Arc<dyn FileStatisticsCache>>, | ||
| pub file_statistics_cache: Option<Arc<dyn FileStatisticsCache>>, |
There was a problem hiding this comment.
The surrounding docstrings say the file statistics cache is “disabled by default”, but CacheManager::try_new will create a DefaultFileStatisticsCache whenever file_statistics_cache_limit > 0 (and the default limit is 1MiB). The docs around file_statistics_cache/with_file_statistics_cache may no longer match actual default behavior.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:good-to-have; category:documentation; feedback: The Augment AI reviewer is correct! The docstring that the cache is disabled by default is wrong and it should be updated.
| #[derive(Default)] | ||
| pub struct DefaultFileStatisticsCache { | ||
| cache: DashMap<Path, CachedFileMetadata>, | ||
| state: Mutex<DefaultFileStatisticsCacheState>, |
There was a problem hiding this comment.
The doc comment above DefaultFileStatisticsCache says it “Uses DashMap for lock-free concurrent access”, but the implementation is now Mutex + LruQueue. This looks like a stale comment that could mislead readers about concurrency characteristics.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:good-to-have; category:documentation; feedback: The Augment AI reviewer is correct! The implementation has changed but the documentation has not been updated. Prevents releasing a new version with wrong documentation
| ) -> Option<CachedFileMetadata> { | ||
| let entry_size = value.heap_size(); | ||
|
|
||
| if entry_size > self.memory_limit { |
There was a problem hiding this comment.
When entry_size > memory_limit, DefaultFileStatisticsCacheState::put returns None without removing or returning any existing value for the same key. That violates CacheAccessor::put’s contract (“Returns the previous value if one existed”) and can leave an invalid/obsolete entry resident if a too-large replacement is attempted.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! If the new value is too big then it is not tracked at all, but any previous entry in the cache for the same key is not removed. Prevents calculating wrong sizes by using wrong/obsolete value for such keys.
value:useful; category:bug; feedback: The Claude AI reviewer is correct! The implementation counts each Arc clone separately, so it will count Nsize instead of 1size, where N is the number of clones. Prevents over-estimating the size of the same object. |
value:valid-but-wont-fix; category:bug; feedback: The Claude AI reviewer is not correct! The proposed implementation first adds the size of the value and then subtracts the size of the old value. In the time in between the total size is not accurate! But it would be the same if first the value of the old is subtracted and then the size of the new value is added. It would be a slightly worse because during that period of time the system would think that there is more free heap to use. |
value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! The idea of using debug_assert!() is to report the issue only during development time. If the issue happens at production then the fact that the cache is empty would cause slow downs but it should not break the application. |
value:good-to-have; category:documentation; feedback: The CodeRabbit AI reviewer is correct! The implementation has changed but the documentation has not been updated. Prevents releasing a new version with wrong documentation |
value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! The variable names shadow built-in types and this would be very confusing. It would be good to use variable names which tell more about the purpose of the variable, not its type. |
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The proposed implementation tracks only the sizes of the values in the cache, but the keys are omitted. The keys' type is Path so they should be tracked too. |
20047: To review by AI