-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Describe the bug
#7850 introduced the cached array reader, which causes multiple data pages to be fetched if their size is less than the batch_size
. Depending on the file, workload, and batch size, this might cause regressions in performance (e.g., apache/datafusion#17575).
While setting max_predicate_cache_size
to 0 essentially disables the predicate cache, multiple pages are still unnecessarily retrieved.
The sources of this issue are:
- The
InMemoryRowGroup::fetch
will use the expanded selection based on the provided cacheProjectionMask
. - The
ArrayReaderBuilder::build_reader
will use aCachedArrayReader
instead of the regular reader, also based on the cacheProjectionMask
.
Thus the most straight forward solution I've found is to return None in the ReaderFactory::compute_cache_projection
if the max_predicate_cache_size
is 0, causing the reader to fetch only the necessary pages:
fn compute_cache_projection(&self, projection: &ProjectionMask) -> Option<ProjectionMask> {
+ if self.max_predicate_cache_size == 0 {
+ return None;
+ }
...
}
The ReaderFactory::read_row_group
remains the same, since it already expects the possibility of compute_cache_projection
to return None:
let cache_projection = match self.compute_cache_projection(&projection) {
Some(projection) => projection,
None => ProjectionMask::none(meta.columns().len()),
};
@alamb @XiangpengHao what do you think? Is this the best way to solve the issue? If so I can open a PR.
To Reproduce
Expected behavior
Retrieve only the minimum required data pages if max_predicate_cache_size
is set to 0.
Additional context