[Parquet] Avoid fetching multiple pages when `max_predicate_cache_size`is 0

**Describe the bug**


#7850 introduced the cached array reader, which causes multiple data pages to be fetched if their size is less than the `batch_size`. Depending on the file, workload, and batch size, this might cause regressions in performance (e.g., https://github.com/apache/datafusion/issues/17575).

While setting `max_predicate_cache_size` to 0 essentially disables the predicate cache, multiple pages are still unnecessarily retrieved.

The sources of this issue are:
 - The `InMemoryRowGroup::fetch` will use the expanded selection based on the provided cache `ProjectionMask`.
 - The `ArrayReaderBuilder::build_reader` will use a `CachedArrayReader` instead of the regular reader, also based on the cache `ProjectionMask`.

Thus the most straight forward solution I've found is to return None in the `ReaderFactory::compute_cache_projection` if the `max_predicate_cache_size` is 0, causing the reader to fetch only the necessary pages:

```diff
    fn compute_cache_projection(&self, projection: &ProjectionMask) -> Option<ProjectionMask> {
+       if self.max_predicate_cache_size == 0 {
+           return None;
+       }
       ...
    }
```

The `ReaderFactory::read_row_group` remains the same, since it already expects the possibility of `compute_cache_projection` to return None:

```rust
        let cache_projection = match self.compute_cache_projection(&projection) {
            Some(projection) => projection,
            None => ProjectionMask::none(meta.columns().len()),
        };
```

@alamb @XiangpengHao what do you think? Is this the best way to solve the issue? If so I can open a PR.

**To Reproduce**


See https://github.com/apache/datafusion/issues/17575.

**Expected behavior**


Retrieve only the minimum required data pages if `max_predicate_cache_size` is set to 0.

**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Parquet] Avoid fetching multiple pages when `max_predicate_cache_size`is 0 #8542

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Parquet] Avoid fetching multiple pages when max_predicate_cache_sizeis 0 #8542

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Parquet] Avoid fetching multiple pages when `max_predicate_cache_size`is 0 #8542