Skip to content

[Parquet] Avoid fetching multiple pages when max_predicate_cache_sizeis 0 #8542

@nuno-faria

Description

@nuno-faria

Describe the bug

#7850 introduced the cached array reader, which causes multiple data pages to be fetched if their size is less than the batch_size. Depending on the file, workload, and batch size, this might cause regressions in performance (e.g., apache/datafusion#17575).

While setting max_predicate_cache_size to 0 essentially disables the predicate cache, multiple pages are still unnecessarily retrieved.

The sources of this issue are:

  • The InMemoryRowGroup::fetch will use the expanded selection based on the provided cache ProjectionMask.
  • The ArrayReaderBuilder::build_reader will use a CachedArrayReader instead of the regular reader, also based on the cache ProjectionMask.

Thus the most straight forward solution I've found is to return None in the ReaderFactory::compute_cache_projection if the max_predicate_cache_size is 0, causing the reader to fetch only the necessary pages:

    fn compute_cache_projection(&self, projection: &ProjectionMask) -> Option<ProjectionMask> {
+       if self.max_predicate_cache_size == 0 {
+           return None;
+       }
       ...
    }

The ReaderFactory::read_row_group remains the same, since it already expects the possibility of compute_cache_projection to return None:

        let cache_projection = match self.compute_cache_projection(&projection) {
            Some(projection) => projection,
            None => ProjectionMask::none(meta.columns().len()),
        };

@alamb @XiangpengHao what do you think? Is this the best way to solve the issue? If so I can open a PR.

To Reproduce

See apache/datafusion#17575.

Expected behavior

Retrieve only the minimum required data pages if max_predicate_cache_size is set to 0.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions