Skip to content

Improve parquet row group pruning metrics display #18355

@2010YOUY01

Description

@2010YOUY01

Is your feature request related to a problem or challenge?

Follow up to #18321
Original discussion #18321 (comment)

Background

For each row group, the parquet scanner will try to prune it in the following order

  1. Check if this row group can be pruned by statistics (e.g. column a has statistics min=1, max=10, the predicate in the query is asking for rows that a>15, so we can skip the whole row group)
  2. Check if this row group can be pruned using bloom filter, similarly.
    Metrics can be used to check the pruning result.

Checking Metrics

In datafusion-cli, run

CREATE EXTERNAL TABLE IF NOT EXISTS lineitem
STORED AS parquet
LOCATION '/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem';

set datafusion.explain.analyze_level = summary;

explain analyze select *
from lineitem
where l_orderkey = 3000000;

And you will get the parquet metrics

DataSourceExec: ...metrics=[... row_groups_pruned_statistics=1 total → 1 matched,row_groups_pruned_bloom_filter=1 total → 1 matched,...]

row_groups_pruned_statistics=1 total → 1 means we start with 1 row group, and it has checked stat, and it can't be pruned
row_groups_pruned_bloom_filter=1 total → 1 matched means there is no bloom filter available, so we can't skip it either, 1 matched row group will continue to do further check

Note: the parquet table is generated using the setup in benchmark/, and we can use https://parquet-viewer.xiangpeng.systems/ to check the availability of the bloom filters

Issue

row_groups_pruned_bloom_filter=1 total → 1 matched is ambiguous, we don't know if it has checked the bloom filter and find it can't be pruned, or the bloom filter is not available.
A better way to display is: if bf is unavailable, don't display this metric.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions