-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Is your feature request related to a problem or challenge?
Follow up to #18321
Original discussion #18321 (comment)
Background
For each row group, the parquet scanner will try to prune it in the following order
- Check if this row group can be pruned by statistics (e.g. column a has statistics
min=1, max=10, the predicate in the query is asking for rows thata>15, so we can skip the whole row group) - Check if this row group can be pruned using bloom filter, similarly.
Metrics can be used to check the pruning result.
Checking Metrics
In datafusion-cli, run
CREATE EXTERNAL TABLE IF NOT EXISTS lineitem
STORED AS parquet
LOCATION '/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem';
set datafusion.explain.analyze_level = summary;
explain analyze select *
from lineitem
where l_orderkey = 3000000;
And you will get the parquet metrics
DataSourceExec: ...metrics=[... row_groups_pruned_statistics=1 total → 1 matched,row_groups_pruned_bloom_filter=1 total → 1 matched,...]
row_groups_pruned_statistics=1 total → 1 means we start with 1 row group, and it has checked stat, and it can't be pruned
row_groups_pruned_bloom_filter=1 total → 1 matched means there is no bloom filter available, so we can't skip it either, 1 matched row group will continue to do further check
Note: the parquet table is generated using the setup in benchmark/, and we can use https://parquet-viewer.xiangpeng.systems/ to check the availability of the bloom filters
Issue
row_groups_pruned_bloom_filter=1 total → 1 matched is ambiguous, we don't know if it has checked the bloom filter and find it can't be pruned, or the bloom filter is not available.
A better way to display is: if bf is unavailable, don't display this metric.
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response