Add `clickbench_pushdown` benchmark #16731

alamb · 2025-07-09T21:02:26Z

Which issue does this PR close?

Related to of Enable parquet filter pushdown (filter_pushdown) by default #3463
Closes Add a datafusion benchmark for filter_pushdown #16729

Rationale for this change

In order to enable filter_pushdown by default, we need to ensure it doesn't regress existing performance

However, it has been very hard to make forward progress on improving filter pushdown because all our benchmarks compare filter pushdown to not filter pushdown, so the bar for change is quite high.
Here is the most recent example:

POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) #16711

It seems obvious but the the right metric for improvements to the filter pushdown are comparing when filter pushdown is already on. However, we don't have any such benchmark (see #16729 and #16730 for why the existing benchmarks are not good enough)

What changes are included in this PR?

Add a benchmark (clickbench_pushdown) that turns on filter_pushdown and reorder_filters on

You can run it like this:

`./benchmarks/bench.sh run clickbench_pushdown

Which then invokes

+ cargo run --release --bin dfbench -- clickbench --pushdown --iterations 5 --path /Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries -o /Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json

Are these changes tested?

I tested it manually . You can see Q30 increase in time when --pushdown is enabled, as expected

with --pushdown:

...     Running `target/profiling/dfbench clickbench --pushdown --iterations 5 --path /Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries -o /Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json --query 30`
Running benchmarks with the following options: RunOpt { query: Some(30), pushdown: true, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned", queries_path: "/Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries", output_path: Some("/Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json") }
Q30: -- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "SearchEngineID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "ClientIP" ORDER BY c DESC LIMIT 10;

Query 30 iteration 0 took 546.0 ms and returned 10 rows
Query 30 iteration 1 took 503.1 ms and returned 10 rows
Query 30 iteration 2 took 488.2 ms and returned 10 rows
Query 30 iteration 3 took 462.6 ms and returned 10 rows
Query 30 iteration 4 took 462.3 ms and returned 10 rows
Query 30 avg time: 492.42 ms

Without pushdown

...
     Running `target/profiling/dfbench clickbench --iterations 5 --path /Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries -o /Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json --query 30`
Running benchmarks with the following options: RunOpt { query: Some(30), pushdown: false, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned", queries_path: "/Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries", output_path: Some("/Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json") }
Q30: -- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "SearchEngineID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "ClientIP" ORDER BY c DESC LIMIT 10;

Query 30 iteration 0 took 305.7 ms and returned 10 rows
Query 30 iteration 1 took 289.1 ms and returned 10 rows
Query 30 iteration 2 took 287.7 ms and returned 10 rows
Query 30 iteration 3 took 266.3 ms and returned 10 rows
Query 30 iteration 4 took 268.3 ms and returned 10 rows
Query 30 avg time: 283.43 ms

Are there any user-facing changes?

No this is a development process change only

zhuqi-lucas

LGTM, thank you @alamb !

This is very helpful for pushdown case performance monitor!

alamb · 2025-07-10T12:27:43Z

I tested this benchmark with our filter pushdown work here, and I think it is useful

#16711 (comment)

Thank you @zhuqi-lucas for the review

alamb

Thank you @timsaucer and @zhuqi-lucas

Add clickbench_pushdown benchmark

5cda900

alamb marked this pull request as draft July 9, 2025 21:02

alamb marked this pull request as ready for review July 9, 2025 21:39

alamb mentioned this pull request Jul 9, 2025

POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) #16711

Closed

zhuqi-lucas approved these changes Jul 10, 2025

View reviewed changes

alamb added 2 commits July 10, 2025 08:29

adjust benchmark name

7ffd8e9

Merge branch 'main' into alamb/new_filter_pushdown

e18b1a3

timsaucer approved these changes Jul 15, 2025

View reviewed changes

alamb commented Jul 15, 2025

View reviewed changes

alamb merged commit 18a30ce into apache:main Jul 15, 2025
27 checks passed

alamb deleted the alamb/new_filter_pushdown branch July 15, 2025 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `clickbench_pushdown` benchmark #16731

Add `clickbench_pushdown` benchmark #16731

Uh oh!

alamb commented Jul 9, 2025 •

edited

Loading

Uh oh!

zhuqi-lucas left a comment

Uh oh!

alamb commented Jul 10, 2025

Uh oh!

alamb left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add clickbench_pushdown benchmark #16731

Add clickbench_pushdown benchmark #16731

Uh oh!

Conversation

alamb commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

zhuqi-lucas left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 10, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add `clickbench_pushdown` benchmark #16731

Add `clickbench_pushdown` benchmark #16731

alamb commented Jul 9, 2025 •

edited

Loading