Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jul 9, 2025

Which issue does this PR close?

Rationale for this change

In order to enable filter_pushdown by default, we need to ensure it doesn't regress existing performance

However, it has been very hard to make forward progress on improving filter pushdown because all our benchmarks compare filter pushdown to not filter pushdown, so the bar for change is quite high.
Here is the most recent example:

It seems obvious but the the right metric for improvements to the filter pushdown are comparing when filter pushdown is already on. However, we don't have any such benchmark (see #16729 and #16730 for why the existing benchmarks are not good enough)

What changes are included in this PR?

Add a benchmark (clickbench_pushdown) that turns on filter_pushdown and reorder_filters on

You can run it like this:

`./benchmarks/bench.sh run clickbench_pushdown

Which then invokes

+ cargo run --release --bin dfbench -- clickbench --pushdown --iterations 5 --path /Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries -o /Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json

Are these changes tested?

I tested it manually . You can see Q30 increase in time when --pushdown is enabled, as expected

with --pushdown:

...     Running `target/profiling/dfbench clickbench --pushdown --iterations 5 --path /Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries -o /Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json --query 30`
Running benchmarks with the following options: RunOpt { query: Some(30), pushdown: true, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned", queries_path: "/Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries", output_path: Some("/Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json") }
Q30: -- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "SearchEngineID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "ClientIP" ORDER BY c DESC LIMIT 10;

Query 30 iteration 0 took 546.0 ms and returned 10 rows
Query 30 iteration 1 took 503.1 ms and returned 10 rows
Query 30 iteration 2 took 488.2 ms and returned 10 rows
Query 30 iteration 3 took 462.6 ms and returned 10 rows
Query 30 iteration 4 took 462.3 ms and returned 10 rows
Query 30 avg time: 492.42 ms

Without pushdown

...
     Running `target/profiling/dfbench clickbench --iterations 5 --path /Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries -o /Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json --query 30`
Running benchmarks with the following options: RunOpt { query: Some(30), pushdown: false, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned", queries_path: "/Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries", output_path: Some("/Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json") }
Q30: -- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "SearchEngineID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "ClientIP" ORDER BY c DESC LIMIT 10;

Query 30 iteration 0 took 305.7 ms and returned 10 rows
Query 30 iteration 1 took 289.1 ms and returned 10 rows
Query 30 iteration 2 took 287.7 ms and returned 10 rows
Query 30 iteration 3 took 266.3 ms and returned 10 rows
Query 30 iteration 4 took 268.3 ms and returned 10 rows
Query 30 avg time: 283.43 ms

Are there any user-facing changes?

No this is a development process change only

@alamb alamb marked this pull request as draft July 9, 2025 21:02
@alamb alamb marked this pull request as ready for review July 9, 2025 21:39
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @alamb !

This is very helpful for pushdown case performance monitor!

@alamb
Copy link
Contributor Author

alamb commented Jul 10, 2025

I tested this benchmark with our filter pushdown work here, and I think it is useful

#16711 (comment)

Thank you @zhuqi-lucas for the review

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @timsaucer and @zhuqi-lucas

@alamb alamb merged commit 18a30ce into apache:main Jul 15, 2025
27 checks passed
@alamb alamb deleted the alamb/new_filter_pushdown branch July 15, 2025 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a datafusion benchmark for filter_pushdown

3 participants