Add parquet-filter and sort benchmarks to dfbench #7120

alamb · 2023-07-27T21:39:17Z

Note this looks like a large change but it a lot of moving code around rather than any logic changes

Which issue does this PR close?

Part of #7052

Rationale for this change

see #7052

TLDR is that making benchmarks easier to run means more people will find them and run them :)

What changes are included in this PR?

Combine / consolidate the parquet filter pushdown and sort benchmarks
Update documentation
Inline the help text into the tool

Like #7054, this PR maintains the old entrypoint (parquet) as well

So these two commands do the same thing (run the filter pushdown benchmark):

# New
cargo run  --bin dfbench -- parquet-filter --iterations=5 --partitions=1 --scale-factor=0.01 --path=/tmp
# Old 
cargo run  --bin parquet filter --iterations=5 --partitions=1 --scale-factor=0.01 --path=/tmp

Likewise for sort benchmark:

# New
cargo run  --bin dfbench sort --iterations=5 --partitions=1 --scale-factor=0.01 --path=/tmp
# Old
cargo run  --bin parquet sort --iterations=5 --partitions=1 --scale-factor=0.01 --path=/tmp

The readme looks like this:

cargo run  --bin dfbench -- parquet-filter --help

dfbench-parquet-filter 28.0.0
Test performance of parquet filter pushdown

The queries are executed on a synthetic dataset generated during
the benchmark execution and designed to simulate web server access
logs.

Example

dfbench parquet-filter  --path ./data --scale-factor 1.0

generates the synthetic dataset at `./data/logs.parquet`. The size
of the dataset can be controlled through the `size_factor`
(with the default value of `1.0` generating a ~1GB parquet file).

For each filter we will run the query using different
`ParquetScanOption` settings.

Example output:

Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data",
batch_size: 8192, scale_factor: 1.0 }
Generated test dataset with 10699521 rows
Executing with filter 'request_method = Utf8("GET")'
Using scan options ParquetScanOptions { pushdown_filters: false, reorder_predicates: false, enable_page_index: false }
Iteration 0 returned 10699521 rows in 1303 ms
Iteration 1 returned 10699521 rows in 1288 ms
Iteration 2 returned 10699521 rows in 1266 ms
Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: true, enable_page_index: true }
Iteration 0 returned 1781686 rows in 1970 ms
Iteration 1 returned 1781686 rows in 2002 ms
Iteration 2 returned 1781686 rows in 1988 ms
Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: false, enable_page_index: true }
Iteration 0 returned 1781686 rows in 1940 ms
Iteration 1 returned 1781686 rows in 1986 ms
Iteration 2 returned 1781686 rows in 1947 ms
...

Are these changes tested?

I tested them manually, both alone and with bench.sh

Are there any user-facing changes?

No, this is a development tool

alamb · 2023-08-14T10:28:16Z

Thanks @Dandandan 🙏

alamb added 2 commits July 27, 2023 17:32

Add parquet-filter and sort benchmarks to dfbench

6fff778

fix

339b1cb

alamb marked this pull request as ready for review July 27, 2023 21:52

alamb marked this pull request as draft July 28, 2023 10:06

alamb added 2 commits July 28, 2023 06:07

fix docs

221e3e5

fix ci bench

7ff4f9a

alamb marked this pull request as ready for review July 28, 2023 14:53

Merge remote-tracking branch 'apache/main' into alamb/parquet_and_sort

8b096c2

alamb mentioned this pull request Aug 5, 2023

Add H2O.ai Database-like Ops benchmark to dfbench #7209

Closed

Update docs

4aea3a7

Dandandan approved these changes Aug 14, 2023

View reviewed changes

Merge remote-tracking branch 'apache/main' into alamb/parquet_and_sort

c524f9f

alamb merged commit 2ec0bc1 into apache:main Aug 14, 2023

alamb deleted the alamb/parquet_and_sort branch August 17, 2023 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parquet-filter and sort benchmarks to dfbench #7120

Add parquet-filter and sort benchmarks to dfbench #7120

alamb commented Jul 27, 2023 •

edited

Loading

alamb commented Aug 14, 2023

Add parquet-filter and sort benchmarks to dfbench #7120

Add parquet-filter and sort benchmarks to dfbench #7120

Conversation

alamb commented Jul 27, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Aug 14, 2023

alamb commented Jul 27, 2023 •

edited

Loading