Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet-filter and sort benchmarks to dfbench #7120

Merged
merged 7 commits into from
Aug 14, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 27, 2023

Note this looks like a large change but it a lot of moving code around rather than any logic changes

Which issue does this PR close?

Part of #7052

Rationale for this change

see #7052

TLDR is that making benchmarks easier to run means more people will find them and run them :)

What changes are included in this PR?

  1. Combine / consolidate the parquet filter pushdown and sort benchmarks
  2. Update documentation
  3. Inline the help text into the tool

Like #7054, this PR maintains the old entrypoint (parquet) as well

So these two commands do the same thing (run the filter pushdown benchmark):

# New
cargo run  --bin dfbench -- parquet-filter --iterations=5 --partitions=1 --scale-factor=0.01 --path=/tmp
# Old 
cargo run  --bin parquet filter --iterations=5 --partitions=1 --scale-factor=0.01 --path=/tmp

Likewise for sort benchmark:

# New
cargo run  --bin dfbench sort --iterations=5 --partitions=1 --scale-factor=0.01 --path=/tmp
# Old
cargo run  --bin parquet sort --iterations=5 --partitions=1 --scale-factor=0.01 --path=/tmp

The readme looks like this:

cargo run  --bin dfbench -- parquet-filter --help

dfbench-parquet-filter 28.0.0
Test performance of parquet filter pushdown

The queries are executed on a synthetic dataset generated during
the benchmark execution and designed to simulate web server access
logs.

Example

dfbench parquet-filter  --path ./data --scale-factor 1.0

generates the synthetic dataset at `./data/logs.parquet`. The size
of the dataset can be controlled through the `size_factor`
(with the default value of `1.0` generating a ~1GB parquet file).

For each filter we will run the query using different
`ParquetScanOption` settings.

Example output:

Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data",
batch_size: 8192, scale_factor: 1.0 }
Generated test dataset with 10699521 rows
Executing with filter 'request_method = Utf8("GET")'
Using scan options ParquetScanOptions { pushdown_filters: false, reorder_predicates: false, enable_page_index: false }
Iteration 0 returned 10699521 rows in 1303 ms
Iteration 1 returned 10699521 rows in 1288 ms
Iteration 2 returned 10699521 rows in 1266 ms
Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: true, enable_page_index: true }
Iteration 0 returned 1781686 rows in 1970 ms
Iteration 1 returned 1781686 rows in 2002 ms
Iteration 2 returned 1781686 rows in 1988 ms
Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: false, enable_page_index: true }
Iteration 0 returned 1781686 rows in 1940 ms
Iteration 1 returned 1781686 rows in 1986 ms
Iteration 2 returned 1781686 rows in 1947 ms
...

Are these changes tested?

I tested them manually, both alone and with bench.sh

Are there any user-facing changes?

No, this is a development tool

@alamb alamb marked this pull request as ready for review July 27, 2023 21:52
@alamb alamb marked this pull request as draft July 28, 2023 10:06
@alamb alamb marked this pull request as ready for review July 28, 2023 14:53
@alamb
Copy link
Contributor Author

alamb commented Aug 14, 2023

Thanks @Dandandan 🙏

@alamb alamb merged commit 2ec0bc1 into apache:main Aug 14, 2023
@alamb alamb deleted the alamb/parquet_and_sort branch August 17, 2023 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants