Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet-filter and sort benchmarks to dfbench #7120

Merged
merged 7 commits into from
Aug 14, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 92 additions & 74 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,31 +229,14 @@ This will produce output like
└──────────────┴──────────────┴──────────────┴───────────────┘
```

### Expected output
# Benchmark Runner

The result of query 1 should produce the following output when executed against the SF=1 dataset.
The `dfbench` program contains subcommands to run the various
benchmarks. When benchmarking, it should always be built in release
mode using `--release`.

```
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
| l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order |
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
| A | F | 37734107 | 56586554400.73001 | 53758257134.870026 | 55909065222.82768 | 25.522005853257337 | 38273.12973462168 | 0.049985295838396455 | 1478493 |
| N | F | 991417 | 1487504710.3799996 | 1413082168.0541 | 1469649223.1943746 | 25.516471920522985 | 38284.467760848296 | 0.05009342667421622 | 38854 |
| N | O | 74476023 | 111701708529.50996 | 106118209986.10472 | 110367023144.56622 | 25.502229680934594 | 38249.1238377803 | 0.049996589476752576 | 2920373 |
| R | F | 37719753 | 56568041380.90001 | 53741292684.60399 | 55889619119.83194 | 25.50579361269077 | 38250.854626099666 | 0.05000940583012587 | 1478870 |
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
Query 1 iteration 0 took 1956.1 ms
Query 1 avg time: 1956.11 ms
```

# Benchmark Descriptions

## `dfbench`

The `dfbench` program contains subcommands to run various benchmarks.

Full help can be found in the relevant sub command. For example to get help for tpch,
run `cargo run --release --bin dfbench tpch --help`
Full help for each benchmark can be found in the relevant sub
command. For example to get help for tpch, run

```shell
cargo run --release --bin dfbench --help
Expand All @@ -265,13 +248,95 @@ USAGE:
dfbench <SUBCOMMAND>

SUBCOMMANDS:
clickbench Run the clickbench benchmark
help Prints this message or the help of the given subcommand(s)
tpch Run the tpch benchmark.
tpch-convert Convert tpch .slt files to .parquet or .csv files
clickbench Run the clickbench benchmark
help Prints this message or the help of the given subcommand(s)
parquet-filter Test performance of parquet filter pushdown
sort Test performance of parquet filter pushdown
tpch Run the tpch benchmark.
tpch-convert Convert tpch .slt files to .parquet or .csv files

```

# Benchmarks

The output of `dfbench` help includes a descripion of each benchmark, which is reproducer here for convenience

## ClickBench

The ClickBench[1] benchmarks are widely cited in the industry and
focus on grouping / aggregation / filtering. This runner uses the
scripts and queries from [2].

[1]: https://github.com/ClickHouse/ClickBench
[2]: https://github.com/ClickHouse/ClickBench/tree/main/datafusion

## Parquet Filter

Test performance of parquet filter pushdown

The queries are executed on a synthetic dataset generated during
the benchmark execution and designed to simulate web server access
logs.

Example

dfbench parquet-filter --path ./data --scale-factor 1.0

generates the synthetic dataset at `./data/logs.parquet`. The size
of the dataset can be controlled through the `size_factor`
(with the default value of `1.0` generating a ~1GB parquet file).

For each filter we will run the query using different
`ParquetScanOption` settings.

Example output:

```
Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data",
batch_size: 8192, scale_factor: 1.0 }
Generated test dataset with 10699521 rows
Executing with filter 'request_method = Utf8("GET")'
Using scan options ParquetScanOptions { pushdown_filters: false, reorder_predicates: false, enable_page_index: false }
Iteration 0 returned 10699521 rows in 1303 ms
Iteration 1 returned 10699521 rows in 1288 ms
Iteration 2 returned 10699521 rows in 1266 ms
Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: true, enable_page_index: true }
Iteration 0 returned 1781686 rows in 1970 ms
Iteration 1 returned 1781686 rows in 2002 ms
Iteration 2 returned 1781686 rows in 1988 ms
Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: false, enable_page_index: true }
Iteration 0 returned 1781686 rows in 1940 ms
Iteration 1 returned 1781686 rows in 1986 ms
Iteration 2 returned 1781686 rows in 1947 ms
...
```

## Sort
Test performance of sorting large datasets

This test sorts a a synthetic dataset generated during the
benchmark execution, designed to simulate sorting web server
access logs. Such sorting is often done during data transformation
steps.

The tests sort the entire dataset using several different sort
orders.

## TPCH

Run the tpch benchmark.

This benchmarks is derived from the [TPC-H][1] version
[2.17.1]. The data and answers are generated using `tpch-gen` from
[2].

[1]: http://www.tpc.org/tpch/
[2]: https://github.com/databricks/tpch-dbgen.git,
[2.17.1]: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf


# Older Benchmarks

## NYC Taxi Benchmark

These benchmarks are based on the [New York Taxi and Limousine Commission][2] data set.
Expand Down Expand Up @@ -317,50 +382,3 @@ h2o groupby query 1 took 1669 ms

[1]: http://www.tpc.org/tpch/
[2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

## Parquet benchmarks

This is a set of benchmarks for testing and verifying performance of parquet filtering and sorting.
The queries are executed on a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs.

To run filter benchmarks, run:

```base
cargo run --release --bin parquet -- filter --path ./data --scale-factor 1.0
```

This will generate the synthetic dataset at `./data/logs.parquet`. The size of the dataset can be controlled through the `size_factor`
(with the default value of `1.0` generating a ~1GB parquet file).

For each filter we will run the query using different `ParquetScanOption` settings.

Example run:

```
Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data", batch_size: 8192, scale_factor: 1.0 }
Generated test dataset with 10699521 rows
Executing with filter 'request_method = Utf8("GET")'
Using scan options ParquetScanOptions { pushdown_filters: false, reorder_predicates: false, enable_page_index: false }
Iteration 0 returned 10699521 rows in 1303 ms
Iteration 1 returned 10699521 rows in 1288 ms
Iteration 2 returned 10699521 rows in 1266 ms
Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: true, enable_page_index: true }
Iteration 0 returned 1781686 rows in 1970 ms
Iteration 1 returned 1781686 rows in 2002 ms
Iteration 2 returned 1781686 rows in 1988 ms
Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: false, enable_page_index: true }
Iteration 0 returned 1781686 rows in 1940 ms
Iteration 1 returned 1781686 rows in 1986 ms
Iteration 2 returned 1781686 rows in 1947 ms
...
```

Similarly, to run sorting benchmarks, run:

```base
cargo run --release --bin parquet -- sort --path ./data --scale-factor 1.0
```

This proceeds in the same way as the filter benchmarks: each sort expression
combination will be run using the same set of `ParquetScanOption` as the
filter benchmarks.
1 change: 1 addition & 0 deletions benchmarks/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,7 @@ main() {
# navigate to the appropriate directory
pushd "${DATAFUSION_DIR}/benchmarks" > /dev/null
mkdir -p "${RESULTS_DIR}"
mkdir -p "${DATA_DIR}"
case "$BENCHMARK" in
all)
run_tpch "1"
Expand Down
6 changes: 5 additions & 1 deletion benchmarks/src/bin/dfbench.rs
Original file line number Diff line number Diff line change
Expand Up @@ -28,14 +28,16 @@ static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
#[global_allocator]
static ALLOC: mimalloc::MiMalloc = mimalloc::MiMalloc;

use datafusion_benchmarks::{clickbench, tpch};
use datafusion_benchmarks::{clickbench, parquet_filter, sort, tpch};

#[derive(Debug, StructOpt)]
#[structopt(about = "benchmark command")]
enum Options {
Tpch(tpch::RunOpt),
TpchConvert(tpch::ConvertOpt),
Clickbench(clickbench::RunOpt),
ParquetFilter(parquet_filter::RunOpt),
Sort(sort::RunOpt),
}

// Main benchmark runner entrypoint
Expand All @@ -47,5 +49,7 @@ pub async fn main() -> Result<()> {
Options::Tpch(opt) => opt.run().await,
Options::TpchConvert(opt) => opt.run().await,
Options::Clickbench(opt) => opt.run().await,
Options::ParquetFilter(opt) => opt.run().await,
Options::Sort(opt) => opt.run().await,
}
}
Loading