Skip to content

Commit

Permalink
Add parquet-filter and sort benchmarks to dfbench (#7120)
Browse files Browse the repository at this point in the history
* Add parquet-filter and sort benchmarks to dfbench

* fix

* fix docs

* fix ci bench

* Update docs
  • Loading branch information
alamb authored Aug 14, 2023
1 parent 563a1dc commit 2ec0bc1
Show file tree
Hide file tree
Showing 13 changed files with 593 additions and 397 deletions.
150 changes: 84 additions & 66 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,31 +229,14 @@ This will produce output like
└──────────────┴──────────────┴──────────────┴───────────────┘
```

### Expected output
# Benchmark Runner

The result of query 1 should produce the following output when executed against the SF=1 dataset.
The `dfbench` program contains subcommands to run the various
benchmarks. When benchmarking, it should always be built in release
mode using `--release`.

```
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
| l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order |
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
| A | F | 37734107 | 56586554400.73001 | 53758257134.870026 | 55909065222.82768 | 25.522005853257337 | 38273.12973462168 | 0.049985295838396455 | 1478493 |
| N | F | 991417 | 1487504710.3799996 | 1413082168.0541 | 1469649223.1943746 | 25.516471920522985 | 38284.467760848296 | 0.05009342667421622 | 38854 |
| N | O | 74476023 | 111701708529.50996 | 106118209986.10472 | 110367023144.56622 | 25.502229680934594 | 38249.1238377803 | 0.049996589476752576 | 2920373 |
| R | F | 37719753 | 56568041380.90001 | 53741292684.60399 | 55889619119.83194 | 25.50579361269077 | 38250.854626099666 | 0.05000940583012587 | 1478870 |
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
Query 1 iteration 0 took 1956.1 ms
Query 1 avg time: 1956.11 ms
```

# Benchmark Descriptions

## `dfbench`

The `dfbench` program contains subcommands to run various benchmarks.

Full help can be found in the relevant sub command. For example to get help for tpch,
run `cargo run --release --bin dfbench tpch --help`
Full help for each benchmark can be found in the relevant sub
command. For example to get help for tpch, run

```shell
cargo run --release --bin dfbench --help
Expand All @@ -265,61 +248,52 @@ USAGE:
dfbench <SUBCOMMAND>

SUBCOMMANDS:
clickbench Run the clickbench benchmark
help Prints this message or the help of the given subcommand(s)
tpch Run the tpch benchmark.
tpch-convert Convert tpch .slt files to .parquet or .csv files
clickbench Run the clickbench benchmark
help Prints this message or the help of the given subcommand(s)
parquet-filter Test performance of parquet filter pushdown
sort Test performance of parquet filter pushdown
tpch Run the tpch benchmark.
tpch-convert Convert tpch .slt files to .parquet or .csv files

```

## h2o benchmarks
# Benchmarks

```bash
cargo run --release --bin h2o group-by --query 1 --path /mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv --mem-table --debug
```
The output of `dfbench` help includes a descripion of each benchmark, which is reproducedd here for convenience

Example run:
## ClickBench

```
Running benchmarks with the following options: GroupBy(GroupBy { query: 1, path: "/mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv", debug: false })
Executing select id1, sum(v1) as v1 from x group by id1
+-------+--------+
| id1 | v1 |
+-------+--------+
| id063 | 199420 |
| id094 | 200127 |
| id044 | 198886 |
...
| id093 | 200132 |
| id003 | 199047 |
+-------+--------+
The ClickBench[1] benchmarks are widely cited in the industry and
focus on grouping / aggregation / filtering. This runner uses the
scripts and queries from [2].

h2o groupby query 1 took 1669 ms
```
[1]: https://github.com/ClickHouse/ClickBench
[2]: https://github.com/ClickHouse/ClickBench/tree/main/datafusion

[1]: http://www.tpc.org/tpch/
[2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
## Parquet Filter

## Parquet benchmarks
Test performance of parquet filter pushdown

This is a set of benchmarks for testing and verifying performance of parquet filtering and sorting.
The queries are executed on a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs.
The queries are executed on a synthetic dataset generated during
the benchmark execution and designed to simulate web server access
logs.

To run filter benchmarks, run:
Example

```base
cargo run --release --bin parquet -- filter --path ./data --scale-factor 1.0
```
dfbench parquet-filter --path ./data --scale-factor 1.0

This will generate the synthetic dataset at `./data/logs.parquet`. The size of the dataset can be controlled through the `size_factor`
generates the synthetic dataset at `./data/logs.parquet`. The size
of the dataset can be controlled through the `size_factor`
(with the default value of `1.0` generating a ~1GB parquet file).

For each filter we will run the query using different `ParquetScanOption` settings.
For each filter we will run the query using different
`ParquetScanOption` settings.

Example run:
Example output:

```
Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data", batch_size: 8192, scale_factor: 1.0 }
Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data",
batch_size: 8192, scale_factor: 1.0 }
Generated test dataset with 10699521 rows
Executing with filter 'request_method = Utf8("GET")'
Using scan options ParquetScanOptions { pushdown_filters: false, reorder_predicates: false, enable_page_index: false }
Expand All @@ -337,12 +311,56 @@ Iteration 2 returned 1781686 rows in 1947 ms
...
```

Similarly, to run sorting benchmarks, run:
## Sort
Test performance of sorting large datasets

This test sorts a a synthetic dataset generated during the
benchmark execution, designed to simulate sorting web server
access logs. Such sorting is often done during data transformation
steps.

The tests sort the entire dataset using several different sort
orders.

## TPCH

Run the tpch benchmark.

This benchmarks is derived from the [TPC-H][1] version
[2.17.1]. The data and answers are generated using `tpch-gen` from
[2].

[1]: http://www.tpc.org/tpch/
[2]: https://github.com/databricks/tpch-dbgen.git,
[2.17.1]: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf


# Older Benchmarks

## h2o benchmarks

```bash
cargo run --release --bin h2o group-by --query 1 --path /mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv --mem-table --debug
```

Example run:

```
Running benchmarks with the following options: GroupBy(GroupBy { query: 1, path: "/mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv", debug: false })
Executing select id1, sum(v1) as v1 from x group by id1
+-------+--------+
| id1 | v1 |
+-------+--------+
| id063 | 199420 |
| id094 | 200127 |
| id044 | 198886 |
...
| id093 | 200132 |
| id003 | 199047 |
+-------+--------+
```base
cargo run --release --bin parquet -- sort --path ./data --scale-factor 1.0
h2o groupby query 1 took 1669 ms
```

This proceeds in the same way as the filter benchmarks: each sort expression
combination will be run using the same set of `ParquetScanOption` as the
filter benchmarks.
[1]: http://www.tpc.org/tpch/
[2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
1 change: 1 addition & 0 deletions benchmarks/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,7 @@ main() {
# navigate to the appropriate directory
pushd "${DATAFUSION_DIR}/benchmarks" > /dev/null
mkdir -p "${RESULTS_DIR}"
mkdir -p "${DATA_DIR}"
case "$BENCHMARK" in
all)
run_tpch "1"
Expand Down
6 changes: 5 additions & 1 deletion benchmarks/src/bin/dfbench.rs
Original file line number Diff line number Diff line change
Expand Up @@ -28,14 +28,16 @@ static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
#[global_allocator]
static ALLOC: mimalloc::MiMalloc = mimalloc::MiMalloc;

use datafusion_benchmarks::{clickbench, tpch};
use datafusion_benchmarks::{clickbench, parquet_filter, sort, tpch};

#[derive(Debug, StructOpt)]
#[structopt(about = "benchmark command")]
enum Options {
Tpch(tpch::RunOpt),
TpchConvert(tpch::ConvertOpt),
Clickbench(clickbench::RunOpt),
ParquetFilter(parquet_filter::RunOpt),
Sort(sort::RunOpt),
}

// Main benchmark runner entrypoint
Expand All @@ -47,5 +49,7 @@ pub async fn main() -> Result<()> {
Options::Tpch(opt) => opt.run().await,
Options::TpchConvert(opt) => opt.run().await,
Options::Clickbench(opt) => opt.run().await,
Options::ParquetFilter(opt) => opt.run().await,
Options::Sort(opt) => opt.run().await,
}
}
Loading

0 comments on commit 2ec0bc1

Please sign in to comment.