Add parquet-filter and sort benchmarks to dfbench (#7120)

* Add parquet-filter and sort benchmarks to dfbench * fix * fix docs * fix ci bench * Update docs
apache · Aug 14, 2023 · 2ec0bc1 · 2ec0bc1
1 parent 563a1dc
commit 2ec0bc1
Show file tree

Hide file tree

Showing 13 changed files with 593 additions and 397 deletions.
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -229,31 +229,14 @@ This will produce output like
 └──────────────┴──────────────┴──────────────┴───────────────┘
 ```
 
-### Expected output
+# Benchmark Runner
 
-The result of query 1 should produce the following output when executed against the SF=1 dataset.
+The `dfbench` program contains subcommands to run the various
+benchmarks. When benchmarking, it should always be built in release
+mode using `--release`.
 
-```
-+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
-| l_returnflag | l_linestatus | sum_qty  | sum_base_price     | sum_disc_price     | sum_charge         | avg_qty            | avg_price          | avg_disc             | count_order |
-+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
-| A            | F            | 37734107 | 56586554400.73001  | 53758257134.870026 | 55909065222.82768  | 25.522005853257337 | 38273.12973462168  | 0.049985295838396455 | 1478493     |
-| N            | F            | 991417   | 1487504710.3799996 | 1413082168.0541    | 1469649223.1943746 | 25.516471920522985 | 38284.467760848296 | 0.05009342667421622  | 38854       |
-| N            | O            | 74476023 | 111701708529.50996 | 106118209986.10472 | 110367023144.56622 | 25.502229680934594 | 38249.1238377803   | 0.049996589476752576 | 2920373     |
-| R            | F            | 37719753 | 56568041380.90001  | 53741292684.60399  | 55889619119.83194  | 25.50579361269077  | 38250.854626099666 | 0.05000940583012587  | 1478870     |
-+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
-Query 1 iteration 0 took 1956.1 ms
-Query 1 avg time: 1956.11 ms
-```
-
-# Benchmark Descriptions
-
-## `dfbench`
-
-The `dfbench` program contains subcommands to run various benchmarks.
-
-Full help can be found in the relevant sub command. For example to get help for tpch,
-run `cargo run --release  --bin dfbench tpch --help`
+Full help for each benchmark can be found in the relevant sub
+command. For example to get help for tpch, run
 
 ```shell
 cargo run --release --bin dfbench  --help
@@ -265,61 +248,52 @@ USAGE:
     dfbench <SUBCOMMAND>
 
 SUBCOMMANDS:
-    clickbench      Run the clickbench benchmark
-    help            Prints this message or the help of the given subcommand(s)
-    tpch            Run the tpch benchmark.
-    tpch-convert    Convert tpch .slt files to .parquet or .csv files
+    clickbench        Run the clickbench benchmark
+    help              Prints this message or the help of the given subcommand(s)
+    parquet-filter    Test performance of parquet filter pushdown
+    sort              Test performance of parquet filter pushdown
+    tpch              Run the tpch benchmark.
+    tpch-convert      Convert tpch .slt files to .parquet or .csv files
 
 ```
 
-## h2o benchmarks
+# Benchmarks
 
-```bash
-cargo run --release --bin h2o group-by --query 1 --path /mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv --mem-table --debug
-```
+The output of `dfbench` help includes a descripion of each benchmark, which is reproducedd here for convenience
 
-Example run:
+## ClickBench
 
-```
-Running benchmarks with the following options: GroupBy(GroupBy { query: 1, path: "/mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv", debug: false })
-Executing select id1, sum(v1) as v1 from x group by id1
-+-------+--------+
-| id1   | v1     |
-+-------+--------+
-| id063 | 199420 |
-| id094 | 200127 |
-| id044 | 198886 |
-...
-| id093 | 200132 |
-| id003 | 199047 |
-+-------+--------+
+The ClickBench[1] benchmarks are widely cited in the industry and
+focus on grouping / aggregation / filtering. This runner uses the
+scripts and queries from [2].
 
-h2o groupby query 1 took 1669 ms
-```
+[1]: https://github.com/ClickHouse/ClickBench
+[2]: https://github.com/ClickHouse/ClickBench/tree/main/datafusion
 
-[1]: http://www.tpc.org/tpch/
-[2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
+## Parquet Filter
 
-## Parquet benchmarks
+Test performance of parquet filter pushdown
 
-This is a set of benchmarks for testing and verifying performance of parquet filtering and sorting.
-The queries are executed on a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs.
+The queries are executed on a synthetic dataset generated during
+the benchmark execution and designed to simulate web server access
+logs.
 
-To run filter benchmarks, run:
+Example
 
-```base
-cargo run --release --bin parquet -- filter  --path ./data --scale-factor 1.0
-```
+dfbench parquet-filter  --path ./data --scale-factor 1.0
 
-This will generate the synthetic dataset at `./data/logs.parquet`. The size of the dataset can be controlled through the `size_factor`
+generates the synthetic dataset at `./data/logs.parquet`. The size
+of the dataset can be controlled through the `size_factor`
 (with the default value of `1.0` generating a ~1GB parquet file).
 
-For each filter we will run the query using different `ParquetScanOption` settings.
+For each filter we will run the query using different
+`ParquetScanOption` settings.
 
-Example run:
+Example output:
 
 ```
-Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data", batch_size: 8192, scale_factor: 1.0 }
+Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data",
+batch_size: 8192, scale_factor: 1.0 }
 Generated test dataset with 10699521 rows
 Executing with filter 'request_method = Utf8("GET")'
 Using scan options ParquetScanOptions { pushdown_filters: false, reorder_predicates: false, enable_page_index: false }
@@ -337,12 +311,56 @@ Iteration 2 returned 1781686 rows in 1947 ms
 ...
 ```
 
-Similarly, to run sorting benchmarks, run:
+## Sort
+Test performance of sorting large datasets
+
+This test sorts a a synthetic dataset generated during the
+benchmark execution, designed to simulate sorting web server
+access logs. Such sorting is often done during data transformation
+steps.
+
+The tests sort the entire dataset using several different sort
+orders.
+
+## TPCH
+
+Run the tpch benchmark.
+
+This benchmarks is derived from the [TPC-H][1] version
+[2.17.1]. The data and answers are generated using `tpch-gen` from
+[2].
+
+[1]: http://www.tpc.org/tpch/
+[2]: https://github.com/databricks/tpch-dbgen.git,
+[2.17.1]: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
+
+
+# Older Benchmarks
+
+## h2o benchmarks
+
+```bash
+cargo run --release --bin h2o group-by --query 1 --path /mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv --mem-table --debug
+```
+
+Example run:
+
+```
+Running benchmarks with the following options: GroupBy(GroupBy { query: 1, path: "/mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv", debug: false })
+Executing select id1, sum(v1) as v1 from x group by id1
++-------+--------+
+| id1   | v1     |
++-------+--------+
+| id063 | 199420 |
+| id094 | 200127 |
+| id044 | 198886 |
+...
+| id093 | 200132 |
+| id003 | 199047 |
++-------+--------+
 
-```base
-cargo run --release --bin parquet -- sort  --path ./data --scale-factor 1.0
+h2o groupby query 1 took 1669 ms
 ```
 
-This proceeds in the same way as the filter benchmarks: each sort expression
-combination will be run using the same set of `ParquetScanOption` as the
-filter benchmarks.
+[1]: http://www.tpc.org/tpch/
+[2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
diff --git a/benchmarks/bench.sh b/benchmarks/bench.sh
@@ -182,6 +182,7 @@ main() {
             # navigate to the appropriate directory
             pushd "${DATAFUSION_DIR}/benchmarks" > /dev/null
             mkdir -p "${RESULTS_DIR}"
+            mkdir -p "${DATA_DIR}"
             case "$BENCHMARK" in
                 all)
                     run_tpch "1"

diff --git a/benchmarks/src/bin/dfbench.rs b/benchmarks/src/bin/dfbench.rs
@@ -28,14 +28,16 @@ static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
 #[global_allocator]
 static ALLOC: mimalloc::MiMalloc = mimalloc::MiMalloc;
 
-use datafusion_benchmarks::{clickbench, tpch};
+use datafusion_benchmarks::{clickbench, parquet_filter, sort, tpch};
 
 #[derive(Debug, StructOpt)]
 #[structopt(about = "benchmark command")]
 enum Options {
     Tpch(tpch::RunOpt),
     TpchConvert(tpch::ConvertOpt),
     Clickbench(clickbench::RunOpt),
+    ParquetFilter(parquet_filter::RunOpt),
+    Sort(sort::RunOpt),
 }
 
 // Main benchmark runner entrypoint
@@ -47,5 +49,7 @@ pub async fn main() -> Result<()> {
         Options::Tpch(opt) => opt.run().await,
         Options::TpchConvert(opt) => opt.run().await,
         Options::Clickbench(opt) => opt.run().await,
+        Options::ParquetFilter(opt) => opt.run().await,
+        Options::Sort(opt) => opt.run().await,
     }
 }