apache · alamb · Aug 14, 2023 · Jul 27, 2023 · Jul 27, 2023 · Jul 28, 2023
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -229,31 +229,14 @@ This will produce output like
 └──────────────┴──────────────┴──────────────┴───────────────┘
 ```
 
-### Expected output
+# Benchmark Runner
 
-The result of query 1 should produce the following output when executed against the SF=1 dataset.
+The `dfbench` program contains subcommands to run the various
+benchmarks. When benchmarking, it should always be built in release
+mode using `--release`.
 
-```
-+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
-| l_returnflag | l_linestatus | sum_qty  | sum_base_price     | sum_disc_price     | sum_charge         | avg_qty            | avg_price          | avg_disc             | count_order |
-+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
-| A            | F            | 37734107 | 56586554400.73001  | 53758257134.870026 | 55909065222.82768  | 25.522005853257337 | 38273.12973462168  | 0.049985295838396455 | 1478493     |
-| N            | F            | 991417   | 1487504710.3799996 | 1413082168.0541    | 1469649223.1943746 | 25.516471920522985 | 38284.467760848296 | 0.05009342667421622  | 38854       |
-| N            | O            | 74476023 | 111701708529.50996 | 106118209986.10472 | 110367023144.56622 | 25.502229680934594 | 38249.1238377803   | 0.049996589476752576 | 2920373     |
-| R            | F            | 37719753 | 56568041380.90001  | 53741292684.60399  | 55889619119.83194  | 25.50579361269077  | 38250.854626099666 | 0.05000940583012587  | 1478870     |
-+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
-Query 1 iteration 0 took 1956.1 ms
-Query 1 avg time: 1956.11 ms
-```
-
-# Benchmark Descriptions
-
-## `dfbench`
-
-The `dfbench` program contains subcommands to run various benchmarks.
-
-Full help can be found in the relevant sub command. For example to get help for tpch,
-run `cargo run --release  --bin dfbench tpch --help`
+Full help for each benchmark can be found in the relevant sub
+command. For example to get help for tpch, run
 
 ```shell
 cargo run --release --bin dfbench  --help
@@ -265,13 +248,95 @@ USAGE:
     dfbench <SUBCOMMAND>
 
 SUBCOMMANDS:
-    clickbench      Run the clickbench benchmark
-    help            Prints this message or the help of the given subcommand(s)
-    tpch            Run the tpch benchmark.
-    tpch-convert    Convert tpch .slt files to .parquet or .csv files
+    clickbench        Run the clickbench benchmark
+    help              Prints this message or the help of the given subcommand(s)
+    parquet-filter    Test performance of parquet filter pushdown
+    sort              Test performance of parquet filter pushdown
+    tpch              Run the tpch benchmark.
+    tpch-convert      Convert tpch .slt files to .parquet or .csv files
 
 ```
 
+# Benchmarks
+
+The output of `dfbench` help includes a descripion of each benchmark, which is reproducer here for convenience
+
+## ClickBench
+
+The ClickBench[1] benchmarks are widely cited in the industry and
+focus on grouping / aggregation / filtering. This runner uses the
+scripts and queries from [2].
+
+[1]: https://github.com/ClickHouse/ClickBench
+[2]: https://github.com/ClickHouse/ClickBench/tree/main/datafusion
+
+## Parquet Filter
+
+Test performance of parquet filter pushdown
+
+The queries are executed on a synthetic dataset generated during
+the benchmark execution and designed to simulate web server access
+logs.
+
+Example
+
+dfbench parquet-filter  --path ./data --scale-factor 1.0
+
+generates the synthetic dataset at `./data/logs.parquet`. The size
+of the dataset can be controlled through the `size_factor`
+(with the default value of `1.0` generating a ~1GB parquet file).
+
+For each filter we will run the query using different
+`ParquetScanOption` settings.
+
+Example output:
+
+```
+Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data",
+batch_size: 8192, scale_factor: 1.0 }
+Generated test dataset with 10699521 rows
+Executing with filter 'request_method = Utf8("GET")'
+Using scan options ParquetScanOptions { pushdown_filters: false, reorder_predicates: false, enable_page_index: false }
+Iteration 0 returned 10699521 rows in 1303 ms
+Iteration 1 returned 10699521 rows in 1288 ms
+Iteration 2 returned 10699521 rows in 1266 ms
+Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: true, enable_page_index: true }
+Iteration 0 returned 1781686 rows in 1970 ms
+Iteration 1 returned 1781686 rows in 2002 ms
+Iteration 2 returned 1781686 rows in 1988 ms
+Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: false, enable_page_index: true }
+Iteration 0 returned 1781686 rows in 1940 ms
+Iteration 1 returned 1781686 rows in 1986 ms
+Iteration 2 returned 1781686 rows in 1947 ms
+...
+```
+
+## Sort
+Test performance of sorting large datasets
+
+This test sorts a a synthetic dataset generated during the
+benchmark execution, designed to simulate sorting web server
+access logs. Such sorting is often done during data transformation
+steps.
+
+The tests sort the entire dataset using several different sort
+orders.
+
+## TPCH
+
+Run the tpch benchmark.
+
+This benchmarks is derived from the [TPC-H][1] version
+[2.17.1]. The data and answers are generated using `tpch-gen` from
+[2].
+
+[1]: http://www.tpc.org/tpch/
+[2]: https://github.com/databricks/tpch-dbgen.git,
+[2.17.1]: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
+
+
+# Older Benchmarks
+
 ## NYC Taxi Benchmark
 
 These benchmarks are based on the [New York Taxi and Limousine Commission][2] data set.
@@ -317,50 +382,3 @@ h2o groupby query 1 took 1669 ms
 
 [1]: http://www.tpc.org/tpch/
 [2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
-
-## Parquet benchmarks
-
-This is a set of benchmarks for testing and verifying performance of parquet filtering and sorting.
-The queries are executed on a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs.
-
-To run filter benchmarks, run:
-
-```base
-cargo run --release --bin parquet -- filter  --path ./data --scale-factor 1.0
-```
-
-This will generate the synthetic dataset at `./data/logs.parquet`. The size of the dataset can be controlled through the `size_factor`
-(with the default value of `1.0` generating a ~1GB parquet file).
-
-For each filter we will run the query using different `ParquetScanOption` settings.
-
-Example run:
-
-```
-Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data", batch_size: 8192, scale_factor: 1.0 }
-Generated test dataset with 10699521 rows
-Executing with filter 'request_method = Utf8("GET")'
-Using scan options ParquetScanOptions { pushdown_filters: false, reorder_predicates: false, enable_page_index: false }
-Iteration 0 returned 10699521 rows in 1303 ms
-Iteration 1 returned 10699521 rows in 1288 ms
-Iteration 2 returned 10699521 rows in 1266 ms
-Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: true, enable_page_index: true }
-Iteration 0 returned 1781686 rows in 1970 ms
-Iteration 1 returned 1781686 rows in 2002 ms
-Iteration 2 returned 1781686 rows in 1988 ms
-Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: false, enable_page_index: true }
-Iteration 0 returned 1781686 rows in 1940 ms
-Iteration 1 returned 1781686 rows in 1986 ms
-Iteration 2 returned 1781686 rows in 1947 ms
-...
-```
-
-Similarly, to run sorting benchmarks, run:
-
-```base
-cargo run --release --bin parquet -- sort  --path ./data --scale-factor 1.0
-```
-
-This proceeds in the same way as the filter benchmarks: each sort expression
-combination will be run using the same set of `ParquetScanOption` as the
-filter benchmarks.
diff --git a/benchmarks/bench.sh b/benchmarks/bench.sh
@@ -182,6 +182,7 @@ main() {
             # navigate to the appropriate directory
             pushd "${DATAFUSION_DIR}/benchmarks" > /dev/null
             mkdir -p "${RESULTS_DIR}"
+            mkdir -p "${DATA_DIR}"
             case "$BENCHMARK" in
                 all)
                     run_tpch "1"

diff --git a/benchmarks/src/bin/dfbench.rs b/benchmarks/src/bin/dfbench.rs
@@ -28,14 +28,16 @@ static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
 #[global_allocator]
 static ALLOC: mimalloc::MiMalloc = mimalloc::MiMalloc;
 
-use datafusion_benchmarks::{clickbench, tpch};
+use datafusion_benchmarks::{clickbench, parquet_filter, sort, tpch};
 
 #[derive(Debug, StructOpt)]
 #[structopt(about = "benchmark command")]
 enum Options {
     Tpch(tpch::RunOpt),
     TpchConvert(tpch::ConvertOpt),
     Clickbench(clickbench::RunOpt),
+    ParquetFilter(parquet_filter::RunOpt),
+    Sort(sort::RunOpt),
 }
 
 // Main benchmark runner entrypoint
@@ -47,5 +49,7 @@ pub async fn main() -> Result<()> {
         Options::Tpch(opt) => opt.run().await,
         Options::TpchConvert(opt) => opt.run().await,
         Options::Clickbench(opt) => opt.run().await,
+        Options::ParquetFilter(opt) => opt.run().await,
+        Options::Sort(opt) => opt.run().await,
     }
 }