[website] Add TPC-DS benchmark documentation

apache · Jun 24, 2022 · 213a2fa · 213a2fa
1 parent dc0b5e4
commit 213a2fa
Show file tree

Hide file tree

Showing 3 changed files with 185 additions and 0 deletions.
diff --git a/website/www/site/content/en/documentation/sdks/java.md b/website/www/site/content/en/documentation/sdks/java.md
@@ -44,6 +44,7 @@ The Java SDK has the following extensions:
 - [join-library](/documentation/sdks/java-extensions/#join-library) provides inner join, outer left join, and outer right join functions.
 - [sorter](/documentation/sdks/java-extensions/#sorter) is an efficient and scalable sorter for large iterables.
 - [Nexmark](/documentation/sdks/java/testing/nexmark) is a benchmark suite that runs in batch and streaming modes.
+- [TPC-DS](/documentation/sdks/java/testing/tpcds) is a SQL benchmark suite that runs in batch mode.
 - [euphoria](/documentation/sdks/java/euphoria) is easy to use Java 8 DSL for BEAM.
 
 In addition several [3rd party Java libraries](/documentation/sdks/java-thirdparty/) exist.

diff --git a/website/www/site/content/en/documentation/sdks/java/testing/tpcds.md b/website/www/site/content/en/documentation/sdks/java/testing/tpcds.md
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+ - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen` 
+- Input datasets can be generated for different sizes:
+ - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB
+
+## TPC-DS extension in Beam
+
+### Reasons
+
+Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:
+
+- Compare the performance boost or degradation of Beam SQL for different runners or their versions
+- Run Beam SQL on different runtime environments
+- Detect missing Beam SQL features or incompatibilities
+- Find performance issues in Beam
+
+### Queries
+All TPC-DS queries in Beam are pre-generated and stored in the provided artifacts.
+
+For the moment, 28 out of 103 SQL queries (99 + 4) successfully pass by running with Beam SQL transform since not all
+SQL-99 operations are supported.
+
+Currently supported queries:
+ - 3, 7, 10, 22, 25, 26, 29, 35, 38, 40, 42, 43, 50, 52, 55, 69, 78, 79, 83, 84, 87, 93, 96, 97, 99 
+
+### Tables
+All TPC-DS table schemas are stored in the provided artifacts.
+
+### Input data
+Input data is already pre-generated for two data formats (CSV and Parquet) and stored in Google Cloud Storage (gs://beam-tpcds)
+
+### Runtime
+TPC-DS extension for Beam can only be run in **Batch** mode and supports these runners for the moment: 
+- Spark Runner
+- Flink Runner
+- Dataflow Runner
+
+## TPC-DS output
+
+TBD
+
+## Benchmark launch configuration
+
+The TPC-DS launcher accepts the `--runner` argument as usual for programs that
+use Beam PipelineOptions to manage their command line arguments. In addition
+to this, the necessary dependencies must be configured.
+
+When running via Gradle, the following two parameters control the execution:
+
+ -P tpcds.args
+  The command line to pass to the TPC-DS main program.
+
+ -P tpcds.runner
+ The Gradle project name of the runner, such as ":runners:spark:3" or
+ ":runners:flink:1.13. The project names can be found in the root
+  `settings.gradle.kts`.
+
+Test data has to be generated before running a suite and stored to accessible file system. The query results will be written into output files. 
+
+### Common configuration parameters
+
+Size of input dataset (1GB / 10GB / 100GB / 1000GB):
+
+ --dataSize=<1GB|10GB|100GB|1000GB>
+
+Path to input datasets directory:
+
+ --dataDirectory=<path to dir>
+
+Path to results directory:
+
+ --resultsDirectory=<path to dir>
+
+Format of input files:
+
+ --sourceType=<CSV|PARQUET>
+
+Run queries (comma separated list of query numbers or `all` for all queries):
+
+ --queries=<1,2,...N|all>
+
+Number of queries **N** that are running in parallel:
+
+ --tpcParallel=N
+
+## Running TPC-DS
+
+There are some examples how to run TPC-DS benchmark on different runners. 
+
+Running suite on the SparkRunner (local) with Query3 against 1Gb dataset in Parquet format: 
+
+ ./gradlew :sdks:java:testing:tpcds:run \
+  -Ptpcds.runner=":runners:spark:3" \
+  -Ptpcds.args="
+  --runner=SparkRunner
+  --dataSize=1GB
+  --sourceType=PARQUET
+  --dataDirectory=gs://beam-tpcds/datasets/parquet/partitioned
+  --resultsDirectory=/tmp/beam-tpcds/results/spark/
+  --tpcParallel=1
+  --queries=3"
+
+Running suite on the FlinkRunner (local) with Query7 and Query10 in parallel against 10Gb dataset in CSV format:
+
+ ./gradlew :sdks:java:testing:tpcds:run \
+  -Ptpcds.runner=":runners:flink:1.13" \
+  -Ptpcds.args="
+  --runner=FlinkRunner
+  --parallelism=2
+  --dataSize=10GB
+  --sourceType=CSV
+  --dataDirectory=gs://beam-tpcds/datasets/csv
+  --resultsDirectory=/tmp/beam-tpcds/results/flink/
+  --tpcParallel=2
+  --queries=7,10"
+
+Running suite on the DataflowRunner (local) with all queries against 100Gb dataset in PARQUET format:
+
+ ./gradlew :sdks:java:testing:tpcds:run \
+  -Ptpcds.runner=":runners:google-cloud-dataflow-java" \
+  -Ptpcds.args="
+  --runner=DataflowRunner
+  --region=<region_name>
+  --project=<project_name>
+  --numWorkers=4
+  --maxNumWorkers=4
+  --autoscalingAlgorithm=NONE
+  --dataSize=100GB
+  --sourceType=PARQUET
+  --dataDirectory=gs://beam-tpcds/datasets/parquet/partitioned
+  --resultsDirectory=/tmp/beam-tpcds/results/dataflow/
+  --tpcParallel=4
+  --queries=all"
+
+## TPC-DS dashboards
+TBD
+
diff --git a/website/www/site/layouts/partials/section-menu/en/sdks.html b/website/www/site/layouts/partials/section-menu/en/sdks.html
@@ -24,6 +24,7 @@
  <li><a href="/documentation/sdks/java-extensions/">Java SDK extensions</a></li>
  <li><a href="/documentation/sdks/java-thirdparty/">Java 3rd party extensions</a></li>
  <li><a href="/documentation/sdks/java/testing/nexmark/">Nexmark benchmark suite</a></li>
+ <li><a href="/documentation/sdks/java/testing/tpcds/">TPC-DS benchmark suite</a></li>
  <li><a href="/documentation/sdks/java-multi-language-pipelines/">Java multi-language pipelines quickstart</a></li>
  </ul>
 </li>