diff --git a/website/www/site/content/en/documentation/sdks/java.md b/website/www/site/content/en/documentation/sdks/java.md index 71d6385ac0881..7b24c13090fb8 100644 --- a/website/www/site/content/en/documentation/sdks/java.md +++ b/website/www/site/content/en/documentation/sdks/java.md @@ -44,6 +44,7 @@ The Java SDK has the following extensions: - [join-library](/documentation/sdks/java-extensions/#join-library) provides inner join, outer left join, and outer right join functions. - [sorter](/documentation/sdks/java-extensions/#sorter) is an efficient and scalable sorter for large iterables. - [Nexmark](/documentation/sdks/java/testing/nexmark) is a benchmark suite that runs in batch and streaming modes. +- [TPC-DS](/documentation/sdks/java/testing/tpcds) is a SQL benchmark suite that runs in batch mode. - [euphoria](/documentation/sdks/java/euphoria) is easy to use Java 8 DSL for BEAM. In addition several [3rd party Java libraries](/documentation/sdks/java-thirdparty/) exist. diff --git a/website/www/site/content/en/documentation/sdks/java/testing/tpcds.md b/website/www/site/content/en/documentation/sdks/java/testing/tpcds.md new file mode 100644 index 0000000000000..45db199bc4414 --- /dev/null +++ b/website/www/site/content/en/documentation/sdks/java/testing/tpcds.md @@ -0,0 +1,183 @@ +--- +type: languages +title: "TPC-DS benchmark suite" +aliases: /documentation/sdks/java/tpcds/ +--- + +# TPC Benchmark™ DS (TPC-DS) benchmark suite + +## What it is + +> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system, +> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general +> purpose decision support system." + +- Industry standard benchmark (OLAP/Data Warehouse) + - http://www.tpc.org/tpcds/ +- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc +- Wide range of different queries (SQL) +- Existing tools to generate input data of different sizes + +## Table schemas +TBD + +## The queries + +TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business +question, which illustrates the business context in which the query could be used. + +All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and +performance. + +## Input data +Input data source: + +- Input files (CSV) are generated with CLI tool `dsdgen` +- Input datasets can be generated for different sizes: + - 1GB / 10GB / 100GB / 1000GB +- The tool constraints the minimum amount of data to be generated to 1GB + +## TPC-DS extension in Beam + +### Reasons + +Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam: + +- Compare the performance boost or degradation of Beam SQL for different runners or their versions +- Run Beam SQL on different runtime environments +- Detect missing Beam SQL features or incompatibilities +- Find performance issues in Beam + +### Queries +All TPC-DS queries in Beam are pre-generated and stored in the provided artifacts. + +For the moment, 28 out of 103 SQL queries (99 + 4) successfully pass by running with Beam SQL transform since not all +SQL-99 operations are supported. + +Currently supported queries: + - 3, 7, 10, 22, 25, 26, 29, 35, 38, 40, 42, 43, 50, 52, 55, 69, 78, 79, 83, 84, 87, 93, 96, 97, 99 + +### Tables +All TPC-DS table schemas are stored in the provided artifacts. + +### Input data +Input data is already pre-generated for two data formats (CSV and Parquet) and stored in Google Cloud Storage (gs://beam-tpcds) + +### Runtime +TPC-DS extension for Beam can only be run in **Batch** mode and supports these runners for the moment: +- Spark Runner +- Flink Runner +- Dataflow Runner + +## TPC-DS output + +TBD + +## Benchmark launch configuration + +The TPC-DS launcher accepts the `--runner` argument as usual for programs that +use Beam PipelineOptions to manage their command line arguments. In addition +to this, the necessary dependencies must be configured. + +When running via Gradle, the following two parameters control the execution: + + -P tpcds.args + The command line to pass to the TPC-DS main program. + + -P tpcds.runner + The Gradle project name of the runner, such as ":runners:spark:3" or + ":runners:flink:1.13. The project names can be found in the root + `settings.gradle.kts`. + +Test data has to be generated before running a suite and stored to accessible file system. The query results will be written into output files. + +### Common configuration parameters + +Size of input dataset (1GB / 10GB / 100GB / 1000GB): + + --dataSize=<1GB|10GB|100GB|1000GB> + +Path to input datasets directory: + + --dataDirectory= + +Path to results directory: + + --resultsDirectory= + +Format of input files: + + --sourceType= + +Run queries (comma separated list of query numbers or `all` for all queries): + + --queries=<1,2,...N|all> + +Number of queries **N** that are running in parallel: + + --tpcParallel=N + +## Running TPC-DS + +There are some examples how to run TPC-DS benchmark on different runners. + +Running suite on the SparkRunner (local) with Query3 against 1Gb dataset in Parquet format: + + ./gradlew :sdks:java:testing:tpcds:run \ + -Ptpcds.runner=":runners:spark:3" \ + -Ptpcds.args=" + --runner=SparkRunner + --dataSize=1GB + --sourceType=PARQUET + --dataDirectory=gs://beam-tpcds/datasets/parquet/partitioned + --resultsDirectory=/tmp/beam-tpcds/results/spark/ + --tpcParallel=1 + --queries=3" + +Running suite on the FlinkRunner (local) with Query7 and Query10 in parallel against 10Gb dataset in CSV format: + + ./gradlew :sdks:java:testing:tpcds:run \ + -Ptpcds.runner=":runners:flink:1.13" \ + -Ptpcds.args=" + --runner=FlinkRunner + --parallelism=2 + --dataSize=10GB + --sourceType=CSV + --dataDirectory=gs://beam-tpcds/datasets/csv + --resultsDirectory=/tmp/beam-tpcds/results/flink/ + --tpcParallel=2 + --queries=7,10" + +Running suite on the DataflowRunner (local) with all queries against 100Gb dataset in PARQUET format: + + ./gradlew :sdks:java:testing:tpcds:run \ + -Ptpcds.runner=":runners:google-cloud-dataflow-java" \ + -Ptpcds.args=" + --runner=DataflowRunner + --region= + --project= + --numWorkers=4 + --maxNumWorkers=4 + --autoscalingAlgorithm=NONE + --dataSize=100GB + --sourceType=PARQUET + --dataDirectory=gs://beam-tpcds/datasets/parquet/partitioned + --resultsDirectory=/tmp/beam-tpcds/results/dataflow/ + --tpcParallel=4 + --queries=all" + +## TPC-DS dashboards +TBD + diff --git a/website/www/site/layouts/partials/section-menu/en/sdks.html b/website/www/site/layouts/partials/section-menu/en/sdks.html index d1bd51b7189d9..a659cc16010ee 100644 --- a/website/www/site/layouts/partials/section-menu/en/sdks.html +++ b/website/www/site/layouts/partials/section-menu/en/sdks.html @@ -24,6 +24,7 @@
  • Java SDK extensions
  • Java 3rd party extensions
  • Nexmark benchmark suite
  • +
  • TPC-DS benchmark suite
  • Java multi-language pipelines quickstart