Skip to content

Commit

Permalink
[website] Add TPC-DS benchmark documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
aromanenko-dev committed Jun 24, 2022
1 parent dc0b5e4 commit 213a2fa
Show file tree
Hide file tree
Showing 3 changed files with 185 additions and 0 deletions.
1 change: 1 addition & 0 deletions website/www/site/content/en/documentation/sdks/java.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ The Java SDK has the following extensions:
- [join-library](/documentation/sdks/java-extensions/#join-library) provides inner join, outer left join, and outer right join functions.
- [sorter](/documentation/sdks/java-extensions/#sorter) is an efficient and scalable sorter for large iterables.
- [Nexmark](/documentation/sdks/java/testing/nexmark) is a benchmark suite that runs in batch and streaming modes.
- [TPC-DS](/documentation/sdks/java/testing/tpcds) is a SQL benchmark suite that runs in batch mode.
- [euphoria](/documentation/sdks/java/euphoria) is easy to use Java 8 DSL for BEAM.

In addition several [3rd party Java libraries](/documentation/sdks/java-thirdparty/) exist.
Expand Down
183 changes: 183 additions & 0 deletions website/www/site/content/en/documentation/sdks/java/testing/tpcds.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
---
type: languages
title: "TPC-DS benchmark suite"
aliases: /documentation/sdks/java/tpcds/
---
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# TPC Benchmark™ DS (TPC-DS) benchmark suite

## What it is

> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
> purpose decision support system."
- Industry standard benchmark (OLAP/Data Warehouse)
- http://www.tpc.org/tpcds/
- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
- Wide range of different queries (SQL)
- Existing tools to generate input data of different sizes

## Table schemas
TBD

## The queries

TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
question, which illustrates the business context in which the query could be used.

All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
performance.

## Input data
Input data source:

- Input files (CSV) are generated with CLI tool `dsdgen`
- Input datasets can be generated for different sizes:
- 1GB / 10GB / 100GB / 1000GB
- The tool constraints the minimum amount of data to be generated to 1GB

## TPC-DS extension in Beam

### Reasons

Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:

- Compare the performance boost or degradation of Beam SQL for different runners or their versions
- Run Beam SQL on different runtime environments
- Detect missing Beam SQL features or incompatibilities
- Find performance issues in Beam

### Queries
All TPC-DS queries in Beam are pre-generated and stored in the provided artifacts.

For the moment, 28 out of 103 SQL queries (99 + 4) successfully pass by running with Beam SQL transform since not all
SQL-99 operations are supported.

Currently supported queries:
- 3, 7, 10, 22, 25, 26, 29, 35, 38, 40, 42, 43, 50, 52, 55, 69, 78, 79, 83, 84, 87, 93, 96, 97, 99

### Tables
All TPC-DS table schemas are stored in the provided artifacts.

### Input data
Input data is already pre-generated for two data formats (CSV and Parquet) and stored in Google Cloud Storage (gs://beam-tpcds)

### Runtime
TPC-DS extension for Beam can only be run in **Batch** mode and supports these runners for the moment:
- Spark Runner
- Flink Runner
- Dataflow Runner

## TPC-DS output

TBD

## Benchmark launch configuration

The TPC-DS launcher accepts the `--runner` argument as usual for programs that
use Beam PipelineOptions to manage their command line arguments. In addition
to this, the necessary dependencies must be configured.

When running via Gradle, the following two parameters control the execution:

-P tpcds.args
The command line to pass to the TPC-DS main program.

-P tpcds.runner
The Gradle project name of the runner, such as ":runners:spark:3" or
":runners:flink:1.13. The project names can be found in the root
`settings.gradle.kts`.

Test data has to be generated before running a suite and stored to accessible file system. The query results will be written into output files.

### Common configuration parameters

Size of input dataset (1GB / 10GB / 100GB / 1000GB):

--dataSize=<1GB|10GB|100GB|1000GB>

Path to input datasets directory:

--dataDirectory=<path to dir>

Path to results directory:

--resultsDirectory=<path to dir>

Format of input files:

--sourceType=<CSV|PARQUET>

Run queries (comma separated list of query numbers or `all` for all queries):

--queries=<1,2,...N|all>

Number of queries **N** that are running in parallel:

--tpcParallel=N

## Running TPC-DS

There are some examples how to run TPC-DS benchmark on different runners.

Running suite on the SparkRunner (local) with Query3 against 1Gb dataset in Parquet format:

./gradlew :sdks:java:testing:tpcds:run \
-Ptpcds.runner=":runners:spark:3" \
-Ptpcds.args="
--runner=SparkRunner
--dataSize=1GB
--sourceType=PARQUET
--dataDirectory=gs://beam-tpcds/datasets/parquet/partitioned
--resultsDirectory=/tmp/beam-tpcds/results/spark/
--tpcParallel=1
--queries=3"

Running suite on the FlinkRunner (local) with Query7 and Query10 in parallel against 10Gb dataset in CSV format:

./gradlew :sdks:java:testing:tpcds:run \
-Ptpcds.runner=":runners:flink:1.13" \
-Ptpcds.args="
--runner=FlinkRunner
--parallelism=2
--dataSize=10GB
--sourceType=CSV
--dataDirectory=gs://beam-tpcds/datasets/csv
--resultsDirectory=/tmp/beam-tpcds/results/flink/
--tpcParallel=2
--queries=7,10"

Running suite on the DataflowRunner (local) with all queries against 100Gb dataset in PARQUET format:

./gradlew :sdks:java:testing:tpcds:run \
-Ptpcds.runner=":runners:google-cloud-dataflow-java" \
-Ptpcds.args="
--runner=DataflowRunner
--region=<region_name>
--project=<project_name>
--numWorkers=4
--maxNumWorkers=4
--autoscalingAlgorithm=NONE
--dataSize=100GB
--sourceType=PARQUET
--dataDirectory=gs://beam-tpcds/datasets/parquet/partitioned
--resultsDirectory=/tmp/beam-tpcds/results/dataflow/
--tpcParallel=4
--queries=all"

## TPC-DS dashboards
TBD

Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
<li><a href="/documentation/sdks/java-extensions/">Java SDK extensions</a></li>
<li><a href="/documentation/sdks/java-thirdparty/">Java 3rd party extensions</a></li>
<li><a href="/documentation/sdks/java/testing/nexmark/">Nexmark benchmark suite</a></li>
<li><a href="/documentation/sdks/java/testing/tpcds/">TPC-DS benchmark suite</a></li>
<li><a href="/documentation/sdks/java-multi-language-pipelines/">Java multi-language pipelines quickstart</a></li>
</ul>
</li>
Expand Down

0 comments on commit 213a2fa

Please sign in to comment.