From 5ed671c8426ab3466c1870aef5c1a8e7368f339a Mon Sep 17 00:00:00 2001 From: Cheng Pan Date: Mon, 15 Aug 2022 21:14:45 +0800 Subject: [PATCH] [KYUUBI #3228] [Subtask] Connectors for Spark SQL Query Engine -> TPC-DS ### _Why are the changes needed?_ Document Kyuubi Spark TPC-DS connector ### _How was this patch tested?_ - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [x] Add screenshots for manual tests if appropriate image - [ ] [Run test](https://kyuubi.apache.org/docs/latest/develop_tools/testing.html#running-tests) locally before make a pull request Closes #3228 from pan3793/tpcds-doc. Closes #3228 0cafdf88 [Cheng Pan] [Subtask] Connectors for Spark SQL Query Engine -> TPC-DS Authored-by: Cheng Pan Signed-off-by: Cheng Pan --- docs/connector/spark/tpcds.rst | 108 +++++++++++++++++++++++++++++++++ docs/connector/spark/tpdcs.rst | 34 ----------- 2 files changed, 108 insertions(+), 34 deletions(-) create mode 100644 docs/connector/spark/tpcds.rst delete mode 100644 docs/connector/spark/tpdcs.rst diff --git a/docs/connector/spark/tpcds.rst b/docs/connector/spark/tpcds.rst new file mode 100644 index 00000000000..e52e56c08b3 --- /dev/null +++ b/docs/connector/spark/tpcds.rst @@ -0,0 +1,108 @@ +.. Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + +TPC-DS +===== + +The TPC-DS is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent +data modifications. The queries and the data populating the database have been chosen to have broad industry-wide +relevance. + +.. tip:: + This article assumes that you have mastered the basic knowledge and operation of `TPC-DS`_. + For the knowledge about TPC-DS not mentioned in this article, you can obtain it from its `Official Documentation`_. + +This connector can be used to test the capabilities and query syntax of Spark without configuring access to an external +data source. When you query a TPC-DS table, the connector generates the data on the fly using a deterministic algorithm. + +Goto `Try Kyuubi`_ to explore TPC-DS data instantly! + +TPC-DS Integration +------------------ + +To enable the integration of kyuubi spark sql engine and TPC-DS through +Apache Spark Datasource V2 and Catalog APIs, you need to: + +- Referencing the TPC-DS connector :ref:`dependencies` +- Setting the spark catalog :ref:`configurations` + +.. _spark-tpcds-deps: + +Dependencies +************ + +The **classpath** of kyuubi spark sql engine with TPC-DS supported consists of + +1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions +2. a copy of spark distribution +3. kyuubi-spark-connector-tpcds-\ |release|\ _2.12.jar, which can be found in the `Maven Central`_ + +In order to make the TPC-DS connector package visible for the runtime classpath of engines, we can use one of these methods: + +1. Put the TPC-DS connector package into ``$SPARK_HOME/jars`` directly +2. Set spark.jars=kyuubi-spark-connector-tpcds-\ |release|\ _2.12.jar + +.. _spark-tpcds-conf: + +Configurations +************** + +To add TPC-DS tables as a catalog, we can set the following configurations in ``$SPARK_HOME/conf/spark-defaults.conf``: + +.. code-block:: properties + + # (required) Register a catalog named `tpcds` for the spark engine. + spark.sql.catalog.tpcds=org.apache.kyuubi.spark.connector.tpcds.TPCDSCatalog + + # (optional) Excluded database list from the catalog, all available databases are: + # sf0, tiny, sf1, sf10, sf30, sf100, sf300, sf1000, sf3000, sf10000, sf30000, sf100000. + spark.sql.catalog.tpcds.excludeDatabases=sf10000,sf30000 + + # (optional) When true, use CHAR/VARCHAR, otherwise use STRING. It affects output of the table schema, + # e.g. `SHOW CREATE TABLE `, `DESC
`. + spark.sql.catalog.tpcds.useAnsiStringType=false + + # (optional) TPCDS changed table schemas in v2.6.0, turn off this option to use old table schemas. + # See detail at: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v3.2.0.pdf + spark.sql.catalog.tpcds.useTableSchema_2_6=true + + # (optional) Maximum bytes per task, consider reducing it if you want higher parallelism. + spark.sql.catalog.tpcds.read.maxPartitionBytes=128m + +TPC-DS Operations +---------------- + +Listing databases under `tpcds` catalog. + +.. code-block:: sql + + SHOW DATABASES IN tpcds; + +Listing tables under `tpcds.sf1` database. + +.. code-block:: sql + + SHOW TABLES IN tpcds.sf1; + +Switch current database to `tpcds.sf1` and run a query against it. + +.. code-block:: sql + + USE tpcds.sf1; + SELECT * FROM orders; + +.. _Official Documentation: https://www.tpc.org/tpcds/ +.. _Try Kyuubi: https://try.kyuubi.cloud/ +.. _Maven Central: https://repo1.maven.org/maven2/org/apache/kyuubi/kyuubi-spark-connector-tpcds_2.12/ \ No newline at end of file diff --git a/docs/connector/spark/tpdcs.rst b/docs/connector/spark/tpdcs.rst deleted file mode 100644 index 58d83b1abec..00000000000 --- a/docs/connector/spark/tpdcs.rst +++ /dev/null @@ -1,34 +0,0 @@ -.. Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - -.. http://www.apache.org/licenses/LICENSE-2.0 - -.. Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. - -TPC-DS -===== - -TPC-DS Integration -------------------- - -.. _spark-tpcds-deps: - -Dependencies -************ - -.. _spark-tpcds-conf: - -Configurations -************** - - -TPC-DS Operations -------------------