[KYUUBI #3068][DOC] Add the Hudi connector doc for Spark SQL Query En…

…gine ### _Why are the changes needed?_ Add the Hudi connector doc for Spark SQL Query Engine ### _How was this patch tested?_ - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [ ] Add screenshots for manual tests if appropriate - [ ] [Run test](https://kyuubi.apache.org/docs/latest/develop_tools/testing.html#running-tests) locally before make a pull request Closes #3099 from deadwind4/hudi-spark-doc. Closes #3068 fcd2cf6 [Luning Wang] update doc 0ee870d [Luning Wang] [KYUUBI #3068][DOC] Add the Hudi connector doc for Spark SQL Query Engine Authored-by: Luning Wang <wang4luning@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>
apache · Jul 21, 2022 · f1312ea · f1312ea
1 parent 4b640b7
commit f1312ea
Showing 1 changed file with 77 additions and 1 deletion.
diff --git a/docs/connector/spark/hudi.rst b/docs/connector/spark/hudi.rst
@@ -16,21 +16,97 @@
 `Hudi`_
 ========
 
+Apache Hudi (pronounced “hoodie”) is the next generation streaming data lake platform.
+Apache Hudi brings core warehouse and database functionality directly to a data lake.
+
+.. tip::
+   This article assumes that you have mastered the basic knowledge and operation of `Hudi`_.
+   For the knowledge about Hudi not mentioned in this article,
+   you can obtain it from its `Official Documentation`_.
+
+By using Kyuubi, we can run SQL queries towards Hudi which is more convenient, easy to understand,
+and easy to expand than directly using Spark to manipulate Hudi.
+
 Hudi Integration
 ----------------
 
+To enable the integration of kyuubi spark sql engine and Hudi through
+Catalog APIs, you need to:
+
+- Referencing the Hudi :ref:`dependencies`
+- Setting the Spark extension and catalog :ref:`configurations`
+
 .. _dependencies:
 
 Dependencies
 ************
 
+The **classpath** of kyuubi spark sql engine with Hudi supported consists of
+
+1. kyuubi-spark-sql-engine-|release|.jar, the engine jar deployed with Kyuubi distributions
+2. a copy of spark distribution
+3. hudi-spark<spark.version>-bundle_<scala.version>-<hudi.version>.jar (example: hudi-spark3.2-bundle_2.12-0.11.1.jar), which can be found in the `Maven Central`_
+
+In order to make the Hudi packages visible for the runtime classpath of engines, we can use one of these methods:
+
+1. Put the Hudi packages into ``$SPARK_HOME/jars`` directly
+2. Set ``spark.jars=/path/to/hudi-spark-bundle``
+
 .. _configurations:
 
 Configurations
 **************
 
+To activate functionality of Hudi, we can set the following configurations:
+
+.. code-block:: properties
+   # Spark 3.2
+   spark.serializer=org.apache.spark.serializer.KryoSerializer
+   spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
+   spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
+
+   # Spark 3.1
+   spark.serializer=org.apache.spark.serializer.KryoSerializer
+   spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
 
 Hudi Operations
 ---------------
 
-.. _Hudi: https://hudi.apache.org/
+Taking ``Create Table`` as a example,
+
+.. code-block:: sql
+
+   CREATE TABLE hudi_cow_nonpcf_tbl (
+     uuid INT,
+     name STRING,
+     price DOUBLE
+   ) USING HUDI;
+
+Taking ``Query Data`` as a example,
+
+.. code-block:: sql
+
+   SELECT * FROM hudi_cow_nonpcf_tbl WHERE id < 20;
+
+Taking ``Insert Data`` as a example,
+
+.. code-block:: sql
+
+   INSERT INTO hudi_cow_nonpcf_tbl SELECT 1, 'a1', 20;
+
+
+Taking ``Update Data`` as a example,
+
+.. code-block:: sql
+
+   UPDATE hudi_cow_nonpcf_tbl SET name = 'foo', price = price * 2 WHERE id = 1;
+
+Taking ``Delete Data`` as a example,
+
+.. code-block:: sql
+
+   DELETE FROM hudi_cow_nonpcf_tbl WHERE uuid = 1;
+
+.. _Hudi: https://hudi.apache.org/
+.. _Official Documentation: https://hudi.apache.org/docs/overview
+.. _Maven Central: https://mvnrepository.com/artifact/org.apache.hudi