Skip to content

Commit

Permalink
[KYUUBI #3068][DOC] Add the Hudi connector doc for Spark SQL Query En…
Browse files Browse the repository at this point in the history
…gine

### _Why are the changes needed?_

Add the Hudi connector doc for Spark SQL Query Engine

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [ ] [Run test](https://kyuubi.apache.org/docs/latest/develop_tools/testing.html#running-tests) locally before make a pull request

Closes #3099 from deadwind4/hudi-spark-doc.

Closes #3068

fcd2cf6 [Luning Wang] update doc
0ee870d [Luning Wang] [KYUUBI #3068][DOC] Add the Hudi connector doc for Spark SQL Query Engine

Authored-by: Luning Wang <wang4luning@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
  • Loading branch information
a49a authored and yaooqinn committed Jul 21, 2022
1 parent 4b640b7 commit f1312ea
Showing 1 changed file with 77 additions and 1 deletion.
78 changes: 77 additions & 1 deletion docs/connector/spark/hudi.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,21 +16,97 @@
`Hudi`_
========

Apache Hudi (pronounced “hoodie”) is the next generation streaming data lake platform.
Apache Hudi brings core warehouse and database functionality directly to a data lake.

.. tip::
This article assumes that you have mastered the basic knowledge and operation of `Hudi`_.
For the knowledge about Hudi not mentioned in this article,
you can obtain it from its `Official Documentation`_.

By using Kyuubi, we can run SQL queries towards Hudi which is more convenient, easy to understand,
and easy to expand than directly using Spark to manipulate Hudi.

Hudi Integration
----------------

To enable the integration of kyuubi spark sql engine and Hudi through
Catalog APIs, you need to:

- Referencing the Hudi :ref:`dependencies`
- Setting the Spark extension and catalog :ref:`configurations`

.. _dependencies:

Dependencies
************

The **classpath** of kyuubi spark sql engine with Hudi supported consists of

1. kyuubi-spark-sql-engine-|release|.jar, the engine jar deployed with Kyuubi distributions
2. a copy of spark distribution
3. hudi-spark<spark.version>-bundle_<scala.version>-<hudi.version>.jar (example: hudi-spark3.2-bundle_2.12-0.11.1.jar), which can be found in the `Maven Central`_

In order to make the Hudi packages visible for the runtime classpath of engines, we can use one of these methods:

1. Put the Hudi packages into ``$SPARK_HOME/jars`` directly
2. Set ``spark.jars=/path/to/hudi-spark-bundle``

.. _configurations:

Configurations
**************

To activate functionality of Hudi, we can set the following configurations:

.. code-block:: properties
# Spark 3.2
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
# Spark 3.1
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
Hudi Operations
---------------

.. _Hudi: https://hudi.apache.org/
Taking ``Create Table`` as a example,

.. code-block:: sql
CREATE TABLE hudi_cow_nonpcf_tbl (
uuid INT,
name STRING,
price DOUBLE
) USING HUDI;
Taking ``Query Data`` as a example,

.. code-block:: sql
SELECT * FROM hudi_cow_nonpcf_tbl WHERE id < 20;
Taking ``Insert Data`` as a example,

.. code-block:: sql
INSERT INTO hudi_cow_nonpcf_tbl SELECT 1, 'a1', 20;
Taking ``Update Data`` as a example,

.. code-block:: sql
UPDATE hudi_cow_nonpcf_tbl SET name = 'foo', price = price * 2 WHERE id = 1;
Taking ``Delete Data`` as a example,

.. code-block:: sql
DELETE FROM hudi_cow_nonpcf_tbl WHERE uuid = 1;
.. _Hudi: https://hudi.apache.org/
.. _Official Documentation: https://hudi.apache.org/docs/overview
.. _Maven Central: https://mvnrepository.com/artifact/org.apache.hudi

0 comments on commit f1312ea

Please sign in to comment.