Skip to content

Commit 9028303

Browse files
Hudi uniform support (#2333)
#### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (Uniform) ## Description - This change aims to add support for Hudi in Uniform - The changes were mostly copied from [OneTable](https://github.com/onetable-io/onetable) which has a working version of Delta to Hudi already ## How was this patch tested? Some basic tests are added ## Does this PR introduce _any_ user-facing changes? Yes, this allows users to expose their Delta tables as Hudi
1 parent 9c302b0 commit 9028303

File tree

15 files changed

+1452
-5
lines changed

15 files changed

+1452
-5
lines changed

build.sbt

+52-1
Original file line numberDiff line numberDiff line change
@@ -484,6 +484,57 @@ lazy val icebergShaded = (project in file("icebergShaded"))
484484
// Make the 'compile' invoke the 'assembly' task to generate the uber jar.
485485
)
486486

487+
lazy val hudi = (project in file("hudi"))
488+
.dependsOn(spark % "compile->compile;test->test;provided->provided")
489+
.settings (
490+
name := "delta-hudi",
491+
commonSettings,
492+
scalaStyleSettings,
493+
releaseSettings,
494+
libraryDependencies ++= Seq(
495+
"org.apache.hudi" % "hudi-java-client" % "0.14.0" % "compile" excludeAll(
496+
ExclusionRule(organization = "org.apache.hadoop"),
497+
ExclusionRule(organization = "org.apache.zookeeper"),
498+
),
499+
"org.apache.spark" %% "spark-avro" % sparkVersion % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
500+
"org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
501+
),
502+
assembly / assemblyJarName := s"${name.value}-assembly_${scalaBinaryVersion.value}-${version.value}.jar",
503+
assembly / logLevel := Level.Info,
504+
assembly / test := {},
505+
assembly / assemblyMergeStrategy := {
506+
// Project hudi `dependsOn` spark and accidentally brings in it, along with its
507+
// compile-time dependencies (like delta-storage). We want these excluded from the
508+
// delta-hudi jar.
509+
case PathList("io", "delta", xs @ _*) =>
510+
// - delta-storage will bring in classes: io/delta/storage
511+
// - delta-spark will bring in classes: io/delta/exceptions/, io/delta/implicits,
512+
// io/delta/package, io/delta/sql, io/delta/tables,
513+
MergeStrategy.discard
514+
case PathList("com", "databricks", xs @ _*) =>
515+
// delta-spark will bring in com/databricks/spark/util
516+
MergeStrategy.discard
517+
case PathList("org", "apache", "spark", "sql", "delta", "hudi", xs @ _*) =>
518+
MergeStrategy.first
519+
case PathList("org", "apache", "spark", xs @ _*) =>
520+
MergeStrategy.discard
521+
// Discard `module-info.class` to fix the `different file contents found` error.
522+
// TODO Upgrade SBT to 1.5 which will do this automatically
523+
case "module-info.class" => MergeStrategy.discard
524+
// Discard unused `parquet.thrift` so that we don't conflict the file used by the user
525+
case "parquet.thrift" => MergeStrategy.discard
526+
// Hudi metadata writer requires this service file to be present on the classpath
527+
case "META-INF/services/org.apache.hadoop.hbase.regionserver.MetricsRegionServerSourceFactory" => MergeStrategy.first
528+
// Discard the jackson service configs that we don't need. These files are not shaded so
529+
// adding them may conflict with other jackson version used by the user.
530+
case PathList("META-INF", "services", xs @ _*) => MergeStrategy.discard
531+
case x =>
532+
MergeStrategy.first
533+
},
534+
// Make the 'compile' invoke the 'assembly' task to generate the uber jar.
535+
Compile / packageBin := assembly.value
536+
)
537+
487538
lazy val hive = (project in file("connectors/hive"))
488539
.dependsOn(standaloneCosmetic)
489540
.settings (
@@ -1120,7 +1171,7 @@ val createTargetClassesDir = taskKey[Unit]("create target classes dir")
11201171

11211172
// Don't use these groups for any other projects
11221173
lazy val sparkGroup = project
1123-
.aggregate(spark, contribs, storage, storageS3DynamoDB, iceberg, testDeltaIcebergJar, sharing)
1174+
.aggregate(spark, contribs, storage, storageS3DynamoDB, iceberg, testDeltaIcebergJar, sharing, hudi)
11241175
.settings(
11251176
// crossScalaVersions must be set to Nil on the aggregating project
11261177
crossScalaVersions := Nil,

hudi/README.md

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Converting to Hudi with UniForm
2+
## Create a table with Hudi UniForm enabled
3+
Using spark-sql you can create a table and insert a few records into it. You will need to include the delta-hudi-assembly jar on the path.
4+
```
5+
spark-sql --packages io.delta:delta-spark_2.12:3.2.0-SNAPSHOT --jars delta-hudi-assembly_2.12-3.2.0-SNAPSHOT.jar --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
6+
```
7+
Then you can create a table with Hudi UniForm enabled.
8+
```
9+
CREATE TABLE `delta_table_with_hudi` (col1 INT) USING DELTA TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'hudi') LOCATION '/tmp/delta-table-with-hudi';
10+
```
11+
And insert a record into it.
12+
```
13+
INSERT INTO delta_table_with_hudi VALUES (1);
14+
```
15+
16+
## Read the table with Hudi
17+
Hudi does not currently support spark 3.5.X so you will need to launch a spark shell with spark 3.4.X or earlier.
18+
Instructions for launching the spark-shell with Hudi can be found [here](https://hudi.apache.org/docs/quick-start-guide#spark-shellsql).
19+
After launching the shell, you can read the table by enabling the hudi metadata table in the reader and loading from the path used in the create table step.
20+
```scala
21+
val df = spark.read.format("hudi").option("hoodie.metadata.enable", "true").load("/tmp/delta-table-with-hudi")
22+
```

0 commit comments

Comments
 (0)