Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Delta-Iceberg] Fix delta-iceberg jar to not pull in delta-spark and delta-storage jars #2022

Closed

Conversation

scottsand-db
Copy link
Collaborator

@scottsand-db scottsand-db commented Sep 5, 2023

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (delta-iceberg)

Description

Resolves #1903

Previously, the delta-iceberg jar was incorrectly including all of the classes from delta-spark and delta-storage.

You could run

wget https://repo1.maven.org/maven2/io/delta/delta-iceberg_2.13/3.0.0rc1/delta-iceberg_2.13-3.0.0rc1.jar
jar tvf delta-iceberg_2.13-3.0.0rc1.jar

and see

com/databricks/spark/util/MetricDefinitions.class
...
io/delta/storage/internal/ThreadUtils.class
...
org/apache/spark/sql/delta/DeltaLog.class

This PR fixes that by updating various SBT assembly configs:

  1. assemblyExcludedJars: excluding jars we don't want (but this only works for jars from libraryDependencies, not .dependsOn)
  2. assemblyMergeStrategy: manually discarding other classes using case matching

How was this patch tested?

Unit Test

Added a new test suite and sbt project. The new project depends on the assembled version of the delta-iceberg jar. The test suite loads that jar and analyses its classes.

QA

Published the jars locally and ran through a simple end-to-end UniForm example.

========== Delta ========== 

build/sbt storage/publishM2
build/sbt spark/publishM2
build/sbt iceberg/publishM2

spark-shell --packages io.delta:delta-spark_2.12:3.0.0-SNAPSHOT,io.delta:delta-storage:3.0.0-SNAPSHOT,io.delta:delta-iceberg_2.12:3.0.0-SNAPSHOT --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

val tablePath = "/Users/scott.sandre/uniform_tables/table_000"

sql(s"CREATE TABLE delta.`$tablePath` (col1 INT, col2 INT) USING DELTA TBLPROPERTIES ('delta.universalFormat.enabledFormats'='iceberg')")

sql(s"INSERT INTO delta.`$tablePath` VALUES (1, 1), (2,2), (3, 3)")

sql(s"SELECT * FROM delta.`$tablePath`").show()
+----+----+
|col1|col2|
+----+----+
|   3|   3|
|   2|   2|
|   1|   1|
+----+----+

==========  Iceberg ========== 

spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1 \
	--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
	--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
	--conf spark.sql.catalog.spark_catalog.type=hive \
	--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
	--conf spark.sql.catalog.local.type=hadoop \
	--conf spark.sql.catalog.local.warehouse=/Users/scott.sandre/iceberg_warehouse

spark.read.format("iceberg").load("/Users/scott.sandre/uniform_tables/table_000").show()
+----+----+
|col1|col2|
+----+----+
|   1|   1|
|   2|   2|
|   3|   3|
+----+----+

Does this PR introduce any user-facing changes?

Fixes a bug where delta-iceberg jar included delta-spark and delta-storage

@scottsand-db scottsand-db self-assigned this Sep 5, 2023
@scottsand-db scottsand-db changed the title [WIP] [Spark] Fix delta-iceberg jar to not pull in delta-spark and delta-storage jars [Delta-Iceberg] Fix delta-iceberg jar to not pull in delta-spark and delta-storage jars Sep 11, 2023
@scottsand-db scottsand-db added this to the 3.0.0 milestone Sep 11, 2023
* limitations under the License.
*/

package org.apache.spark.sql.delta
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the package name and the file path don't seem to match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Iceberg JAR contains the spark and storage packages
3 participants