Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 50 additions & 24 deletions plugins/spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,32 +28,29 @@ REST endpoints, and provides implementations for Apache Spark's
Right now, the plugin only provides support for Spark 3.5, Scala version 2.12 and 2.13,
and depends on iceberg-spark-runtime 1.9.0.

# Build Plugin Jar
A task createPolarisSparkJar is added to build a jar for the Polaris Spark plugin, the jar is named as:
`polaris-spark-<sparkVersion>_<scalaVersion>-<polarisVersion>-bundle.jar`. For example:
`polaris-spark-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT-bundle.jar`.

- `./gradlew :polaris-spark-3.5_2.12:createPolarisSparkJar` -- build jar for Spark 3.5 with Scala version 2.12.
- `./gradlew :polaris-spark-3.5_2.13:createPolarisSparkJar` -- build jar for Spark 3.5 with Scala version 2.13.

The result jar is located at plugins/spark/v3.5/build/<scala_version>/libs after the build.

# Start Spark with Local Polaris Service using built Jar
Once the jar is built, we can manually test it with Spark and a local Polaris service.

# Start Spark with local Polaris service using the Polaris Spark plugin
The following command starts a Polaris server for local testing, it runs on localhost:8181 with default
realm `POLARIS` and root credentials `root:secret`:
realm `POLARIS` and root credentials `root:s3cr3t`:
```shell
./gradlew run
```

Once the local server is running, the following command can be used to start the spark-shell with the built Spark client
jar, and to use the local Polaris server as a Catalog.
Once the local server is running, you can start Spark with the Polaris Spark plugin using either the `--packages`
option with the Polaris Spark package, or the `--jars` option with the Polaris Spark bundle JAR.

The following sections explain how to build and run Spark with both the Polaris package and the bundle JAR.

# Build and run with Polaris spark package locally
The Polaris Spark client source code is located in plugins/spark/v3.5/spark. To use the Polaris Spark package
with Spark, you first need to publish the source JAR to your local Maven repository.

Run the following command to build the Polaris Spark project and publish the source JAR to your local Maven repository:
- `./gradlew assemble` -- build the whole Polaris project without running tests
- `./gradlew publishToMavenLocal` -- publish Polaris project source JAR to local Maven repository

```shell
bin/spark-shell \
--jars <path-to-spark-client-jar> \
--packages org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.3.1 \
--packages org.apache.polaris:polaris-spark-<spark_version>_<scala_version>:<polaris_version>,org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.3.1 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.sql.catalog.<catalog-name>.warehouse=<catalog-name> \
Expand All @@ -66,17 +63,20 @@ bin/spark-shell \
--conf spark.sql.sources.useV1SourceList=''
```

Assume the path to the built Spark client jar is
`/polaris/plugins/spark/v3.5/spark/build/2.12/libs/polaris-spark-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT-bundle.jar`
and the name of the catalog is `polaris`. The cli command will look like following:
The Polaris version is defined in the `versions.txt` file located in the root directory of the Polaris project.
Assume the following values:
- `spark_version`: 3.5
- `scala_version`: 2.12
- `polaris_version`: 1.1.0-incubating-SNAPSHOT
- `catalog-name`: `polaris`
The Spark command would look like following:

```shell
bin/spark-shell \
--jars /polaris/plugins/spark/v3.5/spark/build/2.12/libs/polaris-spark-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT-bundle.jar \
--packages org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.3.1 \
--packages org.apache.polaris:polaris-spark-3.5_2.12:1.1.0-incubating-SNAPSHOT,org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.3.1 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.sql.catalog.polaris.warehouse=<catalog-name> \
--conf spark.sql.catalog.polaris.warehouse=polaris \
--conf spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=vended-credentials \
--conf spark.sql.catalog.polaris=org.apache.polaris.spark.SparkCatalog \
--conf spark.sql.catalog.polaris.uri=http://localhost:8181/api/catalog \
Expand All @@ -86,6 +86,32 @@ bin/spark-shell \
--conf spark.sql.sources.useV1SourceList=''
```

# Build and run with Polaris spark bundle JAR
The polaris-spark project also provides a Spark bundle JAR for the `--jars` use case. The resulting JAR will follow this naming format:
polaris-spark-<spark_version>_<scala_version>-<polaris_version>-bundle.jar
For example:
polaris-spark-bundle-3.5_2.12-1.1.0-incubating-SNAPSHOT-bundle.jar

Run `./gradlew assemble` to build the entire Polaris project without running tests. After the build completes,
the bundle JAR can be found under: plugins/spark/v3.5/spark/build/<scala_version>/libs/.
To start Spark using the bundle JAR, specify it with the `--jars` option as shown below:

```shell
bin/spark-shell \
--jars <path-to-spark-client-jar> \
--packages org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.3.1 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.sql.catalog.<catalog-name>.warehouse=<catalog-name> \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not use polaris as the example catalog name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the shell command without filling the actual value, i added an example in the README with specific value filled.

--conf spark.sql.catalog.<catalog-name>.header.X-Iceberg-Access-Delegation=vended-credentials \
--conf spark.sql.catalog.<catalog-name>=org.apache.polaris.spark.SparkCatalog \
--conf spark.sql.catalog.<catalog-name>.uri=http://localhost:8181/api/catalog \
--conf spark.sql.catalog.<catalog-name>.credential="root:secret" \
--conf spark.sql.catalog.<catalog-name>.scope='PRINCIPAL_ROLE:ALL' \
--conf spark.sql.catalog.<catalog-name>.token-refresh-enabled=true \
--conf spark.sql.sources.useV1SourceList=''
```

# Limitations
The Polaris Spark client supports catalog management for both Iceberg and Delta tables, it routes all Iceberg table
requests to the Iceberg REST endpoints, and routes all Delta table requests to the Generic Table REST endpoints.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,7 @@
"from pyspark.sql import SparkSession\n",
"\n",
"spark = (SparkSession.builder\n",
" .config(\"spark.jars\", \"../polaris_libs/polaris-spark-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT-bundle.jar\")\n",
" .config(\"spark.jars\", \"../polaris_libs/polaris-spark-3.5_2.12-1.1.0-incubating-SNAPSHOT-bundle.jar\") # TODO: add a way to automatically discover the Jar\n",
" .config(\"spark.jars.packages\", \"org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.2.1\")\n",
" .config(\"spark.sql.catalog.spark_catalog\", \"org.apache.spark.sql.delta.catalog.DeltaCatalog\")\n",
" .config('spark.sql.iceberg.vectorization.enabled', 'false')\n",
Expand Down
File renamed without changes.
104 changes: 14 additions & 90 deletions plugins/spark/v3.5/spark/build.gradle.kts
Original file line number Diff line number Diff line change
Expand Up @@ -89,96 +89,20 @@ tasks.register<ShadowJar>("createPolarisSparkJar") {
from(sourceSets.main.get().output)
configurations = listOf(project.configurations.runtimeClasspath.get())

// Optimization: Minimize the JAR (remove unused classes from dependencies)
// The iceberg-spark-runtime plugin is always packaged along with our polaris-spark plugin,
// therefore excluded from the optimization.
minimize { exclude(dependency("org.apache.iceberg:iceberg-spark-runtime-*.*")) }

// Always run the license file addition after this task completes
finalizedBy("addLicenseFilesToJar")
}

// Post-processing task to add our project's LICENSE and NOTICE files to the jar and remove any
// other LICENSE or NOTICE files that were shaded in.
tasks.register("addLicenseFilesToJar") {
dependsOn("createPolarisSparkJar")

doLast {
val shadowTask = tasks.named("createPolarisSparkJar", ShadowJar::class.java).get()
val jarFile = shadowTask.archiveFile.get().asFile
val tempDir =
File(
"${project.layout.buildDirectory.get().asFile}/tmp/jar-cleanup-${shadowTask.archiveBaseName.get()}-${shadowTask.archiveClassifier.get()}"
)
val projectLicenseFile = File(projectDir, "LICENSE")
val projectNoticeFile = File(projectDir, "NOTICE")

// Validate that required license files exist
if (!projectLicenseFile.exists()) {
throw GradleException("Project LICENSE file not found at: ${projectLicenseFile.absolutePath}")
}
if (!projectNoticeFile.exists()) {
throw GradleException("Project NOTICE file not found at: ${projectNoticeFile.absolutePath}")
}

logger.info("Processing jar: ${jarFile.absolutePath}")
logger.info("Using temp directory: ${tempDir.absolutePath}")

// Clean up temp directory
if (tempDir.exists()) {
tempDir.deleteRecursively()
}
tempDir.mkdirs()

// Extract the jar
copy {
from(zipTree(jarFile))
into(tempDir)
}

fileTree(tempDir)
.matching {
include("**/*LICENSE*")
include("**/*NOTICE*")
}
.forEach { file ->
logger.info("Removing license file: ${file.relativeTo(tempDir)}")
file.delete()
}

// Remove META-INF/licenses directory if it exists
val licensesDir = File(tempDir, "META-INF/licenses")
if (licensesDir.exists()) {
licensesDir.deleteRecursively()
logger.info("Removed META-INF/licenses directory")
}

// Copy our project's license files to root
copy {
from(projectLicenseFile)
into(tempDir)
}
logger.info("Added project LICENSE file")

copy {
from(projectNoticeFile)
into(tempDir)
}
logger.info("Added project NOTICE file")

// Delete the original jar
jarFile.delete()

// Create new jar with only project LICENSE and NOTICE files
ant.withGroovyBuilder {
"jar"("destfile" to jarFile.absolutePath) { "fileset"("dir" to tempDir.absolutePath) }
}

logger.info("Recreated jar with only project LICENSE and NOTICE files")

// Clean up temp directory
tempDir.deleteRecursively()
}
// recursively remove all LICENSE and NOTICE file under META-INF, includes
// directories contains 'license' in the name
exclude("META-INF/**/*LICENSE*")
exclude("META-INF/**/*NOTICE*")
// exclude the top level LICENSE, LICENSE-*.txt and NOTICE
exclude("LICENSE*")
exclude("NOTICE*")

// add polaris customized LICENSE and NOTICE for the bundle jar at top level. Note that the
// customized LICENSE and NOTICE file are called BUNDLE-LICENSE and BUNDLE-NOTICE,
// and renamed to LICENSE and NOTICE after include, this is to avoid the file
// being excluded due to the exclude pattern matching used above.
from("${projectDir}/BUNDLE-LICENSE") { rename { "LICENSE" } }
from("${projectDir}/BUNDLE-NOTICE") { rename { "NOTICE" } }
}

// ensure the shadow jar job (which will automatically run license addition) is run for both
Expand Down
Loading