Skip to content

Commit 33d9940

Browse files
authored
Improve the bundle jar license and notice remove using exclude (#1991)
1 parent fae17a0 commit 33d9940

File tree

5 files changed

+65
-115
lines changed

5 files changed

+65
-115
lines changed

plugins/spark/README.md

Lines changed: 50 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -28,32 +28,29 @@ REST endpoints, and provides implementations for Apache Spark's
2828
Right now, the plugin only provides support for Spark 3.5, Scala version 2.12 and 2.13,
2929
and depends on iceberg-spark-runtime 1.9.0.
3030

31-
# Build Plugin Jar
32-
A task createPolarisSparkJar is added to build a jar for the Polaris Spark plugin, the jar is named as:
33-
`polaris-spark-<sparkVersion>_<scalaVersion>-<polarisVersion>-bundle.jar`. For example:
34-
`polaris-spark-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT-bundle.jar`.
35-
36-
- `./gradlew :polaris-spark-3.5_2.12:createPolarisSparkJar` -- build jar for Spark 3.5 with Scala version 2.12.
37-
- `./gradlew :polaris-spark-3.5_2.13:createPolarisSparkJar` -- build jar for Spark 3.5 with Scala version 2.13.
38-
39-
The result jar is located at plugins/spark/v3.5/build/<scala_version>/libs after the build.
40-
41-
# Start Spark with Local Polaris Service using built Jar
42-
Once the jar is built, we can manually test it with Spark and a local Polaris service.
43-
31+
# Start Spark with local Polaris service using the Polaris Spark plugin
4432
The following command starts a Polaris server for local testing, it runs on localhost:8181 with default
45-
realm `POLARIS` and root credentials `root:secret`:
33+
realm `POLARIS` and root credentials `root:s3cr3t`:
4634
```shell
4735
./gradlew run
4836
```
4937

50-
Once the local server is running, the following command can be used to start the spark-shell with the built Spark client
51-
jar, and to use the local Polaris server as a Catalog.
38+
Once the local server is running, you can start Spark with the Polaris Spark plugin using either the `--packages`
39+
option with the Polaris Spark package, or the `--jars` option with the Polaris Spark bundle JAR.
40+
41+
The following sections explain how to build and run Spark with both the Polaris package and the bundle JAR.
42+
43+
# Build and run with Polaris spark package locally
44+
The Polaris Spark client source code is located in plugins/spark/v3.5/spark. To use the Polaris Spark package
45+
with Spark, you first need to publish the source JAR to your local Maven repository.
46+
47+
Run the following command to build the Polaris Spark project and publish the source JAR to your local Maven repository:
48+
- `./gradlew assemble` -- build the whole Polaris project without running tests
49+
- `./gradlew publishToMavenLocal` -- publish Polaris project source JAR to local Maven repository
5250

5351
```shell
5452
bin/spark-shell \
55-
--jars <path-to-spark-client-jar> \
56-
--packages org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.3.1 \
53+
--packages org.apache.polaris:polaris-spark-<spark_version>_<scala_version>:<polaris_version>,org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.3.1 \
5754
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension \
5855
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
5956
--conf spark.sql.catalog.<catalog-name>.warehouse=<catalog-name> \
@@ -66,17 +63,20 @@ bin/spark-shell \
6663
--conf spark.sql.sources.useV1SourceList=''
6764
```
6865

69-
Assume the path to the built Spark client jar is
70-
`/polaris/plugins/spark/v3.5/spark/build/2.12/libs/polaris-spark-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT-bundle.jar`
71-
and the name of the catalog is `polaris`. The cli command will look like following:
66+
The Polaris version is defined in the `versions.txt` file located in the root directory of the Polaris project.
67+
Assume the following values:
68+
- `spark_version`: 3.5
69+
- `scala_version`: 2.12
70+
- `polaris_version`: 1.1.0-incubating-SNAPSHOT
71+
- `catalog-name`: `polaris`
72+
The Spark command would look like following:
7273

7374
```shell
7475
bin/spark-shell \
75-
--jars /polaris/plugins/spark/v3.5/spark/build/2.12/libs/polaris-spark-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT-bundle.jar \
76-
--packages org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.3.1 \
76+
--packages org.apache.polaris:polaris-spark-3.5_2.12:1.1.0-incubating-SNAPSHOT,org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.3.1 \
7777
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension \
7878
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
79-
--conf spark.sql.catalog.polaris.warehouse=<catalog-name> \
79+
--conf spark.sql.catalog.polaris.warehouse=polaris \
8080
--conf spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=vended-credentials \
8181
--conf spark.sql.catalog.polaris=org.apache.polaris.spark.SparkCatalog \
8282
--conf spark.sql.catalog.polaris.uri=http://localhost:8181/api/catalog \
@@ -86,6 +86,32 @@ bin/spark-shell \
8686
--conf spark.sql.sources.useV1SourceList=''
8787
```
8888

89+
# Build and run with Polaris spark bundle JAR
90+
The polaris-spark project also provides a Spark bundle JAR for the `--jars` use case. The resulting JAR will follow this naming format:
91+
polaris-spark-<spark_version>_<scala_version>-<polaris_version>-bundle.jar
92+
For example:
93+
polaris-spark-bundle-3.5_2.12-1.1.0-incubating-SNAPSHOT-bundle.jar
94+
95+
Run `./gradlew assemble` to build the entire Polaris project without running tests. After the build completes,
96+
the bundle JAR can be found under: plugins/spark/v3.5/spark/build/<scala_version>/libs/.
97+
To start Spark using the bundle JAR, specify it with the `--jars` option as shown below:
98+
99+
```shell
100+
bin/spark-shell \
101+
--jars <path-to-spark-client-jar> \
102+
--packages org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.3.1 \
103+
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension \
104+
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
105+
--conf spark.sql.catalog.<catalog-name>.warehouse=<catalog-name> \
106+
--conf spark.sql.catalog.<catalog-name>.header.X-Iceberg-Access-Delegation=vended-credentials \
107+
--conf spark.sql.catalog.<catalog-name>=org.apache.polaris.spark.SparkCatalog \
108+
--conf spark.sql.catalog.<catalog-name>.uri=http://localhost:8181/api/catalog \
109+
--conf spark.sql.catalog.<catalog-name>.credential="root:secret" \
110+
--conf spark.sql.catalog.<catalog-name>.scope='PRINCIPAL_ROLE:ALL' \
111+
--conf spark.sql.catalog.<catalog-name>.token-refresh-enabled=true \
112+
--conf spark.sql.sources.useV1SourceList=''
113+
```
114+
89115
# Limitations
90116
The Polaris Spark client supports catalog management for both Iceberg and Delta tables, it routes all Iceberg table
91117
requests to the Iceberg REST endpoints, and routes all Delta table requests to the Generic Table REST endpoints.

plugins/spark/v3.5/getting-started/notebooks/SparkPolaris.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -265,7 +265,7 @@
265265
"from pyspark.sql import SparkSession\n",
266266
"\n",
267267
"spark = (SparkSession.builder\n",
268-
" .config(\"spark.jars\", \"../polaris_libs/polaris-spark-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT-bundle.jar\")\n",
268+
" .config(\"spark.jars\", \"../polaris_libs/polaris-spark-3.5_2.12-1.1.0-incubating-SNAPSHOT-bundle.jar\") # TODO: add a way to automatically discover the Jar\n",
269269
" .config(\"spark.jars.packages\", \"org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.2.1\")\n",
270270
" .config(\"spark.sql.catalog.spark_catalog\", \"org.apache.spark.sql.delta.catalog.DeltaCatalog\")\n",
271271
" .config('spark.sql.iceberg.vectorization.enabled', 'false')\n",
File renamed without changes.
File renamed without changes.

plugins/spark/v3.5/spark/build.gradle.kts

Lines changed: 14 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -89,96 +89,20 @@ tasks.register<ShadowJar>("createPolarisSparkJar") {
8989
from(sourceSets.main.get().output)
9090
configurations = listOf(project.configurations.runtimeClasspath.get())
9191

92-
// Optimization: Minimize the JAR (remove unused classes from dependencies)
93-
// The iceberg-spark-runtime plugin is always packaged along with our polaris-spark plugin,
94-
// therefore excluded from the optimization.
95-
minimize { exclude(dependency("org.apache.iceberg:iceberg-spark-runtime-*.*")) }
96-
97-
// Always run the license file addition after this task completes
98-
finalizedBy("addLicenseFilesToJar")
99-
}
100-
101-
// Post-processing task to add our project's LICENSE and NOTICE files to the jar and remove any
102-
// other LICENSE or NOTICE files that were shaded in.
103-
tasks.register("addLicenseFilesToJar") {
104-
dependsOn("createPolarisSparkJar")
105-
106-
doLast {
107-
val shadowTask = tasks.named("createPolarisSparkJar", ShadowJar::class.java).get()
108-
val jarFile = shadowTask.archiveFile.get().asFile
109-
val tempDir =
110-
File(
111-
"${project.layout.buildDirectory.get().asFile}/tmp/jar-cleanup-${shadowTask.archiveBaseName.get()}-${shadowTask.archiveClassifier.get()}"
112-
)
113-
val projectLicenseFile = File(projectDir, "LICENSE")
114-
val projectNoticeFile = File(projectDir, "NOTICE")
115-
116-
// Validate that required license files exist
117-
if (!projectLicenseFile.exists()) {
118-
throw GradleException("Project LICENSE file not found at: ${projectLicenseFile.absolutePath}")
119-
}
120-
if (!projectNoticeFile.exists()) {
121-
throw GradleException("Project NOTICE file not found at: ${projectNoticeFile.absolutePath}")
122-
}
123-
124-
logger.info("Processing jar: ${jarFile.absolutePath}")
125-
logger.info("Using temp directory: ${tempDir.absolutePath}")
126-
127-
// Clean up temp directory
128-
if (tempDir.exists()) {
129-
tempDir.deleteRecursively()
130-
}
131-
tempDir.mkdirs()
132-
133-
// Extract the jar
134-
copy {
135-
from(zipTree(jarFile))
136-
into(tempDir)
137-
}
138-
139-
fileTree(tempDir)
140-
.matching {
141-
include("**/*LICENSE*")
142-
include("**/*NOTICE*")
143-
}
144-
.forEach { file ->
145-
logger.info("Removing license file: ${file.relativeTo(tempDir)}")
146-
file.delete()
147-
}
148-
149-
// Remove META-INF/licenses directory if it exists
150-
val licensesDir = File(tempDir, "META-INF/licenses")
151-
if (licensesDir.exists()) {
152-
licensesDir.deleteRecursively()
153-
logger.info("Removed META-INF/licenses directory")
154-
}
155-
156-
// Copy our project's license files to root
157-
copy {
158-
from(projectLicenseFile)
159-
into(tempDir)
160-
}
161-
logger.info("Added project LICENSE file")
162-
163-
copy {
164-
from(projectNoticeFile)
165-
into(tempDir)
166-
}
167-
logger.info("Added project NOTICE file")
168-
169-
// Delete the original jar
170-
jarFile.delete()
171-
172-
// Create new jar with only project LICENSE and NOTICE files
173-
ant.withGroovyBuilder {
174-
"jar"("destfile" to jarFile.absolutePath) { "fileset"("dir" to tempDir.absolutePath) }
175-
}
176-
177-
logger.info("Recreated jar with only project LICENSE and NOTICE files")
178-
179-
// Clean up temp directory
180-
tempDir.deleteRecursively()
181-
}
92+
// recursively remove all LICENSE and NOTICE file under META-INF, includes
93+
// directories contains 'license' in the name
94+
exclude("META-INF/**/*LICENSE*")
95+
exclude("META-INF/**/*NOTICE*")
96+
// exclude the top level LICENSE, LICENSE-*.txt and NOTICE
97+
exclude("LICENSE*")
98+
exclude("NOTICE*")
99+
100+
// add polaris customized LICENSE and NOTICE for the bundle jar at top level. Note that the
101+
// customized LICENSE and NOTICE file are called BUNDLE-LICENSE and BUNDLE-NOTICE,
102+
// and renamed to LICENSE and NOTICE after include, this is to avoid the file
103+
// being excluded due to the exclude pattern matching used above.
104+
from("${projectDir}/BUNDLE-LICENSE") { rename { "LICENSE" } }
105+
from("${projectDir}/BUNDLE-NOTICE") { rename { "NOTICE" } }
182106
}
183107

184108
// ensure the shadow jar job (which will automatically run license addition) is run for both

0 commit comments

Comments
 (0)