Skip to content

Commit 1dd2be9

Browse files
authored
Add getting-started for Polaris Spark Client with Delta tables (#1488)
1 parent e4eabb5 commit 1dd2be9

File tree

5 files changed

+1091
-6
lines changed

5 files changed

+1091
-6
lines changed

plugins/spark/README.md

Lines changed: 61 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -30,11 +30,66 @@ and depends on iceberg-spark-runtime 1.8.1.
3030

3131
# Build Plugin Jar
3232
A task createPolarisSparkJar is added to build a jar for the Polaris Spark plugin, the jar is named as:
33-
"polaris-iceberg-<iceberg_version>-spark-runtime-<spark_major_version>_<scala_version>.jar"
33+
The result jar is located at plugins/spark/v3.5/build/<scala_version>/libs after the build.
3434

35-
Building the Polaris project produces client jars for both Scala 2.12 and 2.13, and CI runs the Spark
36-
client tests for both Scala versions as well.
35+
# Start Spark with Local Polaris Service using built Jar
36+
Once the jar is built, we can manually test it with Spark and a local Polaris service.
3737

38-
The Jar can also be built alone with a specific version using target `:polaris-spark-3.5_<scala_version>`. For example:
39-
- `./gradlew :polaris-spark-3.5_2.12:createPolarisSparkJar` - Build a jar for the Polaris Spark plugin with scala version 2.12.
40-
The result jar is located at plugins/spark/build/<scala_version>/libs after the build.
38+
The following command starts a Polaris server for local testing, it runs on localhost:8181 with default
39+
realm `POLARIS` and root credentials `root:secret`:
40+
```shell
41+
./gradlew run
42+
```
43+
44+
Once the local server is running, the following command can be used to start the spark-shell with the built Spark client
45+
jar, and to use the local Polaris server as a Catalog.
46+
47+
```shell
48+
bin/spark-shell \
49+
--jars <path-to-spark-client-jar> \
50+
--packages org.apache.hadoop:hadoop-aws:3.4.0,io.delta:delta-spark_2.12:3.3.1 \
51+
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension \
52+
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
53+
--conf spark.sql.catalog.<catalog-name>.warehouse=<catalog-name> \
54+
--conf spark.sql.catalog.<catalog-name>.header.X-Iceberg-Access-Delegation=true \
55+
--conf spark.sql.catalog.<catalog-name>=org.apache.polaris.spark.SparkCatalog \
56+
--conf spark.sql.catalog.<catalog-name>.uri=http://localhost:8181/api/catalog \
57+
--conf spark.sql.catalog.<catalog-name>.credential="root:secret" \
58+
--conf spark.sql.catalog.<catalog-name>.scope='PRINCIPAL_ROLE:ALL' \
59+
--conf spark.sql.catalog.<catalog-name>.token-refresh-enabled=true \
60+
--conf spark.sql.catalog.<catalog-name>.type=rest \
61+
--conf spark.sql.sources.useV1SourceList=''
62+
```
63+
64+
Assume the path to the built Spark client jar is
65+
`/polaris/plugins/spark/v3.5/spark/build/2.12/libs/polaris-iceberg-1.8.1-spark-runtime-3.5_2.12-0.10.0-beta-incubating-SNAPSHOT.jar`
66+
and the name of the catalog is `polaris`. The cli command will look like following:
67+
68+
```shell
69+
bin/spark-shell \
70+
--jars /polaris/plugins/spark/v3.5/spark/build/2.12/libs/polaris-iceberg-1.8.1-spark-runtime-3.5_2.12-0.10.0-beta-incubating-SNAPSHOT.jar \
71+
--packages org.apache.hadoop:hadoop-aws:3.4.0,io.delta:delta-spark_2.12:3.3.1 \
72+
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension \
73+
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
74+
--conf spark.sql.catalog.polaris.warehouse=<catalog-name> \
75+
--conf spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=true \
76+
--conf spark.sql.catalog.polaris=org.apache.polaris.spark.SparkCatalog \
77+
--conf spark.sql.catalog.polaris.uri=http://localhost:8181/api/catalog \
78+
--conf spark.sql.catalog.polaris.credential="root:secret" \
79+
--conf spark.sql.catalog.polaris.scope='PRINCIPAL_ROLE:ALL' \
80+
--conf spark.sql.catalog.polaris.token-refresh-enabled=true \
81+
--conf spark.sql.catalog.polaris.type=rest \
82+
--conf spark.sql.sources.useV1SourceList=''
83+
```
84+
85+
# Limitations
86+
The Polaris Spark client supports catalog management for both Iceberg and Delta tables, it routes all Iceberg table
87+
requests to the Iceberg REST endpoints, and routes all Delta table requests to the Generic Table REST endpoints.
88+
89+
Following describes the current limitations of the Polaris Spark client:
90+
1) Create table as select (CTAS) is not supported for Delta tables. As a result, the `saveAsTable` method of `Dataframe`
91+
is also not supported, since it relies on the CTAS support.
92+
2) Create a Delta table without explicit location is not supported.
93+
3) Rename a Delta table is not supported.
94+
4) ALTER TABLE ... SET LOCATION/SET FILEFORMAT/ADD PARTITION is not supported for DELTA table.
95+
5) For other non-iceberg tables like csv, there is no specific guarantee provided today.
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Getting Started with Apache Spark and Apache Polaris With Delta and Iceberg
21+
22+
This getting started guide provides a `docker-compose` file to set up [Apache Spark](https://spark.apache.org/) with Apache Polaris using
23+
the new Polaris Spark Client.
24+
25+
The Polaris Spark Client enables manage of both Delta and Iceberg tables using Apache Polaris.
26+
27+
A Jupyter notebook is started to run PySpark, and Polaris Python client is also installed to call Polaris APIs
28+
directly through Python Client.
29+
30+
## Build the Spark Client Jar and Polaris image
31+
If Spark Client Jar is not presented locally under plugins/spark/v3.5/build/<scala_version>/libs, please build the jar
32+
using
33+
- `./gradlew assemble` -- build the Polaris project and skip the tests.
34+
35+
If a Polaris image is not already present locally, build one with the following command:
36+
37+
```shell
38+
./gradlew \
39+
:polaris-quarkus-server:assemble \
40+
:polaris-quarkus-server:quarkusAppPartsBuild --rerun \
41+
-Dquarkus.container-image.build=true
42+
```
43+
44+
## Run the `docker-compose` file
45+
46+
To start the `docker-compose` file, run this command from the repo's root directory:
47+
```shell
48+
docker-compose -f plugins/spark/v3.5/getting-started/docker-compose.yml up
49+
```
50+
51+
This will spin up 2 container services
52+
* The `polaris` service for running Apache Polaris using an in-memory metastore
53+
* The `jupyter` service for running Jupyter notebook with PySpark
54+
55+
NOTE: Starting the container first time may take a couple of minutes, because it will need to download the Spark 3.5.5.
56+
When working with Delta, the Polaris Spark Client requires delta-io >= 3.2.1, and it requires at least Spark 3.5.3,
57+
but the current jupyter Spark image only support Spark 3.5.0.
58+
59+
### Run with AWS access setup
60+
If you want to interact with S3 bucket, make sure you have the following environment variables setup correctly in
61+
your local env before running the `docker-compose` file.
62+
```
63+
AWS_ACCESS_KEY_ID=<your_access_key>
64+
AWS_SECRET_ACCESS_KEY=<your_secret_key>
65+
```
66+
67+
## Access the Jupyter notebook interface
68+
In the Jupyter notebook container log, look for the URL to access the Jupyter notebook. The url should be in the
69+
format, `http://127.0.0.1:8888/lab?token=<token>`.
70+
71+
Open the Jupyter notebook in a browser.
72+
Navigate to [`notebooks/SparkPolaris.ipynb`](http://127.0.0.1:8888/lab/tree/notebooks/SparkPolaris.ipynb) <!-- markdown-link-check-disable-line -->
73+
74+
If the above url doesn't work, try to replace `127.0.0.1` with `localhost`, for example:
75+
`http://localhost:8888/lab?token=<token>`.
76+
77+
## Run the Jupyter notebook
78+
You can now run all cells in the notebook or write your own code!
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one
3+
# or more contributor license agreements. See the NOTICE file
4+
# distributed with this work for additional information
5+
# regarding copyright ownership. The ASF licenses this file
6+
# to you under the Apache License, Version 2.0 (the
7+
# "License"); you may not use this file except in compliance
8+
# with the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing,
13+
# software distributed under the License is distributed on an
14+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
# KIND, either express or implied. See the License for the
16+
# specific language governing permissions and limitations
17+
# under the License.
18+
#
19+
20+
services:
21+
polaris:
22+
image: apache/polaris:latest
23+
ports:
24+
- "8181:8181"
25+
- "8182"
26+
environment:
27+
AWS_REGION: us-west-2
28+
AWS_ACCESS_KEY_ID: $AWS_ACCESS_KEY_ID
29+
AWS_SECRET_ACCESS_KEY: $AWS_SECRET_ACCESS_KEY
30+
POLARIS_BOOTSTRAP_CREDENTIALS: default-realm,root,s3cr3t
31+
polaris.realm-context.realms: default-realm
32+
quarkus.otel.sdk.disabled: "true"
33+
healthcheck:
34+
test: ["CMD", "curl", "http://localhost:8182/healthcheck"]
35+
interval: 10s
36+
timeout: 10s
37+
retries: 5
38+
jupyter:
39+
build:
40+
context: ../../../../ # this is needed to get the ./client
41+
dockerfile: ./plugins/spark/v3.5/getting-started/notebooks/Dockerfile
42+
network: host
43+
ports:
44+
- "8888:8888"
45+
depends_on:
46+
polaris:
47+
condition: service_healthy
48+
environment:
49+
AWS_REGION: us-west-2
50+
AWS_ACCESS_KEY_ID: $AWS_ACCESS_KEY_ID
51+
AWS_SECRET_ACCESS_KEY: $AWS_SECRET_ACCESS_KEY
52+
POLARIS_HOST: polaris
53+
volumes:
54+
- ./notebooks:/home/jovyan/notebooks
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one
3+
# or more contributor license agreements. See the NOTICE file
4+
# distributed with this work for additional information
5+
# regarding copyright ownership. The ASF licenses this file
6+
# to you under the Apache License, Version 2.0 (the
7+
# "License"); you may not use this file except in compliance
8+
# with the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing,
13+
# software distributed under the License is distributed on an
14+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
# KIND, either express or implied. See the License for the
16+
# specific language governing permissions and limitations
17+
# under the License.
18+
#
19+
20+
FROM jupyter/all-spark-notebook:spark-3.5.0
21+
22+
ENV LANGUAGE='en_US:en'
23+
24+
USER root
25+
26+
# Generic table support requires delta 3.2.1
27+
# Install Spark 3.5.5
28+
RUN wget -q https://archive.apache.org/dist/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz \
29+
&& tar -xzf spark-3.5.5-bin-hadoop3.tgz \
30+
&& mv spark-3.5.5-bin-hadoop3 /opt/spark \
31+
&& rm spark-3.5.5-bin-hadoop3.tgz
32+
33+
# Set environment variables
34+
ENV SPARK_HOME=/opt/spark
35+
ENV PATH=$SPARK_HOME/bin:$PATH
36+
37+
USER jovyan
38+
39+
COPY --chown=jovyan client /home/jovyan/client
40+
COPY --chown=jovyan regtests/requirements.txt /tmp
41+
COPY --chown=jovyan plugins/spark/v3.5/spark/build/2.12/libs /home/jovyan/polaris_libs
42+
RUN pip install -r /tmp/requirements.txt
43+
RUN cd client/python && poetry lock && \
44+
python3 -m poetry install && \
45+
pip install -e .
46+
47+
WORKDIR /home/jovyan/

0 commit comments

Comments
 (0)