Skip to content

Commit

Permalink
Adding in demo of xtable with S3
Browse files Browse the repository at this point in the history
Signed-off-by: Albert Wong <atwong@alumni.uci.edu>
  • Loading branch information
Albert Wong authored and Albert Wong committed Jun 4, 2024
1 parent b58175b commit 46cce88
Show file tree
Hide file tree
Showing 43 changed files with 6,646 additions and 0 deletions.
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,11 @@ target/
*.crc
demo/jars/*
demo/notebook/.ipynb_checkpoints/*

# Demo-s3 files
*.crc
demo-s3/jars/*.jar
demo-s3/notebook/.ipynb_checkpoints/*
demo-s3/spark/cache/**
demo-s3/spark/cache/**/*
demo-s3/spark/jars/*
258 changes: 258 additions & 0 deletions demo-s3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
# Running a Local Demo
This is a modified demo of Docker Demo.

Use `./start_demo.sh` to spin up a local notebook with a scala interpreter, Hive Metastore, Presto and Trino in docker containers. The script will first build the XTable jars required for the demo and then start the containers.

## Setup Instructions
1. Build the spark image
`docker build -t atwong/openjdk-11-spark-3.4-hive-2.3.10-hadoop-2.10.2:0.1.1 docker-image/openjdk-11-spark-3.4-hive-2.3.10-hadoop-2.10.2/.`
The versions were picked because these are the versions that are used to compile Apache Hudi 0.14.1. See https://github.com/apache/hudi/blob/release-0.14.1/pom.xml for more details.

2. Start the demo
`./start_demo.sh`
You can see that in the docker-compose file, it will have Min.IO as your object store, scripts that will create the inital s3 bucket (s3://warehouse), Apache Hive Metastore, and the just built Spark Container.

3. Access the spark container and create the open table format data in S3
`docker exec -it spark /usr/bin/bash`

3.1 Create the initial Apache Hudi data
```
pyspark --packages com.amazonaws:aws-java-sdk-s3:1.11.271,org.apache.hadoop:hadoop-client:2.10.2,org.apache.hadoop:hadoop-aws:2.10.2,org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.1 --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" --conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" --conf "spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
```
What is important to note is that the hadoop client has to match the hadoop version in your spark container (eg. org.apache.hadoop:hadoop-client:2.10.2 means hadoop 2.10.2), Hudi spark bundle has to match (eg. org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.1 means spark 3.4 with Hudi 0.14.1) and the AWS java SDK has to match the one that is compiled against hadoop-client (you can see the compile dependancy at https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.10.2).

And this is the pyspark script
```
from pyspark.sql.types import *
from pyspark import SparkContext
from pyspark.sql import SparkSession
import os
# initialize the bucket
database_name = "hudi_db"
table_name = "people"
base_path = "s3a://warehouse/"
records = [
(1, 'John', 25, 'NYC', '2023-09-28 00:00:00'),
(2, 'Emily', 30, 'SFO', '2023-09-28 00:00:00'),
(3, 'Michael', 35, 'ORD', '2023-09-28 00:00:00'),
(4, 'Andrew', 40, 'NYC', '2023-10-28 00:00:00'),
(5, 'Bob', 28, 'SEA', '2023-09-23 00:00:00'),
(6, 'Charlie', 31, 'DFW', '2023-08-29 00:00:00')
]
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("city", StringType(), True),
StructField("create_ts", StringType(), True)
])
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(records, schema)
hudi_options = {
# 'hoodie.datasource.hive_sync.enable': 'true',
# 'hoodie.datasource.hive_sync.mode': 'hms',
# 'hoodie.datasource.hive_sync.database': database_name,
# 'hoodie.datasource.hive_sync.table': table_name,
# 'hoodie.datasource.hive_sync.metastore.uris': 'thrift://hive-metastore:9083',
'hoodie.table.name': table_name,
'hoodie.datasource.write.partitionpath.field': 'city',
'hoodie.datasource.write.hive_style_partitioning': 'true'
}
(
df.write
.format("hudi")
.options(**hudi_options)
.mode('overwrite')
.save(f"{base_path}/{table_name}")
)
```
Note that I commented out the extra statements need to sync Hudi to HMS. If you uncomment these lines you don't need to run the `hudi-hive-sync/run_sync_tool.sh` tool. For this demo, we will assume that we'll write the hudi files into S3 and will need to use `hudi-hive-sync/run_sync_tool.sh` later.

You can browse the data you've written by going to http://localhost:9000 using admin/password.

3.2 Use xtable to scan the Apache Hudi data and create Iceberg and Delta data
```
export AWS_SECRET_ACCESS_KEY=password
export AWS_ACCESS_KEY_ID=admin
export ENDPOINT=http://minio:9000
export AWS_REGION=us-east-1
cd /opt/xtable/jars/; java -jar xtable-utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig xtable_hudi.yaml -p core-site.xml
```
Check out both xml. One xml is related to xtable config. The other xml is used to configure access to S3A.

4. Access the spark container and sync the data in S3 to HMS
`docker exec -it spark /usr/bin/bash`

4.1 Download compiled hudi-hive-sync-bundle
```
pyspark --packages org.apache.hudi:hudi-hive-sync-bundle:0.14.1,com.amazonaws:aws-java-sdk-s3:1.11.271,org.apache.hadoop:hadoop-client:2.10.2,org.apache.hadoop:hadoop-aws:2.10.2
```
I'm using pyspark to download hudi hive sync libraries to ~/.ivy2/jars

4.2 Download missing packages to get hudi-hive-sync/run_sync_tool.sh to work
```
pyspark --packages org.apache.parquet:parquet-avro:1.10.1,com.esotericsoftware:kryo-shaded:4.0.2
```
I'm using pyspark to download missing hudi hive sync libraries to ~/.ivy2/jars

4.3 Modify run_sync_tool.sh to add Apache Ivy libraries (via pyspark) to classpath
```
vi /opt/hudi/hudi-sync/hudi-hive-sync/run_sync_tool.sh
```
Change the line from
```
java -cp $HUDI_HIVE_UBER_JAR:${HADOOP_HIVE_JARS}:${HADOOP_CONF_DIR} org.apache.hudi.hive.HiveSyncTool "$@"
```
to
```
java -cp ~/.ivy2/jars/*:$HUDI_HIVE_UBER_JAR:${HADOOP_HIVE_JARS}:${HADOOP_CONF_DIR} org.apache.hudi.hive.HiveSyncTool "$@"
```

4.4 Add spark settings to access S3
```
vi /hadoop/etc/hadoop/core-site.xml
```
It should look like this
```
root@spark:/opt/xtable/jars# cat /hadoop/etc/hadoop/core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>s3a://minio:9000</value>
</property>
<!-- Minio properties -->
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>false</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>http://minio:9000</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>admin</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>password</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
</configuration>
```

4.5 Run hudi-hive-sync/run_sync_tool.sh to register Apache Hudi files into Apache Hive MetaStore
```
cd /opt/hudi/hudi-sync/hudi-hive-sync
./run_sync_tool.sh \
--metastore-uris 'thrift://hive-metastore:9083' \
--partitioned-by city \
--base-path 's3a://warehouse/people' \
--database hudi_db \
--table people \
--sync-mode hms
```

4.6 Run spark-sql with Delta Lake libraries to register Delta Lake files into Apache Hive MetaStore (HMS)
```
spark-sql --packages com.amazonaws:aws-java-sdk-s3:1.11.271,org.apache.hadoop:hadoop-aws:2.10.2,io.delta:delta-core_2.12:2.4.0 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \
--conf "spark.sql.catalogImplementation=hive" \
--conf "spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
```
Run this when you get the spark-sql prompt
```
CREATE SCHEMA delta_db LOCATION 's3a://warehouse/';
CREATE TABLE delta_db.people USING DELTA LOCATION 's3a://warehouse/people';
```

4.7 Run spark-sql with Apache Iceberg libraries to register Apache Iceberg files into Apache Hive MetaStore (HMS)
```
spark-sql --packages com.amazonaws:aws-java-sdk-s3:1.11.271,org.apache.hadoop:hadoop-aws:2.10.2,org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.5.2 \
--conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
--conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog" \
--conf "spark.sql.catalog.spark_catalog.type=hive" \
--conf "spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog" \
--conf "spark.sql.catalog.hive_prod.type=hive" \
--conf "spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
```
Run this when you get the spark-sql prompt
```
CREATE SCHEMA iceberg_db;
CALL hive_prod.system.register_table(
table => 'hive_prod.iceberg_db.people',
metadata_file => 's3a://warehouse/people/metadata/v2.metadata.json'
);
```

5. Access the Trino container and configure Trino to understand S3
`docker exec -it trino /usr/bin/bash`

5.1 Configure Trino to understand S3
```
cat /etc/trino/catalog/hudi.properties
cat /etc/trino/catalog/iceberg.properties
cat /etc/trino/catalog/delta.properties
```
The most important item is adding the "hive.s3" configurations.
```
hive.s3.aws-access-key=admin
hive.s3.aws-secret-key=password
hive.s3.endpoint=http://minio:9000
```

## Accessing Services
### Trino
You can access the local Trino container by running `docker exec -it trino trino`

Run the following SQL to view the data.
```
select * from hudi.hudi_db.people;
select * from delta.delta_db.people;
select * from iceberg.iceberg_db.people;
```
### Presto
You can access the local Presto container by running `docker exec -it presto presto-cli --server localhost:8082`
Loading

0 comments on commit 46cce88

Please sign in to comment.