Deltalake with Glue PoC DEMO

Introduction

This repo is intended to deploy a sample environment that will be used to demonstrate the way to utilize Delta lake on top of AWS Glue in order to handle streaming data with ACID transactions. AWS Glue is an AWS managed service that offers users a way to easily set up Spark environment to do ETL jobs. With the convenience of AWS Glue and powerful ACID-compatible features of Opensource Delta lake, users are provided with a very attractive alternatives to easily build large-scale and database-like data lake. On one hand, large sum of unstructured or semi-structured data can be fed into such data lake solution and on the other hand, data lake operators do not need to manually maintain or operate the underlying infrastructure. The superiority of Glue-Delta-lake solution would greatly resolve traditional data lake pain points and hopefully this PoC demo would pave the way to building enterprise-grade data lake in such way.

Architecture

Prerequisite

Understanding of AWS and intermediate knowledge about Kafka, AWS Glue, opensource Deltalake and Spark
A valid AWS account (Both AWS Global and AWS China accounts are supported)
Admin right to the AWS account
AWS CLI
Configure the admin right profile on AWS CLI
Set up Terraform

Setup

Step 1

Execute the following command to download the source code:

git clone --depth=1 https://github.com/wei-zhong90/deltalake-aws-poc-demo.git

Step 2

Modify the terraform input variable values to suit your own need

cd deltalake-aws-poc-demo/infrastructure
vim example.auto.tfvars

In the text editor screen, you will see several variables to fill in. Please pay attention to the public key variable. This variable should contain a RSA key that will be used to ssh into the sample linux instance, through which users will be able to generate sample data and do some customized tests.

Step 3

Deploy the terraform code

terraform apply

Actually the terraform codes accept many more input variables than the ones listed in example.auto.tfvars. I only put the required ones in example.auto.tfvars. For instance, users can also customize some variables to avoid creating a brand new VPC environment for the PoC architecture and use the existing VPC.

Step 4

Once Step 3 successfully finished, users can log in to the newly deployed linux instance and will find this git repo already downloaded under the home directory. Now modify the .env file to point the sample data generator to the correct Kafka cluster.

vim .env

The BOOTSTRAP_SERVERS value can be found at the output values from Step 3. As for TOPIC and SECOND_TOPIC variables, you simply need to fill in the same values that have been filled in the previous step (kafka_test_topic and kafka_test_topic_2 in example.auto.tfvars). Then that is it. Now we can officially start the data generator and validate the whole Deltalake-Glue architecture.

python3 -m pip install -r requirements.txt
python3 generator.py

Suppliment

Not every components are deployed by the terraform codes. Airflow codes need to be manually deployed to MWAA by console or AWS CLI. Some data maintainence glue jobs are also not included in the terraform code. We simply put the Spark scripts of those glue jobs in this directory to help understand. And more features are to be added in the future.

Disclaimer

The code in this repo is for demo purpose. There are still a lot of improvement can be made, especially regarding to the Terraform codes. If you find any bugs or inconvenient errors, please feel free to report it to me weiaws@amazon.com and my colleage, Nick huxiaod@amazon.com. Do not use this code directly on prod environment by any chances.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
airflow		airflow
assets		assets
delta_core		delta_core
helper_scripts		helper_scripts
infrastructure		infrastructure
spark_scripts		spark_scripts
.DS_Store		.DS_Store
.env		.env
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTES		NOTES
README.md		README.md
generator.py		generator.py
orderproducer.py		orderproducer.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deltalake with Glue PoC DEMO

Introduction

Architecture

Prerequisite

Setup

Step 1

Step 2

Step 3

Step 4

Suppliment

Disclaimer

About

Releases

Packages

Contributors 2

Languages

License

wei-zhong90/deltalake-aws-poc-demo

Folders and files

Latest commit

History

Repository files navigation

Deltalake with Glue PoC DEMO

Introduction

Architecture

Prerequisite

Setup

Step 1

Step 2

Step 3

Step 4

Suppliment

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages