Modern data stack

Minimal example integrating different open-source technologies in order to create a working open data lakehouse solution based on Apache Iceberg.

Core technologies used are:

Component	Description	Version	URL
Trino	Federated query engine	v443	http://localhost:8080
MinIO	Object store	v2023.08.23	http://localhost:9000
Hive MestaStore (HMS)	Metadata respository	v.3.1.3
Apache Spark	Distributed computation engine	v3.4.1
Apache Iceberg	Analytics table open format	v1.6
Jupyter notebook	Web-based computational documents	v1.0.0	http://localhost:8000/tree
Mage AI	Job orchestrator	v0.9.73	http://localhost:6789

Code structure

All software components are distributed based on docker images. There is a docker-compose file that automates the deployment of all components easily. All components' configuration files are inside docker directory.

In the notebooks folder, there is a Jupyter notebook showing a simplified Spark-based ETL batch job that generates data products stored in MinIO object storage. All data products are registered into the Hive MetaStore (HMS) service as Apache Iceberg tables.

Both Spark (for data creation) anf Trino (for interactive data exposition) use the lakehouse capabilities through Hive metastore (HMS), creating a seamless integration of ETL workloads and interactive querying.

Inside datasets folder are a small-sized files with samples of the datasets used.

Other technologies to be considered

Since only the above technologies are successfully integrated, other complementary and promising projects will be considered in a future like:

Component	Description
OpenMetadata	Metadata management (catalog, lineage)
Open policy agent (OPA)	Centralized access policy repository and enforcing system
DBT	SQL-first transformation workflow

Installation

Start all docker containers with:

docker-compose up -d

Initializing lakehouse

Since this repo is for teaching purposes, a .env file is provided. The provided access keys are totally disposable and not used in any system. For an easy setup, it's recommended to keep that access keys unaltered to avoid changing the following configuration files:

Initializing our open lakehouse requires invoking MinIO object store API. To provision in MinIO a object store access key, access secret and a S3 bucket called lakehouse in the MinIO server, just type:

docker-compose exec minio bash /opt/bin/init_lakehouse.sh

Using Spark via Jupyter notebooks

Since python kernel is distributed in the spark-iceberg docker image, all coding examples are developed in python and SQL. No scala is used to simplify reading the code.

Open the notebook called Spark ETL and execute it. The following database objects will be created in the lakehouse:

sales_db database with sales_summary table
trip_db database with trips table

The Spark hms catalog is configured in this file and passed to the spark container as the spark-defaults.conf file. This file sets Iceberg as default table format for this catalog. It also sets HMS as catalog's metastore.

Using Trino

trino client is installed in the trino container. To access trino using a CLI, just connect to it:

docker-compose exec trino trino

A very convenient way of connecting to trino is through its JDBC API. Using a generic JDBC client like DBeaver Community is the recommended way of using trino.

Some sample SQL queries for the data model exposed by trino can be found in the sql folder.

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
bin		bin
datasets		datasets
docker		docker
jars		jars
notebooks		notebooks
sql		sql
.env		.env
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modern data stack

Code structure

Other technologies to be considered

Installation

Initializing lakehouse

Using Spark via Jupyter notebooks

Using Trino

Useful links

About

Releases

Packages

Languages

macvaz/modern-data-stack

Folders and files

Latest commit

History

Repository files navigation

Modern data stack

Code structure

Other technologies to be considered

Installation

Initializing lakehouse

Using Spark via Jupyter notebooks

Using Trino

Useful links

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages