Minimal example integrating different open-source technologies in order to create a working open data lakehouse solution based on Apache Iceberg.
Core technologies used are:
Component | Description | Version | URL |
---|---|---|---|
Trino | Federated query engine | v443 | http://localhost:8080 |
MinIO | Object store | v2023.08.23 | http://localhost:9000 |
Hive MestaStore (HMS) | Metadata respository | v.3.1.3 | |
Apache Spark | Distributed computation engine | v3.4.1 | |
Apache Iceberg | Analytics table open format | v1.6 | |
Jupyter notebook | Web-based computational documents | v1.0.0 | http://localhost:8000/tree |
Mage AI | Job orchestrator | v0.9.73 | http://localhost:6789 |
All software components are distributed based on docker images. There is a docker-compose file that automates the deployment of all components easily. All components' configuration files are inside docker directory.
In the notebooks folder, there is a Jupyter notebook showing a simplified Spark-based ETL batch job that generates data products stored in MinIO object storage. All data products are registered into the Hive MetaStore (HMS) service as Apache Iceberg tables.
Both Spark (for data creation) anf Trino (for interactive data exposition) use the lakehouse capabilities through Hive metastore (HMS), creating a seamless integration of ETL workloads and interactive querying.
Inside datasets folder are a small-sized files with samples of the datasets used.
Since only the above technologies are successfully integrated, other complementary and promising projects will be considered in a future like:
Component | Description |
---|---|
OpenMetadata | Metadata management (catalog, lineage) |
Open policy agent (OPA) | Centralized access policy repository and enforcing system |
DBT | SQL-first transformation workflow |
Start all docker containers with:
docker-compose up -d
Since this repo is for teaching purposes, a .env file
is provided. The provided access keys are totally disposable and not used in any system. For an easy setup, it's recommended to keep that access keys unaltered to avoid changing the following configuration files:
Initializing our open lakehouse requires invoking MinIO object store API. To provision in MinIO a object store access key
, access secret
and a S3 bucket
called lakehouse
in the MinIO server, just type:
docker-compose exec minio bash /opt/bin/init_lakehouse.sh
Since python kernel is distributed in the spark-iceberg docker image, all coding examples are developed in python and SQL. No scala is used to simplify reading the code.
Open the notebook called Spark ETL and execute it. The following database objects will be created in the lakehouse:
sales_db
database withsales_summary
tabletrip_db
database withtrips
table
The Spark hms catalog
is configured in this file and passed to the spark container as the spark-defaults.conf file. This file sets Iceberg as default table format for this catalog. It also sets HMS as catalog's metastore.
trino client is installed in the trino container. To access trino using a CLI, just connect to it:
docker-compose exec trino trino
A very convenient way of connecting to trino is through its JDBC API. Using a generic JDBC client like DBeaver Community is the recommended way of using trino.
Some sample SQL queries for the data model exposed by trino can be found in the sql folder.
- https://iceberg.apache.org/spark-quickstart/
- https://tabular.io/blog/docker-spark-and-iceberg/
- https://blog.min.io/manage-iceberg-tables-with-spark/
- https://blog.min.io/iceberg-acid-transactions/
- https://tabular.io/blog/rest-catalog-docker/
- https://blog.min.io/lakehouse-architecture-iceberg-minio/
- https://tabular.io/blog/iceberg-fileio/?ref=blog.min.io
- https://tabular.io/guides-and-papers/
- https://www.datamesh-architecture.com/tech-stacks/minio-trino
- https://dev.to/alexmercedcoder/configuring-apache-spark-for-apache-iceberg-2d41
- https://iceberg.apache.org/docs/latest/configuration/#catalog-properties