Skip to content

A3Data/a3lab-deltalake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project logo

Delta lake is an open-source project that enables building a Lakehouse architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS.

Status License


📝 Table of Contents

🧐 About

This project aims to make a simple etl processing, using pyspark with the deltalake framework. The work of pyspark will consume a filesystem titled landing-zone with files in json format. We will use some time travel techniques, writing in delta format for table management control and much more.

🔧 Architeture ELT Delta lake

image

work reads data from a filesystem called landing-zone using deltalake dependencies, which are jar packages that are in spark's session config, with which it is possible to use the delta lake framework. after the execution of this script, the data will be written in the directory passed in code, inside the write table will be written a directory called _delta_log, which is responsible for storing incremental files on table metadata, it will be something like 00000000000000000000.json, 0000000000000000000001.json... Json file under the _delta_log folder will have the information like add/remove parquet files(for Atomicity), stats(for optimized performance & data skipping), partitionBy(for partition pruning), readVersions(for time travel), commitInfo(for audit).

image

reads the data in delta format, which results in a performance gain due to being stored in parquet format and having one of the great advantages of _delta_log metadata management, steps are performed processing in which unnecessary columns are removed and preparation of tables with join for MDW modeling with data normalized in dataset formats.

has the responsibility of enriching the data, in this process it is where we treat the data and refine it to the business area or who will consume the data, in this script I left the example of how to use the time travel using parameter passed in the function that we declared .option("versionAsOf", "0"), below are images after ingestion

This project aims to make a simple etl processing, using pyspark with the deltalake framework. The work of pyspark will consume a filesystem titled landing-zone with files in json format. We will use some time travel techniques, writing in delta format for table management control and much more.

🔧 Upserts Delta Lake

image

How the Delta Lake Upserts logic works and how we can make the adoption of the new and modern Lake House Architecture, for this I made available a notebook jupyter, in which we read data in format json, which are related to users of a system. Our main goal is to read these files in Json format and convert them to a Delta Table. After that we can access metadata that is generated inside a directory called _delta_log, which we can access through the DeltaTable.forPath method. After instantiating the Delta Table, we managed to merge the new data using whenMatchedUpdateAll using the condition that we are going to compare and finally update our Delta Table.

⛏️ Built Using

✍️ Authors

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published