Skip to content

In this project, I tried implementing a data engineering pipeline using the Medallion Architecture with a set of specific technologies: Apache Spark, Azure Databricks, Data Build Tool (DBT), and Azure Data Factory (ADF). The entire system is deployed within Azure, utilizing Azure Data Lake to store both raw and processed data.

Notifications You must be signed in to change notification settings

sowrabh-m/Data_Pipeline_Spark_Azure_DBT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Modern Data Engineering with Medallion Architecture

Project Overview

This project sets up an end-to-end data engineering pipeline using Apache Spark, Azure Databricks, and Data Build Tool (DBT) on the Azure cloud platform. Leveraging the Medallion Architecture, our pipeline encompasses data ingestion, integration, and transformation processes designed to prepare data for advanced analytics.

Architecture

System Architecture

Components

  • Apache Spark: Utilized for large-scale data processing.
  • Azure Databricks: Provides a high-performance analytics platform.
  • DBT (Data Build Tool): Used for data modeling and transformations within the data lakehouse.
  • Azure Data Factory: Manages data pipelines for data integration and transformation.

Data Layers

  • Bronze: Raw data ingestion and storage.
  • Silver: Data cleaning and enrichment.
  • Gold: Aggregated data optimized for business intelligence.

Workflow Commands

dbt run         # Run transformation models
dbt test        # Execute data tests
dbt snapshot    # Manage slowly changing dimensions
dbt docs generate # Generate project documentation
dbt docs serve   # Serve documentation locally

About

In this project, I tried implementing a data engineering pipeline using the Medallion Architecture with a set of specific technologies: Apache Spark, Azure Databricks, Data Build Tool (DBT), and Azure Data Factory (ADF). The entire system is deployed within Azure, utilizing Azure Data Lake to store both raw and processed data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published