Data Ingestion Project

Introduction

Simple ETL skeleton is designed to ingest updated data from both the same and different Excel files. The PySpark library serves as a bridge to read data from Excel files and ingest it into a PostgreSQL database. Additionally, the ETL pipeline is automated using Apache Airflow as the scheduler.

Requirements

Ubuntu Server (WSL)
Java 8
Python3
PostgreSQL or any other open-soure database
JDBC driver 42.7.3
Libraries listed in requirements.txt

Usage

Create a Virtual Environment: Set up a virtual environment and install the necessary packages from requirements.txt.
Install Airflow: Install Airflow using PyPI by following the official installation guide. In this project, Airflow logs are stored in the same working directory for debugging purposes.
Ensure Data Source Availability: Prepare your data source. In this project, dummy data was created in Excel with created_date and updated_date columns to capture updated data for later use.
Create ETL Script: Write a Python script to read, transform, and ingest data into PostgreSQL.
Add Debugging Column: Add a new column ingestion_timestamp to your table for debugging purposes.
Create DAG: Set up an Airflow DAG to automate the entire ETL process.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dags		dags
main		main
.gitignore		.gitignore
README.md		README.md
airflow.cfg		airflow.cfg
airflow.db		airflow.db
webserver_config.py		webserver_config.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Ingestion Project

Introduction

Requirements

Usage

About

Releases

Packages

Languages

vivianfang88/ETL-Skeleton

Folders and files

Latest commit

History

Repository files navigation

Data Ingestion Project

Introduction

Requirements

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages