The purpose of this project is to build a dynamic ETL data pipeline that utilizes automation and monitoring. The data pipeline is built from reusable tasks allows for easy backfills. It utilizes custom operators to perform tasks such as staging the data, filling the data warehouse, and running a check on the data as the final step so as to to catch any discrepancies in the datasets.
The project comprises of a redshift postgres database in the cluster with staging tables that contain all the data retrieved from the s3 bucket and copied over to the tables. It also contains a fact table songplays
and four dimensional tables namely users
, songs
, artists
and time
. This will store data from the staging tables that has been transformed to provide the relevant data in the tables.
The pipeline utilizes one main dag that contains several tasks that call four custom operators. These operators are:
Stage Operator - Loads any JSON formatted files from S3 to Amazon Redshift.
Fact Operator - Utilizes the provided SQL helper class to run data transformations for the facts table.
Dimension Operator - Utilizes the provided SQL helper class to run data transformations for the dimension tables.
Data Quality Operator - Runs checks on the data itself.
The data gets that gets extracted will need to be transformed to to fit the data model in the target destination tables. For instance the source data for timestamp is in unix format and that will need to be converted to timestamp from which the year, month, day, hour values etc can be extracted which will fit in the relevant target time and songplays table columns. The script will also need to cater for duplicates, ensuring that they aren't part of the final data that is loaded in the tables.
The datasets used are retrieved from the s3 bucket and are in the JSON format. There are two datasets namely log_data
and song_data
. The song_data
dataset is a subset of the the Million Song Dataset while the log_data
contains generated log files based on the songs in song_data
.
In order to have a copy of the project up and running locally, you will need to take note of the following:
- Python 2.7 or greater.
- AWS Account.
- Follow the instructions given here to setup apache airflow locally.
- After setting up airflow, run the commands
airflow scheduler
andairflow webserver
which will spin up the web server on localhost using port 8080 http://0.0.0.0:8080 - Create a redshift cluster.
- Retrieve your
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
and add them well as your redshift database credentials toConnections
on the Admin tab of the airflow UI. - You can then go ahead and Trigger the DAG and sit back and watch as the magic happens.
- Python and Apache Airflow.
- Jonathan Kamau - Github Profile