Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
demo		demo
README.md		README.md
create_tables.py		create_tables.py
dwh-iac.cfg		dwh-iac.cfg
dwh.cfg		dwh.cfg
etl.py		etl.py
redshift-cluster-iac-create.py		redshift-cluster-iac-create.py
redshift-cluster-iac-destroy.py		redshift-cluster-iac-destroy.py
redshift-cluster-iac-samples.ipynb		redshift-cluster-iac-samples.ipynb
sql_queries-samples.ipynb		sql_queries-samples.ipynb
sql_queries.py		sql_queries.py
zip_workspace.ipynb		zip_workspace.ipynb

README.md

Sparkify Amazon Redshift ETL

This repository contains the project submission for the Udacity Data Engineering Nanodegree. The project introduces the following concepts:

Data modeling with Amazon Redshift
Building an ETL pipeline using Python

Context

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

In this project, we will need to load data from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables.

Project Structure

Main files used in the project:

Filename	Description
assets	Assets folder containing assets for markdown presentation
create_tables.py	Python script for re-creating the main database and tables used in this excersice
dwh.cfg	Config file containing the information for how to connect to AWS and use the created the Redshift cluster
etl.py	Python script for the ETL process of loading from S3, staging and transforming the data
README.md	This file, descriping the repository and the content
sql_queries.py	The SQL queries for dropping and creating the tables, as well as inserting into and selecting from the tables for the ETL process

Support/Utility files used in the project:

Filename	Description
demo	A folder containing demo code from the study material presented by Udacity
dwh-iac.cfg	Config file containing the information for how to connect to AWS and setup the Redshift cluster
redshift-cluster-iac-create.py	Python script for setting up the Redshift cluster infrastructure as code
redshift-cluster-iac-destroy.py	Python script for tearing down the Redshift cluster infrastructure as code
redshift-cluster-iac-samples.ipynb	A notebook containing example code for setting up the Redshift cluster infrastructure as code
sql_queries-samples.ipynb	A notebook containing the sample code for exploring the imported staging and analytics data
zip_workspace.ipynb	A supporting notebook containing code to compress this workspace into a single zip file for easy downloading the repository from the workspace

Schema Design

The data from S3 is loaded into two main staging tables before being transformed and loaded into analytics fact and dimentional tables.

Staging Tables

staging_events: artist, auth, firstName, gender, itemInSession, lastName, length, level, location, method, page, registration, sessionId, song, status, ts, userAgent, userId
staging_songs: num_songs, artist_id, artist_latitude, artist_longitude, artist_location, artist_name, song_id, title, duration, year

Analytics tables

For the analytics tables, a Star Schema is being used: A single main fact table containing measures for user activities (songplays) and related dimensional tables (songs, artists, users and time), each containing dimension keys, values and attributes related to the dimension.

This separates measurable, quantitative data (in the fact) and descriptive attributes related to this data (in the dimensions).

How to run

Setup Configurations

Setup the dwh.cfg file with information on how to connect, as well as the IAM role.

[CLUSTER]
HOST=''
DB_NAME=''
DB_USER=''
DB_PASSWORD=''
DB_PORT=5439

[IAM_ROLE]
ARN=''

[S3]
LOG_DATA='s3://udacity-dend/log_data'
LOG_JSONPATH='s3://udacity-dend/log_json_path.json'
SONG_DATA='s3://udacity-dend/song_data'

Create tables

Using the terminal/console, run the create_tables script:

python create_tables.py

This connects to the Redshift cluster, where it drops all tables for the staging, fact and dimensional tables described in sql_queries.py, and re-creates the tables from the create statements in sql_queries.py, and closes the connection.

Load data

Using the terminal/console, run the etl script:

python etl.py

This connects to the Redshift cluster, and loads in data from S3 into staging tables, transforms and inserts this information into the fact and dimensional tables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

udacity-3-data-warehouse

udacity-3-data-warehouse

README.md

Sparkify Amazon Redshift ETL

Context

Project Structure

Main files used in the project:

Support/Utility files used in the project:

Schema Design

Staging Tables

Analytics tables

How to run

Setup Configurations

Create tables

Load data

Files

udacity-3-data-warehouse

Directory actions

More options

Directory actions

More options

Latest commit

History

udacity-3-data-warehouse

Folders and files

parent directory

README.md

Sparkify Amazon Redshift ETL

Context

Project Structure

Main files used in the project:

Support/Utility files used in the project:

Schema Design

Staging Tables

Analytics tables

How to run

Setup Configurations

Create tables

Load data