Data Modeling with PostgreSQL

Introduction

For this project I build a Postgres database for the fictitious music streaming app SPARKIFY. It was part of my Udacity Nanodegree in Data Engineering.

The Task

To better understand their listeners preferences and to provide them with the best content, Sparkify has been collecting data about listener activity and stores data about the songs in their app. To provide the Sparkify analytics team with an easy way to analyse and query this data, it needs to be converted from the current JSON format into structured database.

The Datasets

The database is based on two datasets.

The song dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.
The logfile dataset consists of log files in JSON format generated by an Event Simulator based on the songs in the song dataset. It simulates activity logs from a music streaming app based on specified configurations. The log files are partitioned by year and month.

The Database Schema

The database (db) is modeled after the star schema. The Star Schema separates business process data into facts, which hold the measurable, quantitative data about a business, and dimensions which are descriptive attributes related to fact data.

For the Sparkify database we have the 'songplays' table as the fact table and the 'songs', 'artists', 'users', and 'time' tables as dimension tables, shown also in the ERD in the following image:

The ETL Pipeline

The perform all necessary tasks for the ETL processes there are three files:

sql_queries.py contains all necessary sql queries to create, populate, and delete the database and its tables
create_tables.py accesses the sql queries to first drop an existing version of the Sparkify db and then newly creates the db and its tables
etl.py opens the JSON files and populates the db tables with the respective data

How to run the project

install a PostgreSQL Database
download all files including the /data folder
first run python create_tables.py to create the db and tables, as described above
secondly run python etl.py to fill the db tables, as described above

Additional files in this repo

analyse_data.ipynb provides some basic analysis queries to better understand the data in the Sparkify database
test.ipynb provides simple test queries to check the correct initialization and population of the database tables
PostgresDB.png is an image of the ERD to give you a better overview of the database structure
/data is the folder for the data files used in this project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Modeling with PostgreSQL

Introduction

The Task

The Datasets

The Database Schema

The ETL Pipeline

How to run the project

Additional files in this repo

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
PostgresDB.png		PostgresDB.png
README.md		README.md
analyse_data.ipynb		analyse_data.ipynb
create_tables.py		create_tables.py
etl.py		etl.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

BabsBerlin/nd_postgres_data_modeling

Folders and files

Latest commit

History

Repository files navigation

Data Modeling with PostgreSQL

Introduction

The Task

The Datasets

The Database Schema

The ETL Pipeline

How to run the project

Additional files in this repo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages