For this project I build a Postgres database for the fictitious music streaming app SPARKIFY. It was part of my Udacity Nanodegree in Data Engineering.
To better understand their listeners preferences and to provide them with the best content, Sparkify has been collecting data about listener activity and stores data about the songs in their app. To provide the Sparkify analytics team with an easy way to analyse and query this data, it needs to be converted from the current JSON format into structured database.
The database is based on two datasets.
- The song dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.
- The logfile dataset consists of log files in JSON format generated by an Event Simulator based on the songs in the song dataset. It simulates activity logs from a music streaming app based on specified configurations. The log files are partitioned by year and month.
The database (db) is modeled after the star schema. The Star Schema separates business process data into facts, which hold the measurable, quantitative data about a business, and dimensions which are descriptive attributes related to fact data.
For the Sparkify database we have the 'songplays' table as the fact table and the 'songs', 'artists', 'users', and 'time' tables as dimension tables, shown also in the ERD in the following image:
The perform all necessary tasks for the ETL processes there are three files:
sql_queries.py
contains all necessary sql queries to create, populate, and delete the database and its tablescreate_tables.py
accesses the sql queries to first drop an existing version of the Sparkify db and then newly creates the db and its tablesetl.py
opens the JSON files and populates the db tables with the respective data
- install a PostgreSQL Database
- download all files including the
/data
folder - first run
python create_tables.py
to create the db and tables, as described above - secondly run
python etl.py
to fill the db tables, as described above
analyse_data.ipynb
provides some basic analysis queries to better understand the data in the Sparkify databasetest.ipynb
provides simple test queries to check the correct initialization and population of the database tablesPostgresDB.png
is an image of the ERD to give you a better overview of the database structure/data
is the folder for the data files used in this project