xordux / Sparkify Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

This project aims to model data to an SQL analytics database using PostgreSQL for a fictional music streaming service called Sparkify.

GPL-3.0 license

0 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
requirements.txt		requirements.txt
sql_queries.py		sql_queries.py
start_schema.png		start_schema.png
test.ipynb		test.ipynb

Repository files navigation

Sparkify Data ETL

Sparkify is a fictional company whose data(subset of real data from the Million Song Dataset) is present in the form of JSON files.

This code can perform ETL on the dataset and load all the data into Postgres DB, which can be later used to perform some analytics.

Instructions to run:

Installation:

Install Postgres and update the following line in etl.py according to the Service URL, credentials and DB name:
host=127.0.0.1 dbname=sparkifydb user=student password=student
Install Python3
Create a Virtual Environment and activate it(Optional, however recommended)
Run pip command to install dependencies
pip install -r requirements.txt

To Run:

To initialize DB, run python create_tables.py
To start ETL process run python etl.py

Schema for Song Play Analysis

Fact and dimension tables were defined for a star schema for analytics purposes.

Fact Table

songplays: records in log data associated with song plays i.e. records with page NextSong songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users: users in the app user_id, first_name, last_name, gender, level
songs: songs in music database song_id, title, artist_id, year, duration
artists: artists in music database artist_id, name, location, lattitude, longitude
time: timestamps of records in songplays broken down into specific units start_time, hour, day, week, month, year, weekday

Files in Project

The main files for running the project are:

test.ipynb displays the first few rows of each table to let you check database.
create_tables.py drops and creates tables. You run this file to reset your tables before each time you run ETL script(etl.py)
etl.ipynb reads and processes a single file from song_data and log_data and loads the data into tables. This notebook contains detailed instructions on the ETL process for each of the tables.
etl.py reads and processes files from song_data and log_data and loads them into tables.
sql_queries.py contains all SQL queries, and is imported into the last three files above.

About

This project aims to model data to an SQL analytics database using PostgreSQL for a fictional music streaming service called Sparkify.

GPL-3.0 license

Report repository

Releases

No releases published

Packages

No packages published

Contributors 2

Languages