Sparkify is a fictional company whose data(subset of real data from the Million Song Dataset) is present in the form of JSON files.
This code can perform ETL on the dataset and load all the data into Postgres DB, which can be later used to perform some analytics.
- Install Postgres and update the following line in
etl.py
according to the Service URL, credentials and DB name:
host=127.0.0.1 dbname=sparkifydb user=student password=student
- Install Python3
- Create a Virtual Environment and activate it(Optional, however recommended)
- Run pip command to install dependencies
pip install -r requirements.txt
- To initialize DB, run
python create_tables.py
- To start ETL process run
python etl.py
Fact and dimension tables were defined for a star schema for analytics purposes.
songplays
: records in log data associated with song plays i.e. records with page NextSong songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
users
: users in the app user_id, first_name, last_name, gender, levelsongs
: songs in music database song_id, title, artist_id, year, durationartists
: artists in music database artist_id, name, location, lattitude, longitudetime
: timestamps of records in songplays broken down into specific units start_time, hour, day, week, month, year, weekday
The main files for running the project are:
test.ipynb
displays the first few rows of each table to let you check database.create_tables.py
drops and creates tables. You run this file to reset your tables before each time you run ETL script(etl.py
)etl.ipynb
reads and processes a single file fromsong_data
andlog_data
and loads the data into tables. This notebook contains detailed instructions on the ETL process for each of the tables.etl.py
reads and processes files fromsong_data
andlog_data
and loads them into tables.sql_queries.py
contains all SQL queries, and is imported into the last three files above.