Description

Built an Apache Cassandra database to analyze data on songs and user activity, and find what songs are users listening to.

Included

ELT.py: Main,run to execute the 3 queries.

ETLfunctions.py

CQLqueries.py

The data (included in) resides in a event_data directory of CSV files on user activity on the app.

How to install

Lastest version of Java

Apache Cassandra

Download the latest version of cassandra: http://cassandra.apache.org/download/

install Cassandra

Linux installation: https://www.vultr.com/docs/how-to-install-apache-cassandra-3-11-x-on-ubuntu-16-04-lts
Windows installation: https://www.guru99.com/download-install-cassandra.html

Cassandra-driver

Installing from PyPI: pip install cassandra-driver https://datastax.github.io/python-driver/installation.html

How to run

Go to the Apache Cassandra directory and start a Cassandra server in the cmd line using cassandra -f
Once server is running, run script ETL.py.

ETL.py description

Build, fill and analyze the database by creating an ELT pipeline using a python driver.

Steps:

Extract the data from the logs contained in the folder for which we specify the path
Load the data in a new csv file: event_datafile_new.csv
Create a denormalized dataset to be able to query our data
Load the data in the tables modeled to execute the queries
Run the queries

event_datafile_new.csv contains the following columns:

artist
firstName of user
gender of user
item number in session
last name of user
length of the song
level (paid or free song)
location of the user
sessionId
song title
userId

Queries and tables

song_in_session

Query#1: Get artist, song title and song 's length in the music app history that was heard during sessionId = 338, and itemInSession = 4
This song_in_session table was modeled to to execute Query #1, queries based on sessionid and iteminsession. Recall that Cassandra does not support entries with similar keys. Since a user could listen to several songs within a session, I chose to use sessionid and iteminsession as primary keys.

song_in_session table

> sessionId*  
> itemInSession*  
> artist  
> song  
> length

Sample query(query1): Get artist, song title and song's length in the music app history that was heard during sessionId = 338, and itemInSession = 4.

song_playlist_session

Query #2: Give me only the following: name of artist, song(sorted by itemInSession) and user(first and last name) for userid = 10, sessionid = 182
This song_playlist_session table was modeled to allow queries based on userid and sessionid, allowing us to understand what users typically listen to.Recall that Cassandra does not support entries with similar keys. Since a user could start several sessions, I chose to use userid and sessionid as primary keys, I added iteminsession as a partionning key to be able to retrieve the songs in order they were listened to within the session (ordered by iteminsession).

song_playlist_session table

> userid*  
> sessionid* 
> artist  
> song  
> firstname
> lastname

Sample query(query2): Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182.

songs_listened

Query #3: Give me every user name(first and last) in my music app history who listened to the song 'All Hands Against His Own' This songs_listened table was modeled to allow queries based on song allowing us to see which users listened to a specific song. Recall that Cassandra does not support entries with similar keys.Since several users could listen to the same song, and that users could have the same name, I chose to use userid and song as primary keys.

songs_listened table

> song*  
> userid*    
> firstname  
> lastname

Sample query(query3): Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'.

Author

Philippe Habra

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Description

Included

How to install

Lastest version of Java

Java download

Java SE Development Kit 8 Downloads

Pandas

Apache Cassandra

install Cassandra

Cassandra-driver

How to run

ETL.py description

Queries and tables

song_in_session

song_in_session table

song_playlist_session

song_playlist_session table

songs_listened

songs_listened table

Author

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.idea		.idea
__pycache__		__pycache__
event_data		event_data
CQLqueries.py		CQLqueries.py
ETL.py		ETL.py
ETLfunctions.py		ETLfunctions.py
Readme.md		Readme.md
event_datafile_new.csv		event_datafile_new.csv
functions.py		functions.py

itsphily/ApacheCassandra_python

Folders and files

Latest commit

History

Repository files navigation

Description

Included

How to install

Lastest version of Java

Java download

Java SE Development Kit 8 Downloads

Pandas

Apache Cassandra

install Cassandra

Cassandra-driver

How to run

ETL.py description

Queries and tables

song_in_session

song_in_session table

song_playlist_session

song_playlist_session table

songs_listened

songs_listened table

Author

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages