Skip to content

Latest commit

 

History

History
11 lines (7 loc) · 695 Bytes

File metadata and controls

11 lines (7 loc) · 695 Bytes

Analyzing LastFM Songs using PySpark

This is an introductory project to PySpark and PySpark SQL. The goal of the project is to get familiarized with building a Spark session, loading data, transforming it, and finally querying it.

There are 2 datasets: listening.csv and genre.csv. listening.csv is a collection of songs that Users of LastFM have listened to. genre.csv contains the genre of every major artist who is present in the listening.csv file.

Since listening.csv is a large file size of 1 GB, a Google Colab environment was used to run the Python code.

The data sets are open source and can be found at: https://drive.google.com/drive/folders/1ZhV5LxuRBYiD81HJHQTheQawXZ5MkPim