Skip to content

Generate synthetic Spotify music stream dataset to create dashboards. Spotify API generates fake event data emitted to Kafka. Spark consumes and processes Kafka data, saving it to the Datalake. Airflow orchestrates the pipeline. dbt moves data to Snowflake, transforms it, and creates dashboards.

Notifications You must be signed in to change notification settings

abdkumar/spotify-stream-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spotify-stream-analytics

Generate synthetic Spotify music stream dataset to create dashboards. Spotify API generates fake event data emitted to Kafka. Spark consumes and processes Kafka data, saving it to the Datalake. Airflow orchestrates the pipeline. dbt moves data to Snowflake, transforms it, and creates dashboards.

Dataset Simulation

  • Songs: Leveraged Spotify API to create artists and tracks data, extracted from set of playlists. Each track includes title, artist, album, ID, release date, etc.
  • Users: Created users demographics data with randomized first/last names, gender and location details.
  • Interactions: Real-time-like listening data linking users to songs they "listened."

Feel free to explore and analyze the datasets included in this repository to uncover patterns, trends, and valuable insights in the realm of music and user interactions. If you have any questions or need further information about the dataset, please refer to the documentation provided or reach out to the project contributors.

Tools & Technologies

Architecture

Final Result

Project Flow

  • Setup Free Azure account & Azure Keyvault - Setup
  • Setup Terraform and create resources - Setup
  • SSH into VM (kafka-vm)
    • Setup Kafka Server - Setup
    • Setup Spotify API account & Generate Spotify Stream Events Data - Setup
    • Setup Spark streaming job - Setup
  • Setup Snowflake Warehouse - Setup
  • Setup Databricks Workspace & CDC (Change Data Capture) job - Setup
  • SSH into another VM (airflow-vm)

How can I make this better?!

A lot can still be done :).

  • Choose managed Infra
    • Confluent Cloud for Kafka
  • Write data quality tests
  • Include CI/CD
  • Add more visualizations

About

Generate synthetic Spotify music stream dataset to create dashboards. Spotify API generates fake event data emitted to Kafka. Spark consumes and processes Kafka data, saving it to the Datalake. Airflow orchestrates the pipeline. dbt moves data to Snowflake, transforms it, and creates dashboards.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published