Spotify Dataflow Solution

This project encompasses a data pipeline and machine learning system for predicting the genre of Spotify songs. It consists of several components that work together to ingest, process, and analyze Spotify data, as well as train and serve a machine learning model for genre prediction.

Project Structure

The project is divided into three main components:

Data Pipeline (spotify_dataflow)
Model Training (spotify_genre_training)
Model Serving (spotify_genre_serving)

1. Data Pipeline (spotify_dataflow)

This component is responsible for ingesting data from the Spotify API, storing it in a data lake, and transforming it for analysis and model training.

Key features:

Uses Apache Airflow for orchestration
Stores data in MinIO (S3-compatible object storage)
Transforms data using dbt (data build tool)
Uses Trino for distributed SQL queries

This data pipeline is based on the outstanding content created by Victor Outtes (https://nw.ax/s5A). Be sure to check out his work!

2. Model Training (spotify_genre_training)

This component trains a machine learning model to predict the genre of songs based on features extracted from the Spotify data.

Key features:

Uses scikit-learn for model training
Implements a dummy classifier as a baseline model
Uses MLflow for experiment tracking and model versioning

See this project's README

3. Model Serving (spotify_genre_serving)

This component serves the trained model as an API for making predictions.

Key features:

Uses FastAPI to create a RESTful API
Loads the latest model version from MLflow

See this project's README

Setup and Installation

Clone the repository
Install Docker and Docker Compose
Set up environment variables inside .env file:
- SPOTIFY_CLIENT_ID
- SPOTIFY_SECRET
Build and start the services:

# 1. The ML training docker is executed by the Airflow docker instance, so we need to build it before starting docker compose
docker build -t spotify-model-training:latest spotify_song_genre_predictor/spotify_genre_training/

# 2. Start docker compose
docker compose up --build -d --remove-orphans

Access the various components:
- Airflow: http://localhost:8080
- MinIO: http://localhost:9001
- MLflow: http://localhost:5000
- Model Serving API: http://localhost:8000

Data Pipeline

The data pipeline is orchestrated using Apache Airflow. The main DAG performs the following steps:

Ingests data from Spotify API
Stores raw data in MinIO
Transforms data using dbt
Triggers the model training process

Usage

Access the Airflow UI to trigger and monitor the data pipeline
Use the MLflow UI to view experiment results and model versions
Make predictions using the FastAPI endpoint:

POST http://localhost:8000/predict
{
    "song_name": "Example Song",
    "album_name": "Example Album",
    "artist_name": "Example Artist"
}

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
spotify_dataflow		spotify_dataflow
spotify_song_genre_predictor		spotify_song_genre_predictor
.get_airflow_password.sh		.get_airflow_password.sh
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spotify Dataflow Solution

Project Structure

1. Data Pipeline (spotify_dataflow)

2. Model Training (spotify_genre_training)

3. Model Serving (spotify_genre_serving)

Setup and Installation

Data Pipeline

Usage

License

About

Releases

Packages

Languages

License

wellescastro/spotify-dataflow

Folders and files

Latest commit

History

Repository files navigation

Spotify Dataflow Solution

Project Structure

1. Data Pipeline (spotify_dataflow)

2. Model Training (spotify_genre_training)

3. Model Serving (spotify_genre_serving)

Setup and Installation

Data Pipeline

Usage

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages