This project has been developed as part of the MLOps Zoomcamp course provided by DataTalks.Club.
The dataset used has been downloaded from Kaggle and a preliminary data analysis was performed (see notebooks folder), to get some insights for the further project development.
Below you can find some instructions to understand the project content. Feel free to ⭐ and clone this repo 😉
The project has been structured with the following folders and files:
.github:
contains the CI/CD files (GitHub Actions)data:
dataset and test sample for testing the modelintegration_tests:
prediction integration test with docker-composelambda:
test of the lambda handler with and w/o dockermodel:
full pipeline from preprocessing to prediction and monitoring using MLflow, Prefect, Grafana, Adminer, and docker-composenotebooks:
EDA and Modeling performed at the beginning of the project to establish a baselinetests:
unit teststerraform:
IaC stream-based pipeline infrastructure in AWS using TerraformMakefile:
set of execution taskspyproject.toml:
linting and formattingsetup.py:
project installation modulerequirements.txt:
project requirements
The dataset was obtained from Kaggle and contains various columns with car details and prices. To prepare the data for modeling, an Exploratory Data Analysis was conducted to preprocess numerical and categorical features, and suitable scalers and encoders were chosen for the preprocessing pipeline. Subsequently, a GridSearch was performed to select the best regression models, with RandomForestRegressor and GradientBoostingRegressor being the top performers, achieving an R2 value of approximately 0.9.
Afterward, the models underwent testing, model registry, and deployment using MLflow, Prefect, and Flask. Monitoring of the models was established through Grafana and Adminer Database. Subsequently, a project infrastructure was set up in Terraform, utilizing AWS modules such as Kinesis Streams (Producer & Consumer), Lambda (Serving API), S3 Bucket (Model artifacts), and ECR (Image Registry).
Finally, to streamline the development process, a fully automated CI/CD pipeline was created using GitHub Actions.
The Python version used for this project is Python 3.9.
-
Clone the repo (or download it as a zip file):
git clone https://github.com/benitomartin/mlops-car-prices.git
-
Create the virtual environment named
main-env
using Conda with Python version 3.9:conda create -n main-env python=3.9 conda activate main-env
-
Install
setuptools
andwheel
:conda install setuptools wheel
-
Execute the
setup.py
script and install the project dependencies included in the requirements.txt:pip install . or make install
Each project folder contains a README.md file with instructions about how to run the code. I highly recommend creating a virtual environment for each one. Additionally, please note that an AWS Account, credentials, and proper policies with full access to EC2, S3, ECR, Lambda, and Kinesis are necessary for the projects to function correctly. Make sure to configure the appropriate credentials to interact with AWS services.
The following best practices were implemented:
- ✅ Problem description: The project is well described and it's clear and understandable
- ✅ Cloud: The project is developed on the cloud and IaC tools are used for provisioning the infrastructure
- ✅ Experiment tracking and model registry: Both experiment tracking and model registry are used
- ✅ Workflow orchestration: Fully deployed workflow
- ✅ Model deployment: The model deployment code is containerized and can be deployed to the cloud
- ✅ Model monitoring: Basic model monitoring that calculates and reports metrics
- ✅ Reproducibility: Instructions are clear, it's easy to run the code, and it works. The versions for all the dependencies are specified.
- ✅Best practices:
- There are unit tests
- There is an integration test
- Linter and code formatting are used
- There is a Makefile
- There is a CI/CD pipeline