This is a personal MLOps project based on a Kaggle dataset for medical insurance costs prediction. It contains several AWS SageMaker pipelines from preprocessing till deployment, inference and monitoring.
Feel free to ⭐ and clone this repo 😉
The project has been structured with the following folders and files:
.github/workflows
: contains the CI/CD files (GitHub Actions)aws_pipelines
: AWS pipelines from preprocessing till deployment and monitoring- ✅
preprocessing_pipeline.py
: data preprocessing - ✅
training_pipeline.py
: model training - ✅
tuning_pipeline.py
: model fine tuning - ✅
evaluate_pipeline.py
: model evaluation - ✅
register_pipeline.py
: model registry - ✅
cond_register_pipeline.py
: model conditional registry (based on MAE Threshold) - ✅
deployment_pipeline.py
: model automatic deployment - ✅
manual_deployment_pipeline.py
: model manual deployment (requires manual approval on AWS) - ✅
inference_pipeline.py
: model automatic deployment and endpoint creation - ✅
data_quality_pipeline.py
: model registry with data quality baseline - ✅
model_quality_pipeline.py
: model registry with data and model quality baseline - ✅
monitoring_pipeline.py
: data and model monitor schedules creation
- ✅
data:
raw and clean dataNotebooks
: Exploratory Data Analysissrc:
code_scripts for processing, training, evaluation, serving (Flask), lambda, inference and endpoint testing.env_sample
: sample environmental variables.flake8
: flake requirements.gitattributes
: gitattributesMakefile
: install requirements, formating, testing, linting, coverage report and clean uppyproject.toml
: linting and formattingrequirements.txt:
project requirements
The dataset was obtained from Kaggle and contains 1338 rows and 7 columns to predict health insurance costs. To prepare the data for modelling, an Exploratory Data Analysis was conducted. For modeling, the categorical features where encoded, Tensorflow was use as model and the mean absolute error threshold was selected for model registry.
The Python version used for this project is Python 3.10.
-
Clone the repo (or download it as a zip file):
git clone https://github.com/benitomartin/mlops-aws-insurance.git
-
Create the virtual environment named
main-env
using Conda with Python version 3.10:conda create -n main-env python=3.10 conda activate main-env
-
Execute the
Makefile
script and install the project dependencies included in the requirements.txt:pip install -r requirements.txt or make install
Additionally, please note that an AWS Account, credentials, and proper policies with full access to SageMaker, S3, and Lambda are necessary for the projects to function correctly. Make sure to configure the appropriate credentials to interact with AWS services.
All pipelines where deployed on AWS SageMaker, as well as the Model Registry and Endpoints. At the end of each pipeline the is a line that must be uncommented to run it on AWS:
# Start the pipeline execution (if required)
evaluation_pipeline.start()
Additionally the experiments were tracked on Comel ML.