Skip to content

End-to-end AWS Pipelines: Preprocessing till Monitoring

Notifications You must be signed in to change notification settings

benitomartin/mlops-aws-insurance

Repository files navigation

MLOPS MEDICAL INSURANCE COSTS PREDICTION ⚱️

aws

This is a personal MLOps project based on a Kaggle dataset for medical insurance costs prediction. It contains several AWS SageMaker pipelines from preprocessing till deployment, inference and monitoring.

Feel free to ⭐ and clone this repo 😉

Tech Stack

Visual Studio Code Jupyter Notebook Python Pandas NumPy Matplotlib scikit-learn Flask Anaconda Linux AWS Git

Project Structure

The project has been structured with the following folders and files:

  • .github/workflows: contains the CI/CD files (GitHub Actions)
  • aws_pipelines: AWS pipelines from preprocessing till deployment and monitoring
    • preprocessing_pipeline.py: data preprocessing
    • training_pipeline.py: model training
    • tuning_pipeline.py: model fine tuning
    • evaluate_pipeline.py: model evaluation
    • register_pipeline.py: model registry
    • cond_register_pipeline.py: model conditional registry (based on MAE Threshold)
    • deployment_pipeline.py: model automatic deployment
    • manual_deployment_pipeline.py: model manual deployment (requires manual approval on AWS)
    • inference_pipeline.py: model automatic deployment and endpoint creation
    • data_quality_pipeline.py: model registry with data quality baseline
    • model_quality_pipeline.py: model registry with data and model quality baseline
    • monitoring_pipeline.py: data and model monitor schedules creation
  • data: raw and clean data
  • Notebooks: Exploratory Data Analysis
  • src: code_scripts for processing, training, evaluation, serving (Flask), lambda, inference and endpoint testing
  • .env_sample: sample environmental variables
  • .flake8: flake requirements
  • .gitattributes: gitattributes
  • Makefile: install requirements, formating, testing, linting, coverage report and clean up
  • pyproject.toml: linting and formatting
  • requirements.txt: project requirements

Project Description

The dataset was obtained from Kaggle and contains 1338 rows and 7 columns to predict health insurance costs. To prepare the data for modelling, an Exploratory Data Analysis was conducted. For modeling, the categorical features where encoded, Tensorflow was use as model and the mean absolute error threshold was selected for model registry.

Project Set Up

The Python version used for this project is Python 3.10.

  1. Clone the repo (or download it as a zip file):

    git clone https://github.com/benitomartin/mlops-aws-insurance.git
  2. Create the virtual environment named main-env using Conda with Python version 3.10:

    conda create -n main-env python=3.10
    conda activate main-env
  3. Execute the Makefile script and install the project dependencies included in the requirements.txt:

    pip install -r requirements.txt
    
    or
    
    make install

Additionally, please note that an AWS Account, credentials, and proper policies with full access to SageMaker, S3, and Lambda are necessary for the projects to function correctly. Make sure to configure the appropriate credentials to interact with AWS services.

Pipeline Deployment

All pipelines where deployed on AWS SageMaker, as well as the Model Registry and Endpoints. At the end of each pipeline the is a line that must be uncommented to run it on AWS:

# Start the pipeline execution (if required)
evaluation_pipeline.start()

Additionally the experiments were tracked on Comel ML.