Skip to content

Scalable ML Pipeline for Production - US Census Data using FastAPI

Notifications You must be signed in to change notification settings

statneutrino/uscensus-fastapi

Repository files navigation

Update

Sadly, Heroku free-tier has now expired - will try and move this over to another cloud platform

Predicting Salary with US Census Data - Deploying ML model on Heroku with FastAPI

example workflow

Summary of API

This github repository contains an online API for a simple classification model on the Census Income Data Set to predict salary. The API is live and deployed on Heroku and can be found at: https://uscensus-fastapi.herokuapp.com/. This app is fast, type-checked and autodocumented API and created using FastAPI. I've also created a simple front-end for the API using Anvil. You can interact with the app using the front-end at https://census-salary-predictor.anvil.app/

The machine learning model is a very simple random forest classifier, and can be replaced easily with better models. However the point of this project was to:

  • implement production frameworks such as Continuous Integration and Continuous Deployment
  • ensure pipeliness pass unit tests before deployment
  • testing of local and live API
  • use a remote data pipeline and storage with AWS S3 and implement DVC (data version control) with git.

This is a project completed as part of the Udacity Machine Learning DevOps Engineer Nanodegree.

POST requests to live API on Heroku

POST requests are used to send data to the API. You can use the API to predict the salary by:

curl -X 'POST' \
  'https://uscensus-fastapi.herokuapp.com/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "age": 20,
  "workclass": "Self-emp-not-inc",
  "fnlgt": 205100,
  "education": "HS-grad",
  "education_num": 9,
  "marital_status": "Married-civ-spouse",
  "occupation": "Exec-managerial",
  "relationship": "Wife",
  "race": "White",
  "sex": "Female",
  "capital_gain": 0,
  "capital_loss": 0,
  "hours_per_week": 40,
  "native_country": "United-States"
}'

Coverage reporting

Coverage is now assessed using pytest-cov automatically on pushing commits. The report can be seen in the Github Actions page on the most recent build workflow under the pytest heading.

Use of remote storage

Models and data are stored in an AWS S3 bucket and pulled by DVC on Heroku when the API starts.

CI/CD

Continuous Integration and Continuous Deployment (CI/CD) practices were used. Every commit push triggers a Github workflow, and unit tests using pytest are run before master branch is automatically deployed to Heroku.

The badge above tracks whether CI is passing. More details can be found at the Actions page

Other files (for project rubric)

If you want to do the same - how to run on Heroku

We need to give Heroku the ability to pull in data from DVC upon app start up. We will install a buildpack that allows the installation of apt-files and then define the Aptfile that contains a path to DVC. I.e., in the CLI run:

heroku buildpacks:add --index 1 heroku-community/apt

Then in your root project folder create a file called Aptfile that specifies the release of DVC you want installed, e.g. https://github.com/iterative/dvc/releases/download/2.0.18/dvc_2.0.18_amd64.deb

Add the following code block to your main.py:

import os

if "DYNO" in os.environ and os.path.isdir(".dvc"):
    os.system("dvc config core.no_scm true")
    if os.system("dvc pull") != 0:
        exit("dvc pull failed")
    os.system("rm -r .dvc .apt/usr/lib/dvc")

About

Scalable ML Pipeline for Production - US Census Data using FastAPI

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published