Next: Coming soon
- Introduction to MLOps
- Environment setup
- (Optional) Training a model from scratch and reading parquet files
- Course overview
- Maturity model
MLOps is a set of best practices for bringing Machine Learning to production.
Machine Learning projects can be simpplified to just 3 steps:
- Design - is ML the right tool for solving our problem?
- We want to predict the duration of a taxi trip. Do we need to use ML or can we used a simpler rule-based model?
- Train - if we do need ML, then we train and evaluate the best model.
- Operate - model deployment, management and monitoring.
MLOps is helpful in all 3 stages.
You may check the link above to watch the video in order to learn how to set up a Linux VM instance in Amazon Web Services.
If you'd rather work with Google Cloud Platform, you may check out the instructions in this gist. Please note that the gist was meant for the Data Engineering Zoomcamp and assumes that the reader has some familiarity with GCP and Linux shell commands. You may check out my Data Engineering Zoomcamp notes for a refresher on GCP.
Alternatively, you may also use any other cloud vendor or set up a local environment. The requirements for this course are:
- Docker and Docker Compose
- If you're using a local environment with a GUI, then Docker Desktop is the recommended download for both components.
- Anaconda
- We will use Python 3.9 for this course.
- We will also need Jupyter Notebook.
- For the optional videos and the homework you will also need pandas, scikit-learn, fastparquet, matplotlib and seaborn.
- You may check out my Python environments cheatsheet for a refresher on how to use Anaconda to install Python.
- (Optional) Visual Studio Code and the Remote - SSH extension
- These requirements are not necessary but they make it much easier to connect to remote instances and redirect ports. These notes will assume that you're using both.
Note: Any additional requirements will be listed as needed during the course.
This course builds on the Machine Learning Zoomcamp and the Data Engineering Zoomcamp, so for brevity these notes will not cover content that has already been covered there. You may check out my notes for reference:
The following videos show how to train a model for the purpose of using it later in the course. They use the New York City's TLC Trip Record Data for training. You may skip them if you're already familiar with the previous zoomcamps, but the videos link nicely into the next section and illustrate the purpose of the coourse's contents.
You may also access the resulting files from the following links:
When data scientists experiment with Jupyter Notebooks for creating models, they often don't follow best practices and are often unstructured due to the nature of experimentation: cells are re-run with slightly different values and previous results may be lost, or the cell execution order may be inconsistent, for example.
Module 2 covers experiment tracking: by using tools such as [MLflow](
) we will create experiment trackers (such as the history of cells that we've rerun multiple times) and model registries (for storing the models we've created during the experiments), instead of relying on our memory or janky setups such as external spreadsheets or convoluted naming schemes for our files.
Module 3 covers orchestration and ML pipelines: by using tools such as Prefect and Kubeflow we can break down our notebooks into separate identifyable steps and connect them in order to create a ML pipeline which we can parametrize with the data and models we want and easily execute.
flowchart LR
subgraph "ML pipeline"
direction LR
A[Load and prepare data]
B[Vectorize]
C[Train]
A --> B --> C
end
Module 4 covers serving the models: we will learn how to deploy models in different ways.
Module 5 covers model monitoring: we will see how to check whether our model is performing fine or not and how to generate alers to warn us of performance drops and failures, and even automate retraining and redeploying models without human input.
Module 6 covers best practices, such as how to properly maintain and package code, how to deploy successfully, etc.
Module 7 covers processes: we will see how to properly communicate between all the stakeholders of a ML project (scientists, engineers, etc) and how to work together.
The different levels of MLOps maturity that we will discuss during the course are based on the levels listed in this Microsoft Azure article. These levels are:
- No MLOps
- No automation whatsoever, sloppy Jupyter Notebooks.
- Good enough for Proof Of Concept projects.
- DevOps but no MLOps
- Releases are automated, unit tests and integration tests exist, CI/CD, operational metrics.
- But all of these are not ML aware, so no experiment tracking, no reproducibility and the data scientists are still separated from the engineers.
- Good for POC and production for some projects.
- Automated training
- Training pipeline, experiment tracking, model registry. Low friction deployment.
- DS work with engineers in the same team.
- Automated deployment
- Easy to deploy model, very low friction.
-
flowchart LR direction LR A[Data\n prep] B[Train\n model] C[Deploy\n model] A --> B --> C
-
- A/B tests (not covered in this course).
- Model monitoring.
- The Microsoft article actually places this feature in maturity level 4 but for our purposes it makes more sense to have it here.
- Easy to deploy model, very low friction.
- Full MLOps automation
- Automated training and deployment. All of the above combined.
Be aware that not every project or even every part of a project needs to have the highest maturity level possible because it could exceed the project's resource budget. Pragmatism is key.
Next: Coming soon