This tutorial assumes the user is familiar with Kedro. We will refer to a kedro>=0.18.0, <0.19.0
project template files.
If you want to check out for older kedro version, see:
- the branch
main-kedro-0.16
forkedro>=0.16.0, <0.17.0
- the branch
main-kedro-0.17
forkedro>=0.17.0, <0.18.0
This tutorial shows how to use kedro-mlflow plugin as a mlops framework.
Specifically, it will focus on how one can use the pipeline_ml_factory
to maintain consistency between training and inference and prepare deployment. It will show best practices on code organization to ensure easy transition to deployment and strong reproducibility.
We will not emphasize the fact that kedro-mlflow
provides advanced versioning capabilities tracking, including automatic parameters tracking, but look at the documentation to see what it is capable of!
This is NOT a Kaggle competition. I will not try to create the best model (nor even a good model) to solve this problem. This should not be considered as data science best practices on how to train a model. Each pipeline node has a specific educational purpose to explain one use case kedro-mlflow
can handle.
- Clone the repo:
git clone https://github.com/Galileo-Galilei/kedro-mlflow-tutorial
cd kedro-mlflow-tutorial
- Install dependencies:
conda create -n kedro_mlflow_tutorial python=3.9
conda activate kedro_mlflow_tutorial
pip install -e src
Note: You don't need to call kedro mlflow init
command as you do in a freshly created repo since the mlflow.yml
is already pre-configured.
We will use the IMDB movie review dataset as an example. This dataset contains 50k movie reviews with associated "positive" or "negative" value manually labelled by a human.
We will train a classifier for binary classification to predict the sentiment associated to a movie review.
You can find many notebooks on Kaggle to learn more about this dataset.
The project is divided into 3 applications (i.e. subfolders in the src/kedro_mlflow_tutorial/pipelines). The reasons for such a division is detailed in kedro-mlflow's documentation:
- etl: This app contains 2 pipelines:
etl_instances
: this pipeline creates the "instances" dataset, which is the input of the ml app. Theinstances
dataset represent the raw data any user of your ml pipeline should send you. In practice this is a business object (here, raw text), not something that requires heavy preprocessing from your user.etl_labels
: this pipeline creates the "labels" dataset, which is another input of the ml app since our task is supervised. It must be a different pipeline from the one which creates the instances because when you will deploy your model, the labels will not be available!
- ml: This app contains 2 pipelines:
training
: this pipeline trains and persists your machine learning model as well as any necessary artifacts (here, a tokenizer) which will be reused by inference.inference
: this pipeline is the one you will deploy to your end user. It takesinstances
and returnspredictions
- it is possible to add many other pipelines, e.g.
monitoring
(takesinstances
and return stats on predictions),evaluation
(takesinstances
andlabels
and returns updated model metrics),explanation
(takes instances and returns model explanations like shap values, activation maps...). We do not add such pipelines to keep the example simple
- user_app: This application is composed of as many pipelines as you have use cases you want to use your model for. In this example, we will stick to a single pipeline:
user_app
: this pipeline takes a mlflow model (the entire inference pipeline), or directly the predictions and performs all the business logic.
For the sake of simplicity and educational purpose, we will keep the etl and user_app pipelines very simple and focus on the ml pipelines. In real life, etl and user_app may be very complex.
To create the instances
and labels
datasets, run the etl pipelines:
kedro run --pipeline=etl_instances
kedro run --pipeline=etl_labels
Since they are persisted in the catalog.yml
file, you will be able to reuse them afterwards.
Note: You can change the huggingface_split parameters in globals.yml
and rerun the pipelines to create test data.
The key part is to convert your training
pipeline from a Pipeline
kedro object to a PipelineML
kedro-mlflow object.
This can be done in the pipeline_registry.py
file thanks to the pipeline_ml_factory
helper function.
The register_pipeline
hook of thepipeline_registry.py
looks like this (below snippet is slightly simplified for readability):
from kedro_mlflow_tutorial.pipelines.ml_app.pipeline import create_ml_pipeline
...
def register_pipelines() -> Dict[str, Pipeline]:
...
ml_pipeline = create_ml_pipeline()
inference_pipeline = ml_pipeline.only_nodes_with_tags("inference")
training_pipeline_ml = pipeline_ml_factory(
training=ml_pipeline.only_nodes_with_tags("training"),
inference=inference_pipeline,
input_name="instances",
log_model_kwargs=dict(
artifact_path="kedro_mlflow_tutorial",
conda_env={
"python": 3.9.12,
"build_dependencies": ["pip"],
"dependencies": [f"kedro_mlflow_tutorial=={PROJECT_VERSION}"],
},
signature="auto",
),
)
...
return {
"training": training_pipeline_ml,
}
Let's break it down:
- we already have a
training
andinference
pipelines written in pure kedro, that are filtered out a bigger pipeline from their tags - We "bind" these two pipelines with the
pipeline_ml_factory
function with the following arguments:training
: the pipeline that will be executed when launching the "kedro run --pipeline=training" command.inference
: the pipeline which be logged in mlflow as a Mlflow Model at the end of the training pipeline.conda_env
(optional): the conda environment with the package you need for inference. You can pass a python dictionnary, or a path to yourrequirements.txt
orconda.yml
files.log_model_kwargs
: Any argument you want to pass tomlflow.pyfunc.log_model()
, e.g.:input_name
: the name in thecatalog.yml
of the dataset which contains the data (either to train on for the training pipeline or to predict on for the inference pipeline). This must be the same name for both pipelines.artifact_path
(optional): the name of the folder containing the model in mlflow.signature
(optional): The mlflow signature of yourinput_name
dataset (instances
in our example). This is an object that contains the columns names and types. This will be use to make a consistency check when predicting with the model on new data. If you set it to "auto",kedro-mlflow
will automatically retrieve it from the training data. This is experimental and sometimes comes with bugs, you can set it to "None" to avoid using it.
The ml application (which contains both the training
and inference
pipelines) can be created step by step. The goal is to tag each node as either ["training"]
, ["inference"]
or ["training", "inference"]
. This enables to share nodes and ensure consistency between the two pipelines.
You can encounter the following use cases:
- a preprocessing node which performs deterministic operations with no parameters (lowerize text, remove punctuation...). Tag theses nodes as
["training", "inference"]
to ensure it will be used in both pipelines. - a preprocessing node which performs deterministic operations with shared parameters between inference and training (e.g remove stopwords which exist in a given file). Tag such nodes as
["training", "inference"]
to ensure it will be used in both pipelines, and persist the shared parameters in thecatalog.yml
:
# catalog.yml
english_stopwords:
type: yaml.YAMLDataSet # <- This must be any Kedro Dataset other than "MemoryDataSet"
filepath: data/01_raw/stopwords.yml # <- This must be a local path, no matter what is your mlflow storage (S3 or other)
-
a preprocessing node which produces an object fitted on data which will be reused to apply the processing to data. Some examples can be a tokenizer, an encoder, a vectorizer, and obviously the machine learning model itself... Such operations must absolutely be splitted in two different nodes:
- a
fit_object
(the name does not matter) node which creates the object. It will be tagged as["training"]
only because the object must not be refitted on new data. This object must be persisted in thecatalog.yml
for further reuse.
# catalog.yml label_encoder: type: pickle.PickleDataSet # <- This must be any Kedro Dataset other than "MemoryDataSet" filepath: data/06_models/label_encoder.pkl # <- This must be a local path, no matter what is your mlflow storage (S3 or other)
- a
transform_data
(the name does not matter) node which applies the object to the data. It will be tagged as["training", "inference"]
because it need to be applied in both pipelines.
- a
-
Some training specific operations (split between train and test datasets, hyperparameter tuning...) which are tagged as
["training"]
only. -
Some post analysis after model training to assess its performance. Since they are not used in inference, you can tag these nodes as
["training"]
. You can still log them as artifacts to make comparison between runs easier:
# catalog.yml
xgb_feature_importance:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
data_set:
type: matplotlib.MatplotlibWriter
filepath: data\08_reporting\xgb_feature_importance.png
- Some post processing on the predictions for better rendering. For instance, you may want to decode the predictions to return a string (e.g. "positive", "negative") instead of an array of probability. The situation is exactly the same as for previous node: if you have fitted an object on data before training the model, persist it in the
catalog.yml
and apply this object to decode your predictions in a node after predicting with the ml model:
# catalog.yml
label_encoder:
type: pickle.PickleDataSet
filepath: data/06_models/label_encoder.pkl
Once you have declared your training pipeline as a PipelineML
object, the associated inference pipeline will be logged automatically at the end of the execution along with:
- all the persisted objects which are inputs for your inference pipeline (encoder, tokenizer...) as mlflow artifacts
- the conda environment with required packages
- the signature of the model
- your inference pipeline as a python function in a pickle file
- Run the pipeline
kedro run --pipeline=training
- Open the UI
kedro mlflow ui
- Navigate to the last "training" run:
The parameters have been automatically recorded! For the metrics, you can set them in the catalog.yml
.
- Go the artifacts section:
You can see:
- the
kedro_mlflow_tutorial
folder, which is the model_name we declared in thepipeline_ml_factory
function. - a folder with all needed artifacts for inference which were produced by training
- the
MLmodel
file which contains mlflow metadata, including the model signature we declared inpipeline_ml_factory
- the
conda.yaml
file which contains the environment we declared inpipeline_ml_factory
- the
python_model.pkl
object which contains the inference pipeline function we declared inpipeline_ml_factory
On this picture, we can also see the extra image "xgb_feature_importance.png" logged after model training.
By following these simple steps (basically ~5 lines of code to declare our training and inference pipeline in
pipeline_registry.py
withpipeline_ml_factory
), we have a perfect synchronicity between our training and inference pipelines. Each code change (adding a node or modify a function), parameter changes or data changes (through artifacts fitting) are automatically resolved. You are now sure that you will be able to predict from any old run in one line of code!
If anyone else want to reuse your model from python, the load_model
function of mlflow is what you need:
PROJECT_PATH = r"<your/project/path>"
RUN_ID = "<your-run-id>"
from kedro.framework.startup import bootstrap_project
from kedro.framework.session import KedroSession
from mlflow.pyfunc import load_model
bootstrap_project(PROJECT_PATH)
session=Kedrosession.create(
session_id=1,
project_path=PROJECT_PATH,
package_name="kedro_mlflow_tutorial",
)
local_context = session.load_context() # setup mlflow config
instances = local_context.io.load("instances")
model = load_model(f"runs:/{RUN_ID}/kedro_mlflow_tutorial")
predictions = model.predict(instances)
The predictions object is a pandas.DataFrame
and can be handled as usual.
Say that you want to reuse this trained model in a kedro Pipeline, like the user_app. The easiest way to do it is to add the model in the catalog.yml
file
pipeline_inference_model:
type: kedro_mlflow.io.models.MlflowModelLoggerDataSet
flavor: mlflow.pyfunc
pyfunc_workflow: python_model
artifact_path: kedro_mlflow_tutorial # the name of your mlflow folder = the model_name in pipeline_ml_factory
run_id: <your-run-id> # put it in globals.yml to help people find out what to modify
Then you can reuse it in a node to predict with this model which is the entire inference pipeline at the time you launched the training.
An example is given in the user_app folder.
To try it out:
- rerun the
etl_instances
pipeline with the globals parameterhuggingface_split: test
to create test data, - Put the your own
run_id
in globals.yml - Launch
kedro run --pipeline=user_app
pipeline to see the predictions on the test data. Enjoy modifying theuser_app
to add your own monitoring!
The two previous scenarii assume that your end user will use python (or even more restrictive, kedro) to load the model and predict with it. For many applications, the real "user app" which consume your pipeline is not written in python, and is even not aware of your code.
Fortunately, mlflow provide helpers to serve the model as an API with one line of code:
mlflow models serve -m "runs:/<your-model-run-id>/kedro_mlflow_tutorial"
This will serve your model as an API (beware: there are known issues on windows). You can test it with:
curl -d "{\"columns\":[\"text\"],\"index\":[0,1],\"data\":[[\"This movie is cool\"],[\"awful film\"]]}" -H "Content-Type: application/json" localhost:5000/invocations
The most common way to deploy it is to dockerize it, but this is beyond the scope of this tutorial. Mlflow provides a lot of documentation on deployment on different target platforms.