Exercise 3: Create your Machine Learning Pipeline

Goals

Learn about DVC
Configure a Machine Learning Pipeline with DVC to fetch raw data and train a ML model
Create a pipeline in GoCD to automate your ML training pipeline
Add automated tests to evaluate and govern your ML models
Combine both GoCD pipelines to promote and deploy the new model to production

Step by Step instructions

Configure DVC to use your GCP bucket for remote storage (replace X with your user ID):

dvc remote modify default url gs://cd4ml-bucket-X

Create your Machine Learning pipeline with dvc:

dvc run -f input.dvc -d src/download_data.py -o data/raw/store47-2016.csv python src/download_data.py
dvc run -f split.dvc -d data/raw/store47-2016.csv -d src/splitter.py -o data/splitter/train.csv -o data/splitter/validation.csv python src/splitter.py
dvc run -d data/splitter/train.csv -d data/splitter/validation.csv -d src/decision_tree.py -o data/decision_tree/model.pkl -M results/metrics.json python src/decision_tree.py

Add, commit, and push your changes:

git add .
git commit -m "Creating ML pipeline"
git push

Create machine learning training pipeline in GoCD:

Go to GoCD's "Admin" > "Pipelines" menu and create a new pipeline. Give it a name related to your username, e.g. ml-pipeline-X, replacing X with your user ID).
Configure your Github repository URL (e.g. https://github.com/<github-user>/continuous-intelligence-workshop.git) as a Git material, and use the existing ml-pipeline-gcp-template template when configuring the stages.

Click "Finish"

Combine both pipelines:

Go back to edit your original ci-workshop-app-X pipeline again.
In the "Materials" tab add your new pipeline as a new material (double-click to get the correct auto suggestion).
Expand the "build-and-publish" stage, and click on the "build" job.
Update the second build task to pull the latest model using DVC instead of downloading a static version from Google Storage, by replacing the python src/download_data.py --model command with GOOGLE_APPLICATION_CREDENTIALS=./secret.json dvc pull model.pkl.dvc

Save and go back to the main Dashboard page

Unpause the machine learning pipeline to train and publish your model.

WARNING: The pipeline should fail because the model training accuracy is not good enough!

Improving our Model

In your code, change the model training approach to use a Random Forest algorithm, by editing the src/decision_tree.py file and replacing the Model.DECISION_TREE with Model.RANDOM_FOREST on the last line of the file.
Re-run your dvc pipeline locally:

dvc repro model.pkl.dvc

Add, commit, and push your changes, and watch your pipeline execute and go green:

git add .
git commit -m "Improving model algorithm"
git push

Once the machine learning pipeline succeeds, it will trigger a new application deployment pipeline, which will pull the new improved model and deploy it to production. Visit your application again to verify that you get better predictions!
Done! Go to the next exercise

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3-machine-learning-pipeline.md

3-machine-learning-pipeline.md

Exercise 3: Create your Machine Learning Pipeline

Goals

Step by Step instructions

Improving our Model

Files

3-machine-learning-pipeline.md

Latest commit

History

3-machine-learning-pipeline.md

File metadata and controls

Exercise 3: Create your Machine Learning Pipeline

Goals

Step by Step instructions

Improving our Model