- Learn about DVC
- Configure a Machine Learning Pipeline with DVC to fetch raw data and train a ML model
- Create a pipeline in GoCD to automate your ML training pipeline
- Add automated tests to evaluate and govern your ML models
- Combine both GoCD pipelines to promote and deploy the new model to production
- Configure DVC to use your GCP bucket for remote storage (replace
X
with your user ID):
dvc remote modify default url gs://cd4ml-bucket-X
- Create your Machine Learning pipeline with dvc:
dvc run -f input.dvc -d src/download_data.py -o data/raw/store47-2016.csv python src/download_data.py
dvc run -f split.dvc -d data/raw/store47-2016.csv -d src/splitter.py -o data/splitter/train.csv -o data/splitter/validation.csv python src/splitter.py
dvc run -d data/splitter/train.csv -d data/splitter/validation.csv -d src/decision_tree.py -o data/decision_tree/model.pkl -M results/metrics.json python src/decision_tree.py
- Add, commit, and push your changes:
git add .
git commit -m "Creating ML pipeline"
git push
- Create machine learning training pipeline in GoCD:
-
Go to GoCD's "Admin" > "Pipelines" menu and create a new pipeline. Give it a name related to your username, e.g.
ml-pipeline-X
, replacingX
with your user ID). -
Configure your Github repository URL (e.g.
https://github.com/<github-user>/continuous-intelligence-workshop.git
) as a Git material, and use the existingml-pipeline-gcp-template
template when configuring the stages.
- Click "Finish"
- Combine both pipelines:
-
Go back to edit your original
ci-workshop-app-X
pipeline again. -
In the "Materials" tab add your new pipeline as a new material (double-click to get the correct auto suggestion).
-
Expand the "build-and-publish" stage, and click on the "build" job.
-
Update the second build task to pull the latest model using DVC instead of downloading a static version from Google Storage, by replacing the
python src/download_data.py --model
command withGOOGLE_APPLICATION_CREDENTIALS=./secret.json dvc pull model.pkl.dvc
- Save and go back to the main Dashboard page
- Unpause the machine learning pipeline to train and publish your model.
WARNING: The pipeline should fail because the model training accuracy is not good enough!
-
In your code, change the model training approach to use a Random Forest algorithm, by editing the
src/decision_tree.py
file and replacing theModel.DECISION_TREE
withModel.RANDOM_FOREST
on the last line of the file. -
Re-run your dvc pipeline locally:
dvc repro model.pkl.dvc
- Add, commit, and push your changes, and watch your pipeline execute and go green:
git add .
git commit -m "Improving model algorithm"
git push
-
Once the machine learning pipeline succeeds, it will trigger a new application deployment pipeline, which will pull the new improved model and deploy it to production. Visit your application again to verify that you get better predictions!
-
Done! Go to the next exercise