This document serves as an onboarding document as well as a template repository to quickstart machine learning experimentation at the Autonomous Systems Research Group at Microsoft
Note Use the table of contents icon on the top left corner of this document to get to a specific section quickly.
- Click on the green colored box titled Use this template top right, and name your new repository.
- You can clone your repo when it looks like example_repo_generated_from_ml_template.
Note that after you create the template, it will take about 20 seconds for an automated github action to clean up the generated repository using an auto-commit. Please ensure your repository looks like example_repo_generated_from_ml_template before cloning it.
For the template repository, we will use:
- Pytorch Lightning
- For minimizing boilerplate code
- OmegaConf
- Please go through OmegaConf's github readme for tutorials.
- For config management
Note: we have an archived branch called
hydra
which uses hydra for config management.
- Logging
- We primarily use tensorboard. Amulet automatically patches tensorboard scalars to MLFlow for viewing metrics in Azure ML Studio.
- Conda and Docker
- For development
-
VSCode
-
Extensions:
- Hit
Ctrl+Shift+P
and typeShow Recommended Extensions
and install them from the sidebar. Or click "yes" when you get a VS Code pop up to install the recommended extensions, which are specified in .vscode/extensions.json. Follow this doc for more details. Python
,Pylance
,Docker
,GitLens
,YAML
, and the Remote development extension pack are strongly recommended.
- Hit
-
Debugging:
- Please follow VSCode docs and tutorials on Python debugging
- A minimal debugging configuration has been provided in .vscode/launch.json. Please see VSCode docs on launch.json configs and config options.
-
-
Conda
- Recommended for local development and debugging.
- Note: For CUDA 11.6, see
Creating the conda environment from scratch (click to expand)
below.
# create env conda env create --file docker/environment.yml # activate it conda activate ml_template # install this repo (ml_template) $ pip install -e . # install pre-commit (recommended). Scroll down to the #Developing section for details. (ml_template) $ pre-commit install
Note If you install additional packages in your environment manually, you should update the
environment.yml
correspondingly by doing a$ conda env export | grep -v "^prefix: " > docker/environment.yml
.Creating the conda environment from scratch (click to expand)
conda update -n base -c defaults conda conda create --name ml_template python=3.9 conda activate ml_template conda install pip conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch conda install pytorch-lightning -c conda-forge pip install omegaconf \ pytest \ sh \ pre-commit \ mlflow \ azureml-mlflow \ azureml-core \ torch_tb_profiler \ opencv-python \ black isort flake8 \ psutil \ rich conda env export | grep -v "^prefix: " > docker/environment.yml pre-commit install pre-commit run --all-files pip install -e .
For CUDA 11.6:
conda update -n base -c defaults conda conda create --name ml_template_cu116 python=3.9 conda activate ml_template_cu116 conda install pip conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge pip install pytorch-lightning pip install omegaconf \ pytest \ sh \ pre-commit \ mlflow \ azureml-mlflow \ azureml-core \ torch_tb_profiler \ opencv-python \ black isort flake8 \ psutil \ rich conda env export | grep -v "^prefix: " > docker/environment_cu116.yml pre-commit install pre-commit run --all-files pip install -e .
Upgrading pytorch and cudatoolkit (click to expand)
conda remove pytorch torchvision torchaudio cudatoolkit # then follow pytorch installation steps, for example: conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge # then update pytorch lightning: pip install pytorch-lightning --upgrade pip install pytorch-lightning[extra] --upgrade pip install -U jsonargparse[signatures] --upgrade
-
Docker
-
While submitting jobs to AzureML, we take our local conda environment and overlay them on an appropriate docker base image. For a new project / a custom conda environment, you can build the docker image locally as explained in a note later in this section. Optionally, the docker image building can be automated by CI (as explained later) if your project has a frequently update conda environment.
-
For
ml_template
, we have three docker images built automatically on each commit tomain
branch or a branch corresponding to a Pull Request. Docker images are pushed to PRIVATEAZURECONTAINERREGISTRYNAME container registory under ml_template. To automate this for your generated repository from this template, please follow make an Azure Pipelines which willazure-pipelines.yml
-
The following tags correspond to the the latest commit on the main branch.
-
Tag | Dockerfile | docker pull command | Base Image |
---|---|---|---|
latest or latest-azureml |
azureml | docker pull PRIVATEAZURECONTAINERREGISTRYNAME.azurecr.io/ml_template:latest |
mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.3-cudnn8-ubuntu20.04:latest |
latest-nightly or latest-azureml-nightly |
azureml_nightly | docker pull PRIVATEAZURECONTAINERREGISTRYNAME.azurecr.io/ml_template:latest-nightly |
mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.3-cudnn8-ubuntu20.04:latest |
latest-nvidia |
nvidia | docker pull PRIVATEAZURECONTAINERREGISTRYNAME.azurecr.io/ml_template:latest-nvidia |
nvcr.io/nvidia/pytorch:22-06-py3 |
- Building docker images and running docker containers locally - can be useful to reproduce issues which might occur while submitting to AzureML on your local machine. Please peruse public documentation on docker + vscode.
# pull image with [azureml image](https://hub.docker.com/_/microsoft-azureml?tab=description) as base with docker/environment.yml on top
docker pull PRIVATEAZURECONTAINERREGISTRYNAME.azurecr.io/ml_template:latest
# (optional) pull image with nvidia pytorch image as base
docker pull PRIVATEAZURECONTAINERREGISTRYNAME.azurecr.io/ml_template:latest-nvidia (for nvidia pytorch base image. See the note below for more details.)
# run image
docker run -it --gpus=all -v <PATH_TO_THIS_REPO>:<PATH_TO_THIS_REPO> PRIVATEAZURECONTAINERREGISTRYNAME.azurecr.io/ml_template:latest
# (optional) recommended give a name to your container
docker run -it --rm --name=MYFANCYCONTAINERNAME --gpus=all -v <PATH_TO_THIS_REPO>:<PATH_TO_THIS_REPO> PRIVATEAZURECONTAINERREGISTRYNAME.azurecr.io/ml_template:latest
# setup the repo (run inside the container)
pip install -e .
# install pre-commit (recommended). Scroll down to the "Developing" section for details.
pre-commit install
More details on docker image tags for Pull Request and main branch builds (click to expand)
Similar to the `main` branch, for each pull request, we have:PR-<#pr_number>-latest
akaPR-<pr_number>-latest-azureml
PR-<#pr_number>-latest-nightly
akaPR-<pr_number>-latest-azureml-nightly
PR-<#pr_number>-latest-nvidia
And finally for both main
and PR branches, we have tags corresponding to git commit hashes
main-<gitcommithash>-azureml
andPR-<pr_number>-<gitcommithash>-azureml
main-<gitcommithash>-azureml-nightly
andPR-<pr_number>-<gitcommithash>-azureml-nightly
main-<gitcommithash>-nvidia
andPR-<pr_number>-<gitcommithash>-nvidia
For example:
main-7fadad2b-azureml
,main-7fadad2b-azureml-nightly
,main-7fadad2b-nvidia
: correspond to commit 7fadad2b onmain
branch with our three different dockerfilesPR-50-latest-azureml
,PR-50-latest-azureml-nightly
,PR-50-latest-nvidia
: correspond to latest commit on PR#50 with our three different dockerfilesPR-50-eef3b90-azureml
,PR-50-eef3b90-azureml-nightly
,PR-50-eef3b90-nvidia
: correspond to commit eef3b90 on PR#50 with our three different dockerfiles
Building and understanding our Dockerfiles (click to expand)
-
We have three docker files:
- azureml base:
- nvidia pytorch base:
-
Both of the azureml base images grabs a base image from here, and put the user's conda environment (docker/environment.yml) on top of the base page.
-
In the
latest-azureml
version, packages in your local conda environment should match the docker image exactly. -
In the
latest-azureml-nightly
image, pytorch (including cudatoolkit) and pytorch lightning are updated to the nightly versions. -
The nvidia pytorch base image grabs a base image from here (here for details), which already has the latest version of pytorch. Instead of using user's conda environment, this docker file uses
pip
to install pytorch lightning and other dependencies on top of base image. So this image can have different versions of packages as compared to your conda environment.
All docker images accept a build argument to update the base image version easily:
- azureml images:
- take base azure image name's suffix and tag. see available options here:
- examples:
openmpi4.1.0-cuda11.3-cudnn8-ubuntu20.04:latest
,openmpi4.1.0-cuda11.3-cudnn8-ubuntu20.04:latest
, and so on.
- examples:
- take base azure image name's suffix and tag. see available options here:
- nvidia pytorch image:
- takes base nvidia image name's tag only.
- see available tags here and the release notes for their contents
- examples:
22.06-py3
,22.05-py3
, and so on.
Please review the arguments in the dockerfiles carefully. These can also be seen by reading through azure-pipelines.yml.
Building the azure-ml base + conda env images locally:
cd docker;
docker build \
-f Dockerfile_base_azureml \
--build-arg BASE_IMAGE=openmpi4.1.0-cuda11.3-cudnn8-ubuntu20.04:latest \
-t PRIVATEAZURECONTAINERREGISTRYNAME.azurecr.io/ml_template:latest-azureml .
# note that in the PRIVATEAZURECONTAINERREGISTRYNAME acr, latest is equivalent to latest-azureml tag. So, we can just re-tag the image:
docker tag PRIVATEAZURECONTAINERREGISTRYNAME.azurecr.io/ml_template:latest-azureml PRIVATEAZURECONTAINERREGISTRYNAME.azurecr.io/ml_template:latest
For the CUDA 11.6 version:
cd docker;
docker build \
-f Dockerfile_base_azureml_cu116 \
--build-arg BASE_IMAGE=openmpi4.1.0-cuda11.6-cudnn8-ubuntu20.04:latest \
-t PRIVATEAZURECONTAINERREGISTRYNAME.azurecr.io/ml_template:latest-azureml-cu116 .
# note that in the PRIVATEAZURECONTAINERREGISTRYNAME acr, latest is equivalent to latest-azureml tag. So, we can just re-tag the image:
docker tag PRIVATEAZURECONTAINERREGISTRYNAME.azurecr.io/ml_template:latest-azureml-cu116 PRIVATEAZURECONTAINERREGISTRYNAME.azurecr.io/ml_template:latest-cu116
Building the nvidia-pytorch image locally:
# building nvidia-pytorch image with locally.
cd docker;
docker build \
-f Dockerfile_base_nvidia \
--build-arg BASE_IMAGE=22.06-py3 \
-t PRIVATEAZURECONTAINERREGISTRYNAME.azurecr.io/ml_template:latest-nvidia .
Developing inside docker containers with VSCode: (click to expand)
-
Note: This method can be used on an Azure VM or locally with no change and uses docker
Follow the steps below:
- Connect to your remote Azure VM using VS Code
- Open the workspace within a docker container for development, either using the popup as shown in the animation above, or by searching for
(Re)Build and (Re)open in container
in the command palette (hitCtrl+Shift+P
to open the command palette) - After setup is complete, it is time to set up the repository:
pip install -e . pre-commit install
-
Note: By default, the devcontainer uses the azureml-conda base image. We can also use the nvidia base image by modifying the
dockerfile
line in devcontainer.json. Similarly, we can edit the docker files build argument therein itself.
-
Understanding OmegaConf and config files
- Please review OmegaConf's github readme for their documentation, slides (for ver 2.1), and a live tutorial.
-
Single GPU
python src/train.py base=configs/train.yaml trainer.num_nodes=1 trainer.devices=1
-
Multiple GPUs
python src/train.py base=configs/train.yaml trainer.num_nodes=1 trainer.devices=4
Note: This section used internal tools for job submission to Azure ML workspaces. This section is not supported publicly at the time of writing. However, one may peruse existing public documentation on azure ml.
The template has some basic tests in tests/
directory. To run them, run:
# run all tests
pytest
# run single test
pytest tests/test_dev_fast_run.py
List of tests implemented:
- fast_dev_run: a simple check to run your trainer on single batch of train, valid, and test datase.
It can also be useful to quickly check your code works by running while adding new features:
python src/train.py base=configs/train.yaml --fast_dev_run=True
We use:
-
black for code formatting
-
isort for import ordering
-
pycln for removing unused imports
-
Running locally:
$ cd ml_template; $ black . $ isort . $ pycln --all .
pre-commit hooks automate black autoformatting and ensuring PEP8 compliance.
-
Setting up:
$ cd ml_template; $ pre-commit install
-
Running:
After the above step,
pre-commit
will run automatically when yougit commit
. If the run fails with errors in red, you can check the edits made bypre-commit
bygit diff
. If the changes look good, (1)git add
those files again, and then (2) rungit commit
again.Optionally, you can also run pre-commit manually by:
$ pre-commit run --all-files
-
Updating hooks: Use the
autoupdate
command to keep the versions of formatters in.pre-commit-config.yaml
up to date.$ pre-commit autoupdate
-
Github Actions
- Pre-commit checks
- Template cleanup:
When a new repository is generated using this template, this action replace
README.md
withREADME_template.md
to keep microsoft links internal.
-
Azure Pipelines
-
Create an azure devops pipeline for your repository. This automates building of your docker images, and also run pytests on them.
-
The azure pipeline logs can be seen at Azure DevOps webpage, but not on with github UI directly.
Pull Request example:
- You can click
View more details on Azure Pipelines
under theChecks
section of a github PR. - See PR#6/checks for an example.
- You can click
-
See the job
BuildDockerImageAndPush
in azure-pipelines.yml. It will build the image in docker/Dockerfile and push it to a private azure container registrySee docker section under #running-locally for details
-
-
conda
environment.yml
update:If you install packages in conda, update the
docker/environment.yml
byconda env export | grep -v "^prefix: " > docker/environment.yml
, and send a PR.
-
Pytorch Lightning:
-
Pytorch v/s Pytorch Lightning
-
Template / reference repositories
- https://github.com/ashleve/lightning-hydra-template
- https://github.com/lkhphuc/lightning-hydra-template
- Pytorch lightning bolts
- Look inside the code for datamodules, datasets, models, etc: https://github.com/PyTorchLightning/lightning-bolts/tree/master/pl_bolts
-
-
Pytorch Geometric:
-
Pytorch data, datapipes, dataloaders: