This recipe shows how to run High Performance ML Algorithms (HPMLA) on CPUs across Azure VMs via Open MPI.
Please see refer to this set of sample configuration files for this recipe.
The pool configuration should enable the following properties:
vm_size
should be a CPU-only instance, for example,STANDARD_D2_V2
.inter_node_communication_enabled
must be set totrue
max_tasks_per_node
must be set to 1 or omitted
The global configuration should set the following properties:
docker_images
array must have a reference to a valid HPMLA Docker image that can be run with OpenMPI. The image denoted with0.0.1
tag found in msmadl/symsgd:0.0.1 is compatible with Azure Batch Shipyard VMs.
The jobs configuration should set the following properties within the tasks
array which should have a task definition containing:
-
docker_image
should be the name of the Docker image for this container invocation. For this example, this should bemsmadl/symsgd:0.0.1
. Please note that thedocker_images
in the Global Configuration should match this image name. -
command
should contain the command to pass to the Docker run invocation. For this HPMLA training example with themsmadl/symsgd:0.0.1
Docker image. The applicationcommand
to run would be:"/parasail/run_parasail.sh -w /parasail/supersgd -l 1e-4 -k 32 -m 1e-2 -e 10 -r 10 -f $AZ_BATCH_NODE_SHARED_DIR/azblob/<container_name from the data shredding configuration file> -t 1 -g 1 -d $AZ_BATCH_TASK_WORKING_DIR/models -b $AZ_BATCH_NODE_SHARED_DIR/azblob/<container_name from the data shredding configuration file>"
run_parasail.sh
has these parameters-w
the HPMLA superSGD directory-l
learning rate-k
approximation rank constant-m
model combiner convergence threshold-e
total epochs-r
rounds per epoch-f
training file prefix-t
number of threads-g
log global models every this many epochs-d
log global models to this directory at the host"-b
location for the algorithm's binary"
-
The training data will need to be shredded to match the number of VMs and the thread's count per VM, and then deployed to a mounted Azure blob that the VM docker images have read/write access. A basic python script that can be used to shred and deploy the training data to a blob container, and other data shredding files can be found here.
-
shared_data_volumes
should contain the shared data volume with anazureblob
volume driver as specified in the global configuration file found here. -
multi_instance
property must be definednum_instances
should be set topool_current_dedicated
, orpool_current_low_priority
Supplementary files can be found here.
You must agree to the following licenses prior to use: