The RDMA-capable instances (Important): https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes-hpc#rdma-capable-instances
The cluster configuraiton options to enable rdma (Important): https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes-hpc#cluster-configuration-options
The network topology considerations(Important): https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes-hpc#network-topology-considerations
Based on the knowledge section, you should mark your RDMA-capable machine with a specify label in layout.yaml. Of course, you could customize the label as what you like.
For example, in this tutorial, the following label will be used.
machine-list:
- hostname: example-hosts
hostip: x.x.x.x
machine-type: example
k8s-role: worker
pai-worker: "true"
# The lable of RDMA capable machines in this example
rdma: "true"
cd pai/
sudo ./paictl.py utility sftp-copy -p /path/to/cluster/config -n Azure-RDMA-enable.sh -s src/azure-rdma -d /tmp -f rdma=true
cd pai/
sudo ./paictl.py utility ssh -p /path/to/cluster/config -f rdma=true -c "sudo /bin/bash /tmp/Azure-RDMA-enable.sh"
Please communicate with your cluster owner to reboot the rdma machines after the following steps.
In the services-configuration.yaml, please uncomment the configuration field cluster.common.az-rdma
and set its value as "true""
.
For example, you should modify it as following.
cluster:
#
common:
# clusterid: pai
#
# # HDFS, zookeeper data path on your cluster machine.
# data-path: "/datastorage"
#
# # Enable QoS feature or not. Default value is "true"
# qos-switch: "true"
#
# # If your cluster is created by Azure and the machine is rdma enabled.
# # Set this configuration as "true", the rdma environment will be set into your container.
az-rdma: "true"
- If you wanna enable azure rdma feature in your cluster, please ensure all the worker machines in your cluster is azure rdma capable!
- TODO: YARN should only schedule the rdma job to the machine with azure rdma machine.
- After enabling azure rdma feature in your cluster, everytime adding new machine or remove machine from the cluster, you should restart restserver to refresh the machinelist in it.
- TODO: Make restserver able to update the machinelist through configmap in a loop.