The RDMA-capable instances (Important):
The cluster configuraiton options to enable rdma (Important):
The network topology considerations(Important):
Based on the knowledge section, you should mark your RDMA-capable machine with a specify label in layout.yaml. Of course, you could customize the label as what you like.
For example, in this tutorial, the following label will be used.
- hostname: example-hosts
hostip: x.x.x.x
machine-type: example
k8s-role: worker
pai-worker: "true"
# The lable of RDMA capable machines in this example
rdma: "true"
cd pai/
sudo ./ utility sftp-copy -p /path/to/cluster/config -n -s src/azure-rdma -d /tmp -f rdma=true
cd pai/
sudo ./ utility ssh -p /path/to/cluster/config -f rdma=true -c "sudo /bin/bash /tmp/"
Please communicate with your cluster owner to reboot the rdma machines after the following steps.
In the services-configuration.yaml, please uncomment the configuration field
and set its value as "true""
For example, you should modify it as following.
# clusterid: pai
# # HDFS, zookeeper data path on your cluster machine.
# data-path: "/datastorage"
# # Enable QoS feature or not. Default value is "true"
# qos-switch: "true"
# # If your cluster is created by Azure and the machine is rdma enabled.
# # Set this configuration as "true", the rdma environment will be set into your container.
az-rdma: "true"
- If you wanna enable azure rdma feature in your cluster, please ensure all the worker machines in your cluster is azure rdma capable!
- TODO: YARN should only schedule the rdma job to the machine with azure rdma machine.
- After enabling azure rdma feature in your cluster, everytime adding new machine or remove machine from the cluster, you should restart restserver to refresh the machinelist in it.
- TODO: Make restserver able to update the machinelist through configmap in a loop.