CNI compliant plugin for network namespace aware RDMA interfaces.
RDMA CNI plugin allows network namespace isolation for RDMA workloads in a containerized environment.
RDMA CNI plugin is intended to be run as a chained CNI plugin (introduced in CNI Specifications v0.3.0
).
It ensures isolation of RDMA traffic from other workloads in the system by moving the associated RDMA interfaces of the
provided network interface to the container's network namespace path.
The main use-case (for now...) is for containerized SR-IOV workloads orchestrated by Kubernetes
that perform RDMA and wish to leverage network namespace
isolation of RDMA devices introduced in linux kernel 5.3.0
.
SR-IOV capable NIC which supports RDMA.
ConnectX®-4 and above
Linux distribution
Kernel based on 5.3.0
or newer, RDMA modules loaded in the system.
rdma-core
package provides means to automatically load relevant modules
on system start.
Note: For deployments that use Mellanox out-of-tree driver (Mellanox OFED), Mellanox OFED version
4.7
or newer is required. In this case it is not required to use a Kernel based on5.3.0
or newer.
iproute2
package based on kernel 5.3.0
or newer
installed on the system.
Note: It is recommended that the required packages are installed by your system's package manager.
Note: For deployments using Mellanox OFED,
iproute2
package is bundled with the driver under/opt/mellanox/iproute2/
Please refer to the relevant link on how to deploy each component. For a Kubernetes deployment, each SR-IOV capable worker node should have:
- SR-IOV network device plugin deployed and configured with an RDMA enabled resource
- Multus CNI
v3.4.1
or newer deployed and configured - Per fabric SR-IOV supporting CNI deployed
- Ethernet: SR-IOV CNI
Note:: Kubernetes version 1.16 or newer is required for deploying as daemonset
{
"cniVersion": "0.3.1",
"type": "rdma",
"args": {
"cni": {
"debug": true
}
}
}
Note: "args" keyword is optional.
It is recommended to set RDMA subsystem namespace awareness mode to exclusive
on OS boot.
Set RDMA subsystem namespace awareness mode to exclusive
via ib_core
module parameter:
~$ echo "options ib_core netns_mode=0" >> /etc/modprobe.d/ib_core.conf
Set RDMA subsystem namespace awareness mode to exclusive
via rdma tool:
~$ rdma system set netns exclusive
Note: When changing RDMA subsystem netns mode, kernel requires that no network namespaces to exist in the system.
~$ kubectl apply -f ./deployment/rdma-cni-daemonset.yaml
Pod definition can be found in the example below.
~$ kubectl apply -f ./examples/my_rdma_test_pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: rdma-test-pod
annotations:
k8s.v1.cni.cncf.io/networks: sriov-rdma-net
spec:
containers:
- name: rdma-app
image: centos/tools
imagePullPolicy: IfNotPresent
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 300000; done;" ]
resources:
requests:
mellanox.com/sriov_rdma: '1'
limits:
mellanox.com/sriov_rdma: '1'
The following yaml
defines an RDMA enabled SR-IOV resource pool named: mellanox.com/sriov_rdma
apiVersion: v1
kind: ConfigMap
metadata:
name: sriovdp-config
namespace: kube-system
data:
config.json: |
{
"resourceList": [
{
"resourcePrefix": "mellanox.com",
"resourceName": "sriov_rdma",
"selectors": {
"isRdma": true,
"vendors": ["15b3"],
"pfNames": ["enp4s0f0"]
}
}
]
}
The following yaml
defines a network, sriov-network
, associated with an rdma enabled resurce, mellanox.com/sriov_rdma
.
The CNI plugins that will be executed in a chain are for Pods that request this network are: sriov, rdma CNIs
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: sriov-rdma-net
annotations:
k8s.v1.cni.cncf.io/resourceName: mellanox.com/sriov_rdma
spec:
config: '{
"cniVersion": "0.3.1",
"name": "sriov-rdma-net",
"plugins": [{
"type": "sriov",
"ipam": {
"type": "host-local",
"subnet": "10.56.217.0/24",
"routes": [{
"dst": "0.0.0.0/0"
}],
"gateway": "10.56.217.1"
}
},
{
"type": "rdma"
}]
}'
It is recommended to use the same go version as defined in .travis.yml
to avoid potential build related issues during development (newer version will most likely work as well).
~$ git clone https://github.com/k8snetworkplumbingwg/rdma-cni.git
~$ cd rdma-cni
~$ make
Upon a successful build, rdma
binary can be found under ./build
.
For small deployments (e.g a kubernetes test cluster/AIO K8s deployment) you can:
- Copy
rdma
binary to the CNI dir in each worker node. - Build container image, push to your own image repo then modify the deployment template and deploy.
~$ make tests
~$ make image
For Mellanox Hardware, due to kernel limitation, it is required to pre-allocate MACs for all VFs in the deployment if an RDMA workload wishes to utilize RMDA CM to establish connection.
This is done in the following manner:
Set VF administrative MAC address :
$ ip link set <pf-netdev> vf <vf-index> mac <mac-address>
Unbind/Bind VF driver :
$ echo <vf-pci-address> > /sys/bus/pci/drivers/mlx5_core/unbind
$ echo <vf-pci-address> > /sys/bus/pci/drivers/mlx5_core/bind
Example:
$ ip link set enp4s0f0 vf 3 mac 02:03:00:00:48:56
$ echo 0000:03:00.5 > /sys/bus/pci/drivers/mlx5_core/unbind
$ echo 0000:03:00.5 > /sys/bus/pci/drivers/mlx5_core/bind
Doing so will populate the VF's node and pord GUID required for RDMA CM to establish connection.