Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Nvidia network operator for DGX #1113

Merged
merged 8 commits into from
Apr 13, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/k8s-cluster/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,10 @@ Kubeflow is a popular way for multiple users to run ML workloads. It exposes a J

For more information on Kubeflow, please refer to the [official documentation](https://www.kubeflow.org/docs/about/kubeflow/).

### NVIDIA Network Operator

NVIDIA Network Operator leverages Kubernetes CRDs and Operator SDK to manage networking related components in Kuberenets cluster. High performance networking in Kuberentes requires many components, such as multus-CNI, device drivers and plugins to be installed correctly, NVIDIA network operator aims to manage all those necessary components automatically under one operator frame work to simply the deployment, operation and management of NVIDIA networking for Kubernetes. To deploy NVIDIA network operator, please refer to the [NVIDIA Network Operator Deployment Guide in DeepOps](nvidia-network-operator.md), for more information on NVIDIA network operator, please refer to its [github](https://github.com/Mellanox/network-operator) page and this [solution guide](https://docs.nvidia.com/networking/display/COKAN10/Network+Operator).

## Cluster Maintenance

DeepOps uses [Kubespray](https://github.com/kubernetes-sigs/kubespray) to deploy Kubernetes and therefore common cluster actions (such as adding nodes, removing them, draining and upgrading the cluster) should be performed with it. Kubespray is included as a submodule in the submodules/kubespray directory.
Expand Down
234 changes: 234 additions & 0 deletions docs/k8s-cluster/nvidia-network-operator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
Deploy NVIDIA Network Operator with DeepOps
===========================================

## Overview

NVIDIA Network Operator leverages Kubernetes CRDs and Operator SDK to manage networking related components in Kuberenets cluster. One of the key components is SR-IOV, which partitions a single PCIe hardware into multiple Virtual Functions (VFs) and attach them directly to Kubernetes pods without going through the virtualization layer on the hosts, thus enables the high performance communication between workloads. High performance networking in Kuberentes also requires a few other components, such as multus-CNI, device drivers and plugins, etc, NVIDIA network operator aims to manage all those necessary components automatically under one operator frame work to simply the deployment, operation and management of NVIDIA networking for Kubernetes.

Here are the key components that NVIDIA network operator try to deploy together:

* SR-IOV Virtual Function (VF) activation
* Multus CNI
* SR-IOV CNI for kubernetes
* SR-IOV device plugin for kubernetes
* Multus CNI
* Helm chart for NVIDIA network operator

This playbook also install the latest Kubeflow/MPI-Operator, currently version v2beta1, for multi-node MPI jobs.

Currently only InfiniBand networking is supported in this implementation, RoCE networking support will be added shortly.


## Requirements and Tested Environment:

This playbook is developed and tested in following environments:

* NVIDIA DGX servers with DGX OS 5.1
* Mellanox ConnectX-6 VPI HCA
* Ansible 2.9.27 (deployed by DeepOps)
* Kubernetes v1.21.6 (deployed by DeepOps)
* Helm version v3.6.3 (deployed by DeepOps)
* NVIDIA network opertor v1.1.0
* InfiniBand networking. (Ethernet networking support will be added in the future.)

## Deployment Steps

1. Make sure underlying InfiniBand network works properly between Kubernetes nodes. It's recommended to run some bare metal micro benchmark testing to verify the IB network is working as expected, for example, NVIDIA perftest package can be used for that purpose.

2. Enabling IB port virtualization on IB opensm. This is done in an IB switch in the lab:

```sh
IB_Switch (config) # ib sm virt enable
```

3. Verify SR-IOV is enabled in BIOS and HCAs.

Use following commands to verify SR-IOV and VFs are enabled on ConnectX-6 HCAs, "0000:05:00.0" is the HCA's PCIe bus number.

```sh
sudo mlxconfig -d 0000:05:00.0 q | grep -i "sriov\|vfs"
```
4. Set up Kubernetes cluster
yangatgithub marked this conversation as resolved.
Show resolved Hide resolved

Kubernetes installation is done by DeepOps Ansible playbooks, For more information on Ansible and why we use it, consult the [Ansible Guide](ANSIBLE.md).

- Install and configure DeepOps on managemet node:

```sh
git clone https://github.com/NVIDIA/deepops.git
cd deepops/
./scripts/setup.sh
vi config/inventory
```
Configuring the Ansible inventory file by editing the "config/inventory" file, and verify connectivity to all nodes.
> NOTE: Be warned that `/etc/hostname` and `/etc/hosts` on each host will be modified to the name(s) specified in the inventory file, so it is best to use the actual names of the hosts.

When modifying the inventory, if the hosts are not accessible from the management node by their hostname, supply an an `ansible_host` with its IP address. Example of the inventory file:

```yml
# in config/inventory...
[all]
mgmt01 ansible_host=192.168.1.11
gpu01 ansible_host=192.168.2.11
gpu02 ansible_host=192.168.3.11
...
[kube-master]
mgmt01
[kube-node]
gpu01
gpu02
```
- Add or modify user(s) across cluster if necessary:
The ansible scripts assume a consistent user which has access to all nodes in the cluster.
> Note: If a user with the same username, uid, and password exists on each node, skip this step. It is critical for the user to exist with the same uid across all nodes.

```sh
# The default user is `nvidia` with password `deepops`
# Modify this user/password in config/group_vars/all.yaml as desired
vi config/group_vars/all.yml
```

Run the users playbook to create/modify the user across all nodes.
yangatgithub marked this conversation as resolved.
Show resolved Hide resolved

```sh
# NOTE: If SSH requires a password, add: `-k`
# NOTE: If sudo on remote machine requires a password, add: `-K`
# NOTE: If SSH user is different than current user, add: `-u <user>`
ansible-playbook -b playbooks/generic/users.yml
```
Verify the configuration

```sh
ansible all -m raw -a "hostname"
```

- Deploying and verifying Kubernetes cluster
Install Kubernetes using Ansible and Kubespray

```sh
# NOTE: If SSH requires a password, add: `-k`
# NOTE: If sudo on remote machine requires a password, add: `-K`
# NOTE: If SSH user is different than current user, add: `-u ubuntu`
ansible-playbook -l k8s-cluster playbooks/k8s-cluster.yml
```
Please refer to [DeepOps Kubernetes Deployment Guidehere](https://github.com/NVIDIA/deepops/blob/master/docs/kubernetes-cluster.md) for more information.

Verify that Kubernetes clustering is working with "kubectl get nodes" command:
```sh
nvidia@mgmt01:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
mgmt01 Ready master 1d8h v1.21.6
gpu01 Ready <none> 1d8h v1.21.6
gpu02 Ready <none> 1d8h v1.21.6
nvidia@mgmt01:~$
```
5. Deploy NVIDIA Network Operator
Before runnng the playbook, please update "roles/nvidia-network-operator/vars/main.yml" file according to your hardware and network configuration, this is what we used in our value.yaml file:
yangatgithub marked this conversation as resolved.
Show resolved Hide resolved
```sh
num_vf: 8
vendor_id: "15b3"
link_type: "ib"
mtu: 4096

intf_resources:
- if_name: "ibs1"
pf_name: "ibs1"
res_name: "resibs1"
ip_addr: "192.168.101.0/24"
- if_name: "ibp12s0"
pf_name: "ibp12s0"
res_name: "resibp12s0"
ip_addr: "192.168.102.0/24"
...
## "15b3" is Mellanox vendor code for ConnectX cards.
```
Run the playbook:
```sh
ansible-playbook playbooks/k8s-cluster/nvidia-network-operator.yaml
```

## Running the Workload

The cluster is ready to run multi-node workload in the cluster, One last thing is to add related interface configuration to the job file before launching your job.

### Using SR-IOV interfaces

Below is what is the section of the job file looks like after adding relevant SR-IOV interface configuration. A docker private registry at 192.168.1.11 is used to host and manage the testing images in this example, please refer to this docker [document](https://docs.docker.com/registry/deploying/) for more details. Other container registry can be used as well.

```sh
Worker:
replicas: 2
template:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: ibs1,ibp12s0
spec:
containers:
- image: 192.168.1.11:5000/nccl-test
yangatgithub marked this conversation as resolved.
Show resolved Hide resolved
name: nccl-benchmark
securityContext:
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
nvidia.com/resibs1: "1"
yangatgithub marked this conversation as resolved.
Show resolved Hide resolved
nvidia.com/resibp12s0: "1"
nvidia.com/gpu: 8
env:
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_NET_GDR_LEVEL
value: "2"
```
"nvidia.com/resibs1" is the network resource where SR-IOV is enabled, it's also defined in "roles/nvidia-network-operator/vars/main.yaml" in this repository.

Now you can launch the job with your familiar Kubernetes command:

```sh
nvidia@mgmt01:~$ kubectl create -f nccl-test.yaml
```
### NCCL AllReduce Test Result

Below is a NCCL allreduce test result run on a DGX-1 cluster with 4 x 100G HCA interfaces. NCCL deliveries near line rate performance:

```sh
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 42.96 0.00 0.00 2e-07 32.87 0.00 0.00 1e-07
16 4 float sum 37.39 0.00 0.00 1e-07 32.98 0.00 0.00 1e-07
32 8 float sum 39.11 0.00 0.00 1e-07 34.82 0.00 0.00 1e-07
64 16 float sum 41.81 0.00 0.00 1e-07 34.66 0.00 0.00 6e-08
128 32 float sum 33.23 0.00 0.01 6e-08 38.19 0.00 0.01 6e-08
256 64 float sum 38.90 0.01 0.01 6e-08 33.20 0.01 0.01 6e-08
512 128 float sum 34.32 0.01 0.03 6e-08 32.05 0.02 0.03 6e-08
1024 256 float sum 38.84 0.03 0.05 2e-07 37.46 0.03 0.05 2e-07
2048 512 float sum 36.95 0.06 0.10 2e-07 37.23 0.06 0.10 2e-07
4096 1024 float sum 39.67 0.10 0.19 5e-07 42.29 0.10 0.18 5e-07
8192 2048 float sum 47.62 0.17 0.32 5e-07 45.39 0.18 0.34 5e-07
16384 4096 float sum 45.50 0.36 0.68 5e-07 46.02 0.36 0.67 5e-07
32768 8192 float sum 53.73 0.61 1.14 5e-07 58.91 0.56 1.04 5e-07
65536 16384 float sum 62.27 1.05 1.97 5e-07 66.98 0.98 1.83 5e-07
131072 32768 float sum 69.76 1.88 3.52 5e-07 74.26 1.76 3.31 5e-07
262144 65536 float sum 72.19 3.63 6.81 5e-07 77.27 3.39 6.36 5e-07
524288 131072 float sum 106.2 4.94 9.26 5e-07 104.7 5.01 9.39 5e-07
1048576 262144 float sum 127.9 8.20 15.38 5e-07 126.9 8.26 15.49 5e-07
2097152 524288 float sum 154.5 13.58 25.46 5e-07 153.4 13.67 25.63 5e-07
4194304 1048576 float sum 228.7 18.34 34.38 5e-07 229.6 18.27 34.25 5e-07
8388608 2097152 float sum 399.6 20.99 39.36 5e-07 407.6 20.58 38.59 5e-07
16777216 4194304 float sum 751.9 22.31 41.84 5e-07 749.7 22.38 41.96 5e-07
33554432 8388608 float sum 1437.3 23.35 43.77 5e-07 1431.7 23.44 43.94 5e-07
67108864 16777216 float sum 2677.0 25.07 47.00 5e-07 2732.0 24.56 46.06 5e-07
134217728 33554432 float sum 5292.9 25.36 47.55 5e-07 5300.1 25.32 47.48 5e-07
268435456 67108864 float sum 10540 25.47 47.75 5e-07 10545 25.46 47.73 5e-07
536870912 134217728 float sum 21099 25.45 47.71 5e-07 21010 25.55 47.91 5e-07
1073741824 268435456 float sum 41998 25.57 47.94 5e-07 41949 25.60 47.99 5e-07
2147483648 536870912 float sum 83868 25.61 48.01 5e-07 83730 25.65 48.09 5e-07
4294967296 1073741824 float sum 167263 25.68 48.15 5e-07 167543 25.64 48.07 5e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth : 18.5822
```
Enjoy!

> Note: This is not a performance benchmark testing so we don't fine tune any hardware and software stack parameters. The results are considered as an out-of-box number that can be observed in regular customer environments with the solution documented here. For more information about NCCL, see the following [blog post](https://devblogs.nvidia.com/scaling-deep-learning-training-nccl/).

10 changes: 10 additions & 0 deletions playbooks/k8s-cluster/nvidia-network-operator.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
## Playbook for installing nvidia-network-operator
supertetelman marked this conversation as resolved.
Show resolved Hide resolved
#
- hosts: kube-master[0]
become: true
become_method: sudo
tasks:
- include_role:
name: nvidia-network-operator
tasks_from: main
41 changes: 41 additions & 0 deletions roles/nvidia-network-operator/tasks/main.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
# Deploy network operator related tasks on master node
#

- name: label the nodes
shell: kubectl label --overwrite nodes {{ item }} node-role.kubernetes.io/worker=
with_items: "{{ groups['kube-node']}}"

## required as the DeepOps openshift role doesn't work
- name: Install openshift
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than install this here, add the existing openshift role as a dep. See roles/nvidia-gpu-operator/meta/main.yml for an example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tried that existing role but failed, we can take a closer look if taking that approach.

shell: pip3 install openshift

- name: Add helm repo for network operator
kubernetes.core.helm_repository:
name: mellanox
repo_url: "{{nvidia_network_operator_url}}"

- name: Deploy network operator helm chart
kubernetes.core.helm:
name: network-operator
release_namespace: network-operator
chart_version: "{{ nvidia_network_operator_version }}"
chart_ref: mellanox/network-operator
create_namespace: true
update_repo_cache: true
wait: true
values: "{{ lookup('template', 'values.yaml') | from_yaml }}"

- name: Create network node poliy
include_tasks: sriovnetworknodepolicy.yaml
with_items: "{{ intf_resources }}"

- name: Create IB network attachment definition
include_tasks: sriovibnetwork.yaml
with_items: "{{ intf_resources }}"

- name: Install latest Kubeflow MPI-Operator
k8s:
state: present
definition: "{{ lookup('url', '{{mpi_raw_url}}/mpi-operator.yaml', split_lines=False) }}"
run_once: true
30 changes: 30 additions & 0 deletions roles/nvidia-network-operator/tasks/sriovibnetwork.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
## This example configures the IB interfaces used by the K8s cluster.
##
- name: Create sriov network definition
vars:
spec_yaml: |
resourceName: "{{ item.res_name }}"
linkState: "enable"
networkNamespace: "default"
ipam: |
{
"type": "whereabouts",
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"range": "{{ item.ip_addr }}",
"log_file": "/var/log/whereabouts.log",
"log_level": "info"
}
k8s:
state: present
definition:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
name: "{{ item.if_name }}"
namespace: network-operator
spec: "{{ spec_yaml | from_yaml }}"
run_once: true
28 changes: 28 additions & 0 deletions roles/nvidia-network-operator/tasks/sriovnetworknodepolicy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
## This example configures SriovNetworkNodePolicy for IB interfaces
##
- name: Create sriovnetwork node policy
vars:
spec_yaml: |-
deviceType: netdevice
mtu: {{ mtu |int }}
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
nicSelector:
vendor: "{{ vendor_id }}"
pfNames: ["{{ item.pf_name }}"]
linkType: "{{ link_type }}"
isRdma: true
numVfs: {{ num_vf |int }}
priority: 90
resourceName: "{{ item.res_name }}"
k8s:
state: present
definition:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: "{{ item.if_name }}"
namespace: network-operator
spec: "{{ spec_yaml | from_yaml }}"
run_once: true
27 changes: 27 additions & 0 deletions roles/nvidia-network-operator/templates/values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
#
# Default setting for DGX systems with IB networking
#

nfd:
enabled: true
sriovNetworkOperator:
enabled: true

# NicClusterPolicy CR values:
deployCR: true
ofedDriver:
deploy: false
rdmaSharedDevicePlugin:
deploy: false
sriovDevicePlugin:
deploy: false

secondaryNetwork:
deploy: true
multus:
deploy: true
cniPlugins:
deploy: true
ipamPlugin:
deploy: true
Loading