GPU Operator 1.9 support with DGX inclusion #1074

supertetelman · 2021-12-09T22:42:34Z

Bump GPU Operator
Bump Device Plugin
Remove node-prep role as GPU Operator now handles that logic internally
Introduce flags to enable/disable gpu operator components (migmanager, dcgm, driver, toolkit)
Dynamically detect DGX systems and disable toolkit & driver features

Test matrix:

Install K8s on DGX, verify the system is properly detected and the GPU jobs run
Install on non-DGX without GPU Operator, verify new device plugin versions work
Install on non-DGX with GPU Operator, verify new gpu operator versions work
Install with components disabled to verify flags are pushed through

TODO:

GPU Operator monitoring integration needs to be completed before the gpu operator install can be set as the default
Additional tests should be introduced to uninstall and re-install the gpu operator with different flags
The automated testing should be double checked to make sure we aren't double testing things
NVIDIA AI Enterprise does not yet support GPU Operator v1.9.0 and we should hold of on merging this for a few weeks until that support is added.

supertetelman · 2021-12-10T17:23:06Z

roles/nvidia-gpu-operator/tasks/main.yml

@@ -1,4 +1,15 @@
 ---
+- name: Check for DGX packages


This is not going to work.

I was testing this on a single DGX-station, but in a real deployment scenario the admin node will not have a /etc/dgx-release even if all the workers are DGX systems.

We need to come up with a better method to determine if there are any DGX OS systems within the cluster.

we could still run a task like this on every node, then tag them as 'dgx' in k8s

I bet the GPU Operator is already adding a tag for that, let me check.

dholt · 2021-12-10T20:41:03Z

roles/nvidia-gpu-operator-node-prep/tasks/main.yml

@@ -1,14 +1,18 @@
 ---


do we still need this role? you removed the only place it's referenced

In theory, no. If someone wants to rollback and use an older version of the GPU Operator, potentially.

dholt · 2021-12-10T20:41:23Z

roles/nvidia-gpu-operator-node-prep/tasks/main.yml

 - name: unload nouveau
  modprobe:
    name: nouveau
    state: absent
  ignore_errors: true

- name: blacklist nouveau
+- name: blocklist nouveau


'blacklist' is correct, not 'blocklist'

That goes against the new NVIDIA inclusive language policy.

dholt · 2021-12-10T20:45:49Z

roles/nvidia-gpu-operator/tasks/main.yml

+    path: /etc/dgx-release
+  register: is_dgx
+
+- name: Set DGX specific GPU Operator flags


is this going to work with a heterogeneous set of GPU nodes (DGX and non-DGX)? this looks like it would leave non-DGX nodes without a driver and toolkit

No it will not work. Need to figure that out.

supertetelman · 2021-12-10T22:28:44Z

This current implementation will not change the current behavior unless gpu_operator_preinstalled_nvidia_software is set to true.

If gpu_operator_preinstalled_nvidia_software is true (which should be for any cluster with a DGX or pre-installed drivers) the nvidia-driver and nvidia-docker playbooks will be run on all nodes, ensuring they have software installed and not modifying nodes that are a DGX, and the GPU Operator will then run with driver.enabled=false and toolkit.enabled=false.

…by GPU Operator v1.9.0

…GPU Operator to use driver containers

… bool better

supertetelman · 2022-02-12T04:00:06Z

GPU Operator now works with deploy_monitoring.sh.

I tested the stack with the default settings (device-plugin) and the gpu_operator on a dgx (drivers/containers already installed).

…pulate

ajdecon

Summary

Tested using a single-node cluster with a DGX A100 running DGX OS 5.2.0.

Successfully deployed Kubernetes with the GPU Operator 1.9.
Successfully deployed monitoring stack and confirmed that GPU metrics are being collected
Confirmed that NFS client provisioner is correctly deployed

Demonstration of monitoring dashboard:

Check that NFS client provisioner is present:

$ kubectl get sc
NAME                   PROVISIONER                                     RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
nfs-client (default)   cluster.local/nfs-subdir-external-provisioner   Delete          Immediate           true                   6m32s

Verify that GPU operator is successful:

$ kubectl get pods -n gpu-operator-resources
NAME                                                              READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-pqf2w                                       1/1     Running     0          8m43s
gpu-operator-84b88fc49c-cmhp7                                     1/1     Running     0          9m8s
nvidia-cuda-validator-99zj7                                       0/1     Completed   0          8m34s
nvidia-dcgm-exporter-kld6w                                        1/1     Running     0          8m43s
nvidia-device-plugin-daemonset-rzctw                              1/1     Running     0          8m43s
nvidia-device-plugin-validator-jrvfg                              0/1     Completed   0          8m22s
nvidia-gpu-operator-node-feature-discovery-master-74db7c56vbwgt   1/1     Running     0          9m8s
nvidia-gpu-operator-node-feature-discovery-worker-6hhhn           1/1     Running     0          9m9s
nvidia-mig-manager-5fg42                                          1/1     Running     0          7m43s
nvidia-operator-validator-ps4bn                                   1/1     Running     0          8m43s

Requested changes and notes

In order to successfully deploy, I had to make a change to the hostlist specification for the nvidia-driver.yml and nvidia-docker.yml playbooks. See inline comments.

It's also worth noting that I had to employ a work-around for DGX OS 5.2.0, due to #1110 . This PR doesn't need to be changed for this issue, it depends on a Kubespray upgrade in the next release, but I wanted to note it here for testing purposes.

$ ansible-playbook -b -e docker_version=latest -e containerd_version=latest playbooks/k8s-cluster.yml

ajdecon · 2022-03-08T19:24:33Z

playbooks/k8s-cluster.yml

@@ -125,15 +125,19 @@

 # Install NVIDIA driver on GPU servers
 - include: nvidia-software/nvidia-driver.yml
+  vars:
+    hostlist: "{{ k8s-cluster }}"


Should be:

vars: hostlist: "k8s-cluster"

ajdecon · 2022-03-08T19:24:51Z

playbooks/k8s-cluster.yml


 # Install NVIDIA container runtime on GPU servers
 - include: container/nvidia-docker.yml
+  vars:
+    hostlist: "{{ k8s-cluster }}"


Should be:

vars: hostlist: "k8s-cluster"

ajdecon · 2022-03-08T19:28:15Z

roles/nvidia-gpu-operator/defaults/main.yml

+gpu_operator_enable_dcgm: false
+gpu_operator_enable_migmanager: true
+
+# Set to true fo DGX and other systems with pre-installed drivers


Nit:

< Set to true fo DGX and other systems with pre-installed drivers --- > Set to true for DGX and other systems with pre-installed drivers

ajdecon

LGTM!

supertetelman mentioned this pull request Dec 10, 2021

Update DeepOps to support GPU Operator 1.9 #1069

Closed

supertetelman commented Dec 10, 2021

View reviewed changes

dholt reviewed Dec 10, 2021

View reviewed changes

supertetelman mentioned this pull request Dec 10, 2021

Kubespray v2.18.0 and containerd runtime #1043

Merged

supertetelman marked this pull request as draft January 12, 2022 19:21

supertetelman added 7 commits February 11, 2022 20:03

Bump GPU Operator version to 1.9.0

241a33e

Bump GPU Device plugin to v 0.10.0

1981a32

Remove gpu_operator_default_runtime as it is no dynamically detected …

7c8036c

…by GPU Operator v1.9.0

Remove the gpu-operator node prep from playbook; no longer needed

12654e0

Update GPU Operator to dynamically support flags for DGX

142aadf

Add gpu operator flags to enable/disable migmanager and dcgm

1075a47

Introduce gpu_operator_preinstalled_nvidia_software to manually tell …

740eaaf

…GPU Operator to use driver containers

supertetelman force-pushed the gpu-operator-1.9 branch from 924472b to 740eaaf Compare February 11, 2022 20:04

supertetelman added 8 commits February 11, 2022 20:42

bump operator, 1.9.0->1.9.1

8fd0ffb

nfs-client-provisioner update changed, force update for cluster upgrades

50b6797

Update docs to reflect GPU Operator configuration needs with DGX

4ce1a18

Add some extra usage info to k8s doc

b7ce2cb

Pipe GPU Operator namespace through, not default

5f17ac3

Remove node-labels from pre-existing GPU Operator install, and handle…

88e71d3

… bool better

dcgm now disabled by default in operator

821453e

Update deploy_moniroting.sh to get metrics from GPU Operator

478954c

supertetelman changed the title ~~[WIP] GPU Operator 1.9 support with DGX inclusion~~ GPU Operator 1.9 support with DGX inclusion Feb 12, 2022

supertetelman marked this pull request as ready for review February 12, 2022 04:02

supertetelman requested a review from ajdecon February 12, 2022 04:03

supertetelman added 2 commits February 12, 2022 04:12

Add a single jenkins-nightly for gpu operator w/ existing sw

02e3e4d

change operator/plugin check to avoid failing tests

43cdda8

supertetelman force-pushed the gpu-operator-1.9 branch from d5415d1 to 43cdda8 Compare February 12, 2022 05:26

supertetelman added 2 commits February 12, 2022 08:07

Add test for DCGM metrics in prometheus

b98cdaa

dcgm exporter bump, 2.1.8->2.3.2

7659c0d

supertetelman force-pushed the gpu-operator-1.9 branch from a91263c to 7659c0d Compare February 12, 2022 08:07

supertetelman added 6 commits February 12, 2022 08:57

prom 10.0.2->15.2.0, ingress 3.5.1->4.0.17

9b40479

updated monitoring config for new versions

fb1d766

Add additional k8s debug for updated GPU Operator

dfd7c2c

Fix monitoring tests, follow redirects

5d5ef66

Change how tests check for DCGM metrics

8a38d49

Update monitoring test to poll for DCGM metrics, they need time to po…

c0c84b5

…pulate

ajdecon self-assigned this Feb 16, 2022

supertetelman added 2 commits February 16, 2022 17:02

Change default behavior of GPU Operator to use host-level drivers

24266f3

Only install nvidia-docker/nvidia-driver in k8s-cluster group

eddd414

ajdecon requested changes Mar 8, 2022

View reviewed changes

PR change requests, typos mostly

15ea40d

ajdecon approved these changes Mar 8, 2022

View reviewed changes

ajdecon merged commit dbf9956 into NVIDIA:master Mar 8, 2022

ajdecon mentioned this pull request Apr 26, 2022

DeepOps Release 22.04 #1164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Operator 1.9 support with DGX inclusion #1074

GPU Operator 1.9 support with DGX inclusion #1074

supertetelman commented Dec 9, 2021 •

edited

Loading

supertetelman Dec 10, 2021

dholt Dec 10, 2021

supertetelman Dec 10, 2021

dholt Dec 10, 2021

supertetelman Dec 10, 2021

dholt Dec 10, 2021

supertetelman Dec 10, 2021

dholt Dec 10, 2021

supertetelman Dec 10, 2021

supertetelman commented Dec 10, 2021

supertetelman commented Feb 12, 2022

ajdecon left a comment

ajdecon Mar 8, 2022

ajdecon Mar 8, 2022

ajdecon Mar 8, 2022

ajdecon left a comment

GPU Operator 1.9 support with DGX inclusion #1074

GPU Operator 1.9 support with DGX inclusion #1074

Conversation

supertetelman commented Dec 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

supertetelman commented Dec 10, 2021

supertetelman commented Feb 12, 2022

ajdecon left a comment

Choose a reason for hiding this comment

Summary

Requested changes and notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajdecon left a comment

Choose a reason for hiding this comment

supertetelman commented Dec 9, 2021 •

edited

Loading