Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Operator 1.9 support with DGX inclusion #1074

Merged
merged 28 commits into from
Mar 8, 2022

Conversation

supertetelman
Copy link
Collaborator

@supertetelman supertetelman commented Dec 9, 2021

  • Bump GPU Operator
  • Bump Device Plugin
  • Remove node-prep role as GPU Operator now handles that logic internally
  • Introduce flags to enable/disable gpu operator components (migmanager, dcgm, driver, toolkit)
  • Dynamically detect DGX systems and disable toolkit & driver features

Test matrix:

  • Install K8s on DGX, verify the system is properly detected and the GPU jobs run
  • Install on non-DGX without GPU Operator, verify new device plugin versions work
  • Install on non-DGX with GPU Operator, verify new gpu operator versions work
  • Install with components disabled to verify flags are pushed through

TODO:

  • GPU Operator monitoring integration needs to be completed before the gpu operator install can be set as the default
  • Additional tests should be introduced to uninstall and re-install the gpu operator with different flags
  • The automated testing should be double checked to make sure we aren't double testing things
  • NVIDIA AI Enterprise does not yet support GPU Operator v1.9.0 and we should hold of on merging this for a few weeks until that support is added.

@@ -1,4 +1,15 @@
---
- name: Check for DGX packages
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not going to work.

I was testing this on a single DGX-station, but in a real deployment scenario the admin node will not have a /etc/dgx-release even if all the workers are DGX systems.

We need to come up with a better method to determine if there are any DGX OS systems within the cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could still run a task like this on every node, then tag them as 'dgx' in k8s

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet the GPU Operator is already adding a tag for that, let me check.

@@ -1,14 +1,18 @@
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this role? you removed the only place it's referenced

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, no. If someone wants to rollback and use an older version of the GPU Operator, potentially.

- name: unload nouveau
modprobe:
name: nouveau
state: absent
ignore_errors: true

- name: blacklist nouveau
- name: blocklist nouveau
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'blacklist' is correct, not 'blocklist'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That goes against the new NVIDIA inclusive language policy.

path: /etc/dgx-release
register: is_dgx

- name: Set DGX specific GPU Operator flags
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going to work with a heterogeneous set of GPU nodes (DGX and non-DGX)? this looks like it would leave non-DGX nodes without a driver and toolkit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it will not work. Need to figure that out.

@supertetelman
Copy link
Collaborator Author

This current implementation will not change the current behavior unless gpu_operator_preinstalled_nvidia_software is set to true.

If gpu_operator_preinstalled_nvidia_software is true (which should be for any cluster with a DGX or pre-installed drivers) the nvidia-driver and nvidia-docker playbooks will be run on all nodes, ensuring they have software installed and not modifying nodes that are a DGX, and the GPU Operator will then run with driver.enabled=false and toolkit.enabled=false.

@supertetelman
Copy link
Collaborator Author

GPU Operator now works with deploy_monitoring.sh.

I tested the stack with the default settings (device-plugin) and the gpu_operator on a dgx (drivers/containers already installed).

@supertetelman supertetelman changed the title [WIP] GPU Operator 1.9 support with DGX inclusion GPU Operator 1.9 support with DGX inclusion Feb 12, 2022
@supertetelman supertetelman marked this pull request as ready for review February 12, 2022 04:02
@ajdecon ajdecon self-assigned this Feb 16, 2022
Copy link
Collaborator

@ajdecon ajdecon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Tested using a single-node cluster with a DGX A100 running DGX OS 5.2.0.

  • Successfully deployed Kubernetes with the GPU Operator 1.9.
  • Successfully deployed monitoring stack and confirmed that GPU metrics are being collected
  • Confirmed that NFS client provisioner is correctly deployed

Demonstration of monitoring dashboard:

image

Check that NFS client provisioner is present:

$ kubectl get sc
NAME                   PROVISIONER                                     RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
nfs-client (default)   cluster.local/nfs-subdir-external-provisioner   Delete          Immediate           true                   6m32s

Verify that GPU operator is successful:

$ kubectl get pods -n gpu-operator-resources
NAME                                                              READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-pqf2w                                       1/1     Running     0          8m43s
gpu-operator-84b88fc49c-cmhp7                                     1/1     Running     0          9m8s
nvidia-cuda-validator-99zj7                                       0/1     Completed   0          8m34s
nvidia-dcgm-exporter-kld6w                                        1/1     Running     0          8m43s
nvidia-device-plugin-daemonset-rzctw                              1/1     Running     0          8m43s
nvidia-device-plugin-validator-jrvfg                              0/1     Completed   0          8m22s
nvidia-gpu-operator-node-feature-discovery-master-74db7c56vbwgt   1/1     Running     0          9m8s
nvidia-gpu-operator-node-feature-discovery-worker-6hhhn           1/1     Running     0          9m9s
nvidia-mig-manager-5fg42                                          1/1     Running     0          7m43s
nvidia-operator-validator-ps4bn                                   1/1     Running     0          8m43s

Requested changes and notes

In order to successfully deploy, I had to make a change to the hostlist specification for the nvidia-driver.yml and nvidia-docker.yml playbooks. See inline comments.

It's also worth noting that I had to employ a work-around for DGX OS 5.2.0, due to #1110 . This PR doesn't need to be changed for this issue, it depends on a Kubespray upgrade in the next release, but I wanted to note it here for testing purposes.

$ ansible-playbook -b -e docker_version=latest -e containerd_version=latest playbooks/k8s-cluster.yml

@@ -125,15 +125,19 @@

# Install NVIDIA driver on GPU servers
- include: nvidia-software/nvidia-driver.yml
vars:
hostlist: "{{ k8s-cluster }}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be:

vars:
  hostlist: "k8s-cluster"


# Install NVIDIA container runtime on GPU servers
- include: container/nvidia-docker.yml
vars:
hostlist: "{{ k8s-cluster }}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be:

vars:
  hostlist: "k8s-cluster"

gpu_operator_enable_dcgm: false
gpu_operator_enable_migmanager: true

# Set to true fo DGX and other systems with pre-installed drivers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

< Set to true fo DGX and other systems with pre-installed drivers
---
> Set to true for DGX and other systems with pre-installed drivers

Copy link
Collaborator

@ajdecon ajdecon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ajdecon ajdecon merged commit dbf9956 into NVIDIA:master Mar 8, 2022
@ajdecon ajdecon mentioned this pull request Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants