-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Operator 1.9 support with DGX inclusion #1074
Conversation
@@ -1,4 +1,15 @@ | |||
--- | |||
- name: Check for DGX packages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not going to work.
I was testing this on a single DGX-station, but in a real deployment scenario the admin node will not have a /etc/dgx-release even if all the workers are DGX systems.
We need to come up with a better method to determine if there are any DGX OS systems within the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could still run a task like this on every node, then tag them as 'dgx' in k8s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet the GPU Operator is already adding a tag for that, let me check.
@@ -1,14 +1,18 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need this role? you removed the only place it's referenced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, no. If someone wants to rollback and use an older version of the GPU Operator, potentially.
- name: unload nouveau | ||
modprobe: | ||
name: nouveau | ||
state: absent | ||
ignore_errors: true | ||
|
||
- name: blacklist nouveau | ||
- name: blocklist nouveau |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'blacklist' is correct, not 'blocklist'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That goes against the new NVIDIA inclusive language policy.
path: /etc/dgx-release | ||
register: is_dgx | ||
|
||
- name: Set DGX specific GPU Operator flags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this going to work with a heterogeneous set of GPU nodes (DGX and non-DGX)? this looks like it would leave non-DGX nodes without a driver and toolkit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it will not work. Need to figure that out.
This current implementation will not change the current behavior unless If |
…by GPU Operator v1.9.0
…GPU Operator to use driver containers
924472b
to
740eaaf
Compare
GPU Operator now works with deploy_monitoring.sh. I tested the stack with the default settings (device-plugin) and the gpu_operator on a dgx (drivers/containers already installed). |
d5415d1
to
43cdda8
Compare
a91263c
to
7659c0d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary
Tested using a single-node cluster with a DGX A100 running DGX OS 5.2.0.
- Successfully deployed Kubernetes with the GPU Operator 1.9.
- Successfully deployed monitoring stack and confirmed that GPU metrics are being collected
- Confirmed that NFS client provisioner is correctly deployed
Demonstration of monitoring dashboard:
Check that NFS client provisioner is present:
$ kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
nfs-client (default) cluster.local/nfs-subdir-external-provisioner Delete Immediate true 6m32s
Verify that GPU operator is successful:
$ kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-pqf2w 1/1 Running 0 8m43s
gpu-operator-84b88fc49c-cmhp7 1/1 Running 0 9m8s
nvidia-cuda-validator-99zj7 0/1 Completed 0 8m34s
nvidia-dcgm-exporter-kld6w 1/1 Running 0 8m43s
nvidia-device-plugin-daemonset-rzctw 1/1 Running 0 8m43s
nvidia-device-plugin-validator-jrvfg 0/1 Completed 0 8m22s
nvidia-gpu-operator-node-feature-discovery-master-74db7c56vbwgt 1/1 Running 0 9m8s
nvidia-gpu-operator-node-feature-discovery-worker-6hhhn 1/1 Running 0 9m9s
nvidia-mig-manager-5fg42 1/1 Running 0 7m43s
nvidia-operator-validator-ps4bn 1/1 Running 0 8m43s
Requested changes and notes
In order to successfully deploy, I had to make a change to the hostlist
specification for the nvidia-driver.yml
and nvidia-docker.yml
playbooks. See inline comments.
It's also worth noting that I had to employ a work-around for DGX OS 5.2.0, due to #1110 . This PR doesn't need to be changed for this issue, it depends on a Kubespray upgrade in the next release, but I wanted to note it here for testing purposes.
$ ansible-playbook -b -e docker_version=latest -e containerd_version=latest playbooks/k8s-cluster.yml
playbooks/k8s-cluster.yml
Outdated
@@ -125,15 +125,19 @@ | |||
|
|||
# Install NVIDIA driver on GPU servers | |||
- include: nvidia-software/nvidia-driver.yml | |||
vars: | |||
hostlist: "{{ k8s-cluster }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be:
vars:
hostlist: "k8s-cluster"
playbooks/k8s-cluster.yml
Outdated
|
||
# Install NVIDIA container runtime on GPU servers | ||
- include: container/nvidia-docker.yml | ||
vars: | ||
hostlist: "{{ k8s-cluster }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be:
vars:
hostlist: "k8s-cluster"
gpu_operator_enable_dcgm: false | ||
gpu_operator_enable_migmanager: true | ||
|
||
# Set to true fo DGX and other systems with pre-installed drivers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
< Set to true fo DGX and other systems with pre-installed drivers
---
> Set to true for DGX and other systems with pre-installed drivers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Test matrix:
TODO: