-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Operator automation with NVIDIA AI Enterprise #1059
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iamadrigal : LGTM!
Tested on a three-node VM cluster with two worker nodes hosting A100 GPUs (thanks for providing the test platform!). Performed the following tests in order to validate the successful deployment:
# On the k8s control plane node:
nvidia@deepops-admin:~$ sudo kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-4rbnl 1/1 Running 0 19m
gpu-feature-discovery-gwg2v 1/1 Running 0 19m
gpu-operator-7bcc547564-vf5nq 1/1 Running 0 19m
nvidia-container-toolkit-daemonset-9rxrm 1/1 Running 0 19m
nvidia-container-toolkit-daemonset-tm24d 1/1 Running 0 19m
nvidia-cuda-validator-kgcxl 0/1 Completed 0 17m
nvidia-cuda-validator-mdldv 0/1 Completed 0 17m
nvidia-dcgm-exporter-n48tb 1/1 Running 0 19m
nvidia-dcgm-exporter-r5q5c 1/1 Running 0 19m
nvidia-device-plugin-daemonset-7p6zc 1/1 Running 0 19m
nvidia-device-plugin-daemonset-w47mb 1/1 Running 0 19m
nvidia-device-plugin-validator-df8hb 0/1 Completed 0 16m
nvidia-device-plugin-validator-ftvhl 0/1 Completed 0 17m
nvidia-driver-daemonset-2www4 1/1 Running 0 19m
nvidia-driver-daemonset-vs9f5 1/1 Running 0 19m
nvidia-gpu-operator-node-feature-discovery-master-74db7c56lxmd7 1/1 Running 0 19m
nvidia-gpu-operator-node-feature-discovery-worker-77nt8 1/1 Running 0 19m
nvidia-gpu-operator-node-feature-discovery-worker-bld82 1/1 Running 0 19m
nvidia-gpu-operator-node-feature-discovery-worker-tv4vl 1/1 Running 0 19m
nvidia-mig-manager-bhqkl 1/1 Running 0 17m
nvidia-mig-manager-ff52n 1/1 Running 0 16m
nvidia-operator-validator-7nhgn 1/1 Running 0 19m
nvidia-operator-validator-flpvc 1/1 Running 0 19m
nvidia@deepops-admin:~$ sudo kubectl logs -n gpu-operator-resources nvidia-cuda-validator-kgcxl
cuda workload validation is successful
nvidia@deepops-admin:~$ sudo kubectl logs -n gpu-operator-resources nvidia-device-plugin-validator-df8hb
device-plugin workload validation is successful
nvidia@deepops-admin:~$ sudo kubectl exec -n gpu-operator-resources --stdin --tty nvidia-device-plugin-daemonset-7p6zc -- /bin/bash
[root@nvidia-device-plugin-daemonset-7p6zc /]# nvidia-smi -L
GPU 0: GRID A100-2-10C (UUID: GPU-1dfdcca5-28f9-11b2-904f-4c80d4e1ed0c)
MIG 2g.10gb Device 0: (UUID: MIG-f1029a2c-305f-5223-af78-3136dd9fde27)
Note: in order to successfully deploy, I had to make the following changes to a default DeepOps configuration:
# Required configuration for NVAIE
deepops_gpu_operator_enabled: true
gpu_operator_nvaie_enable: true
gpu_operator_chart_version: "1.8.1"
gpu_operator_driver_registry: "nvcr.io/nvaie"
gpu_operator_driver_version: "470.63.01
gpu_operator_registry_email: "<my-email-address>"
gpu_operator_registry_password: "<my-ngc-key>"
gpu_operator_nvaie_nls_token: "<my-nls-token>"
For documentation purposes, I'd suggest adding the lines above to the config.example/group_vars/k8s-cluster.yml
file, commented out to provide an example of how to configure NVAIE.
I don't think that's a blocker to approving the PR, but if you want to add those lines to the example config then I will re-approve the PR when done!
This PR contains modifications to enable GPU Operator configuration when using deepops on vGPU clusters using NVIDIA AI Enterprise.