GPU Operator automation with NVIDIA AI Enterprise #1059

iamadrigal · 2021-11-16T18:43:14Z

This PR contains modifications to enable GPU Operator configuration when using deepops on vGPU clusters using NVIDIA AI Enterprise.

ajdecon

@iamadrigal : LGTM!

Tested on a three-node VM cluster with two worker nodes hosting A100 GPUs (thanks for providing the test platform!). Performed the following tests in order to validate the successful deployment:

# On the k8s control plane node:

nvidia@deepops-admin:~$ sudo kubectl get pods -n gpu-operator-resources
NAME                                                              READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-4rbnl                                       1/1     Running     0          19m
gpu-feature-discovery-gwg2v                                       1/1     Running     0          19m
gpu-operator-7bcc547564-vf5nq                                     1/1     Running     0          19m
nvidia-container-toolkit-daemonset-9rxrm                          1/1     Running     0          19m
nvidia-container-toolkit-daemonset-tm24d                          1/1     Running     0          19m
nvidia-cuda-validator-kgcxl                                       0/1     Completed   0          17m
nvidia-cuda-validator-mdldv                                       0/1     Completed   0          17m
nvidia-dcgm-exporter-n48tb                                        1/1     Running     0          19m
nvidia-dcgm-exporter-r5q5c                                        1/1     Running     0          19m
nvidia-device-plugin-daemonset-7p6zc                              1/1     Running     0          19m
nvidia-device-plugin-daemonset-w47mb                              1/1     Running     0          19m
nvidia-device-plugin-validator-df8hb                              0/1     Completed   0          16m
nvidia-device-plugin-validator-ftvhl                              0/1     Completed   0          17m
nvidia-driver-daemonset-2www4                                     1/1     Running     0          19m
nvidia-driver-daemonset-vs9f5                                     1/1     Running     0          19m
nvidia-gpu-operator-node-feature-discovery-master-74db7c56lxmd7   1/1     Running     0          19m
nvidia-gpu-operator-node-feature-discovery-worker-77nt8           1/1     Running     0          19m
nvidia-gpu-operator-node-feature-discovery-worker-bld82           1/1     Running     0          19m
nvidia-gpu-operator-node-feature-discovery-worker-tv4vl           1/1     Running     0          19m
nvidia-mig-manager-bhqkl                                          1/1     Running     0          17m
nvidia-mig-manager-ff52n                                          1/1     Running     0          16m
nvidia-operator-validator-7nhgn                                   1/1     Running     0          19m
nvidia-operator-validator-flpvc                                   1/1     Running     0          19m

nvidia@deepops-admin:~$ sudo kubectl logs -n gpu-operator-resources nvidia-cuda-validator-kgcxl
cuda workload validation is successful

nvidia@deepops-admin:~$ sudo kubectl logs -n gpu-operator-resources nvidia-device-plugin-validator-df8hb
device-plugin workload validation is successful


nvidia@deepops-admin:~$ sudo kubectl exec -n gpu-operator-resources --stdin --tty nvidia-device-plugin-daemonset-7p6zc -- /bin/bash
[root@nvidia-device-plugin-daemonset-7p6zc /]# nvidia-smi -L
GPU 0: GRID A100-2-10C (UUID: GPU-1dfdcca5-28f9-11b2-904f-4c80d4e1ed0c)
  MIG 2g.10gb     Device  0: (UUID: MIG-f1029a2c-305f-5223-af78-3136dd9fde27)

Note: in order to successfully deploy, I had to make the following changes to a default DeepOps configuration:

# Required configuration for NVAIE
deepops_gpu_operator_enabled: true
gpu_operator_nvaie_enable: true
gpu_operator_chart_version: "1.8.1"
gpu_operator_driver_registry: "nvcr.io/nvaie"
gpu_operator_driver_version: "470.63.01
gpu_operator_registry_email: "<my-email-address>"
gpu_operator_registry_password: "<my-ngc-key>"
gpu_operator_nvaie_nls_token: "<my-nls-token>"

For documentation purposes, I'd suggest adding the lines above to the config.example/group_vars/k8s-cluster.yml file, commented out to provide an example of how to configure NVAIE.

I don't think that's a blocker to approving the PR, but if you want to add those lines to the example config then I will re-approve the PR when done!

nvidia and others added 4 commits October 14, 2021 00:44

nvaie vgpu operator changes

7e79290

comment change

b8c382e

nvaie logic change

09a4876

bool change for clarity/syntax

196140f

ajdecon approved these changes Nov 18, 2021

View reviewed changes

ajdecon mentioned this pull request Dec 2, 2021

Automate the GPU Operator installation NVIDIA AI Enterprise special cases #1055

Closed

4 tasks

ajdecon merged commit bee2a9e into NVIDIA:master Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Operator automation with NVIDIA AI Enterprise #1059

GPU Operator automation with NVIDIA AI Enterprise #1059

iamadrigal commented Nov 16, 2021

ajdecon left a comment

GPU Operator automation with NVIDIA AI Enterprise #1059

GPU Operator automation with NVIDIA AI Enterprise #1059

Conversation

iamadrigal commented Nov 16, 2021

ajdecon left a comment

Choose a reason for hiding this comment