Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Operator automation with NVIDIA AI Enterprise #1059

Merged
merged 4 commits into from
Dec 3, 2021

Conversation

iamadrigal
Copy link
Contributor

This PR contains modifications to enable GPU Operator configuration when using deepops on vGPU clusters using NVIDIA AI Enterprise.

Copy link
Collaborator

@ajdecon ajdecon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iamadrigal : LGTM!

Tested on a three-node VM cluster with two worker nodes hosting A100 GPUs (thanks for providing the test platform!). Performed the following tests in order to validate the successful deployment:

# On the k8s control plane node:

nvidia@deepops-admin:~$ sudo kubectl get pods -n gpu-operator-resources
NAME                                                              READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-4rbnl                                       1/1     Running     0          19m
gpu-feature-discovery-gwg2v                                       1/1     Running     0          19m
gpu-operator-7bcc547564-vf5nq                                     1/1     Running     0          19m
nvidia-container-toolkit-daemonset-9rxrm                          1/1     Running     0          19m
nvidia-container-toolkit-daemonset-tm24d                          1/1     Running     0          19m
nvidia-cuda-validator-kgcxl                                       0/1     Completed   0          17m
nvidia-cuda-validator-mdldv                                       0/1     Completed   0          17m
nvidia-dcgm-exporter-n48tb                                        1/1     Running     0          19m
nvidia-dcgm-exporter-r5q5c                                        1/1     Running     0          19m
nvidia-device-plugin-daemonset-7p6zc                              1/1     Running     0          19m
nvidia-device-plugin-daemonset-w47mb                              1/1     Running     0          19m
nvidia-device-plugin-validator-df8hb                              0/1     Completed   0          16m
nvidia-device-plugin-validator-ftvhl                              0/1     Completed   0          17m
nvidia-driver-daemonset-2www4                                     1/1     Running     0          19m
nvidia-driver-daemonset-vs9f5                                     1/1     Running     0          19m
nvidia-gpu-operator-node-feature-discovery-master-74db7c56lxmd7   1/1     Running     0          19m
nvidia-gpu-operator-node-feature-discovery-worker-77nt8           1/1     Running     0          19m
nvidia-gpu-operator-node-feature-discovery-worker-bld82           1/1     Running     0          19m
nvidia-gpu-operator-node-feature-discovery-worker-tv4vl           1/1     Running     0          19m
nvidia-mig-manager-bhqkl                                          1/1     Running     0          17m
nvidia-mig-manager-ff52n                                          1/1     Running     0          16m
nvidia-operator-validator-7nhgn                                   1/1     Running     0          19m
nvidia-operator-validator-flpvc                                   1/1     Running     0          19m

nvidia@deepops-admin:~$ sudo kubectl logs -n gpu-operator-resources nvidia-cuda-validator-kgcxl
cuda workload validation is successful

nvidia@deepops-admin:~$ sudo kubectl logs -n gpu-operator-resources nvidia-device-plugin-validator-df8hb
device-plugin workload validation is successful


nvidia@deepops-admin:~$ sudo kubectl exec -n gpu-operator-resources --stdin --tty nvidia-device-plugin-daemonset-7p6zc -- /bin/bash
[root@nvidia-device-plugin-daemonset-7p6zc /]# nvidia-smi -L
GPU 0: GRID A100-2-10C (UUID: GPU-1dfdcca5-28f9-11b2-904f-4c80d4e1ed0c)
  MIG 2g.10gb     Device  0: (UUID: MIG-f1029a2c-305f-5223-af78-3136dd9fde27)

Note: in order to successfully deploy, I had to make the following changes to a default DeepOps configuration:

# Required configuration for NVAIE
deepops_gpu_operator_enabled: true
gpu_operator_nvaie_enable: true
gpu_operator_chart_version: "1.8.1"
gpu_operator_driver_registry: "nvcr.io/nvaie"
gpu_operator_driver_version: "470.63.01
gpu_operator_registry_email: "<my-email-address>"
gpu_operator_registry_password: "<my-ngc-key>"
gpu_operator_nvaie_nls_token: "<my-nls-token>"

For documentation purposes, I'd suggest adding the lines above to the config.example/group_vars/k8s-cluster.yml file, commented out to provide an example of how to configure NVAIE.

I don't think that's a blocker to approving the PR, but if you want to add those lines to the example config then I will re-approve the PR when done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants