|
| 1 | +include::../attributes.txt[] |
| 2 | + |
| 3 | +[.topic] |
| 4 | +[#ml-eks-k8s-device-plugin] |
| 5 | += Install Kubernetes device plugin for GPUs |
| 6 | +:info_titleabbrev: Install device plugin for GPUs |
| 7 | + |
| 8 | +Kubernetes https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/[device plugins] have been the primary mechanism for advertising specialized infrastructure such as GPUs, network interfaces, and network adaptors as consumable resources for Kubernetes workloads. While https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/[Dynamic Resource Allocation] (DRA) is positioned as the future for device management in Kubernetes, most specialized infrastructure providers are early in their support for DRA drivers. Kubernetes device plugins remain a widely available approach for using GPUs in Kubernetes clusters today. |
| 9 | + |
| 10 | +== Considerations |
| 11 | + |
| 12 | +* When using the EKS-optimized AL2023 AMIs with NVIDIA GPUs, you must install the https://github.com/NVIDIA/k8s-device-plugin[NVIDIA Kubernetes device plugin]. You can install and manage the NVIDIA Kubernetes device plugin with Helm, your choice of Kubernetes tooling, or the NVIDIA GPU operator. |
| 13 | +* When using the EKS-optimized Bottlerocket AMIs with NVIDIA GPUs, you do not need to install the NVIDIA Kubernetes device plugin, as it is already included in the EKS-optimized Bottlerocket AMIs. This includes when you use GPU instances with EKS Auto Mode. |
| 14 | +* When using the EKS-optimized AL2023 or Bottlerocket AMIs with AWS Inferentia or Trainium GPUs, then you must install the Neuron Kubernetes device plugin, and optionally install the https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-scheduler.html[Neuron Kubernetes scheduler extension]. For more information, see the https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html[Neuron documentation for running on EKS]. |
| 15 | + |
| 16 | +[#eks-nvidia-device-plugin] |
| 17 | +== Install NVIDIA Kubernetes device plugin |
| 18 | + |
| 19 | +The following procedure describes how to install the NVIDIA Kubernetes device plugin and run a sample test on NVIDIA GPU instances. |
| 20 | + |
| 21 | +=== Prerequisites |
| 22 | + |
| 23 | +* EKS cluster created |
| 24 | +* NVIDIA GPU nodes running in the cluster using EKS-optimized AL2023 NVIDIA AMI |
| 25 | +* Helm installed in your command-line environment, see <<helm,Setup Helm instructions>>. |
| 26 | + |
| 27 | +=== Procedure |
| 28 | + |
| 29 | +. Add the `nvdp` Helm chart repository. |
| 30 | ++ |
| 31 | +[source,bash] |
| 32 | +---- |
| 33 | +helm repo add nvdp https://nvidia.github.io/k8s-device-plugin |
| 34 | +---- |
| 35 | ++ |
| 36 | +. Update your local Helm repository to make sure that you have the most recent charts. |
| 37 | ++ |
| 38 | +[source,bash] |
| 39 | +---- |
| 40 | +helm repo update |
| 41 | +---- |
| 42 | ++ |
| 43 | +. Get the latest version of the NVIDIA Kubernetes device plugin |
| 44 | ++ |
| 45 | +[source,bash] |
| 46 | +---- |
| 47 | +helm search repo nvdp --devel |
| 48 | +---- |
| 49 | ++ |
| 50 | +[source,bash] |
| 51 | +---- |
| 52 | +NAME CHART VERSION APP VERSION DESCRIPTION |
| 53 | +nvdp/gpu-feature-discovery 0.17.4 0.17.4 ... |
| 54 | +nvdp/nvidia-device-plugin 0.17.4 0.17.4 ... |
| 55 | +---- |
| 56 | ++ |
| 57 | +. Install the NVIDIA Kubernetes device plugin on your cluster, replacing `0.17.4` with the latest version from the command above. |
| 58 | ++ |
| 59 | +[source,bash,subs="verbatim,attributes,quotes"] |
| 60 | +---- |
| 61 | +helm install nvdp nvdp/nvidia-device-plugin \ |
| 62 | + --namespace nvidia \ |
| 63 | + --create-namespace \ |
| 64 | + --version [.replaceable]`0.17.4` \ |
| 65 | + --set gfd.enabled=true |
| 66 | +---- |
| 67 | ++ |
| 68 | +. Verify the NVIDIA Kubernetes device plugin is running in your cluster. The output below shows the output with two nodes in the cluster. |
| 69 | ++ |
| 70 | +[source,bash] |
| 71 | +---- |
| 72 | +kubectl get ds -n nvidia nvdp-nvidia-device-plugin |
| 73 | +---- |
| 74 | ++ |
| 75 | +[source, bash] |
| 76 | +---- |
| 77 | +NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE |
| 78 | +nvdp-nvidia-device-plugin 2 2 2 2 2 <none> 11m |
| 79 | +---- |
| 80 | ++ |
| 81 | +. Verify that your nodes have allocatable GPUs with the following command. |
| 82 | ++ |
| 83 | +[source,bash,subs="verbatim,attributes"] |
| 84 | +---- |
| 85 | +kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" |
| 86 | +---- |
| 87 | ++ |
| 88 | +[source,bash] |
| 89 | +---- |
| 90 | +NAME GPU |
| 91 | +ip-192-168-11-225.us-west-2.compute.internal 1 |
| 92 | +ip-192-168-24-96.us-west-2.compute.internal 1 |
| 93 | +---- |
| 94 | ++ |
| 95 | +. Create a file named `nvidia-smi.yaml` with the following contents. This manifest launches a https://docs.aws.amazon.com/linux/al2023/ug/minimal-container.html[minimal AL2023 container image] that runs `nvidia-smi` on a node. |
| 96 | ++ |
| 97 | +[source,yaml,subs="verbatim,attributes,quotes"] |
| 98 | +---- |
| 99 | +apiVersion: v1 |
| 100 | +kind: Pod |
| 101 | +metadata: |
| 102 | + name: nvidia-smi |
| 103 | +spec: |
| 104 | + restartPolicy: OnFailure |
| 105 | + containers: |
| 106 | + - name: gpu-demo |
| 107 | + image: public.ecr.aws/amazonlinux/amazonlinux:2023-minimal |
| 108 | + command: ['/bin/sh', '-c'] |
| 109 | + args: ['nvidia-smi && tail -f /dev/null'] |
| 110 | + resources: |
| 111 | + limits: |
| 112 | + nvidia.com/gpu: 1 |
| 113 | + tolerations: |
| 114 | + - key: 'nvidia.com/gpu' |
| 115 | + operator: 'Equal' |
| 116 | + value: 'true' |
| 117 | + effect: 'NoSchedule' |
| 118 | +---- |
| 119 | ++ |
| 120 | +. Apply the manifest with the following command. |
| 121 | ++ |
| 122 | +[source,bash,subs="verbatim,attributes"] |
| 123 | +---- |
| 124 | +kubectl apply -f nvidia-smi.yaml |
| 125 | +---- |
| 126 | +. After the Pod has finished running, view its logs with the following command. |
| 127 | ++ |
| 128 | +[source,bash,subs="verbatim,attributes"] |
| 129 | +---- |
| 130 | +kubectl logs nvidia-smi |
| 131 | +---- |
| 132 | ++ |
| 133 | +An example output is as follows. |
| 134 | ++ |
| 135 | +[source,bash,subs="verbatim,attributes"] |
| 136 | +---- |
| 137 | ++-----------------------------------------------------------------------------------------+ |
| 138 | +| NVIDIA-SMI XXX.XXX.XX Driver Version: XXX.XXX.XX CUDA Version: XX.X | |
| 139 | +|-----------------------------------------+------------------------+----------------------+ |
| 140 | +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | |
| 141 | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | |
| 142 | +| | | MIG M. | |
| 143 | +|=========================================+========================+======================| |
| 144 | +| 0 NVIDIA L4 On | 00000000:31:00.0 Off | 0 | |
| 145 | +| N/A 27C P8 11W / 72W | 0MiB / 23034MiB | 0% Default | |
| 146 | +| | | N/A | |
| 147 | ++-----------------------------------------+------------------------+----------------------+ |
| 148 | + |
| 149 | ++-----------------------------------------------------------------------------------------+ |
| 150 | +| Processes: | |
| 151 | +| GPU GI CI PID Type Process name GPU Memory | |
| 152 | +| ID ID Usage | |
| 153 | +|=========================================================================================| |
| 154 | +| No running processes found | |
| 155 | ++-----------------------------------------------------------------------------------------+ |
| 156 | +---- |
| 157 | + |
| 158 | +[#eks-neuron-device-plugin] |
| 159 | +== Install Neuron Kubernetes device plugin |
| 160 | + |
| 161 | +The following procedure describes how to install the Neuron Kubernetes device plugin and run a sample test on an Inferentia instance. |
| 162 | + |
| 163 | +=== Prerequisites |
| 164 | + |
| 165 | +* EKS cluster created |
| 166 | +* Neuron GPU nodes running in the cluster using EKS-optimized AL2023 Neuron AMI or Bottlerocket AMI |
| 167 | +* Helm installed in your command-line environment, see <<helm,Setup Helm instructions>>. |
| 168 | + |
| 169 | +=== Procedure |
| 170 | + |
| 171 | +. Install the Neuron Kubernetes device plugin on your cluster. |
| 172 | ++ |
| 173 | +[source,bash] |
| 174 | +---- |
| 175 | +helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \ |
| 176 | + --set "npd.enabled=false" |
| 177 | +---- |
| 178 | ++ |
| 179 | +. Verify the Neuron Kubernetes device plugin is running in your cluster. The output below shows the output with a single Neuron node in the cluster. |
| 180 | ++ |
| 181 | +[source,bash] |
| 182 | +---- |
| 183 | +kubectl get ds -n kube-system neuron-device-plugin |
| 184 | +---- |
| 185 | ++ |
| 186 | +[source, bash] |
| 187 | +---- |
| 188 | +NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE |
| 189 | +neuron-device-plugin 1 1 1 1 1 <none> 72s |
| 190 | +---- |
| 191 | ++ |
| 192 | +. Verify that your nodes have allocatable NueronCores with the following command. |
| 193 | ++ |
| 194 | +[source,bash,subs="verbatim,attributes"] |
| 195 | +---- |
| 196 | +kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronCore:.status.allocatable.aws\.amazon\.com/neuroncore" |
| 197 | +---- |
| 198 | ++ |
| 199 | +[source,bash] |
| 200 | +---- |
| 201 | +NAME NeuronCore |
| 202 | +ip-192-168-47-173.us-west-2.compute.internal 2 |
| 203 | +---- |
| 204 | ++ |
| 205 | +. Verify that your nodes have allocatable NueronDevices with the following command. |
| 206 | ++ |
| 207 | +[source,bash,subs="verbatim,attributes"] |
| 208 | +---- |
| 209 | +kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronDevice:.status.allocatable.aws\.amazon\.com/neuron" |
| 210 | +---- |
| 211 | ++ |
| 212 | +[source,bash] |
| 213 | +---- |
| 214 | +NAME NeuronDevice |
| 215 | +ip-192-168-47-173.us-west-2.compute.internal 1 |
| 216 | +---- |
| 217 | ++ |
| 218 | +. Create a file named `neuron-ls.yaml` with the following contents. This manifest launches an https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html[Neuron Monitor] container that has the `neuron-ls` tool installed. |
| 219 | ++ |
| 220 | +[source,yaml] |
| 221 | +---- |
| 222 | +apiVersion: v1 |
| 223 | +kind: Pod |
| 224 | +metadata: |
| 225 | + name: neuron-ls |
| 226 | +spec: |
| 227 | + restartPolicy: Never |
| 228 | + containers: |
| 229 | + - name: neuron-container |
| 230 | + image: public.ecr.aws/g4h4h0b5/neuron-monitor:1.0.0 |
| 231 | + command: ["/bin/sh"] |
| 232 | + args: ["-c", "neuron-ls"] |
| 233 | + resources: |
| 234 | + limits: |
| 235 | + aws.amazon.com/neuron: 1 |
| 236 | + tolerations: |
| 237 | + - key: "aws.amazon.com/neuron" |
| 238 | + operator: "Exists" |
| 239 | + effect: "NoSchedule" |
| 240 | +---- |
| 241 | ++ |
| 242 | +. Apply the manifest with the following command. |
| 243 | ++ |
| 244 | +[source,bash,subs="verbatim,attributes"] |
| 245 | +---- |
| 246 | +kubectl apply -f neuron-ls.yaml |
| 247 | +---- |
| 248 | +. After the Pod has finished running, view its logs with the following command. |
| 249 | ++ |
| 250 | +[source,bash,subs="verbatim,attributes"] |
| 251 | +---- |
| 252 | +kubectl logs neuron-ls |
| 253 | +---- |
| 254 | ++ |
| 255 | +An example output is below. |
| 256 | ++ |
| 257 | +[source,bash,subs="verbatim,attributes"] |
| 258 | +---- |
| 259 | +instance-type: inf2.xlarge |
| 260 | +instance-id: ... |
| 261 | ++--------+--------+--------+---------+ |
| 262 | +| NEURON | NEURON | NEURON | PCI | |
| 263 | +| DEVICE | CORES | MEMORY | BDF | |
| 264 | ++--------+--------+--------+---------+ |
| 265 | +| 0 | 2 | 32 GB | 00:1f.0 | |
| 266 | ++--------+--------+--------+---------+ |
| 267 | +---- |
0 commit comments