awsdocs
diff --git a/‎latest/ug/ml/ml-eks-k8s-device-plugin.adoc‎
Lines changed: 267 additions & 0 deletions b/‎latest/ug/ml/ml-eks-k8s-device-plugin.adoc‎
Lines changed: 267 additions & 0 deletions
@@ -0,0 +1,267 @@
+include::../attributes.txt[]
+
+[.topic]
+[#ml-eks-k8s-device-plugin]
+= Install Kubernetes device plugin for GPUs
+:info_titleabbrev: Install device plugin for GPUs
+
+Kubernetes https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/[device plugins] have been the primary mechanism for advertising specialized infrastructure such as GPUs, network interfaces, and network adaptors as consumable resources for Kubernetes workloads. While https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/[Dynamic Resource Allocation] (DRA) is positioned as the future for device management in Kubernetes, most specialized infrastructure providers are early in their support for DRA drivers. Kubernetes device plugins remain a widely available approach for using GPUs in Kubernetes clusters today. 
+
+== Considerations
+
+* When using the EKS-optimized AL2023 AMIs with NVIDIA GPUs, you must install the https://github.com/NVIDIA/k8s-device-plugin[NVIDIA Kubernetes device plugin]. You can install and manage the NVIDIA Kubernetes device plugin with Helm, your choice of Kubernetes tooling, or the NVIDIA GPU operator.
+* When using the EKS-optimized Bottlerocket AMIs with NVIDIA GPUs, you do not need to install the NVIDIA Kubernetes device plugin, as it is already included in the EKS-optimized Bottlerocket AMIs. This includes when you use GPU instances with EKS Auto Mode.
+* When using the EKS-optimized AL2023 or Bottlerocket AMIs with AWS Inferentia or Trainium GPUs, then you must install the Neuron Kubernetes device plugin, and optionally install the https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-scheduler.html[Neuron Kubernetes scheduler extension]. For more information, see the https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html[Neuron documentation for running on EKS].
+
+[#eks-nvidia-device-plugin]
+== Install NVIDIA Kubernetes device plugin
+
+The following procedure describes how to install the NVIDIA Kubernetes device plugin and run a sample test on NVIDIA GPU instances.
+
+=== Prerequisites
+
+* EKS cluster created
+* NVIDIA GPU nodes running in the cluster using EKS-optimized AL2023 NVIDIA AMI
+* Helm installed in your command-line environment, see <<helm,Setup Helm instructions>>.
+
+=== Procedure
+
+. Add the `nvdp` Helm chart repository.
++
+[source,bash]
+----
+helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
+----
++
+. Update your local Helm repository to make sure that you have the most recent charts.
++
+[source,bash]
+----
+helm repo update
+----
++
+. Get the latest version of the NVIDIA Kubernetes device plugin
++
+[source,bash]
+----
+helm search repo nvdp --devel
+----
++
+[source,bash]
+----
+NAME                      	CHART VERSION	APP VERSION	DESCRIPTION                                       
+nvdp/gpu-feature-discovery	0.17.4       	0.17.4     	...
+nvdp/nvidia-device-plugin 	0.17.4       	0.17.4     	...
+----
++
+. Install the NVIDIA Kubernetes device plugin on your cluster, replacing `0.17.4` with the latest version from the command above.
++
+[source,bash,subs="verbatim,attributes,quotes"]
+----
+helm install nvdp nvdp/nvidia-device-plugin \
+  --namespace nvidia \
+  --create-namespace \
+  --version [.replaceable]`0.17.4` \
+  --set gfd.enabled=true
+----
++
+. Verify the NVIDIA Kubernetes device plugin is running in your cluster. The output below shows the output with two nodes in the cluster.
++
+[source,bash]
+----
+kubectl get ds -n nvidia nvdp-nvidia-device-plugin
+----
++
+[source, bash]
+----
+NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
+nvdp-nvidia-device-plugin   2         2         2       2            2           <none>          11m
+----
++
+. Verify that your nodes have allocatable GPUs with the following command.
++
+[source,bash,subs="verbatim,attributes"]
+----
+kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
+----
++
+[source,bash]
+----
+NAME                                           GPU
+ip-192-168-11-225.us-west-2.compute.internal   1
+ip-192-168-24-96.us-west-2.compute.internal    1
+----
++
+. Create a file named `nvidia-smi.yaml` with the following contents. This manifest launches a https://docs.aws.amazon.com/linux/al2023/ug/minimal-container.html[minimal AL2023 container image] that runs `nvidia-smi` on a node.
++
+[source,yaml,subs="verbatim,attributes,quotes"]
+----
+apiVersion: v1
+kind: Pod
+metadata:
+  name: nvidia-smi
+spec:
+  restartPolicy: OnFailure
+  containers:
+    - name: gpu-demo
+      image: public.ecr.aws/amazonlinux/amazonlinux:2023-minimal
+      command: ['/bin/sh', '-c']
+      args: ['nvidia-smi && tail -f /dev/null']
+      resources:
+        limits:
+          nvidia.com/gpu: 1
+  tolerations:
+    - key: 'nvidia.com/gpu'
+      operator: 'Equal'
+      value: 'true'
+      effect: 'NoSchedule'
+----
++
+. Apply the manifest with the following command.
++
+[source,bash,subs="verbatim,attributes"]
+----
+kubectl apply -f nvidia-smi.yaml
+----
+. After the Pod has finished running, view its logs with the following command.
++
+[source,bash,subs="verbatim,attributes"]
+----
+kubectl logs nvidia-smi
+----
++
+An example output is as follows.
++
+[source,bash,subs="verbatim,attributes"]
+----
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI XXX.XXX.XX            Driver Version: XXX.XXX.XX     CUDA Version: XX.X      |
+|-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA L4                      On  |   00000000:31:00.0 Off |                    0 |
+| N/A   27C    P8             11W /   72W |       0MiB /  23034MiB |      0%      Default |
+|                                         |                        |                  N/A |
++-----------------------------------------+------------------------+----------------------+
+                                                                                         
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|  No running processes found                                                             |
++-----------------------------------------------------------------------------------------+
+----
+
+[#eks-neuron-device-plugin]
+== Install Neuron Kubernetes device plugin
+
+The following procedure describes how to install the Neuron Kubernetes device plugin and run a sample test on an Inferentia instance.
+
+=== Prerequisites
+
+* EKS cluster created
+* Neuron GPU nodes running in the cluster using EKS-optimized AL2023 Neuron AMI or Bottlerocket AMI
+* Helm installed in your command-line environment, see <<helm,Setup Helm instructions>>.
+
+=== Procedure
+
+. Install the Neuron Kubernetes device plugin on your cluster.
++
+[source,bash]
+----
+helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
+    --set "npd.enabled=false"
+----
++
+. Verify the Neuron Kubernetes device plugin is running in your cluster. The output below shows the output with a single Neuron node in the cluster.
++
+[source,bash]
+----
+kubectl get ds -n kube-system neuron-device-plugin
+----
++
+[source, bash]
+----
+NAME                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
+neuron-device-plugin   1         1         1       1            1           <none>          72s
+----
++
+. Verify that your nodes have allocatable NueronCores with the following command.
++
+[source,bash,subs="verbatim,attributes"]
+----
+kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronCore:.status.allocatable.aws\.amazon\.com/neuroncore"
+----
++
+[source,bash]
+----
+NAME                                           NeuronCore
+ip-192-168-47-173.us-west-2.compute.internal   2
+----
++
+. Verify that your nodes have allocatable NueronDevices with the following command.
++
+[source,bash,subs="verbatim,attributes"]
+----
+kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronDevice:.status.allocatable.aws\.amazon\.com/neuron"
+----
++
+[source,bash]
+----
+NAME                                           NeuronDevice
+ip-192-168-47-173.us-west-2.compute.internal   1
+----
++
+. Create a file named `neuron-ls.yaml` with the following contents. This manifest launches an https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html[Neuron Monitor] container that has the `neuron-ls` tool installed. 
++
+[source,yaml]
+----
+apiVersion: v1
+kind: Pod
+metadata:
+  name: neuron-ls
+spec:
+  restartPolicy: Never
+  containers:
+  - name: neuron-container
+    image: public.ecr.aws/g4h4h0b5/neuron-monitor:1.0.0
+    command: ["/bin/sh"]
+    args: ["-c", "neuron-ls"]
+    resources:
+      limits:
+        aws.amazon.com/neuron: 1
+  tolerations: 
+  - key: "aws.amazon.com/neuron"
+    operator: "Exists"
+    effect: "NoSchedule"
+----
++
+. Apply the manifest with the following command.
++
+[source,bash,subs="verbatim,attributes"]
+----
+kubectl apply -f neuron-ls.yaml
+----
+. After the Pod has finished running, view its logs with the following command.
++
+[source,bash,subs="verbatim,attributes"]
+----
+kubectl logs neuron-ls
+----
++
+An example output is below.
++
+[source,bash,subs="verbatim,attributes"]
+----
+instance-type: inf2.xlarge
+instance-id: ...
++--------+--------+--------+---------+
+| NEURON | NEURON | NEURON |   PCI   |
+| DEVICE | CORES  | MEMORY |   BDF   |
++--------+--------+--------+---------+
+| 0      | 2      | 32 GB  | 00:1f.0 |
++--------+--------+--------+---------+
+----