Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
267 changes: 267 additions & 0 deletions latest/ug/ml/ml-eks-k8s-device-plugin.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
include::../attributes.txt[]

[.topic]
[#ml-eks-k8s-device-plugin]
= Install Kubernetes device plugin for GPUs
:info_titleabbrev: Install device plugin for GPUs

Kubernetes https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/[device plugins] have been the primary mechanism for advertising specialized infrastructure such as GPUs, network interfaces, and network adaptors as consumable resources for Kubernetes workloads. While https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/[Dynamic Resource Allocation] (DRA) is positioned as the future for device management in Kubernetes, most specialized infrastructure providers are early in their support for DRA drivers. Kubernetes device plugins remain a widely available approach for using GPUs in Kubernetes clusters today.

== Considerations

* When using the EKS-optimized AL2023 AMIs with NVIDIA GPUs, you must install the https://github.com/NVIDIA/k8s-device-plugin[NVIDIA Kubernetes device plugin]. You can install and manage the NVIDIA Kubernetes device plugin with Helm, your choice of Kubernetes tooling, or the NVIDIA GPU operator.
* When using the EKS-optimized Bottlerocket AMIs with NVIDIA GPUs, you do not need to install the NVIDIA Kubernetes device plugin, as it is already included in the EKS-optimized Bottlerocket AMIs. This includes when you use GPU instances with EKS Auto Mode.
* When using the EKS-optimized AL2023 or Bottlerocket AMIs with AWS Inferentia or Trainium GPUs, then you must install the Neuron Kubernetes device plugin, and optionally install the https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-scheduler.html[Neuron Kubernetes scheduler extension]. For more information, see the https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html[Neuron documentation for running on EKS].

[#eks-nvidia-device-plugin]
== Install NVIDIA Kubernetes device plugin

The following procedure describes how to install the NVIDIA Kubernetes device plugin and run a sample test on NVIDIA GPU instances.

=== Prerequisites

* EKS cluster created
* NVIDIA GPU nodes running in the cluster using EKS-optimized AL2023 NVIDIA AMI
* Helm installed in your command-line environment, see <<helm,Setup Helm instructions>>.

=== Procedure

. Add the `nvdp` Helm chart repository.
+
[source,bash]
----
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
----
+
. Update your local Helm repository to make sure that you have the most recent charts.
+
[source,bash]
----
helm repo update
----
+
. Get the latest version of the NVIDIA Kubernetes device plugin
+
[source,bash]
----
helm search repo nvdp --devel
----
+
[source,bash]
----
NAME CHART VERSION APP VERSION DESCRIPTION
nvdp/gpu-feature-discovery 0.17.4 0.17.4 ...
nvdp/nvidia-device-plugin 0.17.4 0.17.4 ...
----
+
. Install the NVIDIA Kubernetes device plugin on your cluster, replacing `0.17.4` with the latest version from the command above.
+
[source,bash,subs="verbatim,attributes,quotes"]
----
helm install nvdp nvdp/nvidia-device-plugin \
--namespace nvidia \
--create-namespace \
--version [.replaceable]`0.17.4` \
--set gfd.enabled=true
----
+
. Verify the NVIDIA Kubernetes device plugin is running in your cluster. The output below shows the output with two nodes in the cluster.
+
[source,bash]
----
kubectl get ds -n nvidia nvdp-nvidia-device-plugin
----
+
[source, bash]
----
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
nvdp-nvidia-device-plugin 2 2 2 2 2 <none> 11m
----
+
. Verify that your nodes have allocatable GPUs with the following command.
+
[source,bash,subs="verbatim,attributes"]
----
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
----
+
[source,bash]
----
NAME GPU
ip-192-168-11-225.us-west-2.compute.internal 1
ip-192-168-24-96.us-west-2.compute.internal 1
----
+
. Create a file named `nvidia-smi.yaml` with the following contents. This manifest launches a https://docs.aws.amazon.com/linux/al2023/ug/minimal-container.html[minimal AL2023 container image] that runs `nvidia-smi` on a node.
+
[source,yaml,subs="verbatim,attributes,quotes"]
----
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
spec:
restartPolicy: OnFailure
containers:
- name: gpu-demo
image: public.ecr.aws/amazonlinux/amazonlinux:2023-minimal
command: ['/bin/sh', '-c']
args: ['nvidia-smi && tail -f /dev/null']
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: 'nvidia.com/gpu'
operator: 'Equal'
value: 'true'
effect: 'NoSchedule'
----
+
. Apply the manifest with the following command.
+
[source,bash,subs="verbatim,attributes"]
----
kubectl apply -f nvidia-smi.yaml
----
. After the Pod has finished running, view its logs with the following command.
+
[source,bash,subs="verbatim,attributes"]
----
kubectl logs nvidia-smi
----
+
An example output is as follows.
+
[source,bash,subs="verbatim,attributes"]
----
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI XXX.XXX.XX Driver Version: XXX.XXX.XX CUDA Version: XX.X |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:31:00.0 Off | 0 |
| N/A 27C P8 11W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
----

[#eks-neuron-device-plugin]
== Install Neuron Kubernetes device plugin

The following procedure describes how to install the Neuron Kubernetes device plugin and run a sample test on an Inferentia instance.

=== Prerequisites

* EKS cluster created
* Neuron GPU nodes running in the cluster using EKS-optimized AL2023 Neuron AMI or Bottlerocket AMI
* Helm installed in your command-line environment, see <<helm,Setup Helm instructions>>.

=== Procedure

. Install the Neuron Kubernetes device plugin on your cluster.
+
[source,bash]
----
helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
--set "npd.enabled=false"
----
+
. Verify the Neuron Kubernetes device plugin is running in your cluster. The output below shows the output with a single Neuron node in the cluster.
+
[source,bash]
----
kubectl get ds -n kube-system neuron-device-plugin
----
+
[source, bash]
----
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
neuron-device-plugin 1 1 1 1 1 <none> 72s
----
+
. Verify that your nodes have allocatable NueronCores with the following command.
+
[source,bash,subs="verbatim,attributes"]
----
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronCore:.status.allocatable.aws\.amazon\.com/neuroncore"
----
+
[source,bash]
----
NAME NeuronCore
ip-192-168-47-173.us-west-2.compute.internal 2
----
+
. Verify that your nodes have allocatable NueronDevices with the following command.
+
[source,bash,subs="verbatim,attributes"]
----
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronDevice:.status.allocatable.aws\.amazon\.com/neuron"
----
+
[source,bash]
----
NAME NeuronDevice
ip-192-168-47-173.us-west-2.compute.internal 1
----
+
. Create a file named `neuron-ls.yaml` with the following contents. This manifest launches an https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html[Neuron Monitor] container that has the `neuron-ls` tool installed.
+
[source,yaml]
----
apiVersion: v1
kind: Pod
metadata:
name: neuron-ls
spec:
restartPolicy: Never
containers:
- name: neuron-container
image: public.ecr.aws/g4h4h0b5/neuron-monitor:1.0.0
command: ["/bin/sh"]
args: ["-c", "neuron-ls"]
resources:
limits:
aws.amazon.com/neuron: 1
tolerations:
- key: "aws.amazon.com/neuron"
operator: "Exists"
effect: "NoSchedule"
----
+
. Apply the manifest with the following command.
+
[source,bash,subs="verbatim,attributes"]
----
kubectl apply -f neuron-ls.yaml
----
. After the Pod has finished running, view its logs with the following command.
+
[source,bash,subs="verbatim,attributes"]
----
kubectl logs neuron-ls
----
+
An example output is below.
+
[source,bash,subs="verbatim,attributes"]
----
instance-type: inf2.xlarge
instance-id: ...
+--------+--------+--------+---------+
| NEURON | NEURON | NEURON | PCI |
| DEVICE | CORES | MEMORY | BDF |
+--------+--------+--------+---------+
| 0 | 2 | 32 GB | 00:1f.0 |
+--------+--------+--------+---------+
----
Loading