Skip to content

Commit 03c98d7

Browse files
authored
Merge pull request #1143 from csplinter/docs-gpu-amis
Update EKS-optimized GPU AMI docs
2 parents e87e8a0 + 7a41584 commit 03c98d7

File tree

6 files changed

+430
-194
lines changed

6 files changed

+430
-194
lines changed
Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
include::../attributes.txt[]
2+
3+
[.topic]
4+
[#ml-eks-k8s-device-plugin]
5+
= Install Kubernetes device plugin for GPUs
6+
:info_titleabbrev: Install device plugin for GPUs
7+
8+
Kubernetes https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/[device plugins] have been the primary mechanism for advertising specialized infrastructure such as GPUs, network interfaces, and network adaptors as consumable resources for Kubernetes workloads. While https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/[Dynamic Resource Allocation] (DRA) is positioned as the future for device management in Kubernetes, most specialized infrastructure providers are early in their support for DRA drivers. Kubernetes device plugins remain a widely available approach for using GPUs in Kubernetes clusters today.
9+
10+
== Considerations
11+
12+
* When using the EKS-optimized AL2023 AMIs with NVIDIA GPUs, you must install the https://github.com/NVIDIA/k8s-device-plugin[NVIDIA Kubernetes device plugin]. You can install and manage the NVIDIA Kubernetes device plugin with Helm, your choice of Kubernetes tooling, or the NVIDIA GPU operator.
13+
* When using the EKS-optimized Bottlerocket AMIs with NVIDIA GPUs, you do not need to install the NVIDIA Kubernetes device plugin, as it is already included in the EKS-optimized Bottlerocket AMIs. This includes when you use GPU instances with EKS Auto Mode.
14+
* When using the EKS-optimized AL2023 or Bottlerocket AMIs with AWS Inferentia or Trainium GPUs, then you must install the Neuron Kubernetes device plugin, and optionally install the https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-scheduler.html[Neuron Kubernetes scheduler extension]. For more information, see the https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html[Neuron documentation for running on EKS].
15+
16+
[#eks-nvidia-device-plugin]
17+
== Install NVIDIA Kubernetes device plugin
18+
19+
The following procedure describes how to install the NVIDIA Kubernetes device plugin and run a sample test on NVIDIA GPU instances.
20+
21+
=== Prerequisites
22+
23+
* EKS cluster created
24+
* NVIDIA GPU nodes running in the cluster using EKS-optimized AL2023 NVIDIA AMI
25+
* Helm installed in your command-line environment, see <<helm,Setup Helm instructions>>.
26+
27+
=== Procedure
28+
29+
. Add the `nvdp` Helm chart repository.
30+
+
31+
[source,bash]
32+
----
33+
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
34+
----
35+
+
36+
. Update your local Helm repository to make sure that you have the most recent charts.
37+
+
38+
[source,bash]
39+
----
40+
helm repo update
41+
----
42+
+
43+
. Get the latest version of the NVIDIA Kubernetes device plugin
44+
+
45+
[source,bash]
46+
----
47+
helm search repo nvdp --devel
48+
----
49+
+
50+
[source,bash]
51+
----
52+
NAME CHART VERSION APP VERSION DESCRIPTION
53+
nvdp/gpu-feature-discovery 0.17.4 0.17.4 ...
54+
nvdp/nvidia-device-plugin 0.17.4 0.17.4 ...
55+
----
56+
+
57+
. Install the NVIDIA Kubernetes device plugin on your cluster, replacing `0.17.4` with the latest version from the command above.
58+
+
59+
[source,bash,subs="verbatim,attributes,quotes"]
60+
----
61+
helm install nvdp nvdp/nvidia-device-plugin \
62+
--namespace nvidia \
63+
--create-namespace \
64+
--version [.replaceable]`0.17.4` \
65+
--set gfd.enabled=true
66+
----
67+
+
68+
. Verify the NVIDIA Kubernetes device plugin is running in your cluster. The output below shows the output with two nodes in the cluster.
69+
+
70+
[source,bash]
71+
----
72+
kubectl get ds -n nvidia nvdp-nvidia-device-plugin
73+
----
74+
+
75+
[source, bash]
76+
----
77+
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
78+
nvdp-nvidia-device-plugin 2 2 2 2 2 <none> 11m
79+
----
80+
+
81+
. Verify that your nodes have allocatable GPUs with the following command.
82+
+
83+
[source,bash,subs="verbatim,attributes"]
84+
----
85+
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
86+
----
87+
+
88+
[source,bash]
89+
----
90+
NAME GPU
91+
ip-192-168-11-225.us-west-2.compute.internal 1
92+
ip-192-168-24-96.us-west-2.compute.internal 1
93+
----
94+
+
95+
. Create a file named `nvidia-smi.yaml` with the following contents. This manifest launches a https://docs.aws.amazon.com/linux/al2023/ug/minimal-container.html[minimal AL2023 container image] that runs `nvidia-smi` on a node.
96+
+
97+
[source,yaml,subs="verbatim,attributes,quotes"]
98+
----
99+
apiVersion: v1
100+
kind: Pod
101+
metadata:
102+
name: nvidia-smi
103+
spec:
104+
restartPolicy: OnFailure
105+
containers:
106+
- name: gpu-demo
107+
image: public.ecr.aws/amazonlinux/amazonlinux:2023-minimal
108+
command: ['/bin/sh', '-c']
109+
args: ['nvidia-smi && tail -f /dev/null']
110+
resources:
111+
limits:
112+
nvidia.com/gpu: 1
113+
tolerations:
114+
- key: 'nvidia.com/gpu'
115+
operator: 'Equal'
116+
value: 'true'
117+
effect: 'NoSchedule'
118+
----
119+
+
120+
. Apply the manifest with the following command.
121+
+
122+
[source,bash,subs="verbatim,attributes"]
123+
----
124+
kubectl apply -f nvidia-smi.yaml
125+
----
126+
. After the Pod has finished running, view its logs with the following command.
127+
+
128+
[source,bash,subs="verbatim,attributes"]
129+
----
130+
kubectl logs nvidia-smi
131+
----
132+
+
133+
An example output is as follows.
134+
+
135+
[source,bash,subs="verbatim,attributes"]
136+
----
137+
+-----------------------------------------------------------------------------------------+
138+
| NVIDIA-SMI XXX.XXX.XX Driver Version: XXX.XXX.XX CUDA Version: XX.X |
139+
|-----------------------------------------+------------------------+----------------------+
140+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
141+
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
142+
| | | MIG M. |
143+
|=========================================+========================+======================|
144+
| 0 NVIDIA L4 On | 00000000:31:00.0 Off | 0 |
145+
| N/A 27C P8 11W / 72W | 0MiB / 23034MiB | 0% Default |
146+
| | | N/A |
147+
+-----------------------------------------+------------------------+----------------------+
148+
149+
+-----------------------------------------------------------------------------------------+
150+
| Processes: |
151+
| GPU GI CI PID Type Process name GPU Memory |
152+
| ID ID Usage |
153+
|=========================================================================================|
154+
| No running processes found |
155+
+-----------------------------------------------------------------------------------------+
156+
----
157+
158+
[#eks-neuron-device-plugin]
159+
== Install Neuron Kubernetes device plugin
160+
161+
The following procedure describes how to install the Neuron Kubernetes device plugin and run a sample test on an Inferentia instance.
162+
163+
=== Prerequisites
164+
165+
* EKS cluster created
166+
* Neuron GPU nodes running in the cluster using EKS-optimized AL2023 Neuron AMI or Bottlerocket AMI
167+
* Helm installed in your command-line environment, see <<helm,Setup Helm instructions>>.
168+
169+
=== Procedure
170+
171+
. Install the Neuron Kubernetes device plugin on your cluster.
172+
+
173+
[source,bash]
174+
----
175+
helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
176+
--set "npd.enabled=false"
177+
----
178+
+
179+
. Verify the Neuron Kubernetes device plugin is running in your cluster. The output below shows the output with a single Neuron node in the cluster.
180+
+
181+
[source,bash]
182+
----
183+
kubectl get ds -n kube-system neuron-device-plugin
184+
----
185+
+
186+
[source, bash]
187+
----
188+
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
189+
neuron-device-plugin 1 1 1 1 1 <none> 72s
190+
----
191+
+
192+
. Verify that your nodes have allocatable NueronCores with the following command.
193+
+
194+
[source,bash,subs="verbatim,attributes"]
195+
----
196+
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronCore:.status.allocatable.aws\.amazon\.com/neuroncore"
197+
----
198+
+
199+
[source,bash]
200+
----
201+
NAME NeuronCore
202+
ip-192-168-47-173.us-west-2.compute.internal 2
203+
----
204+
+
205+
. Verify that your nodes have allocatable NueronDevices with the following command.
206+
+
207+
[source,bash,subs="verbatim,attributes"]
208+
----
209+
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronDevice:.status.allocatable.aws\.amazon\.com/neuron"
210+
----
211+
+
212+
[source,bash]
213+
----
214+
NAME NeuronDevice
215+
ip-192-168-47-173.us-west-2.compute.internal 1
216+
----
217+
+
218+
. Create a file named `neuron-ls.yaml` with the following contents. This manifest launches an https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html[Neuron Monitor] container that has the `neuron-ls` tool installed.
219+
+
220+
[source,yaml]
221+
----
222+
apiVersion: v1
223+
kind: Pod
224+
metadata:
225+
name: neuron-ls
226+
spec:
227+
restartPolicy: Never
228+
containers:
229+
- name: neuron-container
230+
image: public.ecr.aws/g4h4h0b5/neuron-monitor:1.0.0
231+
command: ["/bin/sh"]
232+
args: ["-c", "neuron-ls"]
233+
resources:
234+
limits:
235+
aws.amazon.com/neuron: 1
236+
tolerations:
237+
- key: "aws.amazon.com/neuron"
238+
operator: "Exists"
239+
effect: "NoSchedule"
240+
----
241+
+
242+
. Apply the manifest with the following command.
243+
+
244+
[source,bash,subs="verbatim,attributes"]
245+
----
246+
kubectl apply -f neuron-ls.yaml
247+
----
248+
. After the Pod has finished running, view its logs with the following command.
249+
+
250+
[source,bash,subs="verbatim,attributes"]
251+
----
252+
kubectl logs neuron-ls
253+
----
254+
+
255+
An example output is below.
256+
+
257+
[source,bash,subs="verbatim,attributes"]
258+
----
259+
instance-type: inf2.xlarge
260+
instance-id: ...
261+
+--------+--------+--------+---------+
262+
| NEURON | NEURON | NEURON | PCI |
263+
| DEVICE | CORES | MEMORY | BDF |
264+
+--------+--------+--------+---------+
265+
| 0 | 2 | 32 GB | 00:1f.0 |
266+
+--------+--------+--------+---------+
267+
----

0 commit comments

Comments
 (0)