This shows how to deploy a job
to run nvidia-smi
on a node in a Kubernetes cluster.
Note
Before you start
This how-to assumes that you have a Kubernetes cluster with at least one node with an NVIDIA GPU. And the NVIDIA GPU Operator should be deployed in the cluster.
- A Kubernetes cluster
- An accessible node with kubectl and a text editor that can reach the cluster.
- A cluster node with an NVIDIA GPU.
Use your favorite text editor and a system with kubectl
that can access the cluster.
-
Create a file to describe the
job
touch nvidia-smi-job.yaml
-
Open the file in a text editor. For example,
nano nvidia-smi-job.yaml
-
Paste the following
job
description into the file
Make sure to change the metadata.namespace
value to the namespace in which you want to launch the job. Similarly, the spec.template.affinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms.matchExpressions.values
should be changed to the node name on which you want to run the job.
apiVersion: batch/v1
kind: Job
metadata:
name: nvidia-smi
namespace: playground
spec:
template:
metadata:
name: nvidia-smi
spec:
containers:
- name: nvidia-smi
image: 'nvcr.io/nvidia/cuda:12.1.0-runtime-ubuntu20.04'
command:
- nvidia-smi
restartPolicy: Never
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- us-5501-nb.supermicro.com
- Deploy the
job
to the cluster
kubectl apply -f nvidia-smi.yaml
- Examine the logs of the
job
After getting the pods with the first command, use the pod's name in the second command to get the logs for the pod created for the job.
kubectl get -n <namespace> po
kubectl logs -n <namespace> <pod name> -f