You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Enables fine-grained GPU allocation for containers
12
11
13
12
## System Requirements
@@ -20,8 +19,6 @@ See the [ROCm System Requirements](https://rocm.docs.amd.com/projects/install-on
20
19
21
20
## Quick Start
22
21
23
-
Deploy the device plugin on your Kubernetes cluster:
24
-
25
22
To deploy the device plugin, run it on all nodes equipped with AMD GPUs. The simplest way to do this is by creating a Kubernetes [DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/). A pre-built Docker image is available on [DockerHub](https://hub.docker.com/r/rocm/k8s-device-plugin), and a predefined YAML file named [k8s-ds-amdgpu-dp.yaml](https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml) is included in this repository.
26
23
27
24
Create a DaemonSet in your Kubernetes cluster with the following command:
@@ -36,20 +33,34 @@ Alternatively, you can pull directly from the web:
To enable the experimental device health check, use `k8s-ds-amdgpu-dp-health.yaml`**after** setting `--allow-privileged=true` for kube-apiserver.
36
+
### Deploy the Node Labeler (Optional)
40
37
41
-
## Example Workload
38
+
For enhanced GPU discovery and scheduling, deploy the AMD GPU Node Labeler:
39
+
40
+
```bash
41
+
kubectl create -f k8s-ds-amdgpu-labeller.yaml
42
+
```
43
+
44
+
This will automatically label nodes with GPU-specific information such as VRAM size, compute units, and device IDs.
42
45
43
-
You can restrict workloads to a node with a GPU by adding `resources.limits` to the pod definition. An example pod definition is provided in `example/pod/alexnet-gpu.yaml`. Create the pod by running:
46
+
### Verify Installation
47
+
48
+
After deploying the device plugin, verify that your AMD GPUs are properly recognized as schedulable resources:
44
49
45
50
```bash
46
-
kubectl create -f alexnet-gpu.yaml
51
+
# List all nodes with their AMD GPU capacity
52
+
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:"status.capacity.amd\.com/gpu"
53
+
54
+
NAME GPU
55
+
k8s-node-01 8
47
56
```
48
57
49
-
or
58
+
## Example Workload
59
+
60
+
You can restrict workloads to a node with a GPU by adding `resources.limits` to the pod definition. An example pod definition is provided in [example/pod/pytorch.yam](https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/example/pod/pytorch.yaml)l. Create the pod by running:
After the pod is running, view the benchmark results with:
62
73
63
74
```bash
64
-
kubectl logs rocm-test-pod rocm-test-container
75
+
kubectl pytorch-gpu-pod-example
65
76
```
66
77
67
-
## Health Checks
68
-
69
-
This plugin extends more granular health detection per GPU using the exporter health service over a gRPC socket service mounted on `/var/lib/amd-metrics-exporter/`.
70
-
71
78
## Contributing
72
79
73
-
We welcome contributions to this project! Please refer to the [Contributing Guide](contributing/development.md) for details on how to get involved.
80
+
We welcome contributions to this project! Please refer to the [Development Guidelines](contributing/development.md) for details on how to get involved.
0 commit comments