Skip to content

Commit 8c28060

Browse files
AMD-melliotty2kenny-amd
authored andcommitted
Additional edits and updates
1 parent 62d040e commit 8c28060

12 files changed

+581
-358
lines changed

docs/advanced/health-checks.md

-90
This file was deleted.

docs/advanced/monitoring.md

-44
This file was deleted.

docs/advanced/node-labelling.md

-118
This file was deleted.

docs/index.md

+23-16
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,7 @@ The AMD GPU Device Plugin for Kubernetes enables the use of AMD GPUs as schedula
66

77
- Implements the Kubernetes Device Plugin API for AMD GPUs
88
- Exposes AMD GPUs as `amd.com/gpu` resources in Kubernetes
9-
- Supports automated node labeling with GPU-specific information
10-
- Provides optional health monitoring for GPUs
9+
- Provides automated node labeling with detailed GPU properties (device ID, VRAM, compute units, etc.)
1110
- Enables fine-grained GPU allocation for containers
1211

1312
## System Requirements
@@ -20,8 +19,6 @@ See the [ROCm System Requirements](https://rocm.docs.amd.com/projects/install-on
2019

2120
## Quick Start
2221

23-
Deploy the device plugin on your Kubernetes cluster:
24-
2522
To deploy the device plugin, run it on all nodes equipped with AMD GPUs. The simplest way to do this is by creating a Kubernetes [DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/). A pre-built Docker image is available on [DockerHub](https://hub.docker.com/r/rocm/k8s-device-plugin), and a predefined YAML file named [k8s-ds-amdgpu-dp.yaml](https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml) is included in this repository.
2623

2724
Create a DaemonSet in your Kubernetes cluster with the following command:
@@ -36,20 +33,34 @@ Alternatively, you can pull directly from the web:
3633
kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml
3734
```
3835

39-
To enable the experimental device health check, use `k8s-ds-amdgpu-dp-health.yaml` **after** setting `--allow-privileged=true` for kube-apiserver.
36+
### Deploy the Node Labeler (Optional)
4037

41-
## Example Workload
38+
For enhanced GPU discovery and scheduling, deploy the AMD GPU Node Labeler:
39+
40+
```bash
41+
kubectl create -f k8s-ds-amdgpu-labeller.yaml
42+
```
43+
44+
This will automatically label nodes with GPU-specific information such as VRAM size, compute units, and device IDs.
4245

43-
You can restrict workloads to a node with a GPU by adding `resources.limits` to the pod definition. An example pod definition is provided in `example/pod/alexnet-gpu.yaml`. Create the pod by running:
46+
### Verify Installation
47+
48+
After deploying the device plugin, verify that your AMD GPUs are properly recognized as schedulable resources:
4449

4550
```bash
46-
kubectl create -f alexnet-gpu.yaml
51+
# List all nodes with their AMD GPU capacity
52+
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:"status.capacity.amd\.com/gpu"
53+
54+
NAME GPU
55+
k8s-node-01 8
4756
```
4857

49-
or
58+
## Example Workload
59+
60+
You can restrict workloads to a node with a GPU by adding `resources.limits` to the pod definition. An example pod definition is provided in [example/pod/pytorch.yam](https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/example/pod/pytorch.yaml)l. Create the pod by running:
5061

5162
```bash
52-
kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/example/pod/tensorflow-gpu.yaml
63+
kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/example/pod/pytorch.yaml
5364
```
5465

5566
Check the pod status with:
@@ -61,13 +72,9 @@ kubectl describe pods
6172
After the pod is running, view the benchmark results with:
6273

6374
```bash
64-
kubectl logs rocm-test-pod rocm-test-container
75+
kubectl pytorch-gpu-pod-example
6576
```
6677

67-
## Health Checks
68-
69-
This plugin extends more granular health detection per GPU using the exporter health service over a gRPC socket service mounted on `/var/lib/amd-metrics-exporter/`.
70-
7178
## Contributing
7279

73-
We welcome contributions to this project! Please refer to the [Contributing Guide](contributing/development.md) for details on how to get involved.
80+
We welcome contributions to this project! Please refer to the [Development Guidelines](contributing/development.md) for details on how to get involved.

docs/sphinx/_toc.yml.in

-5
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,6 @@ subtrees:
88
- file: user-guide/installation
99
- file: user-guide/configuration
1010
- file: user-guide/examples
11-
- caption: Advanced
12-
entries:
13-
- file: advanced/health-checks
14-
- file: advanced/monitoring
15-
- file: advanced/node-labelling
1611
- caption: Contributing
1712
entries:
1813
- file: contributing/development

0 commit comments

Comments
 (0)