Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Release Notes, Known Limitations and Contributing section to docs #31

Merged
merged 3 commits into from
Dec 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,6 @@ verison
webhook
CRD
uninstallation
OpenShift
Autobuild
NMC
4 changes: 2 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
external_projects = ["amd-gpu-operator"]
external_projects_current_project = "amd-gpu-operator"

project = "AMD Instinct Hub"
version = "1.0.0"
project = "AMD Instinct Documentation"
version = "1.1.0"
release = version
html_title = f"AMD GPU Operator {version}"
author = "Advanced Micro Devices, Inc."
Expand Down
53 changes: 53 additions & 0 deletions docs/contributing/documentation-build-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Documentation Build Guide

This guide provides information for developers who want to contribute to the AMD GPU Operator documentation available at https://dcgpu.docs.amd.com/projects/gpu-operator. The docs use [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) as their base and the below guide will show how you can build and serve the docs locally for testing.

## Building and Serving the Docs

1. Create a Python Virtual Environment (optional, but recommended)

```bash
python3 -m venv .venv/docs
source .venv/docs/bin/activate (or source .venv/docs/Scripts/activate on Windows)
```

2. Install required packages for docs

```bash
pip install -r docs/sphinx/requirements.txt
```

3. Build the docs

```bash
python3 -m sphinx -b html -d _build/doctrees -D language=en ./docs/ docs/_build/html
```

4. Serve docs locally on port 8000

```bash
python3 -m http.server -d ./docs/_build/html/
```

5. You can now view the docs site by going to http://localhost:8000

## Auto-building the docs

Check failure on line 34 in docs/contributing/documentation-build-guide.md

View workflow job for this annotation

GitHub Actions / Documentation / Markdown

Headings should be surrounded by blank lines

docs/contributing/documentation-build-guide.md:34 MD022/blanks-around-headings/blanks-around-headers Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "## Auto-building the docs"] https://github.com/DavidAnson/markdownlint/blob/v0.28.2/doc/md022.md
The below will allow you to watch the docs directory and rebuild the documenatation each time you make a change to the documentation files:

1. Install Sphinx Autobuild package

```bash
pip install sphinx-autobuild
```

2. Run the autobuild (will also serve the docs on port 8000 automatically)

```bash
sphinx-autobuild -b html -d _build/doctrees -D language=en ./docs docs/_build/html --ignore "docs/_build/*" --ignore "docs/sphinx/_toc.yml"
```

## Troubleshooting

1. **Navigation Menu not displaying new links**

Note that if you've recently added a new link to the navigation menu previously unchanged pages may not correctly display the new link. To fix this delete the existing `_build/` directory and rebuild the docs so that the navigation menu will be rebuilt for all pages.
207 changes: 207 additions & 0 deletions docs/contributing/documentation-standards.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
# Documentation Standards

## Voice and Tone

### Writing Style

- Use active voice
- Write in second person ("you") for instructions
- Maintain professional, technical tone
- Be concise and direct
- Use present tense

Examples:

```diff
- The configuration file will be created by the operator
+ The operator creates the configuration file

- One should ensure that all prerequisites are met
+ Ensure all prerequisites are met
```

### Terminology Standards

#### Product Names

- "AMD GPU Operator" (not "GPU operator" or "gpu-operator")
- "Kubernetes" (not "kubernetes" or "K8s")
- "OpenShift" (not "Openshift" or "openshift")
- "AMD ROCm™" (not "ROCM" or "rocm")

#### Technical Terms

| Term | Usage Notes |
|------|-------------|
| AMD GPU driver | Standard term for the driver. Don't use "AMDGPU driver" or "GPU driver" alone |
| worker node | Standard term for cluster nodes. Don't use "worker" or "node" alone |
| DeviceConfig | One word, capital 'D' and 'C' when referring to the resource |
| container image | Use instead of just "image" |
| pod | Lowercase unless starting a sentence |
| namespace | Lowercase unless starting a sentence |

#### Acronym Usage

Always expand acronyms on first use in each document:

- NFD (Node Feature Discovery)
- KMM (Kernel Module Management)
- CRD (Custom Resource Definition)
- CR (Custom Resource)

## Formatting Standards

### Headers

- Use title case for all headers
- Add blank line before and after headers

```markdown
# Main Title

## Section Title

### Subsection Title
```

### Code Blocks

- Always specify language for syntax highlighting
- Use inline code format (`code`) for:
- Command names
- File names
- Variable names
- Resource names
- Use block code format (```) for:
- Command examples
- YAML/JSON examples
- Configuration files
- Output examples

Examples:

````markdown
Install using `helm`:

```bash
helm install amd-gpu-operator rocm/gpu-operator-helm
```

Create a configuration:

```yaml
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
name: example
```
````

### Lists

- Maintain consistent indentation (2 spaces)
- End each list item with punctuation
- Add blank line between list items if they contain multiple sentences or code blocks

### Admonitions

Use consistent formatting for notes, warnings, and tips:

```markdown
```{note}
Important supplementary information.
```

```{warning}
Critical information about potential problems.
```

```{tip}
Helpful advice for better usage.
```

```text

### Tables

- Use tables for structured information
- Include header row
- Align columns consistently
- Add blank lines before and after tables

Example:

```markdown
| Parameter | Description | Default |
|-----------|-------------|---------|
| `image` | Container image path | `rocm/gpu-operator:latest` |
| `version` | Driver version | `6.2.0` |
```

## Document Structure

### Standard Sections

Every document should include these sections in order:

1. Title (H1)
2. Brief overview/introduction
3. Prerequisites (if applicable)
4. Main content
5. Verification steps (if applicable)
6. Troubleshooting (if applicable)

### Example Template

```markdown
# Feature Title

Brief description of the feature or component.

## Prerequisites

- Required components
- Required permissions
- Required resources

## Overview

Detailed description of the feature.

## Configuration

Configuration steps and examples.

## Verification

Steps to verify successful implementation.

## Troubleshooting

Common issues and solutions.
```

## File Naming

- Use lowercase
- Use hyphens for spaces
- Be descriptive but concise
- Include category prefix when applicable

Examples:

- `install-kubernetes.md`
- `upgrade-operator.md`

## Links and References

- Use relative links for internal documentation
- Use absolute links for external references
- Include link text that makes sense out of context

Examples:

```markdown
[Installation Guide](../install/kubernetes)
[Kubernetes Documentation](https://kubernetes.io/docs)
```
79 changes: 79 additions & 0 deletions docs/knownlimitations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Known Issues and Limitations

1. **GPU operator driver installs only DKMS package**
- *****Impact:***** Applications which require ROCM packages will need to install respective packages.
- ***Affected Configurations:*** All configurations
- ***Workaround:*** None as this is the intended behaviour
</br></br>

2. **When Using Operator to install amdgpu 6.1.3/6.2 a reboot is required to complete install**
- ***Impact:*** Node requires a reboot when upgrade is initiated due to ROCm bug. Driver install failures may be seen in dmesg
- ***Affected configurations:*** Nodes with driver version >= ROCm 6.2.x
- ***Workaround:*** Reboot the nodes upgraded manually to finish the driver install. This has been fixed in ROCm 6.3+
</br></br>

3. **GPU Operator unable to install amdgpu driver if existing driver is already installed**
- ***Impact:*** Driver install will fail if amdgpu in-box Driver is present/already installed
- ***Affected Configurations:*** All configurations
- ***Workaround:*** When installing the amdgpu drivers using the GPU Operator, worker nodes should have amdgpu blacklisted or amdgpu drivers should not be pre-installed on the node. [Blacklist in-box driver](https://dcgpu.docs.amd.com/projects/gpu-operator/en/release-v1.0.0/drivers/installation.html#blacklist-inbox-driver) so that it is not loaded or remove the pre-installed driver
</br></br>

4. **When GPU Operator is used in SKIP driver install mode, if amdgpu module is removed with device plugin installed it will not reflect active GPU available on the server**
- ***Impact:*** Scheduling Workloads will have impact as it will scheduled on nodes which does have active GPU.
- ***Affected Configurations:*** All configurations
- ***Workaround:*** Restart the Device plugin pod deployed.
</br></br>

5. **Worker nodes where Kernel needs to be upgraded needs to taken out of the cluster and readded with Operator installed**
- ***Impact:*** Node upgrade will not proceed automatically and requires manual intervention
- ***Affected Configurations:*** All configurations
- ***Workaround:*** Manually mark the node as unschedulable, preventing new pods from being scheduled on it, by cordoning it off:

```bash
kubectl cordon <node-name>
```

</br>

6. **When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module**

Check failure on line 38 in docs/knownlimitations.md

View workflow job for this annotation

GitHub Actions / Documentation / Markdown

Ordered list item prefix

docs/knownlimitations.md:38:1 MD029/ol-prefix Ordered list item prefix [Expected: 1; Actual: 6; Style: 1/2/3] https://github.com/DavidAnson/markdownlint/blob/v0.28.2/doc/md029.md
- ***Impact:*** Driver upgrade is blocked
- ***Affected Configurations:*** All configurations
- ***Workaround:*** Disable the Metrics Exporter on specific node to allow driver upgrade as follows:

1. Label all nodes with new label:

```bash
kubectl label nodes --all amd.com/device-metrics-exporter=true
```

2. Patch DeviceConfig to include new selectors for metrics exporter:

```bash
kubectl patch deviceconfig gpu-operator -n kube-amd-gpu --type='merge' -p {"spec":{"metricsExporter":{"selector":{"feature.node.kubernetes.io/amd-gpu":"true","amd.com/device-metrics-exporter":"true"}}}}'
```

3. Remove the amd.com/device-metrics-exporter label for the specific node you would like to disable the exporter on:

```bash
kubectl label node [node-to-exclude] amd.com/device-metrics-exporter-
```

</br>

7. **Due to issue with KMM 2.2 deletion of DeviceConfig Custom Resource gets stuck in Red Hat OpenShift**

Check failure on line 63 in docs/knownlimitations.md

View workflow job for this annotation

GitHub Actions / Documentation / Markdown

Ordered list item prefix

docs/knownlimitations.md:63:1 MD029/ol-prefix Ordered list item prefix [Expected: 1; Actual: 7; Style: 1/2/3] https://github.com/DavidAnson/markdownlint/blob/v0.28.2/doc/md029.md
- ***Impact:*** Not able to delete the DeviceConfig Custom Resource if the node reboots during uninstall.
- ***Affected Configurations:*** This issue only affects Red Hat OpenShift
- ***Workaround:*** This issue will be fixed in the next release of KMM. For the time being you can use a previous version of KMM aside from 2.2 or manually remove the status from NMC:
1. List all the NMC resources and pick up the correct NMC (there is one nmc per node, named the same as the node it related to).

```bash
oc get nmc -A
```

2. Edit the NMC.

```bash
oc edit nmc <nmc name>
```

3. Remove from NMC status for all the data related to your module and save. That should allow the module to be finally deleted.
Loading
Loading