ROCm · sajmera-pensando · Dec 23, 2024 · Dec 23, 2024 · Dec 23, 2024 · Dec 23, 2024
diff --git a/.wordlist.txt b/.wordlist.txt
@@ -52,3 +52,6 @@ verison
 webhook
 CRD
 uninstallation
+OpenShift
+Autobuild
+NMC
diff --git a/docs/conf.py b/docs/conf.py
@@ -5,8 +5,8 @@
 external_projects = ["amd-gpu-operator"]
 external_projects_current_project = "amd-gpu-operator"
 
-project = "AMD Instinct Hub"
-version = "1.0.0"
+project = "AMD Instinct Documentation"
+version = "1.1.0"
 release = version
 html_title = f"AMD GPU Operator {version}"
 author = "Advanced Micro Devices, Inc."

diff --git a/docs/contributing/documentation-build-guide.md b/docs/contributing/documentation-build-guide.md
@@ -0,0 +1,53 @@
+# Documentation Build Guide
+
+This guide provides information for developers who want to contribute to the AMD GPU Operator documentation available at https://dcgpu.docs.amd.com/projects/gpu-operator. The docs use [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) as their base and the below guide will show how you can build and serve the docs locally for testing.
+
+## Building and Serving the Docs
+
+1. Create a Python Virtual Environment (optional, but recommended)
+
+    ```bash
+    python3 -m venv .venv/docs
+    source .venv/docs/bin/activate (or source .venv/docs/Scripts/activate on Windows)
+    ```
+
+2. Install required packages for docs
+
+    ```bash
+    pip install -r docs/sphinx/requirements.txt
+    ```
+
+3. Build the docs
+
+    ```bash
+    python3 -m sphinx -b html -d _build/doctrees -D language=en ./docs/ docs/_build/html
+    ```
+
+4. Serve docs locally on port 8000
+
+    ```bash
+    python3 -m http.server -d ./docs/_build/html/
+    ```
+
+5. You can now view the docs site by going to http://localhost:8000
+
+## Auto-building the docs
+The below will allow you to watch the docs directory and rebuild the documenatation each time you make a change to the documentation files:
+
+1. Install Sphinx Autobuild package
+
+    ```bash
+    pip install sphinx-autobuild
+    ```
+
+2. Run the autobuild (will also serve the docs on port 8000 automatically)
+
+    ```bash
+    sphinx-autobuild -b html -d _build/doctrees -D language=en ./docs docs/_build/html --ignore "docs/_build/*" --ignore "docs/sphinx/_toc.yml"
+    ```
+
+## Troubleshooting
+
+1. **Navigation Menu not displaying new links**
+
+    Note that if you've recently added a new link to the navigation menu previously unchanged pages may not correctly display the new link. To fix this delete the existing `_build/` directory and rebuild the docs so that the navigation menu will be rebuilt for all pages.
diff --git a/docs/contributing/documentation-standards.md b/docs/contributing/documentation-standards.md
@@ -0,0 +1,207 @@
+# Documentation Standards
+
+## Voice and Tone
+
+### Writing Style
+
+- Use active voice
+- Write in second person ("you") for instructions
+- Maintain professional, technical tone
+- Be concise and direct
+- Use present tense
+
+Examples:
+
+```diff
+- The configuration file will be created by the operator
++ The operator creates the configuration file
+
+- One should ensure that all prerequisites are met
++ Ensure all prerequisites are met
+```
+
+### Terminology Standards
+
+#### Product Names
+
+- "AMD GPU Operator" (not "GPU operator" or "gpu-operator")
+- "Kubernetes" (not "kubernetes" or "K8s")
+- "OpenShift" (not "Openshift" or "openshift")
+- "AMD ROCm™" (not "ROCM" or "rocm")
+
+#### Technical Terms
+
+| Term | Usage Notes |
+|------|-------------|
+| AMD GPU driver | Standard term for the driver. Don't use "AMDGPU driver" or "GPU driver" alone |
+| worker node | Standard term for cluster nodes. Don't use "worker" or "node" alone |
+| DeviceConfig | One word, capital 'D' and 'C' when referring to the resource |
+| container image | Use instead of just "image" |
+| pod | Lowercase unless starting a sentence |
+| namespace | Lowercase unless starting a sentence |
+
+#### Acronym Usage
+
+Always expand acronyms on first use in each document:
+
+- NFD (Node Feature Discovery)
+- KMM (Kernel Module Management)
+- CRD (Custom Resource Definition)
+- CR (Custom Resource)
+
+## Formatting Standards
+
+### Headers
+
+- Use title case for all headers
+- Add blank line before and after headers
+
+```markdown
+# Main Title
+
+## Section Title
+
+### Subsection Title
+```
+
+### Code Blocks
+
+- Always specify language for syntax highlighting
+- Use inline code format (`code`) for:
+  - Command names
+  - File names
+  - Variable names
+  - Resource names
+- Use block code format (```) for:
+  - Command examples
+  - YAML/JSON examples
+  - Configuration files
+  - Output examples
+
+Examples:
+
+````markdown
+Install using `helm`:
+
+```bash
+helm install amd-gpu-operator rocm/gpu-operator-helm
+```
+
+Create a configuration:
+
+```yaml
+apiVersion: amd.com/v1alpha1
+kind: DeviceConfig
+metadata:
+  name: example
+```
+````
+
+### Lists
+
+- Maintain consistent indentation (2 spaces)
+- End each list item with punctuation
+- Add blank line between list items if they contain multiple sentences or code blocks
+
+### Admonitions
+
+Use consistent formatting for notes, warnings, and tips:
+
+```markdown
+```{note}
+Important supplementary information.
+```
+
+```{warning}
+Critical information about potential problems.
+```
+
+```{tip}
+Helpful advice for better usage.
+```
+
+```text
+
+### Tables
+
+- Use tables for structured information
+- Include header row
+- Align columns consistently
+- Add blank lines before and after tables
+
+Example:
+
+```markdown
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `image` | Container image path | `rocm/gpu-operator:latest` |
+| `version` | Driver version | `6.2.0` |
+```
+
+## Document Structure
+
+### Standard Sections
+
+Every document should include these sections in order:
+
+1. Title (H1)
+2. Brief overview/introduction
+3. Prerequisites (if applicable)
+4. Main content
+5. Verification steps (if applicable)
+6. Troubleshooting (if applicable)
+
+### Example Template
+
+```markdown
+# Feature Title
+
+Brief description of the feature or component.
+
+## Prerequisites
+
+- Required components
+- Required permissions
+- Required resources
+
+## Overview
+
+Detailed description of the feature.
+
+## Configuration
+
+Configuration steps and examples.
+
+## Verification
+
+Steps to verify successful implementation.
+
+## Troubleshooting
+
+Common issues and solutions.
+```
+
+## File Naming
+
+- Use lowercase
+- Use hyphens for spaces
+- Be descriptive but concise
+- Include category prefix when applicable
+
+Examples:
+
+- `install-kubernetes.md`
+- `upgrade-operator.md`
+
+## Links and References
+
+- Use relative links for internal documentation
+- Use absolute links for external references
+- Include link text that makes sense out of context
+
+Examples:
+
+```markdown
+[Installation Guide](../install/kubernetes)
+[Kubernetes Documentation](https://kubernetes.io/docs)
+```
diff --git a/docs/knownlimitations.md b/docs/knownlimitations.md
@@ -0,0 +1,79 @@
+# Known Issues and Limitations
+
+1. **GPU operator driver installs only DKMS package**
+   - *****Impact:***** Applications which require ROCM packages will need to install respective packages.
+   - ***Affected Configurations:*** All configurations
+   - ***Workaround:*** None as this is the intended behaviour
+</br></br>
+
+2. **When Using Operator to install amdgpu 6.1.3/6.2 a reboot is required to complete install**
+   - ***Impact:*** Node requires a reboot when upgrade is initiated due to ROCm bug. Driver install failures may be seen in dmesg
+   - ***Affected configurations:*** Nodes with driver version >= ROCm 6.2.x
+   - ***Workaround:*** Reboot the nodes upgraded manually to finish the driver install. This has been fixed in ROCm 6.3+
+</br></br>
+
+3. **GPU Operator unable to install amdgpu driver if existing driver is already installed**
+   - ***Impact:*** Driver install will fail if amdgpu in-box Driver is present/already installed
+   - ***Affected Configurations:*** All configurations
+   - ***Workaround:*** When installing the amdgpu drivers using the GPU Operator, worker nodes should have amdgpu blacklisted or amdgpu drivers should not be pre-installed on the node. [Blacklist in-box driver](https://dcgpu.docs.amd.com/projects/gpu-operator/en/release-v1.0.0/drivers/installation.html#blacklist-inbox-driver) so that it is not loaded or remove the pre-installed driver
+</br></br>
+
+4. **When GPU Operator is used in SKIP driver install mode, if amdgpu module is removed with device plugin installed it will not reflect active GPU available on the server**
+   - ***Impact:*** Scheduling Workloads will have impact as it will scheduled on nodes which does have active GPU.
+   - ***Affected Configurations:*** All configurations
+   - ***Workaround:*** Restart the Device plugin pod deployed.
+</br></br>
+
+5. **Worker nodes where Kernel needs to be upgraded needs to taken out of the cluster and readded with Operator installed**
+   - ***Impact:*** Node upgrade will not proceed automatically and requires manual intervention
+   - ***Affected Configurations:*** All configurations
+   - ***Workaround:*** Manually mark the node as unschedulable, preventing new pods from being scheduled on it, by cordoning it off:
+
+    ```bash
+    kubectl cordon <node-name>
+    ```
+
+</br>
+
+6. **When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module**
+   - ***Impact:*** Driver upgrade is blocked
+   - ***Affected Configurations:*** All configurations
+   - ***Workaround:*** Disable the Metrics Exporter on specific node to allow driver upgrade as follows:
+
+    1. Label all nodes with new label:
+
+       ```bash
+       kubectl label nodes --all amd.com/device-metrics-exporter=true
+       ```
+
+    2. Patch DeviceConfig to include new selectors for metrics exporter:
+
+        ```bash
+        kubectl patch deviceconfig gpu-operator -n kube-amd-gpu --type='merge' -p {"spec":{"metricsExporter":{"selector":{"feature.node.kubernetes.io/amd-gpu":"true","amd.com/device-metrics-exporter":"true"}}}}'
+        ```
+
+    3. Remove the amd.com/device-metrics-exporter label for the specific node you would like to disable the exporter on:
+
+        ```bash
+        kubectl label node [node-to-exclude] amd.com/device-metrics-exporter-
+        ```
+
+</br>
+
+7. **Due to issue with KMM 2.2 deletion of DeviceConfig Custom Resource gets stuck in Red Hat OpenShift**
+   - ***Impact:*** Not able to delete the DeviceConfig Custom Resource if the node reboots during uninstall.
+   - ***Affected Configurations:*** This issue only affects Red Hat OpenShift
+   - ***Workaround:*** This issue will be fixed in the next release of KMM. For the time being you can use a previous version of KMM aside from 2.2 or manually remove the status from NMC:
+    1. List all the NMC resources and pick up the correct NMC (there is one nmc per node, named the same as the node it related to).
+
+        ```bash
+        oc get nmc -A
+        ```
+
+    2. Edit the NMC.
+
+        ```bash
+        oc edit nmc <nmc name>
+        ```
+
+    3. Remove from NMC status for all the data related to your module and save. That should allow the module to be finally deleted.