From e84675e2937e4a56c2bc3e6d0ba4fe3eeb2e57af Mon Sep 17 00:00:00 2001
From: yansun1996 <yan@pensando.io>
Date: Fri, 28 Mar 2025 23:23:34 +0000
Subject: [PATCH 01/24] [DOC] Add note that RVS test isn't compatible with
 partitioned GPU yet

---
 docs/test/auto-unhealthy-device-test.md | 4 ++++
 docs/test/manual-test.md                | 4 ++++
 docs/test/pre-start-job-test.md         | 4 ++++
 3 files changed, 12 insertions(+)

diff --git a/docs/test/auto-unhealthy-device-test.md b/docs/test/auto-unhealthy-device-test.md
index 0b6e9cb3..354cc0c7 100644
--- a/docs/test/auto-unhealthy-device-test.md
+++ b/docs/test/auto-unhealthy-device-test.md
@@ -4,6 +4,10 @@
 
 Test runner is periodically watching for the device health status from device metrics exporter per 30 seconds. Once exporter reported GPU status is unhealthy, test runner will start to run one-time test on the unhealthy GPU. The test result will be exported as Kubernetes event.
 
+```{warning}
+The Test Runner's RVS test recipes aren't compatible with partitioned GPU. If you're using partitoned GPU please disable the test runner from ```DeviceConfig``` by setting ```spec/testRunner/enable``` to ```false```.
+```
+
 ## Configure test runner
 
 To start the Test Runner along with the GPU Operator, Device Metrics Exporter must be enabled since Test Runner is depending on the exported health status. Configure the ``` spec/metricsExporter/enable ``` field in deviceconfig Custom Resource(CR) to enable/disable metrics exporter and configure the ``` spec/testRunner/enable ``` field in deviceconfig Custom Resource(CR) to enable/disable test runner.
diff --git a/docs/test/manual-test.md b/docs/test/manual-test.md
index c00ac288..7d14f1a9 100644
--- a/docs/test/manual-test.md
+++ b/docs/test/manual-test.md
@@ -4,6 +4,10 @@
 
 To start the manual test, directly use the test runner image to create the Kubernetes job and related resources, then the test will be triggered manually.
 
+```{warning}
+The Test Runner's RVS test recipes aren't compatible with partitioned GPU. If you're using partitoned GPU please reset the GPU partition configuration and run the manual test against the non-partitioned GPU.
+```
+
 ## Use Case 1 - GPU is unhealthy on the node
 
 When any GPU on a specific worker node is unhealthy, you can manually trigger a test / benchmark run on that worker node to check more details on the unhealthy state. The test job requires RBAC config to grant the test runner access to export events and add node labels to the cluster. Here is an example of configuring the RBAC and Job resources:
diff --git a/docs/test/pre-start-job-test.md b/docs/test/pre-start-job-test.md
index 2bad5332..f11e765d 100644
--- a/docs/test/pre-start-job-test.md
+++ b/docs/test/pre-start-job-test.md
@@ -4,6 +4,10 @@
 
 Test runner can be embedded as an init container within your Kubernetes workload pod definition. The init container will be executed before the actual workload containers start, in that way the system could be tested right before the workload start to use the hardware resource.
 
+```{warning}
+The Test Runner's RVS test recipes aren't compatible with partitioned GPU. If you're using partitoned GPU, don't run the test runner as init container to perform the pre-start job test.
+```
+
 ## Configure pre-start init container
 
 The init container requires RBAC config to grant the pod access to export events and add node labels to the cluster. Here is an example of configuring the RBAC and Job resources:

From 60959c586f1602f43e4767d96f1b7b4f864c19bd Mon Sep 17 00:00:00 2001
From: yansun1996 <yan@pensando.io>
Date: Mon, 31 Mar 2025 09:01:10 +0000
Subject: [PATCH 02/24] Address comments

---
 docs/test/auto-unhealthy-device-test.md | 2 +-
 docs/test/manual-test.md                | 2 +-
 docs/test/pre-start-job-test.md         | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/test/auto-unhealthy-device-test.md b/docs/test/auto-unhealthy-device-test.md
index 354cc0c7..c610a32c 100644
--- a/docs/test/auto-unhealthy-device-test.md
+++ b/docs/test/auto-unhealthy-device-test.md
@@ -5,7 +5,7 @@
 Test runner is periodically watching for the device health status from device metrics exporter per 30 seconds. Once exporter reported GPU status is unhealthy, test runner will start to run one-time test on the unhealthy GPU. The test result will be exported as Kubernetes event.
 
 ```{warning}
-The Test Runner's RVS test recipes aren't compatible with partitioned GPU. If you're using partitoned GPU please disable the test runner from ```DeviceConfig``` by setting ```spec/testRunner/enable``` to ```false```.
+The RVS test recipes in the Test Runner aren't compatible with partitioned GPUs. To address this, either disable the test runner by setting ```spec/testRunner/enable``` to ```false```, or configure the test runner to run only on nodes without partitioned GPUs by using ```spec/testRunner/selector```.
 ```
 
 ## Configure test runner
diff --git a/docs/test/manual-test.md b/docs/test/manual-test.md
index 7d14f1a9..c4ba4bae 100644
--- a/docs/test/manual-test.md
+++ b/docs/test/manual-test.md
@@ -5,7 +5,7 @@
 To start the manual test, directly use the test runner image to create the Kubernetes job and related resources, then the test will be triggered manually.
 
 ```{warning}
-The Test Runner's RVS test recipes aren't compatible with partitioned GPU. If you're using partitoned GPU please reset the GPU partition configuration and run the manual test against the non-partitioned GPU.
+The RVS test recipes in the Test Runner are not compatible with partitioned GPUs. If you are using a partitioned GPU, please reset the GPU partition configuration and conduct the manual test on a non-partitioned GPU.
 ```
 
 ## Use Case 1 - GPU is unhealthy on the node
diff --git a/docs/test/pre-start-job-test.md b/docs/test/pre-start-job-test.md
index f11e765d..d5133faa 100644
--- a/docs/test/pre-start-job-test.md
+++ b/docs/test/pre-start-job-test.md
@@ -5,7 +5,7 @@
 Test runner can be embedded as an init container within your Kubernetes workload pod definition. The init container will be executed before the actual workload containers start, in that way the system could be tested right before the workload start to use the hardware resource.
 
 ```{warning}
-The Test Runner's RVS test recipes aren't compatible with partitioned GPU. If you're using partitoned GPU, don't run the test runner as init container to perform the pre-start job test.
+The RVS test recipes in the Test Runner are not compatible with partitioned GPUs. If you are using a partitioned GPU, avoid running the Test Runner as an init container for the pre-start job test.
 ```
 
 ## Configure pre-start init container

From 7a50f27fd6a85109453a6984ad68e21384d64791 Mon Sep 17 00:00:00 2001
From: vm <sriramr2230@gmail.com>
Date: Fri, 28 Mar 2025 04:12:20 +0000
Subject: [PATCH 03/24] BootID support for Reboot during Driver Upgrade

---
 internal/controllers/mock_upgrademgr.go | 26 +++++++++++++++++++++++++
 internal/controllers/upgrademgr.go      | 23 ++++++++++++++++++++++
 2 files changed, 49 insertions(+)

diff --git a/internal/controllers/mock_upgrademgr.go b/internal/controllers/mock_upgrademgr.go
index 7db0fa9c..03944030 100644
--- a/internal/controllers/mock_upgrademgr.go
+++ b/internal/controllers/mock_upgrademgr.go
@@ -216,6 +216,20 @@ func (mr *MockupgradeMgrHelperAPIMockRecorder) deleteRebootPod(ctx, nodeName, dc
 	return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "deleteRebootPod", reflect.TypeOf((*MockupgradeMgrHelperAPI)(nil).deleteRebootPod), ctx, nodeName, dc, force, genId)
 }
 
+// getBootID mocks base method.
+func (m *MockupgradeMgrHelperAPI) getBootID(nodeName string) string {
+	m.ctrl.T.Helper()
+	ret := m.ctrl.Call(m, "getBootID", nodeName)
+	ret0, _ := ret[0].(string)
+	return ret0
+}
+
+// getBootID indicates an expected call of getBootID.
+func (mr *MockupgradeMgrHelperAPIMockRecorder) getBootID(nodeName any) *gomock.Call {
+	mr.mock.ctrl.T.Helper()
+	return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "getBootID", reflect.TypeOf((*MockupgradeMgrHelperAPI)(nil).getBootID), nodeName)
+}
+
 // getNode mocks base method.
 func (m *MockupgradeMgrHelperAPI) getNode(ctx context.Context, nodeName string) (*v1.Node, error) {
 	m.ctrl.T.Helper()
@@ -465,6 +479,18 @@ func (mr *MockupgradeMgrHelperAPIMockRecorder) isUpgradePolicyViolated(upgradeIn
 	return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "isUpgradePolicyViolated", reflect.TypeOf((*MockupgradeMgrHelperAPI)(nil).isUpgradePolicyViolated), upgradeInProgress, upgradeFailedState, totalNodes, deviceConfig)
 }
 
+// setBootID mocks base method.
+func (m *MockupgradeMgrHelperAPI) setBootID(nodeName, bootID string) {
+	m.ctrl.T.Helper()
+	m.ctrl.Call(m, "setBootID", nodeName, bootID)
+}
+
+// setBootID indicates an expected call of setBootID.
+func (mr *MockupgradeMgrHelperAPIMockRecorder) setBootID(nodeName, bootID any) *gomock.Call {
+	mr.mock.ctrl.T.Helper()
+	return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "setBootID", reflect.TypeOf((*MockupgradeMgrHelperAPI)(nil).setBootID), nodeName, bootID)
+}
+
 // setNodeStatus mocks base method.
 func (m *MockupgradeMgrHelperAPI) setNodeStatus(ctx context.Context, nodeName string, status v1alpha1.UpgradeState) {
 	m.ctrl.T.Helper()
diff --git a/internal/controllers/upgrademgr.go b/internal/controllers/upgrademgr.go
index a5e519b2..ad2ee41c 100644
--- a/internal/controllers/upgrademgr.go
+++ b/internal/controllers/upgrademgr.go
@@ -287,6 +287,8 @@ type upgradeMgrHelperAPI interface {
 	setUpgradeStartTime(nodeName string)
 	clearUpgradeStartTime(nodeName string)
 	checkUpgradeTimeExceeded(ctx context.Context, nodeName string, deviceConfig *amdv1alpha1.DeviceConfig) bool
+	getBootID(nodeName string) string
+	setBootID(nodeName string, bootID string)
 	clearNodeStatus()
 	isInit() bool
 }
@@ -297,6 +299,7 @@ type upgradeMgrHelper struct {
 	drainHelper          *drain.Helper
 	nodeStatus           *sync.Map
 	nodeUpgradeStartTime *sync.Map
+	nodeBootID           *sync.Map
 	init                 bool
 	currentSpec          driverSpec
 }
@@ -313,6 +316,7 @@ func newUpgradeMgrHelperHandler(client client.Client, k8sInterface kubernetes.In
 		k8sInterface:         k8sInterface,
 		nodeStatus:           new(sync.Map),
 		nodeUpgradeStartTime: new(sync.Map),
+		nodeBootID:           new(sync.Map),
 	}
 }
 
@@ -527,6 +531,18 @@ func (h *upgradeMgrHelper) checkUpgradeTimeExceeded(ctx context.Context, nodeNam
 	return false
 }
 
+func (h *upgradeMgrHelper) getBootID(nodeName string) string {
+	if value, ok := h.nodeBootID.Load(nodeName); ok {
+		return value.(string)
+	}
+
+	return ""
+}
+
+func (h *upgradeMgrHelper) setBootID(nodeName string, currentbootID string) {
+	h.nodeBootID.Store(nodeName, currentbootID)
+}
+
 func (h *upgradeMgrHelper) getNodeStatus(nodeName string) amdv1alpha1.UpgradeState {
 
 	if value, ok := h.nodeStatus.Load(nodeName); ok {
@@ -867,6 +883,8 @@ func (h *upgradeMgrHelper) handleNodeReboot(ctx context.Context, node *v1.Node,
 	// Wait for the driver upgrade to complete
 	waitForDriverUpgrade()
 
+	currentBootID := node.Status.NodeInfo.BootID
+	h.setBootID(node.Name, currentBootID)
 	if err := h.client.Create(ctx, rebootPod); err != nil {
 		logger.Error(err, fmt.Sprintf("Node: %v State: %v RebootPod Create failed with Error: %v", node.Name, h.getNodeStatus(node.Name), err))
 		// Mark the state as failed
@@ -888,6 +906,11 @@ func (h *upgradeMgrHelper) handleNodeReboot(ctx context.Context, node *v1.Node,
 						}
 					}
 
+					if nodeObj.Status.NodeInfo.BootID != h.getBootID(node.Name) {
+						h.setBootID(node.Name, nodeObj.Status.NodeInfo.BootID)
+						logger.Info(fmt.Sprintf("Node: %v has rebooted", node.Name))
+						return
+					}
 					// If node is NotReady, proceed; otherwise, wait for the next tick
 					if nodeNotReady {
 						logger.Info(fmt.Sprintf("Node: %v has moved to NotReady", node.Name))

From b140f1f815f1ded37262af7820d3e05f7de42ff3 Mon Sep 17 00:00:00 2001
From: vm <sriramr2230@gmail.com>
Date: Wed, 26 Mar 2025 07:09:14 +0000
Subject: [PATCH 04/24] Device Plugin Usage documentation from GPU Operator

---
 docs/device_plugin/device-plugin.md | 112 ++++++++++++++++++++++++++++
 docs/sphinx/_toc.yml                |   3 +
 docs/sphinx/_toc.yml.in             |   3 +
 3 files changed, 118 insertions(+)
 create mode 100644 docs/device_plugin/device-plugin.md

diff --git a/docs/device_plugin/device-plugin.md b/docs/device_plugin/device-plugin.md
new file mode 100644
index 00000000..4ecfb97b
--- /dev/null
+++ b/docs/device_plugin/device-plugin.md
@@ -0,0 +1,112 @@
+# Device Plugin
+
+## Configure device plugin
+
+To start the Device Plugin along with the GPU Operator configure fields under the ``` spec/devicePlugin ``` field in deviceconfig Custom Resource(CR)
+
+```yaml
+  devicePlugin:
+    # Specify the device plugin image
+    # default value is rocm/k8s-device-plugin:latest
+    devicePluginImage: rocm/k8s-device-plugin:latest
+
+    # The device plugin arguments is used to pass supported flags and their values while starting device plugin daemonset
+    devicePluginArguments:
+      resource_naming_strategy: single
+
+    # Specify the node labeller image
+    # default value is rocm/k8s-device-plugin:labeller-latest
+    nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
+
+    # Specify whether to bring up node labeller component
+    # default value is true
+    enableNodeLabeller: True
+
+```
+
+The **device-plugin** pods start after updating the **DeviceConfig** CR
+
+```bash
+#kubectl get pods -n kube-amd-gpu
+NAME                                                              READY   STATUS    RESTARTS       AGE
+amd-gpu-operator-gpu-operator-charts-controller-manager-77tpmgn   1/1     Running   0              4h9m
+amd-gpu-operator-kmm-controller-6d459dffcf-lbgtt                  1/1     Running   0              4h9m
+amd-gpu-operator-kmm-webhook-server-5fdc8b995-qgj49               1/1     Running   0              4h9m
+amd-gpu-operator-node-feature-discovery-gc-78989c896-7lh8t        1/1     Running   0              3h48m
+amd-gpu-operator-node-feature-discovery-master-b8bffc48b-6rnz6    1/1     Running   0              4h9m
+amd-gpu-operator-node-feature-discovery-worker-m9lwn              1/1     Running   0              4h9m
+test-deviceconfig-device-plugin-rk5f4                             1/1     Running   0              134m
+test-deviceconfig-node-labeller-bxk7x                             1/1     Running   0              134m
+```
+
+<div style="background-color: #d0e7f; border-left: 6px solid #2196F3; padding: 10px;">
+<strong>Note:</strong> The Device Plugin name will be prefixed with the name of your DeviceConfig custom resource
+</div></br>
+
+## Device Plugin DeviceConfig
+| Field Name                       | Details                                      |
+|----------------------------------|----------------------------------------------|
+| **DevicePluginImage**            | Device plugin image                          |
+| **DevicePluginImagePullPolicy**  | One of Always, Never, IfNotPresent.          |
+| **NodeLabellerImage**            | Node labeller image                          |
+| **NodeLabellerImagePullPolicy**  | One of Always, Never, IfNotPresent.          |
+| **EnableNodeLabeller**           | Enable/Disable node labeller with True/False |
+| **DevicePluginArguments**        | The flag/values to pass on to Device Plugin  |
+</br>
+
+1. Both the `ImagePullPolicy` fields default to `Always` if `:latest` tag is specified on the respective Image, or defaults to `IfNotPresent` otherwise. This is default k8s behaviour for `ImagePullPolicy`
+
+2. `DevicePluginArguments` is of type `map[string]string`. Currently supported key value pairs to set under `DevicePluginArguments` are:
+   -> "resource_naming_strategy": {"single", "mixed"}
+
+## How to choose Resource Naming Strategy
+
+To customize the way device plugin reports gpu resources to kubernetes as allocatable k8s resources, use the `single` or `mixed` resource naming strategy in **DeviceConfig** CR
+Before understanding each strategy, please note the definition of homogeneous and heterogeneous nodes
+
+Homogeneous node: A node whose gpu's follow the same compute-memory partition style 
+    -> Example: A node of 8 GPU's where all 8 GPU's are following CPX-NPS4 partition style
+    
+Heterogeneous node: A node whose gpu's follow different compute-memory partition styles
+    -> Example: A node of 8 GPU's where 5 GPU's are following SPX-NPS1 and 3 GPU's are following CPX-NPS1
+
+### Single
+
+In `single` mode, the device plugin reports all gpu's (regardless of whether they are whole gpu's or partitions of a gpu) under the resource name `amd.com/gpu`
+This mode is supported for homogeneous nodes but not supported for heterogeneous nodes
+
+A node which has 8 GPUs where all GPUs are not partitioned will report its resources as:
+
+```bash
+amd.com/gpu: 8
+```
+
+A node which has 8 GPUs where all GPUs are partitioned using CPX-NPS4 style will report its resources as:
+
+```bash
+amd.com/gpu: 64
+```
+
+### Mixed
+
+In `mixed` mode, the device plugin reports all gpu's under a name which matches its partition style.
+This mode is supported for both homogeneous nodes and heterogeneous nodes
+
+A node which has 8 GPUs which are all partitioned using CPX-NPS4 style will report its resources as:
+
+```bash
+amd.com/cpx_nps4: 64
+```
+
+A node which has 8 GPUs where 5 GPU's are following SPX-NPS1 and 3 GPU's are following CPX-NPS1 will report its resources as:
+
+```bash
+amd.com/spx_nps1: 5
+amd.com/cpx_nps1: 24
+``` 
+
+#### **Notes**
+
+- If `resource_naming_strategy` is not passed using `DevicePluginArguments` field in CR, then device plugin will internally default to `single` resource naming strategy. This maintains backwards compatibility with earlier release of device plugin with reported resource name of `amd.com/gpu`
+- If a node has GPUs which do not support partitioning, such as MI210, then the GPUs are reported under resource name `amd.com/gpu` regardless of the resource naming strategy
+- These different naming styles of resources, for example, `amd.com/cpx_nps1` should be followed when requesting for resources in a pod spec
\ No newline at end of file
diff --git a/docs/sphinx/_toc.yml b/docs/sphinx/_toc.yml
index a232e7ab..62786ea4 100644
--- a/docs/sphinx/_toc.yml
+++ b/docs/sphinx/_toc.yml
@@ -44,6 +44,9 @@ subtrees:
       - file: test/manual-test
       - file: test/pre-start-job-test
       - file: test/appendix-test-recipe
+  - caption: Device Plugin
+    entries:
+      - file: device_plugin/device-plugin
   - caption: Specialized Networks
     entries:
       - file: specialized_networks/airgapped-install
diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in
index a232e7ab..62786ea4 100644
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -44,6 +44,9 @@ subtrees:
       - file: test/manual-test
       - file: test/pre-start-job-test
       - file: test/appendix-test-recipe
+  - caption: Device Plugin
+    entries:
+      - file: device_plugin/device-plugin
   - caption: Specialized Networks
     entries:
       - file: specialized_networks/airgapped-install

From 1a99bba075542ee6a04411d909f82a638a10c5a5 Mon Sep 17 00:00:00 2001
From: yansun1996 <yan@pensando.io>
Date: Wed, 26 Mar 2025 20:13:14 +0000
Subject: [PATCH 05/24] Optimize the docs and filename for blacklist function

---
 api/v1alpha1/deviceconfig_types.go                   |  4 +++-
 .../amd-gpu-operator.clusterserviceversion.yaml      |  7 +++++--
 bundle/manifests/amd.com_deviceconfigs.yaml          |  5 ++++-
 config/crd/bases/amd.com_deviceconfigs.yaml          |  5 ++++-
 .../amd-gpu-operator.clusterserviceversion.yaml      |  5 ++++-
 helm-charts-k8s/Chart.lock                           |  2 +-
 helm-charts-k8s/crds/deviceconfig-crd.yaml           |  5 ++++-
 helm-charts-openshift/Chart.lock                     |  2 +-
 helm-charts-openshift/crds/deviceconfig-crd.yaml     |  5 ++++-
 internal/nodelabeller/nodelabeller.go                | 12 +++++++++---
 10 files changed, 39 insertions(+), 13 deletions(-)

diff --git a/api/v1alpha1/deviceconfig_types.go b/api/v1alpha1/deviceconfig_types.go
index 503c0939..b6f186c0 100644
--- a/api/v1alpha1/deviceconfig_types.go
+++ b/api/v1alpha1/deviceconfig_types.go
@@ -94,7 +94,9 @@ type DriverSpec struct {
 	// +kubebuilder:default=true
 	Enable *bool `json:"enable,omitempty"`
 
-	// blacklist amdgpu drivers on the host
+	// blacklist amdgpu drivers on the host. Node reboot is required to apply the baclklist on the worker nodes.
+	// Not working for OpenShift cluster. OpenShift users please use the Machine Config Operator (MCO) resource to configure amdgpu blacklist.
+	// Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module
 	//+operator-sdk:csv:customresourcedefinitions:type=spec,displayName="BlacklistDrivers",xDescriptors={"urn:alm:descriptor:com.amd.deviceconfigs:blacklistDrivers"}
 	Blacklist *bool `json:"blacklist,omitempty"`
 
diff --git a/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml b/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml
index 45078acb..3a6cd86b 100644
--- a/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml
+++ b/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml
@@ -30,7 +30,7 @@ metadata:
         }
       ]
     capabilities: Basic Install
-    createdAt: "2025-03-25T06:19:27Z"
+    createdAt: "2025-03-26T20:10:59Z"
     operatorframework.io/suggested-namespace: openshift-amd-gpu
     operators.operatorframework.io/builder: operator-sdk-v1.32.0
     operators.operatorframework.io/project_layout: go.kubebuilder.io/v3
@@ -229,7 +229,10 @@ spec:
         path: driver.amdgpuInstallerRepoURL
         x-descriptors:
         - urn:alm:descriptor:com.amd.deviceconfigs:amdgpuInstallerRepoURL
-      - description: blacklist amdgpu drivers on the host
+      - description: blacklist amdgpu drivers on the host. Node reboot is required
+          to apply the baclklist on the worker nodes. Not working for OpenShift cluster.
+          OpenShift users please use the Machine Config Operator (MCO) resource to
+          configure amdgpu blacklist. Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module
         displayName: BlacklistDrivers
         path: driver.blacklist
         x-descriptors:
diff --git a/bundle/manifests/amd.com_deviceconfigs.yaml b/bundle/manifests/amd.com_deviceconfigs.yaml
index c9123ffe..d2669dc1 100644
--- a/bundle/manifests/amd.com_deviceconfigs.yaml
+++ b/bundle/manifests/amd.com_deviceconfigs.yaml
@@ -342,7 +342,10 @@ spec:
                       installer URL is https://repo.radeon.com/amdgpu-install by default
                     type: string
                   blacklist:
-                    description: blacklist amdgpu drivers on the host
+                    description: |-
+                      blacklist amdgpu drivers on the host. Node reboot is required to apply the baclklist on the worker nodes.
+                      Not working for OpenShift cluster. OpenShift users please use the Machine Config Operator (MCO) resource to configure amdgpu blacklist.
+                      Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module
                     type: boolean
                   enable:
                     default: true
diff --git a/config/crd/bases/amd.com_deviceconfigs.yaml b/config/crd/bases/amd.com_deviceconfigs.yaml
index 24c2b053..7916a7e6 100644
--- a/config/crd/bases/amd.com_deviceconfigs.yaml
+++ b/config/crd/bases/amd.com_deviceconfigs.yaml
@@ -338,7 +338,10 @@ spec:
                       installer URL is https://repo.radeon.com/amdgpu-install by default
                     type: string
                   blacklist:
-                    description: blacklist amdgpu drivers on the host
+                    description: |-
+                      blacklist amdgpu drivers on the host. Node reboot is required to apply the baclklist on the worker nodes.
+                      Not working for OpenShift cluster. OpenShift users please use the Machine Config Operator (MCO) resource to configure amdgpu blacklist.
+                      Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module
                     type: boolean
                   enable:
                     default: true
diff --git a/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml b/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml
index a9f4d685..f91b8a24 100644
--- a/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml
+++ b/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml
@@ -200,7 +200,10 @@ spec:
         path: driver.amdgpuInstallerRepoURL
         x-descriptors:
         - urn:alm:descriptor:com.amd.deviceconfigs:amdgpuInstallerRepoURL
-      - description: blacklist amdgpu drivers on the host
+      - description: blacklist amdgpu drivers on the host. Node reboot is required
+          to apply the baclklist on the worker nodes. Not working for OpenShift cluster.
+          OpenShift users please use the Machine Config Operator (MCO) resource to
+          configure amdgpu blacklist. Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module
         displayName: BlacklistDrivers
         path: driver.blacklist
         x-descriptors:
diff --git a/helm-charts-k8s/Chart.lock b/helm-charts-k8s/Chart.lock
index 54b4cb8c..f42b6cfb 100644
--- a/helm-charts-k8s/Chart.lock
+++ b/helm-charts-k8s/Chart.lock
@@ -6,4 +6,4 @@ dependencies:
   repository: file://./charts/kmm
   version: v1.0.0
 digest: sha256:f9a315dd2ce3d515ebf28c8e9a6a82158b493ca2686439ec381487761261b597
-generated: "2025-03-25T06:19:17.248998622Z"
+generated: "2025-03-26T20:10:45.247725094Z"
diff --git a/helm-charts-k8s/crds/deviceconfig-crd.yaml b/helm-charts-k8s/crds/deviceconfig-crd.yaml
index 502f4b89..81c564c1 100644
--- a/helm-charts-k8s/crds/deviceconfig-crd.yaml
+++ b/helm-charts-k8s/crds/deviceconfig-crd.yaml
@@ -346,7 +346,10 @@ spec:
                       installer URL is https://repo.radeon.com/amdgpu-install by default
                     type: string
                   blacklist:
-                    description: blacklist amdgpu drivers on the host
+                    description: |-
+                      blacklist amdgpu drivers on the host. Node reboot is required to apply the baclklist on the worker nodes.
+                      Not working for OpenShift cluster. OpenShift users please use the Machine Config Operator (MCO) resource to configure amdgpu blacklist.
+                      Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module
                     type: boolean
                   enable:
                     default: true
diff --git a/helm-charts-openshift/Chart.lock b/helm-charts-openshift/Chart.lock
index 6e9b718d..8eb0ba07 100644
--- a/helm-charts-openshift/Chart.lock
+++ b/helm-charts-openshift/Chart.lock
@@ -6,4 +6,4 @@ dependencies:
   repository: file://./charts/kmm
   version: v1.0.0
 digest: sha256:25200c34a5cc846a1275e5bf3fc637b19e909dc68de938189c5278d77d03f5ac
-generated: "2025-03-25T06:19:26.060856628Z"
+generated: "2025-03-26T20:10:56.781691243Z"
diff --git a/helm-charts-openshift/crds/deviceconfig-crd.yaml b/helm-charts-openshift/crds/deviceconfig-crd.yaml
index 502f4b89..81c564c1 100644
--- a/helm-charts-openshift/crds/deviceconfig-crd.yaml
+++ b/helm-charts-openshift/crds/deviceconfig-crd.yaml
@@ -346,7 +346,10 @@ spec:
                       installer URL is https://repo.radeon.com/amdgpu-install by default
                     type: string
                   blacklist:
-                    description: blacklist amdgpu drivers on the host
+                    description: |-
+                      blacklist amdgpu drivers on the host. Node reboot is required to apply the baclklist on the worker nodes.
+                      Not working for OpenShift cluster. OpenShift users please use the Machine Config Operator (MCO) resource to configure amdgpu blacklist.
+                      Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module
                     type: boolean
                   enable:
                     default: true
diff --git a/internal/nodelabeller/nodelabeller.go b/internal/nodelabeller/nodelabeller.go
index 959bf39f..81293fd9 100644
--- a/internal/nodelabeller/nodelabeller.go
+++ b/internal/nodelabeller/nodelabeller.go
@@ -52,6 +52,8 @@ const (
 	defaultNodeLabellerImage    = "rocm/k8s-device-plugin:labeller-latest"
 	defaultUbiNodeLabellerImage = "rocm/k8s-node-labeller:rhubi-latest"
 	defaultInitContainerImage   = "busybox:1.36"
+	defaultBlacklistFileName    = "blacklist-amdgpu.conf"
+	openShiftBlacklistFileName  = "blacklist-amdgpu-by-operator.conf"
 )
 
 //go:generate mockgen -source=nodelabeller.go -package=nodelabeller -destination=mock_nodelabeller.go NodeLabeller
@@ -129,15 +131,19 @@ func (nl *nodeLabeller) SetNodeLabellerAsDesired(ds *appsv1.DaemonSet, devConfig
 		},
 	}
 
-	var initContainerCommand []string
+	blackListFileName := defaultBlacklistFileName
+	if nl.isOpenShift {
+		blackListFileName = openShiftBlacklistFileName
+	}
 
+	var initContainerCommand []string
 	if devConfig.Spec.Driver.Blacklist != nil && *devConfig.Spec.Driver.Blacklist {
 		// if users want to apply the blacklist, init container will add the amdgpu to the blacklist
-		initContainerCommand = []string{"sh", "-c", "echo \"# added by gpu operator \nblacklist amdgpu\" > /host-etc/modprobe.d/blacklist-amdgpu.conf; while [ ! -d /host-sys/class/kfd ] || [ ! -d /host-sys/module/amdgpu/drivers/ ]; do echo \"amdgpu driver is not loaded \"; sleep 2 ;done"}
+		initContainerCommand = []string{"sh", "-c", fmt.Sprintf("echo \"# added by gpu operator \nblacklist amdgpu\" > /host-etc/modprobe.d/%v; while [ ! -d /host-sys/class/kfd ] || [ ! -d /host-sys/module/amdgpu/drivers/ ]; do echo \"amdgpu driver is not loaded \"; sleep 2 ;done", blackListFileName)}
 	} else {
 		// if users disabled the KMM driver, or disabled the blacklist
 		// init container will remove any hanging amdgpu blacklist entry from the list
-		initContainerCommand = []string{"sh", "-c", "rm -f /host-etc/modprobe.d/blacklist-amdgpu.conf; while [ ! -d /host-sys/class/kfd ] || [ ! -d /host-sys/module/amdgpu/drivers/ ]; do echo \"amdgpu driver is not loaded \"; sleep 2 ;done"}
+		initContainerCommand = []string{"sh", "-c", fmt.Sprintf("rm -f /host-etc/modprobe.d/%v; while [ ! -d /host-sys/class/kfd ] || [ ! -d /host-sys/module/amdgpu/drivers/ ]; do echo \"amdgpu driver is not loaded \"; sleep 2 ;done", blackListFileName)}
 	}
 
 	initContainerImage := defaultInitContainerImage

From 042ba4868fa6e007c89757b73ade21b7f61a9dc0 Mon Sep 17 00:00:00 2001
From: vm <sriramr2230@gmail.com>
Date: Wed, 2 Apr 2025 05:37:22 +0000
Subject: [PATCH 06/24] Rhubi based utils container

---
 internal/utils_container/Dockerfile | 36 ++++++-----------------------
 1 file changed, 7 insertions(+), 29 deletions(-)

diff --git a/internal/utils_container/Dockerfile b/internal/utils_container/Dockerfile
index 59e84fda..ada5a760 100644
--- a/internal/utils_container/Dockerfile
+++ b/internal/utils_container/Dockerfile
@@ -1,31 +1,9 @@
-# Base image
-FROM alpine:3.20.3
+FROM registry.access.redhat.com/ubi9/ubi:9.3
 
-# Install build dependencies
-RUN apk add --no-cache \
-    bash \
-    build-base \
-    automake \
-    autoconf \
-    libtool \
-    pkgconfig \
-    gettext-dev \
-    bison \
-    wget \
-    tar \
-    flex \
-    linux-headers
+# Install nsenter from util-linux package
+RUN dnf install -y util-linux && \
+    cp /usr/bin/nsenter /nsenter && \
+    dnf clean all
 
-# Set working directory
-WORKDIR /tmp
-
-RUN wget https://github.com/util-linux/util-linux/archive/v2.40.tar.gz && tar -xzf v2.40.tar.gz
-
-# Build and install nsenter only
-WORKDIR /tmp/util-linux-2.40
-RUN ./autogen.sh && \
-    ./configure --disable-all-programs --enable-nsenter && \
-    make nsenter && \
-    cp nsenter /nsenter
-
-ENTRYPOINT ["/nsenter"]
+# Set entrypoint to nsenter
+ENTRYPOINT ["/nsenter"]
\ No newline at end of file

From 027cb95d5253dea6f0b21196937811cb05eabc5a Mon Sep 17 00:00:00 2001
From: Sriram Ravishankar <79412470+sriram-30@users.noreply.github.com>
Date: Wed, 2 Apr 2025 11:25:23 +0530
Subject: [PATCH 07/24] use ubi minimal image for smaller size

---
 internal/utils_container/Dockerfile | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/internal/utils_container/Dockerfile b/internal/utils_container/Dockerfile
index ada5a760..a40f740b 100644
--- a/internal/utils_container/Dockerfile
+++ b/internal/utils_container/Dockerfile
@@ -1,9 +1,9 @@
-FROM registry.access.redhat.com/ubi9/ubi:9.3
+FROM registry.access.redhat.com/ubi9/ubi-minimal:9.3
 
 # Install nsenter from util-linux package
-RUN dnf install -y util-linux && \
+RUN microdnf install -y util-linux && \
     cp /usr/bin/nsenter /nsenter && \
-    dnf clean all
+    microdnf clean all
 
 # Set entrypoint to nsenter
-ENTRYPOINT ["/nsenter"]
\ No newline at end of file
+ENTRYPOINT ["/nsenter"]

From 51e8a3ee2a33153a8409c10779c774066c69d922 Mon Sep 17 00:00:00 2001
From: yansun1996 <yan@pensando.io>
Date: Wed, 2 Apr 2025 23:23:03 +0000
Subject: [PATCH 08/24] Push OLM changes for certification on OperatorHub

---
 ...md-gpu-operator.clusterserviceversion.yaml | 47 +++++++++++++++----
 ...md-gpu-operator.clusterserviceversion.yaml | 45 +++++++++++++++---
 2 files changed, 77 insertions(+), 15 deletions(-)

diff --git a/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml b/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml
index 3a6cd86b..134634b9 100644
--- a/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml
+++ b/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml
@@ -29,12 +29,30 @@ metadata:
           }
         }
       ]
-    capabilities: Basic Install
-    createdAt: "2025-03-26T20:10:59Z"
+    capabilities: Seamless Upgrades
+    categories: AI/Machine Learning,Monitoring
+    containerImage: docker.io/rocm/gpu-operator:v1.2.0
+    createdAt: "2025-04-02T23:22:18Z"
+    description: |-
+      Operator responsible for deploying AMD GPU kernel drivers, device plugin, device test runner and device metrics exporter
+      For more information, visit [documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/)
+    devicePluginImage: docker.io/rocm/k8s-device-plugin:rhubi-latest
+    features.operators.openshift.io/disconnected: "true"
+    features.operators.openshift.io/fips-compliant: "false"
+    features.operators.openshift.io/proxy-aware: "true"
+    features.operators.openshift.io/tls-profiles: "false"
+    features.operators.openshift.io/token-auth-aws: "false"
+    features.operators.openshift.io/token-auth-azure: "false"
+    features.operators.openshift.io/token-auth-gcp: "false"
+    metricsExporterImage: docker.io/rocm/device-metrics-exporter:v1.2.0
+    nodelabellerImage: docker.io/rocm/k8s-device-plugin:labeller-rhubi-latest
+    operatorframework.io/cluster-monitoring: "true"
     operatorframework.io/suggested-namespace: openshift-amd-gpu
+    operators.openshift.io/valid-subscription: '[]'
     operators.operatorframework.io/builder: operator-sdk-v1.32.0
     operators.operatorframework.io/project_layout: go.kubebuilder.io/v3
     repository: https://github.com/ROCm/gpu-operator
+    support: Advanced Micro Devices, Inc.
   name: amd-gpu-operator.v1.2.0
   namespace: placeholder
 spec:
@@ -611,7 +629,7 @@ spec:
         - urn:alm:descriptor:com.amd.deviceconfigs:nodeModuleStatus
       version: v1alpha1
   description: |-
-    Operator responsible for deploying AMD GPU kernel drivers and device plugin
+    Operator responsible for deploying AMD GPU kernel drivers, device plugin, device test runner and device metrics exporter
     For more information, visit [documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/)
   displayName: amd-gpu-operator
   icon:
@@ -1115,11 +1133,24 @@ spec:
   - supported: true
     type: AllNamespaces
   keywords:
-  - amd-gpu-operator
+  - AMD
+  - GPU
+  - AI
+  - Deep Learning
+  - Hardware
+  - Driver
+  - Monitoring
   links:
-  - name: Amd Gpu Operator
-    url: https://amd-gpu-operator.domain
-  maturity: alpha
+  - name: AMD GPU Operator
+    url: https://github.com/ROCm/gpu-operator
+  maintainers:
+  - email: Yan.Sun3@amd.com
+    name: Yan Sun
+  - email: farshad.ghodsian@amd.com
+    name: Farshad Ghodsian
+  - email: shrey.ajmera@amd.com
+    name: Shrey Ajmera
+  maturity: stable
   provider:
-    name: amd-gpu-operator
+    name: Advanced Micro Devices, Inc.
   version: 1.2.0
diff --git a/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml b/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml
index f91b8a24..878483bd 100644
--- a/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml
+++ b/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml
@@ -3,9 +3,27 @@ kind: ClusterServiceVersion
 metadata:
   annotations:
     alm-examples: '[]'
-    capabilities: Basic Install
+    capabilities: Seamless Upgrades
+    categories: AI/Machine Learning,Monitoring
+    containerImage: docker.io/rocm/gpu-operator:v1.2.0
+    description: |-
+      Operator responsible for deploying AMD GPU kernel drivers, device plugin, device test runner and device metrics exporter
+      For more information, visit [documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/)
+    devicePluginImage: docker.io/rocm/k8s-device-plugin:rhubi-latest
+    features.operators.openshift.io/disconnected: "true"
+    features.operators.openshift.io/fips-compliant: "false"
+    features.operators.openshift.io/proxy-aware: "true"
+    features.operators.openshift.io/tls-profiles: "false"
+    features.operators.openshift.io/token-auth-aws: "false"
+    features.operators.openshift.io/token-auth-azure: "false"
+    features.operators.openshift.io/token-auth-gcp: "false"
+    metricsExporterImage: docker.io/rocm/device-metrics-exporter:v1.2.0
+    nodelabellerImage: docker.io/rocm/k8s-device-plugin:labeller-rhubi-latest
+    operatorframework.io/cluster-monitoring: "true"
     operatorframework.io/suggested-namespace: openshift-amd-gpu
+    operators.openshift.io/valid-subscription: '[]'
     repository: https://github.com/ROCm/gpu-operator
+    support: Advanced Micro Devices, Inc.
   name: amd-gpu-operator.v0.0.0
   namespace: placeholder
 spec:
@@ -582,7 +600,7 @@ spec:
         - urn:alm:descriptor:com.amd.deviceconfigs:nodeModuleStatus
       version: v1alpha1
   description: |-
-    Operator responsible for deploying AMD GPU kernel drivers and device plugin
+    Operator responsible for deploying AMD GPU kernel drivers, device plugin, device test runner and device metrics exporter
     For more information, visit [documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/)
   displayName: amd-gpu-operator
   icon:
@@ -602,11 +620,24 @@ spec:
   - supported: true
     type: AllNamespaces
   keywords:
-  - amd-gpu-operator
+  - AMD
+  - GPU
+  - AI
+  - Deep Learning
+  - Hardware
+  - Driver
+  - Monitoring
   links:
-  - name: Amd Gpu Operator
-    url: https://amd-gpu-operator.domain
-  maturity: alpha
+  - name: AMD GPU Operator
+    url: https://github.com/ROCm/gpu-operator
+  maintainers:
+  - email: Yan.Sun3@amd.com
+    name: Yan Sun
+  - email: farshad.ghodsian@amd.com
+    name: Farshad Ghodsian
+  - email: shrey.ajmera@amd.com
+    name: Shrey Ajmera
+  maturity: stable
   provider:
-    name: amd-gpu-operator
+    name: Advanced Micro Devices, Inc.
   version: 0.0.0

From 6490f63cdd04ea31dcf2752f25f62ee5df5ea66e Mon Sep 17 00:00:00 2001
From: im-AbhiP <8828883+im-AbhiP@users.noreply.github.com>
Date: Thu, 3 Apr 2025 19:10:37 -0700
Subject: [PATCH 09/24] New doc additions to metric and test runner section
 (#112)

* Added Test Runner overview page, ECC error injection test page, compatibility matrix on index page, added missing intramfs rebuild step on Driver Installation page, updated the TOC to reflect new additions

* Fixed linting/markdown errors
---
 docs/drivers/installation.md        |   7 +
 docs/index.md                       |  42 +++++-
 docs/metrics/ecc-error-injection.md | 199 ++++++++++++++++++++++++++++
 docs/sphinx/_toc.yml.in             |   3 +
 docs/test/test-runner-overview.md   |  34 +++++
 5 files changed, 283 insertions(+), 2 deletions(-)
 create mode 100644 docs/metrics/ecc-error-injection.md
 create mode 100644 docs/test/test-runner-overview.md

diff --git a/docs/drivers/installation.md b/docs/drivers/installation.md
index 890da553..ead38e4d 100644
--- a/docs/drivers/installation.md
+++ b/docs/drivers/installation.md
@@ -18,12 +18,19 @@ Before installing the AMD GPU driver:
 
 Before installing the out-of-tree AMD GPU driver, you must blacklist the inbox AMD GPU driver:
 
+- These commands need to either be run as `root` or by using `sudo`
 - Create blacklist configuration file on worker nodes:
 
 ```bash
 echo "blacklist amdgpu" > /etc/modprobe.d/blacklist-amdgpu.conf
 ```
 
+- After blacklist configuration file, you need to rebuild the initramfs for the change to take effect:
+
+```bash
+echo update-initramfs -u -k all
+```
+
 - Reboot the worker node to apply the blacklist
 - Verify the blacklisting:
 
diff --git a/docs/index.md b/docs/index.md
index 3a8340ea..9348b933 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -13,8 +13,46 @@ The AMD GPU Operator simplifies the deployment and management of AMD Instinct GP
 
 ## Compatibility
 
-- **Kubernetes**: 1.29.0
-- Please refer to the [ROCm documentation](https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html) for the compatibility matrix for the AMD GPU DKMS driver.
+### Supported Hardware
+
+| **GPUs** | |
+| --- | --- |
+| AMD Instinct™ MI300X | ✅ Supported |
+| AMD Instinct™ MI250 | ✅ Supported |
+| AMD Instinct™ MI210 | ✅ Supported |
+
+### OS & Platform Support Matrix
+
+Below is a matrix of supported Operating systems and the corresponding Kubernetes version that have been validated to work. We will continue to add more Operating Systems and future versions of Kubernetes with each release of the AMD GPU Operator and Metrics Exporter.
+
+<table style="border-collapse: collapse; margin-left: 0; margin-right: auto;">
+  <thead style="background-color: #2c2c2c; color: white;">
+    <tr>
+      <th style="border: 1px solid grey;">Operating System</th>
+      <th style="border: 1px solid grey;">Kubernetes</th>
+      <th style="border: 1px solid grey;">Red Hat OpenShift</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr style="background-color: white; color: black;">
+      <td style="background-color: #2c2c2c; color: white; border: 1px solid grey;">Ubuntu 22.04 LTS</td>
+      <td style="border: 1px solid grey;">1.29—1.31</td>
+      <td style="border: 1px solid grey;"></td>
+    </tr>
+    <tr style="background-color: lightgrey; color: black;">
+      <td style="background-color: #2c2c2c; color: white; border: 1px solid grey;">Ubuntu 24.04 LTS</td>
+      <td style="border: 1px solid grey;">1.29—1.31</td>
+      <td style="border: 1px solid grey;"></td>
+    </tr>
+    <tr style="background-color: white; color: black;">
+      <td style="background-color: #2c2c2c; color: white; border: 1px solid grey;">Red Hat Core OS (RHCOS)</td>
+      <td style="border: 1px solid grey;"></td>
+      <td style="border: 1px solid grey;">4.16—4.17</td>
+    </tr>
+  </tbody>
+</table>
+
+Please refer to the [ROCM documentaiton](https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html) for the compatability matrix for the AMD GPU DKMS driver.
 
 ## Prerequisites
 
diff --git a/docs/metrics/ecc-error-injection.md b/docs/metrics/ecc-error-injection.md
new file mode 100644
index 00000000..f3f17926
--- /dev/null
+++ b/docs/metrics/ecc-error-injection.md
@@ -0,0 +1,199 @@
+## ECC Error Injection Testing
+
+The Metric Exporter has the capability to check for unhealthy GPUs via the monitoring of ECC Errors that can occur when a GPU is not functioning as expected. When an ECC error is detected the Metrics Exporter will now mark the offending GPU as unhealthy and add a node label to indicate which GPU on the node is unhealthy. The Kubernetes Device Plugin also listens to the health metrics coming from the Metrics Exporter to determine GPU status, marking GPUs as schedulable if healthy and unschedulable if unhealthy.
+
+This health check workflow runs automatically on every node the Device Metrics Exporter is running on, with the Metrics Exporter polling GPUs every 30 seconds and the device plugin checking health status at the same interval, ensuring updates within one minute. Users can customize the default ECC error threshold (set to 0) via the `HealthThresholds` field in the metrics exporter ConfigMap. As part of this workflow healthy GPUs are made available for Kubernetes job scheduling, while ensuring no new jobs are scheduled on an unhealthy GPUs.
+
+## To do error injection follow these steps
+
+We have added a new `metricsclient` to the Device Metrics Exporter pod that can be used to inject ECC errors into an otherwise healthy GPU for testing the above health check workflow. This is fairly simple and don't worry this does not harm your GPU as any errors that are being injected are debugging in nature and not real errors. The steps to do this have been outlined below:
+
+### 1. Set Node Name
+
+Use an environment variable to set the Kubernetes node name to indicate which node you want to test error injection on:
+
+```bash
+NODE_NAME=<node-name>
+```
+
+Replace <node-name> with the name of the node you want to test. If you are running this from the same node you want to test you can grab the hostname using:
+
+```bash
+NODE_NAME=$(hostname)
+```
+
+### 2. Set Metrics Exporter Pod Name
+
+Since you have to execute the `metricsclient` from directly within the Device Metrics Exporter pod we need to get the Metrics Exporter pod name running on the node:
+
+```bash
+METRICS_POD=$(kubectl get pods -n kube-amd-gpu --field-selector spec.nodeName=$NODE_NAME --no-headers -o custom-columns=":metadata.name" | grep '^gpu-operator-metrics-exporter-' | head -n 1)
+```
+
+### 3. Check Metrics Client to see GPU Health
+
+Now that you have the name of the metrics exporter pod you can use the metricsclient to check the current health of all GPUs on the node:
+
+```bash
+kubectl exec -n kube-amd-gpu $METRICS_POD -c metrics-exporter-container -- metricsclient
+```
+
+You should see a list of all the GPUs on that node along with their corresponding status. In most cases all GPUs should report as being `healthy`.
+
+```bash
+ID      Health  Associated Workload
+------------------------------------------------
+1       healthy []
+0       healthy []
+7       healthy []
+6       healthy []
+5       healthy []
+4       healthy []
+3       healthy []
+2       healthy []
+------------------------------------------------
+```
+
+### 4. Inject ECC Errors on GPU 0
+
+In order to simulate errors on a GPU we will be using a json file that specifies a GPU ID along with counters for several ECC Uncorrectable error fields that are being monitored by the Device Metrics Exporter. In the below example you can see that we are specifying `GPU 0` and injecting 1 `GPU_ECC_UNCORRECT_SEM` error and 2 `GPU_ECC_UNCORRECT_FUSE` errors. We use the `metricslient -ecc-file-path <file.json>` command to specify the json file we want to inject into the metrics table. To create the json file and execute the metricsclient command all in in one go run the following:
+
+```bash
+kubectl exec -n kube-amd-gpu $METRICS_POD -c metrics-exporter-container -- sh -c 'cat > /tmp/ecc.json <<EOF
+{
+        "ID": "0",
+        "Fields": [
+                "GPU_ECC_UNCORRECT_SEM",
+                "GPU_ECC_UNCORRECT_FUSE"
+        ],
+        "Counts" : [
+                1, 2
+        ]
+}
+EOF
+metricsclient -ecc-file-path /tmp/ecc.json'
+```
+
+The metricsclient should report back the current status of the GPUs as well as the new json string you just injected.
+
+```bash
+ID      Health  Associated Workload
+------------------------------------------------
+6       healthy []
+5       healthy []
+4       healthy []
+3       healthy []
+2       healthy []
+1       healthy []
+0       healthy []
+7       healthy []
+------------------------------------------------
+{"ID":"0","Fields":["GPU_ECC_UNCORRECT_SEM","GPU_ECC_UNCORRECT_FUSE"]}
+```
+
+### 5. Query the Mericsclient to See the Unhealthy GPU
+
+Since the Metric Exporter will check every 30 seconds for GPU health status you will need to wait this amount of time before executing the following command again to see the unhealthy GPU:
+
+```bash
+kubectl exec -n kube-amd-gpu $METRICS_POD -c metrics-exporter-container -- metricsclient
+```
+
+You should now see that one of the GPUs, `GPU 0`, in this case has been marked as unhealthy:
+
+```bash
+ ID      Health  Associated Workload
+------------------------------------------------
+0       unhealthy       []
+7       healthy []
+6       healthy []
+5       healthy []
+4       healthy []
+3       healthy []
+2       healthy []
+1       healthy []
+------------------------------------------------
+```
+
+### 6. Checking the Unhealthy GPU Node label
+
+The Metrics Exporter should of also added an unhealthy GPU label to your affected node to identify which GPU is unhealthy. Run the following to check for unhealth gpu node labels:
+
+```bash
+kubectl describe node $NODE_NAME | grep unhealthy
+```
+
+The command should return back one label indicating `gpu.0.state` as unhealthy:
+
+```yaml
+metricsexporter.amd.com.gpu.0.state=unhealthy
+```
+
+### 7. Check Number of Allocatable GPUs
+
+In order to confirm that the unhealthy GPU resource has in fact been removed from the Kubernetes Scheduler we can check the number of total GPUs on the node and compare it with the number of allocatable GPUs. To do so run the following:
+
+```bash
+kubectl get nodes -o custom-columns=NAME:.metadata.name,"Total GPUs:.status.capacity.amd\.com/gpu","Allocatable GPUs:.status.allocatable.amd\.com/gpu"
+```
+
+You should now have one less GPU that is allocatable on your node:
+
+```bash
+NAME                     Total GPUs   Allocatable GPUs
+amd-mi300x-gpu-worker1   8            7
+```
+
+### 8. Clear ECC Errors on GPU 0
+
+Now that we have tested to ensure the Health Check workflow is working we can clear the ECC errors on GPU0 by using the metrics client in a similar fashion to 4. This time we are setting the error counts to 0 for both GPU_ECC_UNCORRECT error fields.
+
+```bash
+kubectl exec -n kube-amd-gpu $METRICS_POD -c metrics-exporter-container -- sh -c 'cat > /tmp/delete_ecc.json <<EOF
+{
+        "ID": "0",
+        "Fields": [
+                "GPU_ECC_UNCORRECT_SEM",
+                "GPU_ECC_UNCORRECT_FUSE"
+        ],
+        "Counts" : [
+                0, 0
+        ]
+}
+EOF
+metricsclient -ecc-file-path /tmp/delete_ecc.json'
+```
+
+### 9. Check to see GPU 0 Become Healthy Again
+
+After waiting another 30 seconds or so you can check the metrics client again to see that all GPUs are now healthy again:
+
+```bash
+kubectl exec -n kube-amd-gpu $METRICS_POD -c metrics-exporter-container -- metricsclient
+```
+
+You should see the following:
+
+```bash
+ID      Health  Associated Workload
+------------------------------------------------
+4       healthy []
+3       healthy []
+2       healthy []
+1       healthy []
+0       healthy []
+7       healthy []
+6       healthy []
+5       healthy []
+------------------------------------------------
+```
+
+### 10. Check that all GPUs are Allocatable Again
+
+Lastly check the number of allocatable GPUs on your node to ensure that it matches the total number of GPUs:
+
+```bash
+kubectl get nodes -o custom-columns=NAME:.metadata.name,"Total GPUs:.status.capacity.amd\.com/gpu","Allocatable GPUs:.status.allocatable.amd\.com/gpu"
+```
+
+Following the above steps will help you successfully test the new GPU Health Check Feature.
diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in
index 62786ea4..58638453 100644
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -38,8 +38,11 @@ subtrees:
       - file: metrics/kube-rbac-proxy
       - file: metrics/health
         title: Health Checks
+      - file: metrics/ecc-error-injection
+        title: ECC Error Injection Testing
   - caption: Test Runner
     entries:
+      - file: test/test-runner-overview
       - file: test/auto-unhealthy-device-test
       - file: test/manual-test
       - file: test/pre-start-job-test
diff --git a/docs/test/test-runner-overview.md b/docs/test/test-runner-overview.md
new file mode 100644
index 00000000..7ffa72a0
--- /dev/null
+++ b/docs/test/test-runner-overview.md
@@ -0,0 +1,34 @@
+## Test Runner Overview
+
+The test runner component offers hardware validation, diagnostics and benchmarking capabilities for your GPU Worker nodes. The new capabilities include:
+
+- Automatically triggering of configurable tests on unhealthy GPUs
+
+- Scheduling or Manually triggering tests within the Kubernetes cluster
+
+- Running pre-start job tests as init containers within your GPU workload pods to ensure GPU health and stability before execution of long running jobs
+
+- Reporting test results as Kubernetes events
+
+Under the hood the Device Test runner leverages the ROCm Validation Suite (RVS) to run any number of tests including GPU stress tests, PCIE bandwidth benchmarks, memory tests, and longer burn-in tests if so desired. The DeviceConfig custom resource has also been updated to provide new configuration options for the Test Runner:
+
+```bash
+  testRunner:
+    # To enable/disable the testrunner, disabled by default
+    enable: True
+
+    # testrunner image
+    image: docker.io/rocm/test-runner:v1.2.0-beta.0
+
+    # image pull policy for the testrunner
+    # default value is IfNotPresent for valid tags, Always for no tag or "latest" tag
+    imagePullPolicy: "Always"
+
+    # specify the mount for test logs
+    logsLocation:
+      # mount path inside test runner container
+      mountPath: "/var/log/amd-test-runner"
+
+      # host path to be mounted into test runner container
+      hostPath: "/var/log/amd-test-runner"
+```

From 11df42c5a839b64f059e2dd07b282e6d3ea49767 Mon Sep 17 00:00:00 2001
From: im-AbhiP <8828883+im-AbhiP@users.noreply.github.com>
Date: Thu, 3 Apr 2025 19:27:21 -0700
Subject: [PATCH 10/24] Revert "New doc additions to metric and test runner
 section (#112)" (#113)

This reverts commit 249d688f519f363cf7698db132d6e3ab4be34a27.
---
 docs/drivers/installation.md        |   7 -
 docs/index.md                       |  42 +-----
 docs/metrics/ecc-error-injection.md | 199 ----------------------------
 docs/sphinx/_toc.yml.in             |   3 -
 docs/test/test-runner-overview.md   |  34 -----
 5 files changed, 2 insertions(+), 283 deletions(-)
 delete mode 100644 docs/metrics/ecc-error-injection.md
 delete mode 100644 docs/test/test-runner-overview.md

diff --git a/docs/drivers/installation.md b/docs/drivers/installation.md
index ead38e4d..890da553 100644
--- a/docs/drivers/installation.md
+++ b/docs/drivers/installation.md
@@ -18,19 +18,12 @@ Before installing the AMD GPU driver:
 
 Before installing the out-of-tree AMD GPU driver, you must blacklist the inbox AMD GPU driver:
 
-- These commands need to either be run as `root` or by using `sudo`
 - Create blacklist configuration file on worker nodes:
 
 ```bash
 echo "blacklist amdgpu" > /etc/modprobe.d/blacklist-amdgpu.conf
 ```
 
-- After blacklist configuration file, you need to rebuild the initramfs for the change to take effect:
-
-```bash
-echo update-initramfs -u -k all
-```
-
 - Reboot the worker node to apply the blacklist
 - Verify the blacklisting:
 
diff --git a/docs/index.md b/docs/index.md
index 9348b933..3a8340ea 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -13,46 +13,8 @@ The AMD GPU Operator simplifies the deployment and management of AMD Instinct GP
 
 ## Compatibility
 
-### Supported Hardware
-
-| **GPUs** | |
-| --- | --- |
-| AMD Instinct™ MI300X | ✅ Supported |
-| AMD Instinct™ MI250 | ✅ Supported |
-| AMD Instinct™ MI210 | ✅ Supported |
-
-### OS & Platform Support Matrix
-
-Below is a matrix of supported Operating systems and the corresponding Kubernetes version that have been validated to work. We will continue to add more Operating Systems and future versions of Kubernetes with each release of the AMD GPU Operator and Metrics Exporter.
-
-<table style="border-collapse: collapse; margin-left: 0; margin-right: auto;">
-  <thead style="background-color: #2c2c2c; color: white;">
-    <tr>
-      <th style="border: 1px solid grey;">Operating System</th>
-      <th style="border: 1px solid grey;">Kubernetes</th>
-      <th style="border: 1px solid grey;">Red Hat OpenShift</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr style="background-color: white; color: black;">
-      <td style="background-color: #2c2c2c; color: white; border: 1px solid grey;">Ubuntu 22.04 LTS</td>
-      <td style="border: 1px solid grey;">1.29—1.31</td>
-      <td style="border: 1px solid grey;"></td>
-    </tr>
-    <tr style="background-color: lightgrey; color: black;">
-      <td style="background-color: #2c2c2c; color: white; border: 1px solid grey;">Ubuntu 24.04 LTS</td>
-      <td style="border: 1px solid grey;">1.29—1.31</td>
-      <td style="border: 1px solid grey;"></td>
-    </tr>
-    <tr style="background-color: white; color: black;">
-      <td style="background-color: #2c2c2c; color: white; border: 1px solid grey;">Red Hat Core OS (RHCOS)</td>
-      <td style="border: 1px solid grey;"></td>
-      <td style="border: 1px solid grey;">4.16—4.17</td>
-    </tr>
-  </tbody>
-</table>
-
-Please refer to the [ROCM documentaiton](https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html) for the compatability matrix for the AMD GPU DKMS driver.
+- **Kubernetes**: 1.29.0
+- Please refer to the [ROCm documentation](https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html) for the compatibility matrix for the AMD GPU DKMS driver.
 
 ## Prerequisites
 
diff --git a/docs/metrics/ecc-error-injection.md b/docs/metrics/ecc-error-injection.md
deleted file mode 100644
index f3f17926..00000000
--- a/docs/metrics/ecc-error-injection.md
+++ /dev/null
@@ -1,199 +0,0 @@
-## ECC Error Injection Testing
-
-The Metric Exporter has the capability to check for unhealthy GPUs via the monitoring of ECC Errors that can occur when a GPU is not functioning as expected. When an ECC error is detected the Metrics Exporter will now mark the offending GPU as unhealthy and add a node label to indicate which GPU on the node is unhealthy. The Kubernetes Device Plugin also listens to the health metrics coming from the Metrics Exporter to determine GPU status, marking GPUs as schedulable if healthy and unschedulable if unhealthy.
-
-This health check workflow runs automatically on every node the Device Metrics Exporter is running on, with the Metrics Exporter polling GPUs every 30 seconds and the device plugin checking health status at the same interval, ensuring updates within one minute. Users can customize the default ECC error threshold (set to 0) via the `HealthThresholds` field in the metrics exporter ConfigMap. As part of this workflow healthy GPUs are made available for Kubernetes job scheduling, while ensuring no new jobs are scheduled on an unhealthy GPUs.
-
-## To do error injection follow these steps
-
-We have added a new `metricsclient` to the Device Metrics Exporter pod that can be used to inject ECC errors into an otherwise healthy GPU for testing the above health check workflow. This is fairly simple and don't worry this does not harm your GPU as any errors that are being injected are debugging in nature and not real errors. The steps to do this have been outlined below:
-
-### 1. Set Node Name
-
-Use an environment variable to set the Kubernetes node name to indicate which node you want to test error injection on:
-
-```bash
-NODE_NAME=<node-name>
-```
-
-Replace <node-name> with the name of the node you want to test. If you are running this from the same node you want to test you can grab the hostname using:
-
-```bash
-NODE_NAME=$(hostname)
-```
-
-### 2. Set Metrics Exporter Pod Name
-
-Since you have to execute the `metricsclient` from directly within the Device Metrics Exporter pod we need to get the Metrics Exporter pod name running on the node:
-
-```bash
-METRICS_POD=$(kubectl get pods -n kube-amd-gpu --field-selector spec.nodeName=$NODE_NAME --no-headers -o custom-columns=":metadata.name" | grep '^gpu-operator-metrics-exporter-' | head -n 1)
-```
-
-### 3. Check Metrics Client to see GPU Health
-
-Now that you have the name of the metrics exporter pod you can use the metricsclient to check the current health of all GPUs on the node:
-
-```bash
-kubectl exec -n kube-amd-gpu $METRICS_POD -c metrics-exporter-container -- metricsclient
-```
-
-You should see a list of all the GPUs on that node along with their corresponding status. In most cases all GPUs should report as being `healthy`.
-
-```bash
-ID      Health  Associated Workload
-------------------------------------------------
-1       healthy []
-0       healthy []
-7       healthy []
-6       healthy []
-5       healthy []
-4       healthy []
-3       healthy []
-2       healthy []
-------------------------------------------------
-```
-
-### 4. Inject ECC Errors on GPU 0
-
-In order to simulate errors on a GPU we will be using a json file that specifies a GPU ID along with counters for several ECC Uncorrectable error fields that are being monitored by the Device Metrics Exporter. In the below example you can see that we are specifying `GPU 0` and injecting 1 `GPU_ECC_UNCORRECT_SEM` error and 2 `GPU_ECC_UNCORRECT_FUSE` errors. We use the `metricslient -ecc-file-path <file.json>` command to specify the json file we want to inject into the metrics table. To create the json file and execute the metricsclient command all in in one go run the following:
-
-```bash
-kubectl exec -n kube-amd-gpu $METRICS_POD -c metrics-exporter-container -- sh -c 'cat > /tmp/ecc.json <<EOF
-{
-        "ID": "0",
-        "Fields": [
-                "GPU_ECC_UNCORRECT_SEM",
-                "GPU_ECC_UNCORRECT_FUSE"
-        ],
-        "Counts" : [
-                1, 2
-        ]
-}
-EOF
-metricsclient -ecc-file-path /tmp/ecc.json'
-```
-
-The metricsclient should report back the current status of the GPUs as well as the new json string you just injected.
-
-```bash
-ID      Health  Associated Workload
-------------------------------------------------
-6       healthy []
-5       healthy []
-4       healthy []
-3       healthy []
-2       healthy []
-1       healthy []
-0       healthy []
-7       healthy []
-------------------------------------------------
-{"ID":"0","Fields":["GPU_ECC_UNCORRECT_SEM","GPU_ECC_UNCORRECT_FUSE"]}
-```
-
-### 5. Query the Mericsclient to See the Unhealthy GPU
-
-Since the Metric Exporter will check every 30 seconds for GPU health status you will need to wait this amount of time before executing the following command again to see the unhealthy GPU:
-
-```bash
-kubectl exec -n kube-amd-gpu $METRICS_POD -c metrics-exporter-container -- metricsclient
-```
-
-You should now see that one of the GPUs, `GPU 0`, in this case has been marked as unhealthy:
-
-```bash
- ID      Health  Associated Workload
-------------------------------------------------
-0       unhealthy       []
-7       healthy []
-6       healthy []
-5       healthy []
-4       healthy []
-3       healthy []
-2       healthy []
-1       healthy []
-------------------------------------------------
-```
-
-### 6. Checking the Unhealthy GPU Node label
-
-The Metrics Exporter should of also added an unhealthy GPU label to your affected node to identify which GPU is unhealthy. Run the following to check for unhealth gpu node labels:
-
-```bash
-kubectl describe node $NODE_NAME | grep unhealthy
-```
-
-The command should return back one label indicating `gpu.0.state` as unhealthy:
-
-```yaml
-metricsexporter.amd.com.gpu.0.state=unhealthy
-```
-
-### 7. Check Number of Allocatable GPUs
-
-In order to confirm that the unhealthy GPU resource has in fact been removed from the Kubernetes Scheduler we can check the number of total GPUs on the node and compare it with the number of allocatable GPUs. To do so run the following:
-
-```bash
-kubectl get nodes -o custom-columns=NAME:.metadata.name,"Total GPUs:.status.capacity.amd\.com/gpu","Allocatable GPUs:.status.allocatable.amd\.com/gpu"
-```
-
-You should now have one less GPU that is allocatable on your node:
-
-```bash
-NAME                     Total GPUs   Allocatable GPUs
-amd-mi300x-gpu-worker1   8            7
-```
-
-### 8. Clear ECC Errors on GPU 0
-
-Now that we have tested to ensure the Health Check workflow is working we can clear the ECC errors on GPU0 by using the metrics client in a similar fashion to 4. This time we are setting the error counts to 0 for both GPU_ECC_UNCORRECT error fields.
-
-```bash
-kubectl exec -n kube-amd-gpu $METRICS_POD -c metrics-exporter-container -- sh -c 'cat > /tmp/delete_ecc.json <<EOF
-{
-        "ID": "0",
-        "Fields": [
-                "GPU_ECC_UNCORRECT_SEM",
-                "GPU_ECC_UNCORRECT_FUSE"
-        ],
-        "Counts" : [
-                0, 0
-        ]
-}
-EOF
-metricsclient -ecc-file-path /tmp/delete_ecc.json'
-```
-
-### 9. Check to see GPU 0 Become Healthy Again
-
-After waiting another 30 seconds or so you can check the metrics client again to see that all GPUs are now healthy again:
-
-```bash
-kubectl exec -n kube-amd-gpu $METRICS_POD -c metrics-exporter-container -- metricsclient
-```
-
-You should see the following:
-
-```bash
-ID      Health  Associated Workload
-------------------------------------------------
-4       healthy []
-3       healthy []
-2       healthy []
-1       healthy []
-0       healthy []
-7       healthy []
-6       healthy []
-5       healthy []
-------------------------------------------------
-```
-
-### 10. Check that all GPUs are Allocatable Again
-
-Lastly check the number of allocatable GPUs on your node to ensure that it matches the total number of GPUs:
-
-```bash
-kubectl get nodes -o custom-columns=NAME:.metadata.name,"Total GPUs:.status.capacity.amd\.com/gpu","Allocatable GPUs:.status.allocatable.amd\.com/gpu"
-```
-
-Following the above steps will help you successfully test the new GPU Health Check Feature.
diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in
index 58638453..62786ea4 100644
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -38,11 +38,8 @@ subtrees:
       - file: metrics/kube-rbac-proxy
       - file: metrics/health
         title: Health Checks
-      - file: metrics/ecc-error-injection
-        title: ECC Error Injection Testing
   - caption: Test Runner
     entries:
-      - file: test/test-runner-overview
       - file: test/auto-unhealthy-device-test
       - file: test/manual-test
       - file: test/pre-start-job-test
diff --git a/docs/test/test-runner-overview.md b/docs/test/test-runner-overview.md
deleted file mode 100644
index 7ffa72a0..00000000
--- a/docs/test/test-runner-overview.md
+++ /dev/null
@@ -1,34 +0,0 @@
-## Test Runner Overview
-
-The test runner component offers hardware validation, diagnostics and benchmarking capabilities for your GPU Worker nodes. The new capabilities include:
-
-- Automatically triggering of configurable tests on unhealthy GPUs
-
-- Scheduling or Manually triggering tests within the Kubernetes cluster
-
-- Running pre-start job tests as init containers within your GPU workload pods to ensure GPU health and stability before execution of long running jobs
-
-- Reporting test results as Kubernetes events
-
-Under the hood the Device Test runner leverages the ROCm Validation Suite (RVS) to run any number of tests including GPU stress tests, PCIE bandwidth benchmarks, memory tests, and longer burn-in tests if so desired. The DeviceConfig custom resource has also been updated to provide new configuration options for the Test Runner:
-
-```bash
-  testRunner:
-    # To enable/disable the testrunner, disabled by default
-    enable: True
-
-    # testrunner image
-    image: docker.io/rocm/test-runner:v1.2.0-beta.0
-
-    # image pull policy for the testrunner
-    # default value is IfNotPresent for valid tags, Always for no tag or "latest" tag
-    imagePullPolicy: "Always"
-
-    # specify the mount for test logs
-    logsLocation:
-      # mount path inside test runner container
-      mountPath: "/var/log/amd-test-runner"
-
-      # host path to be mounted into test runner container
-      hostPath: "/var/log/amd-test-runner"
-```

From 174fee6a7c84f15f431fc53ee004c0f763f12772 Mon Sep 17 00:00:00 2001
From: vm <sriramr2230@gmail.com>
Date: Thu, 3 Apr 2025 10:14:15 +0000
Subject: [PATCH 11/24] Reboot Loop issue if control node needs to go down for
 driver upgrade

---
 api/v1alpha1/deviceconfig_types.go            |  1 +
 ...md-gpu-operator.clusterserviceversion.yaml |  2 +-
 bundle/manifests/amd.com_deviceconfigs.yaml   |  2 ++
 config/crd/bases/amd.com_deviceconfigs.yaml   |  2 ++
 helm-charts-k8s/Chart.lock                    |  2 +-
 helm-charts-k8s/crds/deviceconfig-crd.yaml    |  2 ++
 helm-charts-openshift/Chart.lock              |  2 +-
 .../crds/deviceconfig-crd.yaml                |  2 ++
 .../controllers/device_config_reconciler.go   | 10 +++++++-
 internal/controllers/mock_upgrademgr.go       | 14 +++++++++++
 internal/controllers/upgrademgr.go            | 25 +++++++++++++++----
 11 files changed, 55 insertions(+), 9 deletions(-)

diff --git a/api/v1alpha1/deviceconfig_types.go b/api/v1alpha1/deviceconfig_types.go
index b6f186c0..4a5d0597 100644
--- a/api/v1alpha1/deviceconfig_types.go
+++ b/api/v1alpha1/deviceconfig_types.go
@@ -597,6 +597,7 @@ type ModuleStatus struct {
 	LastTransitionTime string       `json:"lastTransitionTime,omitempty"`
 	Status             UpgradeState `json:"status,omitempty"`
 	UpgradeStartTime   string       `json:"upgradeStartTime,omitempty"`
+	BootId             string       `json:"bootId,omitempty"`
 }
 
 // DeviceConfigStatus defines the observed state of Module.
diff --git a/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml b/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml
index 134634b9..c73d0351 100644
--- a/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml
+++ b/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml
@@ -32,7 +32,7 @@ metadata:
     capabilities: Seamless Upgrades
     categories: AI/Machine Learning,Monitoring
     containerImage: docker.io/rocm/gpu-operator:v1.2.0
-    createdAt: "2025-04-02T23:22:18Z"
+    createdAt: "2025-04-07T07:07:00Z"
     description: |-
       Operator responsible for deploying AMD GPU kernel drivers, device plugin, device test runner and device metrics exporter
       For more information, visit [documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/)
diff --git a/bundle/manifests/amd.com_deviceconfigs.yaml b/bundle/manifests/amd.com_deviceconfigs.yaml
index d2669dc1..606fa1d8 100644
--- a/bundle/manifests/amd.com_deviceconfigs.yaml
+++ b/bundle/manifests/amd.com_deviceconfigs.yaml
@@ -931,6 +931,8 @@ spec:
                   description: ModuleStatus contains the status of driver module installed
                     by operator on the node
                   properties:
+                    bootId:
+                      type: string
                     containerImage:
                       type: string
                     kernelVersion:
diff --git a/config/crd/bases/amd.com_deviceconfigs.yaml b/config/crd/bases/amd.com_deviceconfigs.yaml
index 7916a7e6..64427582 100644
--- a/config/crd/bases/amd.com_deviceconfigs.yaml
+++ b/config/crd/bases/amd.com_deviceconfigs.yaml
@@ -927,6 +927,8 @@ spec:
                   description: ModuleStatus contains the status of driver module installed
                     by operator on the node
                   properties:
+                    bootId:
+                      type: string
                     containerImage:
                       type: string
                     kernelVersion:
diff --git a/helm-charts-k8s/Chart.lock b/helm-charts-k8s/Chart.lock
index f42b6cfb..dd529b75 100644
--- a/helm-charts-k8s/Chart.lock
+++ b/helm-charts-k8s/Chart.lock
@@ -6,4 +6,4 @@ dependencies:
   repository: file://./charts/kmm
   version: v1.0.0
 digest: sha256:f9a315dd2ce3d515ebf28c8e9a6a82158b493ca2686439ec381487761261b597
-generated: "2025-03-26T20:10:45.247725094Z"
+generated: "2025-04-07T07:06:50.661624221Z"
diff --git a/helm-charts-k8s/crds/deviceconfig-crd.yaml b/helm-charts-k8s/crds/deviceconfig-crd.yaml
index 81c564c1..ff9c1c79 100644
--- a/helm-charts-k8s/crds/deviceconfig-crd.yaml
+++ b/helm-charts-k8s/crds/deviceconfig-crd.yaml
@@ -932,6 +932,8 @@ spec:
                   description: ModuleStatus contains the status of driver module installed
                     by operator on the node
                   properties:
+                    bootId:
+                      type: string
                     containerImage:
                       type: string
                     kernelVersion:
diff --git a/helm-charts-openshift/Chart.lock b/helm-charts-openshift/Chart.lock
index 8eb0ba07..d4a86324 100644
--- a/helm-charts-openshift/Chart.lock
+++ b/helm-charts-openshift/Chart.lock
@@ -6,4 +6,4 @@ dependencies:
   repository: file://./charts/kmm
   version: v1.0.0
 digest: sha256:25200c34a5cc846a1275e5bf3fc637b19e909dc68de938189c5278d77d03f5ac
-generated: "2025-03-26T20:10:56.781691243Z"
+generated: "2025-04-07T07:06:59.305455465Z"
diff --git a/helm-charts-openshift/crds/deviceconfig-crd.yaml b/helm-charts-openshift/crds/deviceconfig-crd.yaml
index 81c564c1..ff9c1c79 100644
--- a/helm-charts-openshift/crds/deviceconfig-crd.yaml
+++ b/helm-charts-openshift/crds/deviceconfig-crd.yaml
@@ -932,6 +932,8 @@ spec:
                   description: ModuleStatus contains the status of driver module installed
                     by operator on the node
                   properties:
+                    bootId:
+                      type: string
                     containerImage:
                       type: string
                     kernelVersion:
diff --git a/internal/controllers/device_config_reconciler.go b/internal/controllers/device_config_reconciler.go
index 2e782fb5..7486a8b7 100644
--- a/internal/controllers/device_config_reconciler.go
+++ b/internal/controllers/device_config_reconciler.go
@@ -593,9 +593,11 @@ func (dcrh *deviceConfigReconcilerHelper) getDeviceConfigOwnedKMMModule(ctx cont
 func (dcrh *deviceConfigReconcilerHelper) updateDeviceConfigNodeStatus(ctx context.Context, devConfig *amdv1alpha1.DeviceConfig, nodes *v1.NodeList) error {
 	logger := log.FromContext(ctx)
 	previousUpgradeTimes := make(map[string]string)
+	previousBootIds := make(map[string]string)
 	// Persist the UpgradeStartTime
 	for nodeName, moduleStatus := range devConfig.Status.NodeModuleStatus {
 		previousUpgradeTimes[nodeName] = moduleStatus.UpgradeStartTime
+		previousBootIds[nodeName] = moduleStatus.BootId
 	}
 	devConfig.Status.NodeModuleStatus = map[string]amdv1alpha1.ModuleStatus{}
 
@@ -610,7 +612,12 @@ func (dcrh *deviceConfigReconcilerHelper) updateDeviceConfigNodeStatus(ctx conte
 		if upgradeStartTime == "" {
 			upgradeStartTime = previousUpgradeTimes[node.Name]
 		}
-		devConfig.Status.NodeModuleStatus[node.Name] = amdv1alpha1.ModuleStatus{Status: dcrh.upgradeMgrHandler.GetNodeStatus(node.Name), UpgradeStartTime: upgradeStartTime}
+		bootId := dcrh.upgradeMgrHandler.GetNodeBootId(node.Name)
+		//If operator restarted during Upgrade, then fetch previous known bootId since the internal maps would have been cleared
+		if bootId == "" {
+			bootId = previousBootIds[node.Name]
+		}
+		devConfig.Status.NodeModuleStatus[node.Name] = amdv1alpha1.ModuleStatus{Status: dcrh.upgradeMgrHandler.GetNodeStatus(node.Name), UpgradeStartTime: upgradeStartTime, BootId: bootId}
 
 		nmc := kmmv1beta1.NodeModulesConfig{}
 		err := dcrh.client.Get(ctx, types.NamespacedName{Name: node.Name}, &nmc)
@@ -632,6 +639,7 @@ func (dcrh *deviceConfigReconcilerHelper) updateDeviceConfigNodeStatus(ctx conte
 						LastTransitionTime: module.LastTransitionTime.String(),
 						Status:             dcrh.upgradeMgrHandler.GetNodeStatus(node.Name),
 						UpgradeStartTime:   upgradeStartTime,
+						BootId:             bootId,
 					}
 				}
 			}
diff --git a/internal/controllers/mock_upgrademgr.go b/internal/controllers/mock_upgrademgr.go
index 03944030..748a33d6 100644
--- a/internal/controllers/mock_upgrademgr.go
+++ b/internal/controllers/mock_upgrademgr.go
@@ -57,6 +57,20 @@ func (m *MockupgradeMgrAPI) EXPECT() *MockupgradeMgrAPIMockRecorder {
 	return m.recorder
 }
 
+// GetNodeBootId mocks base method.
+func (m *MockupgradeMgrAPI) GetNodeBootId(nodeName string) string {
+	m.ctrl.T.Helper()
+	ret := m.ctrl.Call(m, "GetNodeBootId", nodeName)
+	ret0, _ := ret[0].(string)
+	return ret0
+}
+
+// GetNodeBootId indicates an expected call of GetNodeBootId.
+func (mr *MockupgradeMgrAPIMockRecorder) GetNodeBootId(nodeName any) *gomock.Call {
+	mr.mock.ctrl.T.Helper()
+	return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "GetNodeBootId", reflect.TypeOf((*MockupgradeMgrAPI)(nil).GetNodeBootId), nodeName)
+}
+
 // GetNodeStatus mocks base method.
 func (m *MockupgradeMgrAPI) GetNodeStatus(nodeName string) v1alpha1.UpgradeState {
 	m.ctrl.T.Helper()
diff --git a/internal/controllers/upgrademgr.go b/internal/controllers/upgrademgr.go
index ad2ee41c..25dd5d6b 100644
--- a/internal/controllers/upgrademgr.go
+++ b/internal/controllers/upgrademgr.go
@@ -74,6 +74,7 @@ type upgradeMgrAPI interface {
 	HandleDelete(ctx context.Context, deviceConfig *amdv1alpha1.DeviceConfig, nodes *v1.NodeList) (ctrl.Result, error)
 	GetNodeStatus(nodeName string) amdv1alpha1.UpgradeState
 	GetNodeUpgradeStartTime(nodeName string) string
+	GetNodeBootId(nodeName string) string
 }
 
 func newUpgradeMgrHandler(client client.Client, k8sConfig *rest.Config) upgradeMgrAPI {
@@ -108,16 +109,25 @@ func (n *upgradeMgr) HandleUpgrade(ctx context.Context, deviceConfig *amdv1alpha
 				if deviceConfig.Spec.Driver.UpgradePolicy.RebootRequired != nil && *deviceConfig.Spec.Driver.UpgradePolicy.RebootRequired {
 					nodeObj, err := n.helper.getNode(ctx, nodeName)
 					if err == nil {
-						log.FromContext(ctx).Info("Reboot is required for driver upgrade, triggering node reboot")
-						n.helper.handleNodeReboot(ctx, nodeObj, deviceConfig)
+						// trigger reboot only for nodes which are in UpgradeStarted but haven't rebooted yet
+						if nodeObj.Status.NodeInfo.BootID == moduleStatus.BootId {
+							log.FromContext(ctx).Info(fmt.Sprintf("Node: %v: Reboot is required for driver upgrade, triggering node reboot", nodeName))
+							n.helper.handleNodeReboot(ctx, nodeObj, deviceConfig)
+							// for nodes which are in UpgradeStarted but already rebooted. Schedule the reboot pod deletion
+						} else {
+							currentBootID := nodeObj.Status.NodeInfo.BootID
+							n.helper.setBootID(nodeObj.Name, currentBootID)
+							log.FromContext(ctx).Info(fmt.Sprintf("Node: %v: Node already rebooted, scheduling reboot pod deletion", nodeName))
+							go n.helper.deleteRebootPod(ctx, nodeName, deviceConfig, false, deviceConfig.Generation)
+						}
 					}
 				} else {
-					log.FromContext(ctx).Info("Resetting Upgrade State to UpgradeStateEmpty")
+					log.FromContext(ctx).Info(fmt.Sprintf("Node: %v: Resetting Upgrade State to UpgradeStateEmpty", nodeName))
 					n.helper.setNodeStatus(ctx, nodeName, amdv1alpha1.UpgradeStateEmpty)
 				}
 			} else if moduleStatus.Status == amdv1alpha1.UpgradeStateRebootInProgress {
 				// Operator restarted during upgrade operation. Schedule the reboot pod deletion
-				log.FromContext(ctx).Info("Reboot is in progress, scheduling reboot pod deletion")
+				log.FromContext(ctx).Info(fmt.Sprintf("Node: %v: Reboot is in progress, scheduling reboot pod deletion", nodeName))
 				n.helper.setNodeStatus(ctx, nodeName, moduleStatus.Status)
 				go n.helper.deleteRebootPod(ctx, nodeName, deviceConfig, false, deviceConfig.Generation)
 			} else {
@@ -244,11 +254,16 @@ func (n *upgradeMgr) GetNodeStatus(nodeName string) (status amdv1alpha1.UpgradeS
 	return n.helper.getNodeStatus(nodeName)
 }
 
-// GetNodeStaGetNodeUpgradeStartTimetus returns the time when upgrade started on the node
+// GetNodeUpgradeStartTime returns the time when upgrade started on the node
 func (n *upgradeMgr) GetNodeUpgradeStartTime(nodeName string) string {
 	return n.helper.getUpgradeStartTime(nodeName)
 }
 
+// GetNodeBootId returns the last known bootid of the node
+func (n *upgradeMgr) GetNodeBootId(nodeName string) string {
+	return n.helper.getBootID(nodeName)
+}
+
 /*=========================================== Upgrade Manager Helper APIs ==========================================*/
 
 //go:generate mockgen -source=upgrademgr.go -package=controllers -destination=mock_upgrademgr.go upgradeMgrHelperAPI

From 6433730dfd7b8ee25987ee1d04972630299df2c2 Mon Sep 17 00:00:00 2001
From: yansun1996 <yan@pensando.io>
Date: Sun, 6 Apr 2025 02:02:08 +0000
Subject: [PATCH 12/24] Add warning to describe the known GPU scheduling issue
 for pre-start job check

---
 docs/test/pre-start-job-test.md | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/docs/test/pre-start-job-test.md b/docs/test/pre-start-job-test.md
index d5133faa..d9b67750 100644
--- a/docs/test/pre-start-job-test.md
+++ b/docs/test/pre-start-job-test.md
@@ -8,6 +8,14 @@ Test runner can be embedded as an init container within your Kubernetes workload
 The RVS test recipes in the Test Runner are not compatible with partitioned GPUs. If you are using a partitioned GPU, avoid running the Test Runner as an init container for the pre-start job test.
 ```
 
+```{warning}
+* Known Issue: Within a pod, the initContainer and workload container might not be assigned the same GPUs.
+
+* Workaround: The example in this document remains applicable if both initContainer and workload containers request all GPUs on the same node.
+
+* Future Solution: With the introduction of [Dynamic Resource Allocation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/), both initContainer and workload container will be able to share the same set of GPUs.
+```
+
 ## Configure pre-start init container
 
 The init container requires RBAC config to grant the pod access to export events and add node labels to the cluster. Here is an example of configuring the RBAC and Job resources:

From 1ad40e323e61bccfc64dd8870f34c274bee283e1 Mon Sep 17 00:00:00 2001
From: yansun1996 <yan@pensando.io>
Date: Mon, 7 Apr 2025 19:15:00 +0000
Subject: [PATCH 13/24] Address comment

---
 docs/test/pre-start-job-test.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/test/pre-start-job-test.md b/docs/test/pre-start-job-test.md
index d9b67750..a376ba73 100644
--- a/docs/test/pre-start-job-test.md
+++ b/docs/test/pre-start-job-test.md
@@ -85,8 +85,8 @@ spec:
         image: docker.io/rocm/test-runner:v1.2.0-beta.0
         imagePullPolicy: IfNotPresent
         resources:
-          limits:
-            amd.com/gpu: 1 # requesting a GPU
+          requests:
+            amd.com/gpu: 8 # requesting all GPUs on the worker node
         env:
         - name: TEST_TRIGGER
           value: "PRE_START_JOB_CHECK" # Set the TEST_TRIGGER environment variable to PRE_START_JOB_CHECK for test runner as init container
@@ -108,8 +108,8 @@ spec:
         command: ["/bin/sh", "-c", "--"]
         args: ["sleep 6000"]
         resources:
-          limits:
-            amd.com/gpu: 1 # requesting a GPU
+          requests:
+            amd.com/gpu: 8 # requesting all GPUs on the worker node
 ```
 
 ## Check test runner init container

From 1773fc9628fbe2818ef1a9af33c65612856e39c0 Mon Sep 17 00:00:00 2001
From: vm <sriramr2230@gmail.com>
Date: Tue, 8 Apr 2025 09:12:07 +0000
Subject: [PATCH 14/24] Doc on known limitation

---
 docs/knownlimitations.md | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/docs/knownlimitations.md b/docs/knownlimitations.md
index 051b48cd..c84430e8 100644
--- a/docs/knownlimitations.md
+++ b/docs/knownlimitations.md
@@ -85,6 +85,13 @@
     - **Recommendation:** Ensure nodes are fully stable before triggering an upgrade, and if necessary, manually update node labels to enforce the new driver version. Refer to driver upgrade documentation for more details.
 </br></br>
 
+13. **Driver Upgrade Issue when maxParallel Upgrades is equal to total number of worker nodes in Red Hat OpenShift**
+
+    - **Impact:** Not able to perform driver upgrade
+    - **Affected Configurations:** This issue only affects Red Hat OpenShift when Image registry pod is running on one of the worker nodes or kmm build pod is required to be run on one of the worker nodes
+    - **Recommendation:** Please set maxParallel Upgrades to a number less than total number of worker nodes
+</br></br>
+
 ## Fixed Issues
 
 1. **When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module <span style="color:red">(Fixed in v1.2.0)</span>**

From 81090a4fd03ab7f6ad5a0402ef107a689df4083b Mon Sep 17 00:00:00 2001
From: yansun1996 <yan@pensando.io>
Date: Tue, 8 Apr 2025 09:49:28 +0000
Subject: [PATCH 15/24] Add note for blacklisting amdgpu on OpenShift cluster
 in full example

---
 docs/drivers/installation.md | 2 ++
 docs/fulldeviceconfig.rst    | 7 +++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/docs/drivers/installation.md b/docs/drivers/installation.md
index 890da553..ed1d9041 100644
--- a/docs/drivers/installation.md
+++ b/docs/drivers/installation.md
@@ -96,6 +96,8 @@ spec:
     # enable operator to install out-of-tree amdgpu kernel module
     enable: true
     # blacklist is required for installing out-of-tree amdgpu kernel module
+    # Not working for OpenShift cluster. OpenShift users please use the Machine Config Operator (MCO) resource to configure amdgpu blacklist.
+    # Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olmhtml#create-blacklist-for-installing-out-of-tree-kernel-module
     blacklist: true
     # Specify your repository to host driver image
     # DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you
diff --git a/docs/fulldeviceconfig.rst b/docs/fulldeviceconfig.rst
index 8d8c1d95..00d52de9 100644
--- a/docs/fulldeviceconfig.rst
+++ b/docs/fulldeviceconfig.rst
@@ -38,8 +38,11 @@ Below is an example of a full DeviceConfig CR that can be used to install the AM
       driver:
         # Set to false to skip driver installation to use inbox or pre-installed driver on worker nodes
         # Set to true to enable operator to install out-of-tree amdgpu kernel module
-        enable: false 
-        blacklist: false # Set to true to blacklist the amdgpu kernel module which is required for installing out-of-tree driver
+        enable: false
+        # Set to true to blacklist the amdgpu kernel module which is required for installing out-of-tree driver
+        # Not working for OpenShift cluster. OpenShift users please use the Machine Config Operator (MCO) resource to configure amdgpu blacklist.
+        # Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module
+        blacklist: false
         # Specify your repository to host driver image
         # DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you
         image: docker.io/username/repo

From 36c6e9616fc1f9fcc72f0a3aa60c70d6c5b9728f Mon Sep 17 00:00:00 2001
From: Nitish Bhat <bhatnitish@gmail.com>
Date: Mon, 7 Apr 2025 16:03:22 -0700
Subject: [PATCH 16/24] Expose ContainerPort in Metrics Exporter Pod (#534)

- ContainerPort lists the ports to expose from the Container. Not specifying a port
  DOES NOT prevent that port from being exposed. The device metrics exporter
  container starts a metrics server on the port specified by the METRICS_EXPORTER_PORT
  on the default "0.0.0.0" address in the container which exposes the port.
  Look at https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/ for more
  information on this behavior.

Co-authored-by: Nitish Bhat <nitish.bhat@amd.com>
---
 internal/metricsexporter/metricsexporter.go | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/internal/metricsexporter/metricsexporter.go b/internal/metricsexporter/metricsexporter.go
index 7ead7bca..c57341fb 100644
--- a/internal/metricsexporter/metricsexporter.go
+++ b/internal/metricsexporter/metricsexporter.go
@@ -240,7 +240,7 @@ func (nl *metricsExporter) SetMetricsExporterAsDesired(ds *appsv1.DaemonSet, dev
 		if internalPort == port {
 			internalPort = port - 1
 		}
-		// Bind service port to localhost only
+		// Bind service port to localhost only, don't expose port in ContainerPort
 		containers[0].Args = []string{"--bind=127.0.0.1:" + fmt.Sprintf("%v", int32(internalPort))}
 		containers[0].Env[1].Value = fmt.Sprintf("%v", internalPort)
 
@@ -292,12 +292,26 @@ func (nl *metricsExporter) SetMetricsExporterAsDesired(ds *appsv1.DaemonSet, dev
 			},
 			Args:         args,
 			VolumeMounts: volumeMounts,
+			Ports: []v1.ContainerPort{
+				{
+					Name:          "exporter-port",
+					Protocol:      v1.ProtocolTCP,
+					ContainerPort: port,
+				},
+			},
 		})
 
 		// Provide elevated privilege only when rbac-proxy is enabled
 		serviceaccount = kubeRbacSAName
 	} else {
 		containers[0].Env[1].Value = fmt.Sprintf("%v", port)
+		containers[0].Ports = []v1.ContainerPort{
+			{
+				Name:          "exporter-port",
+				Protocol:      v1.ProtocolTCP,
+				ContainerPort: port,
+			},
+		}
 	}
 
 	gracePeriod := int64(1)

From 53d34c06f8e163e2ec4670b0ead56b896266ad64 Mon Sep 17 00:00:00 2001
From: Nitish Bhat <nitish.bhat@amd.com>
Date: Tue, 8 Apr 2025 19:18:15 +0000
Subject: [PATCH 17/24] Change default cpu/memory resource limits for
 Controller Manager

For larger deployments, the default CPU limits of 500m (half a core) and
memory limit of 384Mi in the cluster might be insufficient. It has been
bumped up with this change and documentation has been added to alert the
user to modify these values in helm if they have larger clusters.
---
 docs/installation/kubernetes-helm.md          | 40 +++++++++++++++++++
 hack/k8s-patch/metadata-patch/values.yaml     |  8 ++--
 .../metadata-patch/values.yaml                |  8 ++--
 helm-charts-k8s/values.yaml                   |  8 ++--
 helm-charts-openshift/values.yaml             |  8 ++--
 5 files changed, 56 insertions(+), 16 deletions(-)

diff --git a/docs/installation/kubernetes-helm.md b/docs/installation/kubernetes-helm.md
index c1415324..8681222f 100644
--- a/docs/installation/kubernetes-helm.md
+++ b/docs/installation/kubernetes-helm.md
@@ -163,6 +163,10 @@ The following parameters are able to be configued when using the Helm Chart. In
 | controllerManager.manager.image.tag | string | `"v1.2.0"` | AMD GPU operator controller manager image tag |
 | controllerManager.manager.imagePullPolicy | string | `"Always"` | Image pull policy for AMD GPU operator controller manager pod |
 | controllerManager.manager.imagePullSecrets | string | `""` | Image pull secret name for pulling AMD GPU operator controller manager image if registry needs credential to pull image |
+| controllerManager.manager.resources.limits.cpu | string | `"1000m"` | CPU limits for the controller manager. Consider increasing for large clusters |
+| controllerManager.manager.resources.limits.memory | string | `"1Gi"` | Memory limits for the controller manager. Consider increasing if experiencing OOM issues |
+| controllerManager.manager.resources.requests.cpu | string | `"100m"` | CPU requests for the controller manager. Adjust based on observed CPU usage |
+| controllerManager.manager.resources.requests.memory | string | `"256Mi"` | Memory requests for the controller manager. Adjust based on observed memory usage |
 | controllerManager.nodeSelector | object | `{}` | Node selector for AMD GPU operator controller manager deployment |
 | installdefaultNFDRule | bool | `true` | Default NFD rule will detect amd gpu based on pci vendor ID |
 | kmm.enabled | bool | `true` | Set to true/false to enable/disable the installation of kernel module management (KMM) operator |
@@ -258,6 +262,42 @@ Verify that nodes with AMD GPU hardware are properly labeled:
 kubectl get nodes -L feature.node.kubernetes.io/amd-gpu
 ```
 
+## Resource Configuration
+
+### Controller Manager Resource Settings
+
+The AMD GPU Operator controller manager component has default resource limits and requests configured for typical usage scenarios. You may need to adjust these values based on your specific cluster environment:
+
+```yaml
+controllerManager:
+  manager:
+    resources:
+      limits:
+        cpu: 1000m
+        memory: 1Gi
+      requests:
+        cpu: 100m
+        memory: 256Mi
+```
+
+#### When to Adjust Resource Settings
+
+You should consider adjusting the controller manager resource settings in these scenarios:
+
+- **Large clusters**: If managing a large number of nodes or GPU devices, consider increasing both CPU and memory limits
+- **Memory pressure**: If you observe OOM (Out of Memory) kills in controller manager pods, increase the memory limit and request
+- **CPU pressure**: If the controller manager is experiencing throttling or slow response times during operations, increase the CPU limit and request
+- **Resource-constrained environments**: For smaller development or test clusters, you may reduce these values to conserve resources
+
+You can apply resource changes by updating your values.yaml file and upgrading the Helm release:
+
+```bash
+helm upgrade amd-gpu-operator amd/gpu-operator-helm \
+  --namespace kube-amd-gpu \
+  --version=v1.0.0 \
+  -f values.yaml
+```
+
 ## Install Custom Resource
 
 After the installation of AMD GPU Operator, you need to create the `DeviceConfig` custom resource in order to trigger the operator start to work. By preparing the `DeviceConfig` in the YAML file, you can create the resouce by running ```kubectl apply -f deviceconfigs.yaml```. For custom resource definition and more detailed information, please refer to [Custom Resource Installation Guide](../drivers/installation). Here are some examples for common deployment scenarios.
diff --git a/hack/k8s-patch/metadata-patch/values.yaml b/hack/k8s-patch/metadata-patch/values.yaml
index 71bfd56c..6e6e0a0d 100644
--- a/hack/k8s-patch/metadata-patch/values.yaml
+++ b/hack/k8s-patch/metadata-patch/values.yaml
@@ -47,11 +47,11 @@ controllerManager:
       effect: "NoSchedule"
     resources:
       limits:
-        cpu: 500m
-        memory: 384Mi
+        cpu: 1000m
+        memory: 1Gi
       requests:
-        cpu: 10m
-        memory: 64Mi
+        cpu: 100m
+        memory: 256Mi
   # -- Node selector for AMD GPU operator controller manager deployment
   nodeSelector: {}
   # -- Deployment affinity configs for controller manager
diff --git a/hack/openshift-patch/metadata-patch/values.yaml b/hack/openshift-patch/metadata-patch/values.yaml
index b0b937a9..2bdb27ad 100644
--- a/hack/openshift-patch/metadata-patch/values.yaml
+++ b/hack/openshift-patch/metadata-patch/values.yaml
@@ -26,11 +26,11 @@ controllerManager:
       effect: "NoSchedule"
     resources:
       limits:
-        cpu: 500m
-        memory: 384Mi
+        cpu: 1000m
+        memory: 1Gi
       requests:
-        cpu: 10m
-        memory: 64Mi
+        cpu: 100m
+        memory: 256Mi
   nodeSelector: {}
   affinity:
     nodeAffinity:
diff --git a/helm-charts-k8s/values.yaml b/helm-charts-k8s/values.yaml
index 71bfd56c..6e6e0a0d 100644
--- a/helm-charts-k8s/values.yaml
+++ b/helm-charts-k8s/values.yaml
@@ -47,11 +47,11 @@ controllerManager:
       effect: "NoSchedule"
     resources:
       limits:
-        cpu: 500m
-        memory: 384Mi
+        cpu: 1000m
+        memory: 1Gi
       requests:
-        cpu: 10m
-        memory: 64Mi
+        cpu: 100m
+        memory: 256Mi
   # -- Node selector for AMD GPU operator controller manager deployment
   nodeSelector: {}
   # -- Deployment affinity configs for controller manager
diff --git a/helm-charts-openshift/values.yaml b/helm-charts-openshift/values.yaml
index b0b937a9..2bdb27ad 100644
--- a/helm-charts-openshift/values.yaml
+++ b/helm-charts-openshift/values.yaml
@@ -26,11 +26,11 @@ controllerManager:
       effect: "NoSchedule"
     resources:
       limits:
-        cpu: 500m
-        memory: 384Mi
+        cpu: 1000m
+        memory: 1Gi
       requests:
-        cpu: 10m
-        memory: 64Mi
+        cpu: 100m
+        memory: 256Mi
   nodeSelector: {}
   affinity:
     nodeAffinity:

From a7570fcfb31e7b93fa3e3720202c9585b6025014 Mon Sep 17 00:00:00 2001
From: Farshad Ghodsian <47931571+farshadghodsian@users.noreply.github.com>
Date: Wed, 9 Apr 2025 11:42:41 -0400
Subject: [PATCH 18/24] Updated ReadTheDocs conf to support copy code block
 button

---
 docs/conf.py                 | 16 ++++++---
 docs/requirements.txt        |  1 -
 docs/sphinx/requirements.in  |  4 +--
 docs/sphinx/requirements.txt | 67 ++++++++++++++++++------------------
 4 files changed, 46 insertions(+), 42 deletions(-)
 delete mode 100644 docs/requirements.txt

diff --git a/docs/conf.py b/docs/conf.py
index c2086415..8fe2aaba 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -1,21 +1,27 @@
 """Configuration file for the Sphinx documentation builder."""
+import os
 
+html_baseurl = os.environ.get("READTHEDOCS_CANONICAL_URL", "instinct.docs.amd.com")
+html_context = {}
+if os.environ.get("READTHEDOCS", "") == "True":
+    html_context["READTHEDOCS"] = True
 external_projects_local_file = "projects.yaml"
 external_projects_remote_repository = ""
 external_projects = ["amd-gpu-operator"]
 external_projects_current_project = "amd-gpu-operator"
 
-project = "AMD Instinct Documentation"
+project = "AMD GPU Operator"
 version = "1.2.0"
 release = version
-html_title = f"AMD GPU Operator {version}"
+html_title = f"{project} {version}"
 author = "Advanced Micro Devices, Inc."
-copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved."
+copyright = "Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved."
 
 # Required settings
 html_theme = "rocm_docs_theme"
 html_theme_options = {
-    "flavor": "instinct"
+    "flavor": "instinct",
+    "link_main_doc": True,
     # Add any additional theme options here
 }
 extensions = ["rocm_docs"]
@@ -23,4 +29,4 @@
 # Table of contents
 external_toc_path = "./sphinx/_toc.yml"
 
-exclude_patterns = ['.venv']
+exclude_patterns = ['.venv']
\ No newline at end of file
diff --git a/docs/requirements.txt b/docs/requirements.txt
deleted file mode 100644
index 78600aa6..00000000
--- a/docs/requirements.txt
+++ /dev/null
@@ -1 +0,0 @@
-rocm-docs-core
diff --git a/docs/sphinx/requirements.in b/docs/sphinx/requirements.in
index 5efe4f66..e75ed236 100644
--- a/docs/sphinx/requirements.in
+++ b/docs/sphinx/requirements.in
@@ -1,2 +1,2 @@
-rocm-docs-core==1.17.1
-sphinx-reredirects
+rocm-docs-core==1.18.1
+sphinx-reredirects
\ No newline at end of file
diff --git a/docs/sphinx/requirements.txt b/docs/sphinx/requirements.txt
index fc912bea..cf7a01fd 100644
--- a/docs/sphinx/requirements.txt
+++ b/docs/sphinx/requirements.txt
@@ -15,38 +15,39 @@ attrs==25.1.0
     #   jsonschema
     #   jupyter-cache
     #   referencing
-babel==2.17.0
+babel==2.16.0
     # via
     #   pydata-sphinx-theme
     #   sphinx
-beautifulsoup4==4.13.3
+beautifulsoup4==4.12.3
     # via pydata-sphinx-theme
-breathe==4.36.0
+breathe==4.35.0
     # via rocm-docs-core
-certifi==2025.1.31
+certifi==2024.8.30
     # via requests
 cffi==1.17.1
     # via
     #   cryptography
     #   pynacl
-charset-normalizer==3.4.1
+charset-normalizer==3.4.0
     # via requests
-click==8.1.8
+click==8.1.7
     # via
     #   jupyter-cache
     #   sphinx-external-toc
 comm==0.2.2
     # via ipykernel
-cryptography==44.0.2
+cryptography==43.0.3
     # via pyjwt
-debugpy==1.8.13
+debugpy==1.8.12
     # via ipykernel
-decorator==5.2.1
+decorator==5.1.1
     # via ipython
-deprecated==1.2.18
+deprecated==1.2.15
     # via pygithub
 docutils==0.21.2
     # via
+    #   breathe
     #   myst-parser
     #   pydata-sphinx-theme
     #   sphinx
@@ -54,13 +55,13 @@ exceptiongroup==1.2.2
     # via ipython
 executing==2.2.0
     # via stack-data
-fastjsonschema==2.21.1
+fastjsonschema==2.20.0
     # via
     #   nbformat
     #   rocm-docs-core
-gitdb==4.0.12
+gitdb==4.0.11
     # via gitpython
-gitpython==3.1.44
+gitpython==3.1.43
     # via rocm-docs-core
 greenlet==3.1.1
     # via sqlalchemy
@@ -74,13 +75,13 @@ importlib-metadata==8.6.1
     #   myst-nb
 ipykernel==6.29.5
     # via myst-nb
-ipython==8.33.0
+ipython==8.31.0
     # via
     #   ipykernel
     #   myst-nb
 jedi==0.19.2
     # via ipython
-jinja2==3.1.6
+jinja2==3.1.4
     # via
     #   myst-parser
     #   sphinx
@@ -114,9 +115,9 @@ mdit-py-plugins==0.4.2
     # via myst-parser
 mdurl==0.1.2
     # via markdown-it-py
-myst-nb==1.2.0
+myst-nb==1.1.2
     # via rocm-docs-core
-myst-parser==4.0.1
+myst-parser==4.0.0
     # via myst-nb
 nbclient==0.10.2
     # via
@@ -132,7 +133,6 @@ nest-asyncio==1.6.0
 packaging==24.2
     # via
     #   ipykernel
-    #   pydata-sphinx-theme
     #   sphinx
 parso==0.8.4
     # via jedi
@@ -142,7 +142,7 @@ platformdirs==4.3.6
     # via jupyter-core
 prompt-toolkit==3.0.50
     # via ipython
-psutil==7.0.0
+psutil==6.1.1
     # via ipykernel
 ptyprocess==0.7.0
     # via pexpect
@@ -150,19 +150,19 @@ pure-eval==0.2.3
     # via stack-data
 pycparser==2.22
     # via cffi
-pydata-sphinx-theme==0.15.4
+pydata-sphinx-theme==0.16.0
     # via
     #   rocm-docs-core
     #   sphinx-book-theme
-pygithub==2.6.1
+pygithub==2.5.0
     # via rocm-docs-core
-pygments==2.19.1
+pygments==2.18.0
     # via
     #   accessible-pygments
     #   ipython
     #   pydata-sphinx-theme
     #   sphinx
-pyjwt[crypto]==2.10.1
+pyjwt[crypto]==2.10.0
     # via pygithub
 pynacl==1.5.0
     # via pygithub
@@ -187,15 +187,15 @@ requests==2.32.3
     # via
     #   pygithub
     #   sphinx
-rocm-docs-core==1.17.1
+rocm-docs-core==1.18.1
     # via -r requirements.in
-rpds-py==0.23.1
+rpds-py==0.22.3
     # via
     #   jsonschema
     #   referencing
 six==1.17.0
     # via python-dateutil
-smmap==5.0.2
+smmap==5.0.1
     # via gitdb
 snowballstemmer==2.2.0
     # via sphinx
@@ -214,7 +214,7 @@ sphinx==8.1.3
     #   sphinx-external-toc
     #   sphinx-notfound-page
     #   sphinx-reredirects
-sphinx-book-theme==1.1.4
+sphinx-book-theme==1.1.3
     # via rocm-docs-core
 sphinx-copybutton==0.5.2
     # via rocm-docs-core
@@ -222,7 +222,7 @@ sphinx-design==0.6.1
     # via rocm-docs-core
 sphinx-external-toc==1.0.1
     # via rocm-docs-core
-sphinx-notfound-page==1.1.0
+sphinx-notfound-page==1.0.4
     # via rocm-docs-core
 sphinx-reredirects==0.1.5
     # via -r requirements.in
@@ -238,13 +238,13 @@ sphinxcontrib-qthelp==2.0.0
     # via sphinx
 sphinxcontrib-serializinghtml==2.0.0
     # via sphinx
-sqlalchemy==2.0.38
+sqlalchemy==2.0.37
     # via jupyter-cache
 stack-data==0.6.3
     # via ipython
 tabulate==0.9.0
     # via jupyter-cache
-tomli==2.2.1
+tomli==2.1.0
     # via sphinx
 tornado==6.4.2
     # via
@@ -262,20 +262,19 @@ traitlets==5.14.3
     #   nbformat
 typing-extensions==4.12.2
     # via
-    #   beautifulsoup4
     #   ipython
     #   myst-nb
     #   pydata-sphinx-theme
     #   pygithub
     #   referencing
     #   sqlalchemy
-urllib3==2.3.0
+urllib3==2.2.3
     # via
     #   pygithub
     #   requests
 wcwidth==0.2.13
     # via prompt-toolkit
-wrapt==1.17.2
+wrapt==1.17.0
     # via deprecated
 zipp==3.21.0
-    # via importlib-metadata
+    # via importlib-metadata
\ No newline at end of file

From 039ce94782ace78af1f7ff604ae16534a4d3909b Mon Sep 17 00:00:00 2001
From: vm <sriramr2230@gmail.com>
Date: Wed, 9 Apr 2025 11:01:52 +0000
Subject: [PATCH 19/24] Evict pods consuming partition resource types

---
 internal/controllers/upgrademgr.go | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/internal/controllers/upgrademgr.go b/internal/controllers/upgrademgr.go
index 25dd5d6b..033734be 100644
--- a/internal/controllers/upgrademgr.go
+++ b/internal/controllers/upgrademgr.go
@@ -64,6 +64,25 @@ const (
 	defaultSAName     = "amd-gpu-operator-utils-container"
 )
 
+var (
+	computePartitionTypes = []string{"spx", "cpx", "dpx", "qpx", "tpx"}
+	memoryPartitionTypes  = []string{"nps1", "nps4"}
+	validResources        = buildValidResources()
+)
+
+func buildValidResources() map[string]struct{} {
+	resources := map[string]struct{}{
+		"amd.com/gpu": {},
+	}
+	for _, compute := range computePartitionTypes {
+		for _, memory := range memoryPartitionTypes {
+			resourceName := fmt.Sprintf("amd.com/%s_%s", compute, memory)
+			resources[resourceName] = struct{}{}
+		}
+	}
+	return resources
+}
+
 type upgradeMgr struct {
 	helper upgradeMgrHelperAPI
 }
@@ -663,9 +682,11 @@ func (h *upgradeMgrHelper) getPodsToDrainOrDelete(ctx context.Context, deviceCon
 			continue
 		}
 		for _, container := range pod.Spec.Containers {
-			if _, ok := container.Resources.Requests["amd.com/gpu"]; ok {
-				newPods = append(newPods, pod)
-				break
+			for resourceName := range container.Resources.Requests {
+				if _, ok := validResources[string(resourceName)]; ok {
+					newPods = append(newPods, pod)
+					break
+				}
 			}
 		}
 	}

From a636a9d669d21ef831fc74a267be5580142c63cf Mon Sep 17 00:00:00 2001
From: yansun1996 <yan@pensando.io>
Date: Thu, 10 Apr 2025 00:31:47 +0000
Subject: [PATCH 20/24] [DOC] Add note that updating driver image repo is not
 supported

---
 api/v1alpha1/deviceconfig_types.go                        | 1 +
 .../manifests/amd-gpu-operator.clusterserviceversion.yaml | 8 +++++---
 bundle/manifests/amd.com_deviceconfigs.yaml               | 1 +
 config/crd/bases/amd.com_deviceconfigs.yaml               | 1 +
 .../bases/amd-gpu-operator.clusterserviceversion.yaml     | 6 ++++--
 docs/drivers/installation.md                              | 4 +++-
 docs/fulldeviceconfig.rst                                 | 4 +++-
 helm-charts-k8s/Chart.lock                                | 2 +-
 helm-charts-k8s/crds/deviceconfig-crd.yaml                | 1 +
 helm-charts-openshift/Chart.lock                          | 2 +-
 helm-charts-openshift/crds/deviceconfig-crd.yaml          | 1 +
 11 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/api/v1alpha1/deviceconfig_types.go b/api/v1alpha1/deviceconfig_types.go
index 4a5d0597..b4b7ba04 100644
--- a/api/v1alpha1/deviceconfig_types.go
+++ b/api/v1alpha1/deviceconfig_types.go
@@ -117,6 +117,7 @@ type DriverSpec struct {
 	// for OpenShift the default value is image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod
 	// image tag will be in the format of <linux distro>-<release version>-<kernel version>-<driver version>
 	// example tag is coreos-416.94-5.14.0-427.28.1.el9_4.x86_64-6.2.2 and ubuntu-22.04-5.15.0-94-generic-6.1.3
+	// NOTE: Updating the driver image repository is not supported. Please delete the existing DeviceConfig and create a new one with the updated image repository
 	//+operator-sdk:csv:customresourcedefinitions:type=spec,displayName="Image",xDescriptors={"urn:alm:descriptor:com.amd.deviceconfigs:image"}
 	// +optional
 	// +kubebuilder:validation:Pattern=`^([a-z0-9]+(?:[._-][a-z0-9]+)*(:[0-9]+)?)(/[$a-zA-Z0-9_]+(?:[._-][$a-zA-Z0-9_]+)*)*(?::[a-z0-9._-]+)?(?:@[a-zA-Z0-9]+:[a-f0-9]+)?$`
diff --git a/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml b/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml
index c73d0351..09886fe0 100644
--- a/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml
+++ b/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml
@@ -32,7 +32,7 @@ metadata:
     capabilities: Seamless Upgrades
     categories: AI/Machine Learning,Monitoring
     containerImage: docker.io/rocm/gpu-operator:v1.2.0
-    createdAt: "2025-04-07T07:07:00Z"
+    createdAt: "2025-04-10T00:25:51Z"
     description: |-
       Operator responsible for deploying AMD GPU kernel drivers, device plugin, device test runner and device metrics exporter
       For more information, visit [documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/)
@@ -262,13 +262,15 @@ spec:
         path: driver.enable
         x-descriptors:
         - urn:alm:descriptor:com.amd.deviceconfigs:enable
-      - description: defines image that includes drivers and firmware blobs, don't
+      - description: 'defines image that includes drivers and firmware blobs, don''t
           include tag since it will be fully managed by operator for vanilla k8s the
           default value is image-registry:5000/$MOD_NAMESPACE/amdgpu_kmod for OpenShift
           the default value is image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod
           image tag will be in the format of <linux distro>-<release version>-<kernel
           version>-<driver version> example tag is coreos-416.94-5.14.0-427.28.1.el9_4.x86_64-6.2.2
-          and ubuntu-22.04-5.15.0-94-generic-6.1.3
+          and ubuntu-22.04-5.15.0-94-generic-6.1.3 NOTE: Updating the driver image
+          repository is not supported. Please delete the existing DeviceConfig and
+          create a new one with the updated image repository'
         displayName: Image
         path: driver.image
         x-descriptors:
diff --git a/bundle/manifests/amd.com_deviceconfigs.yaml b/bundle/manifests/amd.com_deviceconfigs.yaml
index 606fa1d8..8a439b8d 100644
--- a/bundle/manifests/amd.com_deviceconfigs.yaml
+++ b/bundle/manifests/amd.com_deviceconfigs.yaml
@@ -360,6 +360,7 @@ spec:
                       for OpenShift the default value is image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod
                       image tag will be in the format of <linux distro>-<release version>-<kernel version>-<driver version>
                       example tag is coreos-416.94-5.14.0-427.28.1.el9_4.x86_64-6.2.2 and ubuntu-22.04-5.15.0-94-generic-6.1.3
+                      NOTE: Updating the driver image repository is not supported. Please delete the existing DeviceConfig and create a new one with the updated image repository
                     pattern: ^([a-z0-9]+(?:[._-][a-z0-9]+)*(:[0-9]+)?)(/[$a-zA-Z0-9_]+(?:[._-][$a-zA-Z0-9_]+)*)*(?::[a-z0-9._-]+)?(?:@[a-zA-Z0-9]+:[a-f0-9]+)?$
                     type: string
                   imageRegistrySecret:
diff --git a/config/crd/bases/amd.com_deviceconfigs.yaml b/config/crd/bases/amd.com_deviceconfigs.yaml
index 64427582..dfd71b78 100644
--- a/config/crd/bases/amd.com_deviceconfigs.yaml
+++ b/config/crd/bases/amd.com_deviceconfigs.yaml
@@ -356,6 +356,7 @@ spec:
                       for OpenShift the default value is image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod
                       image tag will be in the format of <linux distro>-<release version>-<kernel version>-<driver version>
                       example tag is coreos-416.94-5.14.0-427.28.1.el9_4.x86_64-6.2.2 and ubuntu-22.04-5.15.0-94-generic-6.1.3
+                      NOTE: Updating the driver image repository is not supported. Please delete the existing DeviceConfig and create a new one with the updated image repository
                     pattern: ^([a-z0-9]+(?:[._-][a-z0-9]+)*(:[0-9]+)?)(/[$a-zA-Z0-9_]+(?:[._-][$a-zA-Z0-9_]+)*)*(?::[a-z0-9._-]+)?(?:@[a-zA-Z0-9]+:[a-f0-9]+)?$
                     type: string
                   imageRegistrySecret:
diff --git a/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml b/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml
index 878483bd..c49d9c30 100644
--- a/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml
+++ b/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml
@@ -233,13 +233,15 @@ spec:
         path: driver.enable
         x-descriptors:
         - urn:alm:descriptor:com.amd.deviceconfigs:enable
-      - description: defines image that includes drivers and firmware blobs, don't
+      - description: 'defines image that includes drivers and firmware blobs, don''t
           include tag since it will be fully managed by operator for vanilla k8s the
           default value is image-registry:5000/$MOD_NAMESPACE/amdgpu_kmod for OpenShift
           the default value is image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod
           image tag will be in the format of <linux distro>-<release version>-<kernel
           version>-<driver version> example tag is coreos-416.94-5.14.0-427.28.1.el9_4.x86_64-6.2.2
-          and ubuntu-22.04-5.15.0-94-generic-6.1.3
+          and ubuntu-22.04-5.15.0-94-generic-6.1.3 NOTE: Updating the driver image
+          repository is not supported. Please delete the existing DeviceConfig and
+          create a new one with the updated image repository'
         displayName: Image
         path: driver.image
         x-descriptors:
diff --git a/docs/drivers/installation.md b/docs/drivers/installation.md
index ed1d9041..9825e546 100644
--- a/docs/drivers/installation.md
+++ b/docs/drivers/installation.md
@@ -100,7 +100,9 @@ spec:
     # Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olmhtml#create-blacklist-for-installing-out-of-tree-kernel-module
     blacklist: true
     # Specify your repository to host driver image
-    # DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you
+    # Note:
+    # 1. DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you
+    # 2. Updating the driver image repository is not supported. Please delete the existing DeviceConfig and create a new one with the updated image repository
     image: docker.io/username/repo
     # (Optional) Specify the credential for your private registry if it requires credential to get pull/push access
     # you can create the docker-registry type secret by running command like:
diff --git a/docs/fulldeviceconfig.rst b/docs/fulldeviceconfig.rst
index 00d52de9..9f7b8441 100644
--- a/docs/fulldeviceconfig.rst
+++ b/docs/fulldeviceconfig.rst
@@ -44,7 +44,9 @@ Below is an example of a full DeviceConfig CR that can be used to install the AM
         # Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module
         blacklist: false
         # Specify your repository to host driver image
-        # DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you
+        # Note:
+        # 1. DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you
+        # 2. Updating the driver image repository is not supported. Please delete the existing DeviceConfig and create a new one with the updated image repository
         image: docker.io/username/repo
         # (Optional) Specify the credential for your private registry if it requires credential to get pull/push access
         # you can create the docker-registry type secret by running command like:
diff --git a/helm-charts-k8s/Chart.lock b/helm-charts-k8s/Chart.lock
index dd529b75..95811e74 100644
--- a/helm-charts-k8s/Chart.lock
+++ b/helm-charts-k8s/Chart.lock
@@ -6,4 +6,4 @@ dependencies:
   repository: file://./charts/kmm
   version: v1.0.0
 digest: sha256:f9a315dd2ce3d515ebf28c8e9a6a82158b493ca2686439ec381487761261b597
-generated: "2025-04-07T07:06:50.661624221Z"
+generated: "2025-04-10T00:25:36.698574082Z"
diff --git a/helm-charts-k8s/crds/deviceconfig-crd.yaml b/helm-charts-k8s/crds/deviceconfig-crd.yaml
index ff9c1c79..24669303 100644
--- a/helm-charts-k8s/crds/deviceconfig-crd.yaml
+++ b/helm-charts-k8s/crds/deviceconfig-crd.yaml
@@ -364,6 +364,7 @@ spec:
                       for OpenShift the default value is image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod
                       image tag will be in the format of <linux distro>-<release version>-<kernel version>-<driver version>
                       example tag is coreos-416.94-5.14.0-427.28.1.el9_4.x86_64-6.2.2 and ubuntu-22.04-5.15.0-94-generic-6.1.3
+                      NOTE: Updating the driver image repository is not supported. Please delete the existing DeviceConfig and create a new one with the updated image repository
                     pattern: ^([a-z0-9]+(?:[._-][a-z0-9]+)*(:[0-9]+)?)(/[$a-zA-Z0-9_]+(?:[._-][$a-zA-Z0-9_]+)*)*(?::[a-z0-9._-]+)?(?:@[a-zA-Z0-9]+:[a-f0-9]+)?$
                     type: string
                   imageRegistrySecret:
diff --git a/helm-charts-openshift/Chart.lock b/helm-charts-openshift/Chart.lock
index d4a86324..ea8bd255 100644
--- a/helm-charts-openshift/Chart.lock
+++ b/helm-charts-openshift/Chart.lock
@@ -6,4 +6,4 @@ dependencies:
   repository: file://./charts/kmm
   version: v1.0.0
 digest: sha256:25200c34a5cc846a1275e5bf3fc637b19e909dc68de938189c5278d77d03f5ac
-generated: "2025-04-07T07:06:59.305455465Z"
+generated: "2025-04-10T00:25:48.698223085Z"
diff --git a/helm-charts-openshift/crds/deviceconfig-crd.yaml b/helm-charts-openshift/crds/deviceconfig-crd.yaml
index ff9c1c79..24669303 100644
--- a/helm-charts-openshift/crds/deviceconfig-crd.yaml
+++ b/helm-charts-openshift/crds/deviceconfig-crd.yaml
@@ -364,6 +364,7 @@ spec:
                       for OpenShift the default value is image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod
                       image tag will be in the format of <linux distro>-<release version>-<kernel version>-<driver version>
                       example tag is coreos-416.94-5.14.0-427.28.1.el9_4.x86_64-6.2.2 and ubuntu-22.04-5.15.0-94-generic-6.1.3
+                      NOTE: Updating the driver image repository is not supported. Please delete the existing DeviceConfig and create a new one with the updated image repository
                     pattern: ^([a-z0-9]+(?:[._-][a-z0-9]+)*(:[0-9]+)?)(/[$a-zA-Z0-9_]+(?:[._-][$a-zA-Z0-9_]+)*)*(?::[a-z0-9._-]+)?(?:@[a-zA-Z0-9]+:[a-f0-9]+)?$
                     type: string
                   imageRegistrySecret:

From 50eac0758c6de8d59671034d21c635d992dcab85 Mon Sep 17 00:00:00 2001
From: vm <sriramr2230@gmail.com>
Date: Fri, 11 Apr 2025 03:24:23 +0000
Subject: [PATCH 21/24] Handle auto driver upgrade on OpenShift when KMM
 self-delete the NMC

---
 internal/controllers/mock_upgrademgr.go | 14 +++++++++++
 internal/controllers/upgrademgr.go      | 32 +++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/internal/controllers/mock_upgrademgr.go b/internal/controllers/mock_upgrademgr.go
index 748a33d6..33e8332e 100644
--- a/internal/controllers/mock_upgrademgr.go
+++ b/internal/controllers/mock_upgrademgr.go
@@ -394,6 +394,20 @@ func (mr *MockupgradeMgrHelperAPIMockRecorder) isNodeNew(ctx, node, deviceConfig
 	return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "isNodeNew", reflect.TypeOf((*MockupgradeMgrHelperAPI)(nil).isNodeNew), ctx, node, deviceConfig)
 }
 
+// isNodeNmcStatusMissing mocks base method.
+func (m *MockupgradeMgrHelperAPI) isNodeNmcStatusMissing(ctx context.Context, node *v1.Node, deviceConfig *v1alpha1.DeviceConfig) bool {
+	m.ctrl.T.Helper()
+	ret := m.ctrl.Call(m, "isNodeNmcStatusMissing", ctx, node, deviceConfig)
+	ret0, _ := ret[0].(bool)
+	return ret0
+}
+
+// isNodeNmcStatusMissing indicates an expected call of isNodeNmcStatusMissing.
+func (mr *MockupgradeMgrHelperAPIMockRecorder) isNodeNmcStatusMissing(ctx, node, deviceConfig any) *gomock.Call {
+	mr.mock.ctrl.T.Helper()
+	return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "isNodeNmcStatusMissing", reflect.TypeOf((*MockupgradeMgrHelperAPI)(nil).isNodeNmcStatusMissing), ctx, node, deviceConfig)
+}
+
 // isNodeReady mocks base method.
 func (m *MockupgradeMgrHelperAPI) isNodeReady(ctx context.Context, node *v1.Node, deviceConfig *v1alpha1.DeviceConfig) bool {
 	m.ctrl.T.Helper()
diff --git a/internal/controllers/upgrademgr.go b/internal/controllers/upgrademgr.go
index 033734be..8a25f5db 100644
--- a/internal/controllers/upgrademgr.go
+++ b/internal/controllers/upgrademgr.go
@@ -187,6 +187,12 @@ func (n *upgradeMgr) HandleUpgrade(ctx context.Context, deviceConfig *amdv1alpha
 			continue
 		}
 
+		// Untaint to let upgrade continue in case of KMM bug after node reboot
+		if n.helper.isNodeNmcStatusMissing(ctx, &nodeList.Items[i], deviceConfig) {
+			upgradeInProgress++
+			continue
+		}
+
 		// 3. Handle Started Nodes
 		if n.helper.isNodeStateUpgradeStarted(&nodeList.Items[i]) {
 			upgradeInProgress++
@@ -292,6 +298,7 @@ type upgradeMgrHelperAPI interface {
 
 	// Handle node state transitions
 	isNodeReady(ctx context.Context, node *v1.Node, deviceConfig *amdv1alpha1.DeviceConfig) bool
+	isNodeNmcStatusMissing(ctx context.Context, node *v1.Node, deviceConfig *amdv1alpha1.DeviceConfig) bool
 	isNodeNew(ctx context.Context, node *v1.Node, deviceConfig *amdv1alpha1.DeviceConfig) bool
 	isNodeStateUpgradeStarted(node *v1.Node) bool
 	isNodeStateInstallInProgress(ctx context.Context, node *v1.Node, deviceConfig *amdv1alpha1.DeviceConfig) bool
@@ -405,6 +412,31 @@ func (h *upgradeMgrHelper) isNodeNew(ctx context.Context, node *v1.Node, deviceC
 	return false
 }
 
+// Handle Driver installation for nodes with nmc status missing
+func (h *upgradeMgrHelper) isNodeNmcStatusMissing(ctx context.Context, node *v1.Node, deviceConfig *amdv1alpha1.DeviceConfig) bool {
+
+	if nodeStatus, ok := deviceConfig.Status.NodeModuleStatus[node.Name]; ok {
+		currentState := h.getNodeStatus(node.Name)
+		// during the automatic upgrade, if node reboot was triggered, KMM could possibly remove the NMC status, making the ContainerImage empty
+		// https://github.com/rh-ecosystem-edge/kernel-module-management/blob/b57037ec1b8ceef9961ca1baeb9529121c6df398/internal/controllers/nmc_reconciler.go#L414-L419
+		// at this moment the node status would be UpgradeStateInProgress with empty ContainerImage
+		// we still need to proceed with this status
+		if nodeStatus.ContainerImage == "" && currentState == amdv1alpha1.UpgradeStateInProgress {
+
+			// Uncordon the node
+			if err := h.cordonOrUncordonNode(ctx, deviceConfig, node, false); err != nil {
+				// Move to failure state if uncordon fails
+				h.setNodeStatus(ctx, node.Name, amdv1alpha1.UpgradeStateUncordonFailed)
+				return false
+			}
+
+			return true
+		}
+	}
+
+	return false
+}
+
 // Handle Driver installation for ready nodes.
 func (h *upgradeMgrHelper) isNodeReady(ctx context.Context, node *v1.Node, deviceConfig *amdv1alpha1.DeviceConfig) bool {
 

From 0efcd891c187a43037fec8e032ce987c9e2ebcd7 Mon Sep 17 00:00:00 2001
From: vm <sriramr2230@gmail.com>
Date: Wed, 9 Apr 2025 03:23:39 +0000
Subject: [PATCH 22/24] MaxParallel constraint with MaxUnavailable

---
 internal/controllers/upgrademgr.go | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/internal/controllers/upgrademgr.go b/internal/controllers/upgrademgr.go
index 8a25f5db..2407dbe2 100644
--- a/internal/controllers/upgrademgr.go
+++ b/internal/controllers/upgrademgr.go
@@ -548,7 +548,23 @@ func (h *upgradeMgrHelper) isUpgradePolicyViolated(upgradeInProgress int, upgrad
 		return maxParallelUpdates, true
 	}
 
-	return maxParallelUpdates, (upgradeInProgress >= maxParallelUpdates) || (upgradeFailedState >= maxUnavailableNodes)
+	// Remaining space for unavailable nodes
+	remainingUnavailable := maxUnavailableNodes - upgradeFailedState
+
+	var maxParallelAllowed int
+	if maxParallelUpdates == 0 {
+		// "0 means Unlimited parallel" — so allow up to remaining unavailable
+		maxParallelAllowed = remainingUnavailable
+	} else {
+		// Take into consideration minimum between configured value and remaining unavailable
+		maxParallelAllowed = min(maxParallelUpdates, remainingUnavailable)
+	}
+
+	if maxParallelAllowed == 0 || upgradeInProgress >= maxParallelAllowed {
+		return maxParallelAllowed, true
+	}
+
+	return maxParallelAllowed, false
 
 }
 

From a4c4cc99858f008d4a118f7b7600b803e9f2335e Mon Sep 17 00:00:00 2001
From: vm <sriramr2230@gmail.com>
Date: Fri, 11 Apr 2025 05:26:50 +0000
Subject: [PATCH 23/24] Release note doc

---
 docs/knownlimitations.md | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/docs/knownlimitations.md b/docs/knownlimitations.md
index c84430e8..a21fc6d3 100644
--- a/docs/knownlimitations.md
+++ b/docs/knownlimitations.md
@@ -92,6 +92,13 @@
     - **Recommendation:** Please set maxParallel Upgrades to a number less than total number of worker nodes
 </br></br>
 
+14. **Driver Install/Upgrade Issue if one of the nodes where KMM is running build pod gets rebooted accidentaly when rebootRequired is set to false**
+
+    - **Impact:** Not able to perform driver install/upgrade
+    - **Affected Configurations:** All configurations
+    - **Recommendation:** Please retrigger driver install/upgrade and ensure to not reboot node manually when rebootRequired is false
+</br></br>
+
 ## Fixed Issues
 
 1. **When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module <span style="color:red">(Fixed in v1.2.0)</span>**

From f2acfb1d00b1e367c80f006edf3b26512d842a7c Mon Sep 17 00:00:00 2001
From: vm <sriramr2230@gmail.com>
Date: Thu, 17 Apr 2025 09:35:31 +0000
Subject: [PATCH 24/24] Node labeller flags for partition related labels[DO NOT
 MERGE]

---
 internal/nodelabeller/nodelabeller.go | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/internal/nodelabeller/nodelabeller.go b/internal/nodelabeller/nodelabeller.go
index 81293fd9..e745f6c3 100644
--- a/internal/nodelabeller/nodelabeller.go
+++ b/internal/nodelabeller/nodelabeller.go
@@ -175,7 +175,7 @@ func (nl *nodeLabeller) SetNodeLabellerAsDesired(ds *appsv1.DaemonSet, devConfig
 				InitContainers: initContainers,
 				Containers: []v1.Container{
 					{
-						Args:    []string{"-c", "./k8s-node-labeller -vram -cu-count -simd-count -device-id -family -product-name -driver-version"},
+						Args:    []string{"-c", "./k8s-node-labeller -vram -cu-count -simd-count -device-id -family -product-name -driver-version -compute-memory-partition -compute-partitioning-supported -memory-partitioning-supported"},
 						Command: []string{"sh"},
 						Env: []v1.EnvVar{
 							{