From e84675e2937e4a56c2bc3e6d0ba4fe3eeb2e57af Mon Sep 17 00:00:00 2001 From: yansun1996 Date: Fri, 28 Mar 2025 23:23:34 +0000 Subject: [PATCH 01/24] [DOC] Add note that RVS test isn't compatible with partitioned GPU yet --- docs/test/auto-unhealthy-device-test.md | 4 ++++ docs/test/manual-test.md | 4 ++++ docs/test/pre-start-job-test.md | 4 ++++ 3 files changed, 12 insertions(+) diff --git a/docs/test/auto-unhealthy-device-test.md b/docs/test/auto-unhealthy-device-test.md index 0b6e9cb3..354cc0c7 100644 --- a/docs/test/auto-unhealthy-device-test.md +++ b/docs/test/auto-unhealthy-device-test.md @@ -4,6 +4,10 @@ Test runner is periodically watching for the device health status from device metrics exporter per 30 seconds. Once exporter reported GPU status is unhealthy, test runner will start to run one-time test on the unhealthy GPU. The test result will be exported as Kubernetes event. +```{warning} +The Test Runner's RVS test recipes aren't compatible with partitioned GPU. If you're using partitoned GPU please disable the test runner from ```DeviceConfig``` by setting ```spec/testRunner/enable``` to ```false```. +``` + ## Configure test runner To start the Test Runner along with the GPU Operator, Device Metrics Exporter must be enabled since Test Runner is depending on the exported health status. Configure the ``` spec/metricsExporter/enable ``` field in deviceconfig Custom Resource(CR) to enable/disable metrics exporter and configure the ``` spec/testRunner/enable ``` field in deviceconfig Custom Resource(CR) to enable/disable test runner. diff --git a/docs/test/manual-test.md b/docs/test/manual-test.md index c00ac288..7d14f1a9 100644 --- a/docs/test/manual-test.md +++ b/docs/test/manual-test.md @@ -4,6 +4,10 @@ To start the manual test, directly use the test runner image to create the Kubernetes job and related resources, then the test will be triggered manually. +```{warning} +The Test Runner's RVS test recipes aren't compatible with partitioned GPU. If you're using partitoned GPU please reset the GPU partition configuration and run the manual test against the non-partitioned GPU. +``` + ## Use Case 1 - GPU is unhealthy on the node When any GPU on a specific worker node is unhealthy, you can manually trigger a test / benchmark run on that worker node to check more details on the unhealthy state. The test job requires RBAC config to grant the test runner access to export events and add node labels to the cluster. Here is an example of configuring the RBAC and Job resources: diff --git a/docs/test/pre-start-job-test.md b/docs/test/pre-start-job-test.md index 2bad5332..f11e765d 100644 --- a/docs/test/pre-start-job-test.md +++ b/docs/test/pre-start-job-test.md @@ -4,6 +4,10 @@ Test runner can be embedded as an init container within your Kubernetes workload pod definition. The init container will be executed before the actual workload containers start, in that way the system could be tested right before the workload start to use the hardware resource. +```{warning} +The Test Runner's RVS test recipes aren't compatible with partitioned GPU. If you're using partitoned GPU, don't run the test runner as init container to perform the pre-start job test. +``` + ## Configure pre-start init container The init container requires RBAC config to grant the pod access to export events and add node labels to the cluster. Here is an example of configuring the RBAC and Job resources: From 60959c586f1602f43e4767d96f1b7b4f864c19bd Mon Sep 17 00:00:00 2001 From: yansun1996 Date: Mon, 31 Mar 2025 09:01:10 +0000 Subject: [PATCH 02/24] Address comments --- docs/test/auto-unhealthy-device-test.md | 2 +- docs/test/manual-test.md | 2 +- docs/test/pre-start-job-test.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/test/auto-unhealthy-device-test.md b/docs/test/auto-unhealthy-device-test.md index 354cc0c7..c610a32c 100644 --- a/docs/test/auto-unhealthy-device-test.md +++ b/docs/test/auto-unhealthy-device-test.md @@ -5,7 +5,7 @@ Test runner is periodically watching for the device health status from device metrics exporter per 30 seconds. Once exporter reported GPU status is unhealthy, test runner will start to run one-time test on the unhealthy GPU. The test result will be exported as Kubernetes event. ```{warning} -The Test Runner's RVS test recipes aren't compatible with partitioned GPU. If you're using partitoned GPU please disable the test runner from ```DeviceConfig``` by setting ```spec/testRunner/enable``` to ```false```. +The RVS test recipes in the Test Runner aren't compatible with partitioned GPUs. To address this, either disable the test runner by setting ```spec/testRunner/enable``` to ```false```, or configure the test runner to run only on nodes without partitioned GPUs by using ```spec/testRunner/selector```. ``` ## Configure test runner diff --git a/docs/test/manual-test.md b/docs/test/manual-test.md index 7d14f1a9..c4ba4bae 100644 --- a/docs/test/manual-test.md +++ b/docs/test/manual-test.md @@ -5,7 +5,7 @@ To start the manual test, directly use the test runner image to create the Kubernetes job and related resources, then the test will be triggered manually. ```{warning} -The Test Runner's RVS test recipes aren't compatible with partitioned GPU. If you're using partitoned GPU please reset the GPU partition configuration and run the manual test against the non-partitioned GPU. +The RVS test recipes in the Test Runner are not compatible with partitioned GPUs. If you are using a partitioned GPU, please reset the GPU partition configuration and conduct the manual test on a non-partitioned GPU. ``` ## Use Case 1 - GPU is unhealthy on the node diff --git a/docs/test/pre-start-job-test.md b/docs/test/pre-start-job-test.md index f11e765d..d5133faa 100644 --- a/docs/test/pre-start-job-test.md +++ b/docs/test/pre-start-job-test.md @@ -5,7 +5,7 @@ Test runner can be embedded as an init container within your Kubernetes workload pod definition. The init container will be executed before the actual workload containers start, in that way the system could be tested right before the workload start to use the hardware resource. ```{warning} -The Test Runner's RVS test recipes aren't compatible with partitioned GPU. If you're using partitoned GPU, don't run the test runner as init container to perform the pre-start job test. +The RVS test recipes in the Test Runner are not compatible with partitioned GPUs. If you are using a partitioned GPU, avoid running the Test Runner as an init container for the pre-start job test. ``` ## Configure pre-start init container From 7a50f27fd6a85109453a6984ad68e21384d64791 Mon Sep 17 00:00:00 2001 From: vm Date: Fri, 28 Mar 2025 04:12:20 +0000 Subject: [PATCH 03/24] BootID support for Reboot during Driver Upgrade --- internal/controllers/mock_upgrademgr.go | 26 +++++++++++++++++++++++++ internal/controllers/upgrademgr.go | 23 ++++++++++++++++++++++ 2 files changed, 49 insertions(+) diff --git a/internal/controllers/mock_upgrademgr.go b/internal/controllers/mock_upgrademgr.go index 7db0fa9c..03944030 100644 --- a/internal/controllers/mock_upgrademgr.go +++ b/internal/controllers/mock_upgrademgr.go @@ -216,6 +216,20 @@ func (mr *MockupgradeMgrHelperAPIMockRecorder) deleteRebootPod(ctx, nodeName, dc return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "deleteRebootPod", reflect.TypeOf((*MockupgradeMgrHelperAPI)(nil).deleteRebootPod), ctx, nodeName, dc, force, genId) } +// getBootID mocks base method. +func (m *MockupgradeMgrHelperAPI) getBootID(nodeName string) string { + m.ctrl.T.Helper() + ret := m.ctrl.Call(m, "getBootID", nodeName) + ret0, _ := ret[0].(string) + return ret0 +} + +// getBootID indicates an expected call of getBootID. +func (mr *MockupgradeMgrHelperAPIMockRecorder) getBootID(nodeName any) *gomock.Call { + mr.mock.ctrl.T.Helper() + return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "getBootID", reflect.TypeOf((*MockupgradeMgrHelperAPI)(nil).getBootID), nodeName) +} + // getNode mocks base method. func (m *MockupgradeMgrHelperAPI) getNode(ctx context.Context, nodeName string) (*v1.Node, error) { m.ctrl.T.Helper() @@ -465,6 +479,18 @@ func (mr *MockupgradeMgrHelperAPIMockRecorder) isUpgradePolicyViolated(upgradeIn return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "isUpgradePolicyViolated", reflect.TypeOf((*MockupgradeMgrHelperAPI)(nil).isUpgradePolicyViolated), upgradeInProgress, upgradeFailedState, totalNodes, deviceConfig) } +// setBootID mocks base method. +func (m *MockupgradeMgrHelperAPI) setBootID(nodeName, bootID string) { + m.ctrl.T.Helper() + m.ctrl.Call(m, "setBootID", nodeName, bootID) +} + +// setBootID indicates an expected call of setBootID. +func (mr *MockupgradeMgrHelperAPIMockRecorder) setBootID(nodeName, bootID any) *gomock.Call { + mr.mock.ctrl.T.Helper() + return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "setBootID", reflect.TypeOf((*MockupgradeMgrHelperAPI)(nil).setBootID), nodeName, bootID) +} + // setNodeStatus mocks base method. func (m *MockupgradeMgrHelperAPI) setNodeStatus(ctx context.Context, nodeName string, status v1alpha1.UpgradeState) { m.ctrl.T.Helper() diff --git a/internal/controllers/upgrademgr.go b/internal/controllers/upgrademgr.go index a5e519b2..ad2ee41c 100644 --- a/internal/controllers/upgrademgr.go +++ b/internal/controllers/upgrademgr.go @@ -287,6 +287,8 @@ type upgradeMgrHelperAPI interface { setUpgradeStartTime(nodeName string) clearUpgradeStartTime(nodeName string) checkUpgradeTimeExceeded(ctx context.Context, nodeName string, deviceConfig *amdv1alpha1.DeviceConfig) bool + getBootID(nodeName string) string + setBootID(nodeName string, bootID string) clearNodeStatus() isInit() bool } @@ -297,6 +299,7 @@ type upgradeMgrHelper struct { drainHelper *drain.Helper nodeStatus *sync.Map nodeUpgradeStartTime *sync.Map + nodeBootID *sync.Map init bool currentSpec driverSpec } @@ -313,6 +316,7 @@ func newUpgradeMgrHelperHandler(client client.Client, k8sInterface kubernetes.In k8sInterface: k8sInterface, nodeStatus: new(sync.Map), nodeUpgradeStartTime: new(sync.Map), + nodeBootID: new(sync.Map), } } @@ -527,6 +531,18 @@ func (h *upgradeMgrHelper) checkUpgradeTimeExceeded(ctx context.Context, nodeNam return false } +func (h *upgradeMgrHelper) getBootID(nodeName string) string { + if value, ok := h.nodeBootID.Load(nodeName); ok { + return value.(string) + } + + return "" +} + +func (h *upgradeMgrHelper) setBootID(nodeName string, currentbootID string) { + h.nodeBootID.Store(nodeName, currentbootID) +} + func (h *upgradeMgrHelper) getNodeStatus(nodeName string) amdv1alpha1.UpgradeState { if value, ok := h.nodeStatus.Load(nodeName); ok { @@ -867,6 +883,8 @@ func (h *upgradeMgrHelper) handleNodeReboot(ctx context.Context, node *v1.Node, // Wait for the driver upgrade to complete waitForDriverUpgrade() + currentBootID := node.Status.NodeInfo.BootID + h.setBootID(node.Name, currentBootID) if err := h.client.Create(ctx, rebootPod); err != nil { logger.Error(err, fmt.Sprintf("Node: %v State: %v RebootPod Create failed with Error: %v", node.Name, h.getNodeStatus(node.Name), err)) // Mark the state as failed @@ -888,6 +906,11 @@ func (h *upgradeMgrHelper) handleNodeReboot(ctx context.Context, node *v1.Node, } } + if nodeObj.Status.NodeInfo.BootID != h.getBootID(node.Name) { + h.setBootID(node.Name, nodeObj.Status.NodeInfo.BootID) + logger.Info(fmt.Sprintf("Node: %v has rebooted", node.Name)) + return + } // If node is NotReady, proceed; otherwise, wait for the next tick if nodeNotReady { logger.Info(fmt.Sprintf("Node: %v has moved to NotReady", node.Name)) From b140f1f815f1ded37262af7820d3e05f7de42ff3 Mon Sep 17 00:00:00 2001 From: vm Date: Wed, 26 Mar 2025 07:09:14 +0000 Subject: [PATCH 04/24] Device Plugin Usage documentation from GPU Operator --- docs/device_plugin/device-plugin.md | 112 ++++++++++++++++++++++++++++ docs/sphinx/_toc.yml | 3 + docs/sphinx/_toc.yml.in | 3 + 3 files changed, 118 insertions(+) create mode 100644 docs/device_plugin/device-plugin.md diff --git a/docs/device_plugin/device-plugin.md b/docs/device_plugin/device-plugin.md new file mode 100644 index 00000000..4ecfb97b --- /dev/null +++ b/docs/device_plugin/device-plugin.md @@ -0,0 +1,112 @@ +# Device Plugin + +## Configure device plugin + +To start the Device Plugin along with the GPU Operator configure fields under the ``` spec/devicePlugin ``` field in deviceconfig Custom Resource(CR) + +```yaml + devicePlugin: + # Specify the device plugin image + # default value is rocm/k8s-device-plugin:latest + devicePluginImage: rocm/k8s-device-plugin:latest + + # The device plugin arguments is used to pass supported flags and their values while starting device plugin daemonset + devicePluginArguments: + resource_naming_strategy: single + + # Specify the node labeller image + # default value is rocm/k8s-device-plugin:labeller-latest + nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest + + # Specify whether to bring up node labeller component + # default value is true + enableNodeLabeller: True + +``` + +The **device-plugin** pods start after updating the **DeviceConfig** CR + +```bash +#kubectl get pods -n kube-amd-gpu +NAME READY STATUS RESTARTS AGE +amd-gpu-operator-gpu-operator-charts-controller-manager-77tpmgn 1/1 Running 0 4h9m +amd-gpu-operator-kmm-controller-6d459dffcf-lbgtt 1/1 Running 0 4h9m +amd-gpu-operator-kmm-webhook-server-5fdc8b995-qgj49 1/1 Running 0 4h9m +amd-gpu-operator-node-feature-discovery-gc-78989c896-7lh8t 1/1 Running 0 3h48m +amd-gpu-operator-node-feature-discovery-master-b8bffc48b-6rnz6 1/1 Running 0 4h9m +amd-gpu-operator-node-feature-discovery-worker-m9lwn 1/1 Running 0 4h9m +test-deviceconfig-device-plugin-rk5f4 1/1 Running 0 134m +test-deviceconfig-node-labeller-bxk7x 1/1 Running 0 134m +``` + +
+Note: The Device Plugin name will be prefixed with the name of your DeviceConfig custom resource +

+ +## Device Plugin DeviceConfig +| Field Name | Details | +|----------------------------------|----------------------------------------------| +| **DevicePluginImage** | Device plugin image | +| **DevicePluginImagePullPolicy** | One of Always, Never, IfNotPresent. | +| **NodeLabellerImage** | Node labeller image | +| **NodeLabellerImagePullPolicy** | One of Always, Never, IfNotPresent. | +| **EnableNodeLabeller** | Enable/Disable node labeller with True/False | +| **DevicePluginArguments** | The flag/values to pass on to Device Plugin | +
+ +1. Both the `ImagePullPolicy` fields default to `Always` if `:latest` tag is specified on the respective Image, or defaults to `IfNotPresent` otherwise. This is default k8s behaviour for `ImagePullPolicy` + +2. `DevicePluginArguments` is of type `map[string]string`. Currently supported key value pairs to set under `DevicePluginArguments` are: + -> "resource_naming_strategy": {"single", "mixed"} + +## How to choose Resource Naming Strategy + +To customize the way device plugin reports gpu resources to kubernetes as allocatable k8s resources, use the `single` or `mixed` resource naming strategy in **DeviceConfig** CR +Before understanding each strategy, please note the definition of homogeneous and heterogeneous nodes + +Homogeneous node: A node whose gpu's follow the same compute-memory partition style + -> Example: A node of 8 GPU's where all 8 GPU's are following CPX-NPS4 partition style + +Heterogeneous node: A node whose gpu's follow different compute-memory partition styles + -> Example: A node of 8 GPU's where 5 GPU's are following SPX-NPS1 and 3 GPU's are following CPX-NPS1 + +### Single + +In `single` mode, the device plugin reports all gpu's (regardless of whether they are whole gpu's or partitions of a gpu) under the resource name `amd.com/gpu` +This mode is supported for homogeneous nodes but not supported for heterogeneous nodes + +A node which has 8 GPUs where all GPUs are not partitioned will report its resources as: + +```bash +amd.com/gpu: 8 +``` + +A node which has 8 GPUs where all GPUs are partitioned using CPX-NPS4 style will report its resources as: + +```bash +amd.com/gpu: 64 +``` + +### Mixed + +In `mixed` mode, the device plugin reports all gpu's under a name which matches its partition style. +This mode is supported for both homogeneous nodes and heterogeneous nodes + +A node which has 8 GPUs which are all partitioned using CPX-NPS4 style will report its resources as: + +```bash +amd.com/cpx_nps4: 64 +``` + +A node which has 8 GPUs where 5 GPU's are following SPX-NPS1 and 3 GPU's are following CPX-NPS1 will report its resources as: + +```bash +amd.com/spx_nps1: 5 +amd.com/cpx_nps1: 24 +``` + +#### **Notes** + +- If `resource_naming_strategy` is not passed using `DevicePluginArguments` field in CR, then device plugin will internally default to `single` resource naming strategy. This maintains backwards compatibility with earlier release of device plugin with reported resource name of `amd.com/gpu` +- If a node has GPUs which do not support partitioning, such as MI210, then the GPUs are reported under resource name `amd.com/gpu` regardless of the resource naming strategy +- These different naming styles of resources, for example, `amd.com/cpx_nps1` should be followed when requesting for resources in a pod spec \ No newline at end of file diff --git a/docs/sphinx/_toc.yml b/docs/sphinx/_toc.yml index a232e7ab..62786ea4 100644 --- a/docs/sphinx/_toc.yml +++ b/docs/sphinx/_toc.yml @@ -44,6 +44,9 @@ subtrees: - file: test/manual-test - file: test/pre-start-job-test - file: test/appendix-test-recipe + - caption: Device Plugin + entries: + - file: device_plugin/device-plugin - caption: Specialized Networks entries: - file: specialized_networks/airgapped-install diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index a232e7ab..62786ea4 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -44,6 +44,9 @@ subtrees: - file: test/manual-test - file: test/pre-start-job-test - file: test/appendix-test-recipe + - caption: Device Plugin + entries: + - file: device_plugin/device-plugin - caption: Specialized Networks entries: - file: specialized_networks/airgapped-install From 1a99bba075542ee6a04411d909f82a638a10c5a5 Mon Sep 17 00:00:00 2001 From: yansun1996 Date: Wed, 26 Mar 2025 20:13:14 +0000 Subject: [PATCH 05/24] Optimize the docs and filename for blacklist function --- api/v1alpha1/deviceconfig_types.go | 4 +++- .../amd-gpu-operator.clusterserviceversion.yaml | 7 +++++-- bundle/manifests/amd.com_deviceconfigs.yaml | 5 ++++- config/crd/bases/amd.com_deviceconfigs.yaml | 5 ++++- .../amd-gpu-operator.clusterserviceversion.yaml | 5 ++++- helm-charts-k8s/Chart.lock | 2 +- helm-charts-k8s/crds/deviceconfig-crd.yaml | 5 ++++- helm-charts-openshift/Chart.lock | 2 +- helm-charts-openshift/crds/deviceconfig-crd.yaml | 5 ++++- internal/nodelabeller/nodelabeller.go | 12 +++++++++--- 10 files changed, 39 insertions(+), 13 deletions(-) diff --git a/api/v1alpha1/deviceconfig_types.go b/api/v1alpha1/deviceconfig_types.go index 503c0939..b6f186c0 100644 --- a/api/v1alpha1/deviceconfig_types.go +++ b/api/v1alpha1/deviceconfig_types.go @@ -94,7 +94,9 @@ type DriverSpec struct { // +kubebuilder:default=true Enable *bool `json:"enable,omitempty"` - // blacklist amdgpu drivers on the host + // blacklist amdgpu drivers on the host. Node reboot is required to apply the baclklist on the worker nodes. + // Not working for OpenShift cluster. OpenShift users please use the Machine Config Operator (MCO) resource to configure amdgpu blacklist. + // Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module //+operator-sdk:csv:customresourcedefinitions:type=spec,displayName="BlacklistDrivers",xDescriptors={"urn:alm:descriptor:com.amd.deviceconfigs:blacklistDrivers"} Blacklist *bool `json:"blacklist,omitempty"` diff --git a/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml b/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml index 45078acb..3a6cd86b 100644 --- a/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml +++ b/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml @@ -30,7 +30,7 @@ metadata: } ] capabilities: Basic Install - createdAt: "2025-03-25T06:19:27Z" + createdAt: "2025-03-26T20:10:59Z" operatorframework.io/suggested-namespace: openshift-amd-gpu operators.operatorframework.io/builder: operator-sdk-v1.32.0 operators.operatorframework.io/project_layout: go.kubebuilder.io/v3 @@ -229,7 +229,10 @@ spec: path: driver.amdgpuInstallerRepoURL x-descriptors: - urn:alm:descriptor:com.amd.deviceconfigs:amdgpuInstallerRepoURL - - description: blacklist amdgpu drivers on the host + - description: blacklist amdgpu drivers on the host. Node reboot is required + to apply the baclklist on the worker nodes. Not working for OpenShift cluster. + OpenShift users please use the Machine Config Operator (MCO) resource to + configure amdgpu blacklist. Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module displayName: BlacklistDrivers path: driver.blacklist x-descriptors: diff --git a/bundle/manifests/amd.com_deviceconfigs.yaml b/bundle/manifests/amd.com_deviceconfigs.yaml index c9123ffe..d2669dc1 100644 --- a/bundle/manifests/amd.com_deviceconfigs.yaml +++ b/bundle/manifests/amd.com_deviceconfigs.yaml @@ -342,7 +342,10 @@ spec: installer URL is https://repo.radeon.com/amdgpu-install by default type: string blacklist: - description: blacklist amdgpu drivers on the host + description: |- + blacklist amdgpu drivers on the host. Node reboot is required to apply the baclklist on the worker nodes. + Not working for OpenShift cluster. OpenShift users please use the Machine Config Operator (MCO) resource to configure amdgpu blacklist. + Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module type: boolean enable: default: true diff --git a/config/crd/bases/amd.com_deviceconfigs.yaml b/config/crd/bases/amd.com_deviceconfigs.yaml index 24c2b053..7916a7e6 100644 --- a/config/crd/bases/amd.com_deviceconfigs.yaml +++ b/config/crd/bases/amd.com_deviceconfigs.yaml @@ -338,7 +338,10 @@ spec: installer URL is https://repo.radeon.com/amdgpu-install by default type: string blacklist: - description: blacklist amdgpu drivers on the host + description: |- + blacklist amdgpu drivers on the host. Node reboot is required to apply the baclklist on the worker nodes. + Not working for OpenShift cluster. OpenShift users please use the Machine Config Operator (MCO) resource to configure amdgpu blacklist. + Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module type: boolean enable: default: true diff --git a/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml b/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml index a9f4d685..f91b8a24 100644 --- a/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml +++ b/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml @@ -200,7 +200,10 @@ spec: path: driver.amdgpuInstallerRepoURL x-descriptors: - urn:alm:descriptor:com.amd.deviceconfigs:amdgpuInstallerRepoURL - - description: blacklist amdgpu drivers on the host + - description: blacklist amdgpu drivers on the host. Node reboot is required + to apply the baclklist on the worker nodes. Not working for OpenShift cluster. + OpenShift users please use the Machine Config Operator (MCO) resource to + configure amdgpu blacklist. Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module displayName: BlacklistDrivers path: driver.blacklist x-descriptors: diff --git a/helm-charts-k8s/Chart.lock b/helm-charts-k8s/Chart.lock index 54b4cb8c..f42b6cfb 100644 --- a/helm-charts-k8s/Chart.lock +++ b/helm-charts-k8s/Chart.lock @@ -6,4 +6,4 @@ dependencies: repository: file://./charts/kmm version: v1.0.0 digest: sha256:f9a315dd2ce3d515ebf28c8e9a6a82158b493ca2686439ec381487761261b597 -generated: "2025-03-25T06:19:17.248998622Z" +generated: "2025-03-26T20:10:45.247725094Z" diff --git a/helm-charts-k8s/crds/deviceconfig-crd.yaml b/helm-charts-k8s/crds/deviceconfig-crd.yaml index 502f4b89..81c564c1 100644 --- a/helm-charts-k8s/crds/deviceconfig-crd.yaml +++ b/helm-charts-k8s/crds/deviceconfig-crd.yaml @@ -346,7 +346,10 @@ spec: installer URL is https://repo.radeon.com/amdgpu-install by default type: string blacklist: - description: blacklist amdgpu drivers on the host + description: |- + blacklist amdgpu drivers on the host. Node reboot is required to apply the baclklist on the worker nodes. + Not working for OpenShift cluster. OpenShift users please use the Machine Config Operator (MCO) resource to configure amdgpu blacklist. + Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module type: boolean enable: default: true diff --git a/helm-charts-openshift/Chart.lock b/helm-charts-openshift/Chart.lock index 6e9b718d..8eb0ba07 100644 --- a/helm-charts-openshift/Chart.lock +++ b/helm-charts-openshift/Chart.lock @@ -6,4 +6,4 @@ dependencies: repository: file://./charts/kmm version: v1.0.0 digest: sha256:25200c34a5cc846a1275e5bf3fc637b19e909dc68de938189c5278d77d03f5ac -generated: "2025-03-25T06:19:26.060856628Z" +generated: "2025-03-26T20:10:56.781691243Z" diff --git a/helm-charts-openshift/crds/deviceconfig-crd.yaml b/helm-charts-openshift/crds/deviceconfig-crd.yaml index 502f4b89..81c564c1 100644 --- a/helm-charts-openshift/crds/deviceconfig-crd.yaml +++ b/helm-charts-openshift/crds/deviceconfig-crd.yaml @@ -346,7 +346,10 @@ spec: installer URL is https://repo.radeon.com/amdgpu-install by default type: string blacklist: - description: blacklist amdgpu drivers on the host + description: |- + blacklist amdgpu drivers on the host. Node reboot is required to apply the baclklist on the worker nodes. + Not working for OpenShift cluster. OpenShift users please use the Machine Config Operator (MCO) resource to configure amdgpu blacklist. + Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module type: boolean enable: default: true diff --git a/internal/nodelabeller/nodelabeller.go b/internal/nodelabeller/nodelabeller.go index 959bf39f..81293fd9 100644 --- a/internal/nodelabeller/nodelabeller.go +++ b/internal/nodelabeller/nodelabeller.go @@ -52,6 +52,8 @@ const ( defaultNodeLabellerImage = "rocm/k8s-device-plugin:labeller-latest" defaultUbiNodeLabellerImage = "rocm/k8s-node-labeller:rhubi-latest" defaultInitContainerImage = "busybox:1.36" + defaultBlacklistFileName = "blacklist-amdgpu.conf" + openShiftBlacklistFileName = "blacklist-amdgpu-by-operator.conf" ) //go:generate mockgen -source=nodelabeller.go -package=nodelabeller -destination=mock_nodelabeller.go NodeLabeller @@ -129,15 +131,19 @@ func (nl *nodeLabeller) SetNodeLabellerAsDesired(ds *appsv1.DaemonSet, devConfig }, } - var initContainerCommand []string + blackListFileName := defaultBlacklistFileName + if nl.isOpenShift { + blackListFileName = openShiftBlacklistFileName + } + var initContainerCommand []string if devConfig.Spec.Driver.Blacklist != nil && *devConfig.Spec.Driver.Blacklist { // if users want to apply the blacklist, init container will add the amdgpu to the blacklist - initContainerCommand = []string{"sh", "-c", "echo \"# added by gpu operator \nblacklist amdgpu\" > /host-etc/modprobe.d/blacklist-amdgpu.conf; while [ ! -d /host-sys/class/kfd ] || [ ! -d /host-sys/module/amdgpu/drivers/ ]; do echo \"amdgpu driver is not loaded \"; sleep 2 ;done"} + initContainerCommand = []string{"sh", "-c", fmt.Sprintf("echo \"# added by gpu operator \nblacklist amdgpu\" > /host-etc/modprobe.d/%v; while [ ! -d /host-sys/class/kfd ] || [ ! -d /host-sys/module/amdgpu/drivers/ ]; do echo \"amdgpu driver is not loaded \"; sleep 2 ;done", blackListFileName)} } else { // if users disabled the KMM driver, or disabled the blacklist // init container will remove any hanging amdgpu blacklist entry from the list - initContainerCommand = []string{"sh", "-c", "rm -f /host-etc/modprobe.d/blacklist-amdgpu.conf; while [ ! -d /host-sys/class/kfd ] || [ ! -d /host-sys/module/amdgpu/drivers/ ]; do echo \"amdgpu driver is not loaded \"; sleep 2 ;done"} + initContainerCommand = []string{"sh", "-c", fmt.Sprintf("rm -f /host-etc/modprobe.d/%v; while [ ! -d /host-sys/class/kfd ] || [ ! -d /host-sys/module/amdgpu/drivers/ ]; do echo \"amdgpu driver is not loaded \"; sleep 2 ;done", blackListFileName)} } initContainerImage := defaultInitContainerImage From 042ba4868fa6e007c89757b73ade21b7f61a9dc0 Mon Sep 17 00:00:00 2001 From: vm Date: Wed, 2 Apr 2025 05:37:22 +0000 Subject: [PATCH 06/24] Rhubi based utils container --- internal/utils_container/Dockerfile | 36 ++++++----------------------- 1 file changed, 7 insertions(+), 29 deletions(-) diff --git a/internal/utils_container/Dockerfile b/internal/utils_container/Dockerfile index 59e84fda..ada5a760 100644 --- a/internal/utils_container/Dockerfile +++ b/internal/utils_container/Dockerfile @@ -1,31 +1,9 @@ -# Base image -FROM alpine:3.20.3 +FROM registry.access.redhat.com/ubi9/ubi:9.3 -# Install build dependencies -RUN apk add --no-cache \ - bash \ - build-base \ - automake \ - autoconf \ - libtool \ - pkgconfig \ - gettext-dev \ - bison \ - wget \ - tar \ - flex \ - linux-headers +# Install nsenter from util-linux package +RUN dnf install -y util-linux && \ + cp /usr/bin/nsenter /nsenter && \ + dnf clean all -# Set working directory -WORKDIR /tmp - -RUN wget https://github.com/util-linux/util-linux/archive/v2.40.tar.gz && tar -xzf v2.40.tar.gz - -# Build and install nsenter only -WORKDIR /tmp/util-linux-2.40 -RUN ./autogen.sh && \ - ./configure --disable-all-programs --enable-nsenter && \ - make nsenter && \ - cp nsenter /nsenter - -ENTRYPOINT ["/nsenter"] +# Set entrypoint to nsenter +ENTRYPOINT ["/nsenter"] \ No newline at end of file From 027cb95d5253dea6f0b21196937811cb05eabc5a Mon Sep 17 00:00:00 2001 From: Sriram Ravishankar <79412470+sriram-30@users.noreply.github.com> Date: Wed, 2 Apr 2025 11:25:23 +0530 Subject: [PATCH 07/24] use ubi minimal image for smaller size --- internal/utils_container/Dockerfile | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/internal/utils_container/Dockerfile b/internal/utils_container/Dockerfile index ada5a760..a40f740b 100644 --- a/internal/utils_container/Dockerfile +++ b/internal/utils_container/Dockerfile @@ -1,9 +1,9 @@ -FROM registry.access.redhat.com/ubi9/ubi:9.3 +FROM registry.access.redhat.com/ubi9/ubi-minimal:9.3 # Install nsenter from util-linux package -RUN dnf install -y util-linux && \ +RUN microdnf install -y util-linux && \ cp /usr/bin/nsenter /nsenter && \ - dnf clean all + microdnf clean all # Set entrypoint to nsenter -ENTRYPOINT ["/nsenter"] \ No newline at end of file +ENTRYPOINT ["/nsenter"] From 51e8a3ee2a33153a8409c10779c774066c69d922 Mon Sep 17 00:00:00 2001 From: yansun1996 Date: Wed, 2 Apr 2025 23:23:03 +0000 Subject: [PATCH 08/24] Push OLM changes for certification on OperatorHub --- ...md-gpu-operator.clusterserviceversion.yaml | 47 +++++++++++++++---- ...md-gpu-operator.clusterserviceversion.yaml | 45 +++++++++++++++--- 2 files changed, 77 insertions(+), 15 deletions(-) diff --git a/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml b/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml index 3a6cd86b..134634b9 100644 --- a/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml +++ b/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml @@ -29,12 +29,30 @@ metadata: } } ] - capabilities: Basic Install - createdAt: "2025-03-26T20:10:59Z" + capabilities: Seamless Upgrades + categories: AI/Machine Learning,Monitoring + containerImage: docker.io/rocm/gpu-operator:v1.2.0 + createdAt: "2025-04-02T23:22:18Z" + description: |- + Operator responsible for deploying AMD GPU kernel drivers, device plugin, device test runner and device metrics exporter + For more information, visit [documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/) + devicePluginImage: docker.io/rocm/k8s-device-plugin:rhubi-latest + features.operators.openshift.io/disconnected: "true" + features.operators.openshift.io/fips-compliant: "false" + features.operators.openshift.io/proxy-aware: "true" + features.operators.openshift.io/tls-profiles: "false" + features.operators.openshift.io/token-auth-aws: "false" + features.operators.openshift.io/token-auth-azure: "false" + features.operators.openshift.io/token-auth-gcp: "false" + metricsExporterImage: docker.io/rocm/device-metrics-exporter:v1.2.0 + nodelabellerImage: docker.io/rocm/k8s-device-plugin:labeller-rhubi-latest + operatorframework.io/cluster-monitoring: "true" operatorframework.io/suggested-namespace: openshift-amd-gpu + operators.openshift.io/valid-subscription: '[]' operators.operatorframework.io/builder: operator-sdk-v1.32.0 operators.operatorframework.io/project_layout: go.kubebuilder.io/v3 repository: https://github.com/ROCm/gpu-operator + support: Advanced Micro Devices, Inc. name: amd-gpu-operator.v1.2.0 namespace: placeholder spec: @@ -611,7 +629,7 @@ spec: - urn:alm:descriptor:com.amd.deviceconfigs:nodeModuleStatus version: v1alpha1 description: |- - Operator responsible for deploying AMD GPU kernel drivers and device plugin + Operator responsible for deploying AMD GPU kernel drivers, device plugin, device test runner and device metrics exporter For more information, visit [documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/) displayName: amd-gpu-operator icon: @@ -1115,11 +1133,24 @@ spec: - supported: true type: AllNamespaces keywords: - - amd-gpu-operator + - AMD + - GPU + - AI + - Deep Learning + - Hardware + - Driver + - Monitoring links: - - name: Amd Gpu Operator - url: https://amd-gpu-operator.domain - maturity: alpha + - name: AMD GPU Operator + url: https://github.com/ROCm/gpu-operator + maintainers: + - email: Yan.Sun3@amd.com + name: Yan Sun + - email: farshad.ghodsian@amd.com + name: Farshad Ghodsian + - email: shrey.ajmera@amd.com + name: Shrey Ajmera + maturity: stable provider: - name: amd-gpu-operator + name: Advanced Micro Devices, Inc. version: 1.2.0 diff --git a/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml b/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml index f91b8a24..878483bd 100644 --- a/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml +++ b/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml @@ -3,9 +3,27 @@ kind: ClusterServiceVersion metadata: annotations: alm-examples: '[]' - capabilities: Basic Install + capabilities: Seamless Upgrades + categories: AI/Machine Learning,Monitoring + containerImage: docker.io/rocm/gpu-operator:v1.2.0 + description: |- + Operator responsible for deploying AMD GPU kernel drivers, device plugin, device test runner and device metrics exporter + For more information, visit [documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/) + devicePluginImage: docker.io/rocm/k8s-device-plugin:rhubi-latest + features.operators.openshift.io/disconnected: "true" + features.operators.openshift.io/fips-compliant: "false" + features.operators.openshift.io/proxy-aware: "true" + features.operators.openshift.io/tls-profiles: "false" + features.operators.openshift.io/token-auth-aws: "false" + features.operators.openshift.io/token-auth-azure: "false" + features.operators.openshift.io/token-auth-gcp: "false" + metricsExporterImage: docker.io/rocm/device-metrics-exporter:v1.2.0 + nodelabellerImage: docker.io/rocm/k8s-device-plugin:labeller-rhubi-latest + operatorframework.io/cluster-monitoring: "true" operatorframework.io/suggested-namespace: openshift-amd-gpu + operators.openshift.io/valid-subscription: '[]' repository: https://github.com/ROCm/gpu-operator + support: Advanced Micro Devices, Inc. name: amd-gpu-operator.v0.0.0 namespace: placeholder spec: @@ -582,7 +600,7 @@ spec: - urn:alm:descriptor:com.amd.deviceconfigs:nodeModuleStatus version: v1alpha1 description: |- - Operator responsible for deploying AMD GPU kernel drivers and device plugin + Operator responsible for deploying AMD GPU kernel drivers, device plugin, device test runner and device metrics exporter For more information, visit [documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/) displayName: amd-gpu-operator icon: @@ -602,11 +620,24 @@ spec: - supported: true type: AllNamespaces keywords: - - amd-gpu-operator + - AMD + - GPU + - AI + - Deep Learning + - Hardware + - Driver + - Monitoring links: - - name: Amd Gpu Operator - url: https://amd-gpu-operator.domain - maturity: alpha + - name: AMD GPU Operator + url: https://github.com/ROCm/gpu-operator + maintainers: + - email: Yan.Sun3@amd.com + name: Yan Sun + - email: farshad.ghodsian@amd.com + name: Farshad Ghodsian + - email: shrey.ajmera@amd.com + name: Shrey Ajmera + maturity: stable provider: - name: amd-gpu-operator + name: Advanced Micro Devices, Inc. version: 0.0.0 From 6490f63cdd04ea31dcf2752f25f62ee5df5ea66e Mon Sep 17 00:00:00 2001 From: im-AbhiP <8828883+im-AbhiP@users.noreply.github.com> Date: Thu, 3 Apr 2025 19:10:37 -0700 Subject: [PATCH 09/24] New doc additions to metric and test runner section (#112) * Added Test Runner overview page, ECC error injection test page, compatibility matrix on index page, added missing intramfs rebuild step on Driver Installation page, updated the TOC to reflect new additions * Fixed linting/markdown errors --- docs/drivers/installation.md | 7 + docs/index.md | 42 +++++- docs/metrics/ecc-error-injection.md | 199 ++++++++++++++++++++++++++++ docs/sphinx/_toc.yml.in | 3 + docs/test/test-runner-overview.md | 34 +++++ 5 files changed, 283 insertions(+), 2 deletions(-) create mode 100644 docs/metrics/ecc-error-injection.md create mode 100644 docs/test/test-runner-overview.md diff --git a/docs/drivers/installation.md b/docs/drivers/installation.md index 890da553..ead38e4d 100644 --- a/docs/drivers/installation.md +++ b/docs/drivers/installation.md @@ -18,12 +18,19 @@ Before installing the AMD GPU driver: Before installing the out-of-tree AMD GPU driver, you must blacklist the inbox AMD GPU driver: +- These commands need to either be run as `root` or by using `sudo` - Create blacklist configuration file on worker nodes: ```bash echo "blacklist amdgpu" > /etc/modprobe.d/blacklist-amdgpu.conf ``` +- After blacklist configuration file, you need to rebuild the initramfs for the change to take effect: + +```bash +echo update-initramfs -u -k all +``` + - Reboot the worker node to apply the blacklist - Verify the blacklisting: diff --git a/docs/index.md b/docs/index.md index 3a8340ea..9348b933 100644 --- a/docs/index.md +++ b/docs/index.md @@ -13,8 +13,46 @@ The AMD GPU Operator simplifies the deployment and management of AMD Instinct GP ## Compatibility -- **Kubernetes**: 1.29.0 -- Please refer to the [ROCm documentation](https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html) for the compatibility matrix for the AMD GPU DKMS driver. +### Supported Hardware + +| **GPUs** | | +| --- | --- | +| AMD Instinct™ MI300X | ✅ Supported | +| AMD Instinct™ MI250 | ✅ Supported | +| AMD Instinct™ MI210 | ✅ Supported | + +### OS & Platform Support Matrix + +Below is a matrix of supported Operating systems and the corresponding Kubernetes version that have been validated to work. We will continue to add more Operating Systems and future versions of Kubernetes with each release of the AMD GPU Operator and Metrics Exporter. + + + + + + + + + + + + + + + + + + + + + + + + + + +
Operating SystemKubernetesRed Hat OpenShift
Ubuntu 22.04 LTS1.29—1.31
Ubuntu 24.04 LTS1.29—1.31
Red Hat Core OS (RHCOS)4.16—4.17
+ +Please refer to the [ROCM documentaiton](https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html) for the compatability matrix for the AMD GPU DKMS driver. ## Prerequisites diff --git a/docs/metrics/ecc-error-injection.md b/docs/metrics/ecc-error-injection.md new file mode 100644 index 00000000..f3f17926 --- /dev/null +++ b/docs/metrics/ecc-error-injection.md @@ -0,0 +1,199 @@ +## ECC Error Injection Testing + +The Metric Exporter has the capability to check for unhealthy GPUs via the monitoring of ECC Errors that can occur when a GPU is not functioning as expected. When an ECC error is detected the Metrics Exporter will now mark the offending GPU as unhealthy and add a node label to indicate which GPU on the node is unhealthy. The Kubernetes Device Plugin also listens to the health metrics coming from the Metrics Exporter to determine GPU status, marking GPUs as schedulable if healthy and unschedulable if unhealthy. + +This health check workflow runs automatically on every node the Device Metrics Exporter is running on, with the Metrics Exporter polling GPUs every 30 seconds and the device plugin checking health status at the same interval, ensuring updates within one minute. Users can customize the default ECC error threshold (set to 0) via the `HealthThresholds` field in the metrics exporter ConfigMap. As part of this workflow healthy GPUs are made available for Kubernetes job scheduling, while ensuring no new jobs are scheduled on an unhealthy GPUs. + +## To do error injection follow these steps + +We have added a new `metricsclient` to the Device Metrics Exporter pod that can be used to inject ECC errors into an otherwise healthy GPU for testing the above health check workflow. This is fairly simple and don't worry this does not harm your GPU as any errors that are being injected are debugging in nature and not real errors. The steps to do this have been outlined below: + +### 1. Set Node Name + +Use an environment variable to set the Kubernetes node name to indicate which node you want to test error injection on: + +```bash +NODE_NAME= +``` + +Replace with the name of the node you want to test. If you are running this from the same node you want to test you can grab the hostname using: + +```bash +NODE_NAME=$(hostname) +``` + +### 2. Set Metrics Exporter Pod Name + +Since you have to execute the `metricsclient` from directly within the Device Metrics Exporter pod we need to get the Metrics Exporter pod name running on the node: + +```bash +METRICS_POD=$(kubectl get pods -n kube-amd-gpu --field-selector spec.nodeName=$NODE_NAME --no-headers -o custom-columns=":metadata.name" | grep '^gpu-operator-metrics-exporter-' | head -n 1) +``` + +### 3. Check Metrics Client to see GPU Health + +Now that you have the name of the metrics exporter pod you can use the metricsclient to check the current health of all GPUs on the node: + +```bash +kubectl exec -n kube-amd-gpu $METRICS_POD -c metrics-exporter-container -- metricsclient +``` + +You should see a list of all the GPUs on that node along with their corresponding status. In most cases all GPUs should report as being `healthy`. + +```bash +ID Health Associated Workload +------------------------------------------------ +1 healthy [] +0 healthy [] +7 healthy [] +6 healthy [] +5 healthy [] +4 healthy [] +3 healthy [] +2 healthy [] +------------------------------------------------ +``` + +### 4. Inject ECC Errors on GPU 0 + +In order to simulate errors on a GPU we will be using a json file that specifies a GPU ID along with counters for several ECC Uncorrectable error fields that are being monitored by the Device Metrics Exporter. In the below example you can see that we are specifying `GPU 0` and injecting 1 `GPU_ECC_UNCORRECT_SEM` error and 2 `GPU_ECC_UNCORRECT_FUSE` errors. We use the `metricslient -ecc-file-path ` command to specify the json file we want to inject into the metrics table. To create the json file and execute the metricsclient command all in in one go run the following: + +```bash +kubectl exec -n kube-amd-gpu $METRICS_POD -c metrics-exporter-container -- sh -c 'cat > /tmp/ecc.json < /tmp/delete_ecc.json < Date: Thu, 3 Apr 2025 19:27:21 -0700 Subject: [PATCH 10/24] Revert "New doc additions to metric and test runner section (#112)" (#113) This reverts commit 249d688f519f363cf7698db132d6e3ab4be34a27. --- docs/drivers/installation.md | 7 - docs/index.md | 42 +----- docs/metrics/ecc-error-injection.md | 199 ---------------------------- docs/sphinx/_toc.yml.in | 3 - docs/test/test-runner-overview.md | 34 ----- 5 files changed, 2 insertions(+), 283 deletions(-) delete mode 100644 docs/metrics/ecc-error-injection.md delete mode 100644 docs/test/test-runner-overview.md diff --git a/docs/drivers/installation.md b/docs/drivers/installation.md index ead38e4d..890da553 100644 --- a/docs/drivers/installation.md +++ b/docs/drivers/installation.md @@ -18,19 +18,12 @@ Before installing the AMD GPU driver: Before installing the out-of-tree AMD GPU driver, you must blacklist the inbox AMD GPU driver: -- These commands need to either be run as `root` or by using `sudo` - Create blacklist configuration file on worker nodes: ```bash echo "blacklist amdgpu" > /etc/modprobe.d/blacklist-amdgpu.conf ``` -- After blacklist configuration file, you need to rebuild the initramfs for the change to take effect: - -```bash -echo update-initramfs -u -k all -``` - - Reboot the worker node to apply the blacklist - Verify the blacklisting: diff --git a/docs/index.md b/docs/index.md index 9348b933..3a8340ea 100644 --- a/docs/index.md +++ b/docs/index.md @@ -13,46 +13,8 @@ The AMD GPU Operator simplifies the deployment and management of AMD Instinct GP ## Compatibility -### Supported Hardware - -| **GPUs** | | -| --- | --- | -| AMD Instinct™ MI300X | ✅ Supported | -| AMD Instinct™ MI250 | ✅ Supported | -| AMD Instinct™ MI210 | ✅ Supported | - -### OS & Platform Support Matrix - -Below is a matrix of supported Operating systems and the corresponding Kubernetes version that have been validated to work. We will continue to add more Operating Systems and future versions of Kubernetes with each release of the AMD GPU Operator and Metrics Exporter. - - - - - - - - - - - - - - - - - - - - - - - - - - -
Operating SystemKubernetesRed Hat OpenShift
Ubuntu 22.04 LTS1.29—1.31
Ubuntu 24.04 LTS1.29—1.31
Red Hat Core OS (RHCOS)4.16—4.17
- -Please refer to the [ROCM documentaiton](https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html) for the compatability matrix for the AMD GPU DKMS driver. +- **Kubernetes**: 1.29.0 +- Please refer to the [ROCm documentation](https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html) for the compatibility matrix for the AMD GPU DKMS driver. ## Prerequisites diff --git a/docs/metrics/ecc-error-injection.md b/docs/metrics/ecc-error-injection.md deleted file mode 100644 index f3f17926..00000000 --- a/docs/metrics/ecc-error-injection.md +++ /dev/null @@ -1,199 +0,0 @@ -## ECC Error Injection Testing - -The Metric Exporter has the capability to check for unhealthy GPUs via the monitoring of ECC Errors that can occur when a GPU is not functioning as expected. When an ECC error is detected the Metrics Exporter will now mark the offending GPU as unhealthy and add a node label to indicate which GPU on the node is unhealthy. The Kubernetes Device Plugin also listens to the health metrics coming from the Metrics Exporter to determine GPU status, marking GPUs as schedulable if healthy and unschedulable if unhealthy. - -This health check workflow runs automatically on every node the Device Metrics Exporter is running on, with the Metrics Exporter polling GPUs every 30 seconds and the device plugin checking health status at the same interval, ensuring updates within one minute. Users can customize the default ECC error threshold (set to 0) via the `HealthThresholds` field in the metrics exporter ConfigMap. As part of this workflow healthy GPUs are made available for Kubernetes job scheduling, while ensuring no new jobs are scheduled on an unhealthy GPUs. - -## To do error injection follow these steps - -We have added a new `metricsclient` to the Device Metrics Exporter pod that can be used to inject ECC errors into an otherwise healthy GPU for testing the above health check workflow. This is fairly simple and don't worry this does not harm your GPU as any errors that are being injected are debugging in nature and not real errors. The steps to do this have been outlined below: - -### 1. Set Node Name - -Use an environment variable to set the Kubernetes node name to indicate which node you want to test error injection on: - -```bash -NODE_NAME= -``` - -Replace with the name of the node you want to test. If you are running this from the same node you want to test you can grab the hostname using: - -```bash -NODE_NAME=$(hostname) -``` - -### 2. Set Metrics Exporter Pod Name - -Since you have to execute the `metricsclient` from directly within the Device Metrics Exporter pod we need to get the Metrics Exporter pod name running on the node: - -```bash -METRICS_POD=$(kubectl get pods -n kube-amd-gpu --field-selector spec.nodeName=$NODE_NAME --no-headers -o custom-columns=":metadata.name" | grep '^gpu-operator-metrics-exporter-' | head -n 1) -``` - -### 3. Check Metrics Client to see GPU Health - -Now that you have the name of the metrics exporter pod you can use the metricsclient to check the current health of all GPUs on the node: - -```bash -kubectl exec -n kube-amd-gpu $METRICS_POD -c metrics-exporter-container -- metricsclient -``` - -You should see a list of all the GPUs on that node along with their corresponding status. In most cases all GPUs should report as being `healthy`. - -```bash -ID Health Associated Workload ------------------------------------------------- -1 healthy [] -0 healthy [] -7 healthy [] -6 healthy [] -5 healthy [] -4 healthy [] -3 healthy [] -2 healthy [] ------------------------------------------------- -``` - -### 4. Inject ECC Errors on GPU 0 - -In order to simulate errors on a GPU we will be using a json file that specifies a GPU ID along with counters for several ECC Uncorrectable error fields that are being monitored by the Device Metrics Exporter. In the below example you can see that we are specifying `GPU 0` and injecting 1 `GPU_ECC_UNCORRECT_SEM` error and 2 `GPU_ECC_UNCORRECT_FUSE` errors. We use the `metricslient -ecc-file-path ` command to specify the json file we want to inject into the metrics table. To create the json file and execute the metricsclient command all in in one go run the following: - -```bash -kubectl exec -n kube-amd-gpu $METRICS_POD -c metrics-exporter-container -- sh -c 'cat > /tmp/ecc.json < /tmp/delete_ecc.json < Date: Thu, 3 Apr 2025 10:14:15 +0000 Subject: [PATCH 11/24] Reboot Loop issue if control node needs to go down for driver upgrade --- api/v1alpha1/deviceconfig_types.go | 1 + ...md-gpu-operator.clusterserviceversion.yaml | 2 +- bundle/manifests/amd.com_deviceconfigs.yaml | 2 ++ config/crd/bases/amd.com_deviceconfigs.yaml | 2 ++ helm-charts-k8s/Chart.lock | 2 +- helm-charts-k8s/crds/deviceconfig-crd.yaml | 2 ++ helm-charts-openshift/Chart.lock | 2 +- .../crds/deviceconfig-crd.yaml | 2 ++ .../controllers/device_config_reconciler.go | 10 +++++++- internal/controllers/mock_upgrademgr.go | 14 +++++++++++ internal/controllers/upgrademgr.go | 25 +++++++++++++++---- 11 files changed, 55 insertions(+), 9 deletions(-) diff --git a/api/v1alpha1/deviceconfig_types.go b/api/v1alpha1/deviceconfig_types.go index b6f186c0..4a5d0597 100644 --- a/api/v1alpha1/deviceconfig_types.go +++ b/api/v1alpha1/deviceconfig_types.go @@ -597,6 +597,7 @@ type ModuleStatus struct { LastTransitionTime string `json:"lastTransitionTime,omitempty"` Status UpgradeState `json:"status,omitempty"` UpgradeStartTime string `json:"upgradeStartTime,omitempty"` + BootId string `json:"bootId,omitempty"` } // DeviceConfigStatus defines the observed state of Module. diff --git a/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml b/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml index 134634b9..c73d0351 100644 --- a/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml +++ b/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml @@ -32,7 +32,7 @@ metadata: capabilities: Seamless Upgrades categories: AI/Machine Learning,Monitoring containerImage: docker.io/rocm/gpu-operator:v1.2.0 - createdAt: "2025-04-02T23:22:18Z" + createdAt: "2025-04-07T07:07:00Z" description: |- Operator responsible for deploying AMD GPU kernel drivers, device plugin, device test runner and device metrics exporter For more information, visit [documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/) diff --git a/bundle/manifests/amd.com_deviceconfigs.yaml b/bundle/manifests/amd.com_deviceconfigs.yaml index d2669dc1..606fa1d8 100644 --- a/bundle/manifests/amd.com_deviceconfigs.yaml +++ b/bundle/manifests/amd.com_deviceconfigs.yaml @@ -931,6 +931,8 @@ spec: description: ModuleStatus contains the status of driver module installed by operator on the node properties: + bootId: + type: string containerImage: type: string kernelVersion: diff --git a/config/crd/bases/amd.com_deviceconfigs.yaml b/config/crd/bases/amd.com_deviceconfigs.yaml index 7916a7e6..64427582 100644 --- a/config/crd/bases/amd.com_deviceconfigs.yaml +++ b/config/crd/bases/amd.com_deviceconfigs.yaml @@ -927,6 +927,8 @@ spec: description: ModuleStatus contains the status of driver module installed by operator on the node properties: + bootId: + type: string containerImage: type: string kernelVersion: diff --git a/helm-charts-k8s/Chart.lock b/helm-charts-k8s/Chart.lock index f42b6cfb..dd529b75 100644 --- a/helm-charts-k8s/Chart.lock +++ b/helm-charts-k8s/Chart.lock @@ -6,4 +6,4 @@ dependencies: repository: file://./charts/kmm version: v1.0.0 digest: sha256:f9a315dd2ce3d515ebf28c8e9a6a82158b493ca2686439ec381487761261b597 -generated: "2025-03-26T20:10:45.247725094Z" +generated: "2025-04-07T07:06:50.661624221Z" diff --git a/helm-charts-k8s/crds/deviceconfig-crd.yaml b/helm-charts-k8s/crds/deviceconfig-crd.yaml index 81c564c1..ff9c1c79 100644 --- a/helm-charts-k8s/crds/deviceconfig-crd.yaml +++ b/helm-charts-k8s/crds/deviceconfig-crd.yaml @@ -932,6 +932,8 @@ spec: description: ModuleStatus contains the status of driver module installed by operator on the node properties: + bootId: + type: string containerImage: type: string kernelVersion: diff --git a/helm-charts-openshift/Chart.lock b/helm-charts-openshift/Chart.lock index 8eb0ba07..d4a86324 100644 --- a/helm-charts-openshift/Chart.lock +++ b/helm-charts-openshift/Chart.lock @@ -6,4 +6,4 @@ dependencies: repository: file://./charts/kmm version: v1.0.0 digest: sha256:25200c34a5cc846a1275e5bf3fc637b19e909dc68de938189c5278d77d03f5ac -generated: "2025-03-26T20:10:56.781691243Z" +generated: "2025-04-07T07:06:59.305455465Z" diff --git a/helm-charts-openshift/crds/deviceconfig-crd.yaml b/helm-charts-openshift/crds/deviceconfig-crd.yaml index 81c564c1..ff9c1c79 100644 --- a/helm-charts-openshift/crds/deviceconfig-crd.yaml +++ b/helm-charts-openshift/crds/deviceconfig-crd.yaml @@ -932,6 +932,8 @@ spec: description: ModuleStatus contains the status of driver module installed by operator on the node properties: + bootId: + type: string containerImage: type: string kernelVersion: diff --git a/internal/controllers/device_config_reconciler.go b/internal/controllers/device_config_reconciler.go index 2e782fb5..7486a8b7 100644 --- a/internal/controllers/device_config_reconciler.go +++ b/internal/controllers/device_config_reconciler.go @@ -593,9 +593,11 @@ func (dcrh *deviceConfigReconcilerHelper) getDeviceConfigOwnedKMMModule(ctx cont func (dcrh *deviceConfigReconcilerHelper) updateDeviceConfigNodeStatus(ctx context.Context, devConfig *amdv1alpha1.DeviceConfig, nodes *v1.NodeList) error { logger := log.FromContext(ctx) previousUpgradeTimes := make(map[string]string) + previousBootIds := make(map[string]string) // Persist the UpgradeStartTime for nodeName, moduleStatus := range devConfig.Status.NodeModuleStatus { previousUpgradeTimes[nodeName] = moduleStatus.UpgradeStartTime + previousBootIds[nodeName] = moduleStatus.BootId } devConfig.Status.NodeModuleStatus = map[string]amdv1alpha1.ModuleStatus{} @@ -610,7 +612,12 @@ func (dcrh *deviceConfigReconcilerHelper) updateDeviceConfigNodeStatus(ctx conte if upgradeStartTime == "" { upgradeStartTime = previousUpgradeTimes[node.Name] } - devConfig.Status.NodeModuleStatus[node.Name] = amdv1alpha1.ModuleStatus{Status: dcrh.upgradeMgrHandler.GetNodeStatus(node.Name), UpgradeStartTime: upgradeStartTime} + bootId := dcrh.upgradeMgrHandler.GetNodeBootId(node.Name) + //If operator restarted during Upgrade, then fetch previous known bootId since the internal maps would have been cleared + if bootId == "" { + bootId = previousBootIds[node.Name] + } + devConfig.Status.NodeModuleStatus[node.Name] = amdv1alpha1.ModuleStatus{Status: dcrh.upgradeMgrHandler.GetNodeStatus(node.Name), UpgradeStartTime: upgradeStartTime, BootId: bootId} nmc := kmmv1beta1.NodeModulesConfig{} err := dcrh.client.Get(ctx, types.NamespacedName{Name: node.Name}, &nmc) @@ -632,6 +639,7 @@ func (dcrh *deviceConfigReconcilerHelper) updateDeviceConfigNodeStatus(ctx conte LastTransitionTime: module.LastTransitionTime.String(), Status: dcrh.upgradeMgrHandler.GetNodeStatus(node.Name), UpgradeStartTime: upgradeStartTime, + BootId: bootId, } } } diff --git a/internal/controllers/mock_upgrademgr.go b/internal/controllers/mock_upgrademgr.go index 03944030..748a33d6 100644 --- a/internal/controllers/mock_upgrademgr.go +++ b/internal/controllers/mock_upgrademgr.go @@ -57,6 +57,20 @@ func (m *MockupgradeMgrAPI) EXPECT() *MockupgradeMgrAPIMockRecorder { return m.recorder } +// GetNodeBootId mocks base method. +func (m *MockupgradeMgrAPI) GetNodeBootId(nodeName string) string { + m.ctrl.T.Helper() + ret := m.ctrl.Call(m, "GetNodeBootId", nodeName) + ret0, _ := ret[0].(string) + return ret0 +} + +// GetNodeBootId indicates an expected call of GetNodeBootId. +func (mr *MockupgradeMgrAPIMockRecorder) GetNodeBootId(nodeName any) *gomock.Call { + mr.mock.ctrl.T.Helper() + return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "GetNodeBootId", reflect.TypeOf((*MockupgradeMgrAPI)(nil).GetNodeBootId), nodeName) +} + // GetNodeStatus mocks base method. func (m *MockupgradeMgrAPI) GetNodeStatus(nodeName string) v1alpha1.UpgradeState { m.ctrl.T.Helper() diff --git a/internal/controllers/upgrademgr.go b/internal/controllers/upgrademgr.go index ad2ee41c..25dd5d6b 100644 --- a/internal/controllers/upgrademgr.go +++ b/internal/controllers/upgrademgr.go @@ -74,6 +74,7 @@ type upgradeMgrAPI interface { HandleDelete(ctx context.Context, deviceConfig *amdv1alpha1.DeviceConfig, nodes *v1.NodeList) (ctrl.Result, error) GetNodeStatus(nodeName string) amdv1alpha1.UpgradeState GetNodeUpgradeStartTime(nodeName string) string + GetNodeBootId(nodeName string) string } func newUpgradeMgrHandler(client client.Client, k8sConfig *rest.Config) upgradeMgrAPI { @@ -108,16 +109,25 @@ func (n *upgradeMgr) HandleUpgrade(ctx context.Context, deviceConfig *amdv1alpha if deviceConfig.Spec.Driver.UpgradePolicy.RebootRequired != nil && *deviceConfig.Spec.Driver.UpgradePolicy.RebootRequired { nodeObj, err := n.helper.getNode(ctx, nodeName) if err == nil { - log.FromContext(ctx).Info("Reboot is required for driver upgrade, triggering node reboot") - n.helper.handleNodeReboot(ctx, nodeObj, deviceConfig) + // trigger reboot only for nodes which are in UpgradeStarted but haven't rebooted yet + if nodeObj.Status.NodeInfo.BootID == moduleStatus.BootId { + log.FromContext(ctx).Info(fmt.Sprintf("Node: %v: Reboot is required for driver upgrade, triggering node reboot", nodeName)) + n.helper.handleNodeReboot(ctx, nodeObj, deviceConfig) + // for nodes which are in UpgradeStarted but already rebooted. Schedule the reboot pod deletion + } else { + currentBootID := nodeObj.Status.NodeInfo.BootID + n.helper.setBootID(nodeObj.Name, currentBootID) + log.FromContext(ctx).Info(fmt.Sprintf("Node: %v: Node already rebooted, scheduling reboot pod deletion", nodeName)) + go n.helper.deleteRebootPod(ctx, nodeName, deviceConfig, false, deviceConfig.Generation) + } } } else { - log.FromContext(ctx).Info("Resetting Upgrade State to UpgradeStateEmpty") + log.FromContext(ctx).Info(fmt.Sprintf("Node: %v: Resetting Upgrade State to UpgradeStateEmpty", nodeName)) n.helper.setNodeStatus(ctx, nodeName, amdv1alpha1.UpgradeStateEmpty) } } else if moduleStatus.Status == amdv1alpha1.UpgradeStateRebootInProgress { // Operator restarted during upgrade operation. Schedule the reboot pod deletion - log.FromContext(ctx).Info("Reboot is in progress, scheduling reboot pod deletion") + log.FromContext(ctx).Info(fmt.Sprintf("Node: %v: Reboot is in progress, scheduling reboot pod deletion", nodeName)) n.helper.setNodeStatus(ctx, nodeName, moduleStatus.Status) go n.helper.deleteRebootPod(ctx, nodeName, deviceConfig, false, deviceConfig.Generation) } else { @@ -244,11 +254,16 @@ func (n *upgradeMgr) GetNodeStatus(nodeName string) (status amdv1alpha1.UpgradeS return n.helper.getNodeStatus(nodeName) } -// GetNodeStaGetNodeUpgradeStartTimetus returns the time when upgrade started on the node +// GetNodeUpgradeStartTime returns the time when upgrade started on the node func (n *upgradeMgr) GetNodeUpgradeStartTime(nodeName string) string { return n.helper.getUpgradeStartTime(nodeName) } +// GetNodeBootId returns the last known bootid of the node +func (n *upgradeMgr) GetNodeBootId(nodeName string) string { + return n.helper.getBootID(nodeName) +} + /*=========================================== Upgrade Manager Helper APIs ==========================================*/ //go:generate mockgen -source=upgrademgr.go -package=controllers -destination=mock_upgrademgr.go upgradeMgrHelperAPI From 6433730dfd7b8ee25987ee1d04972630299df2c2 Mon Sep 17 00:00:00 2001 From: yansun1996 Date: Sun, 6 Apr 2025 02:02:08 +0000 Subject: [PATCH 12/24] Add warning to describe the known GPU scheduling issue for pre-start job check --- docs/test/pre-start-job-test.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/test/pre-start-job-test.md b/docs/test/pre-start-job-test.md index d5133faa..d9b67750 100644 --- a/docs/test/pre-start-job-test.md +++ b/docs/test/pre-start-job-test.md @@ -8,6 +8,14 @@ Test runner can be embedded as an init container within your Kubernetes workload The RVS test recipes in the Test Runner are not compatible with partitioned GPUs. If you are using a partitioned GPU, avoid running the Test Runner as an init container for the pre-start job test. ``` +```{warning} +* Known Issue: Within a pod, the initContainer and workload container might not be assigned the same GPUs. + +* Workaround: The example in this document remains applicable if both initContainer and workload containers request all GPUs on the same node. + +* Future Solution: With the introduction of [Dynamic Resource Allocation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/), both initContainer and workload container will be able to share the same set of GPUs. +``` + ## Configure pre-start init container The init container requires RBAC config to grant the pod access to export events and add node labels to the cluster. Here is an example of configuring the RBAC and Job resources: From 1ad40e323e61bccfc64dd8870f34c274bee283e1 Mon Sep 17 00:00:00 2001 From: yansun1996 Date: Mon, 7 Apr 2025 19:15:00 +0000 Subject: [PATCH 13/24] Address comment --- docs/test/pre-start-job-test.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/test/pre-start-job-test.md b/docs/test/pre-start-job-test.md index d9b67750..a376ba73 100644 --- a/docs/test/pre-start-job-test.md +++ b/docs/test/pre-start-job-test.md @@ -85,8 +85,8 @@ spec: image: docker.io/rocm/test-runner:v1.2.0-beta.0 imagePullPolicy: IfNotPresent resources: - limits: - amd.com/gpu: 1 # requesting a GPU + requests: + amd.com/gpu: 8 # requesting all GPUs on the worker node env: - name: TEST_TRIGGER value: "PRE_START_JOB_CHECK" # Set the TEST_TRIGGER environment variable to PRE_START_JOB_CHECK for test runner as init container @@ -108,8 +108,8 @@ spec: command: ["/bin/sh", "-c", "--"] args: ["sleep 6000"] resources: - limits: - amd.com/gpu: 1 # requesting a GPU + requests: + amd.com/gpu: 8 # requesting all GPUs on the worker node ``` ## Check test runner init container From 1773fc9628fbe2818ef1a9af33c65612856e39c0 Mon Sep 17 00:00:00 2001 From: vm Date: Tue, 8 Apr 2025 09:12:07 +0000 Subject: [PATCH 14/24] Doc on known limitation --- docs/knownlimitations.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/knownlimitations.md b/docs/knownlimitations.md index 051b48cd..c84430e8 100644 --- a/docs/knownlimitations.md +++ b/docs/knownlimitations.md @@ -85,6 +85,13 @@ - **Recommendation:** Ensure nodes are fully stable before triggering an upgrade, and if necessary, manually update node labels to enforce the new driver version. Refer to driver upgrade documentation for more details.

+13. **Driver Upgrade Issue when maxParallel Upgrades is equal to total number of worker nodes in Red Hat OpenShift** + + - **Impact:** Not able to perform driver upgrade + - **Affected Configurations:** This issue only affects Red Hat OpenShift when Image registry pod is running on one of the worker nodes or kmm build pod is required to be run on one of the worker nodes + - **Recommendation:** Please set maxParallel Upgrades to a number less than total number of worker nodes +

+ ## Fixed Issues 1. **When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module (Fixed in v1.2.0)** From 81090a4fd03ab7f6ad5a0402ef107a689df4083b Mon Sep 17 00:00:00 2001 From: yansun1996 Date: Tue, 8 Apr 2025 09:49:28 +0000 Subject: [PATCH 15/24] Add note for blacklisting amdgpu on OpenShift cluster in full example --- docs/drivers/installation.md | 2 ++ docs/fulldeviceconfig.rst | 7 +++++-- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/docs/drivers/installation.md b/docs/drivers/installation.md index 890da553..ed1d9041 100644 --- a/docs/drivers/installation.md +++ b/docs/drivers/installation.md @@ -96,6 +96,8 @@ spec: # enable operator to install out-of-tree amdgpu kernel module enable: true # blacklist is required for installing out-of-tree amdgpu kernel module + # Not working for OpenShift cluster. OpenShift users please use the Machine Config Operator (MCO) resource to configure amdgpu blacklist. + # Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olmhtml#create-blacklist-for-installing-out-of-tree-kernel-module blacklist: true # Specify your repository to host driver image # DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you diff --git a/docs/fulldeviceconfig.rst b/docs/fulldeviceconfig.rst index 8d8c1d95..00d52de9 100644 --- a/docs/fulldeviceconfig.rst +++ b/docs/fulldeviceconfig.rst @@ -38,8 +38,11 @@ Below is an example of a full DeviceConfig CR that can be used to install the AM driver: # Set to false to skip driver installation to use inbox or pre-installed driver on worker nodes # Set to true to enable operator to install out-of-tree amdgpu kernel module - enable: false - blacklist: false # Set to true to blacklist the amdgpu kernel module which is required for installing out-of-tree driver + enable: false + # Set to true to blacklist the amdgpu kernel module which is required for installing out-of-tree driver + # Not working for OpenShift cluster. OpenShift users please use the Machine Config Operator (MCO) resource to configure amdgpu blacklist. + # Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module + blacklist: false # Specify your repository to host driver image # DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you image: docker.io/username/repo From 36c6e9616fc1f9fcc72f0a3aa60c70d6c5b9728f Mon Sep 17 00:00:00 2001 From: Nitish Bhat Date: Mon, 7 Apr 2025 16:03:22 -0700 Subject: [PATCH 16/24] Expose ContainerPort in Metrics Exporter Pod (#534) - ContainerPort lists the ports to expose from the Container. Not specifying a port DOES NOT prevent that port from being exposed. The device metrics exporter container starts a metrics server on the port specified by the METRICS_EXPORTER_PORT on the default "0.0.0.0" address in the container which exposes the port. Look at https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/ for more information on this behavior. Co-authored-by: Nitish Bhat --- internal/metricsexporter/metricsexporter.go | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/internal/metricsexporter/metricsexporter.go b/internal/metricsexporter/metricsexporter.go index 7ead7bca..c57341fb 100644 --- a/internal/metricsexporter/metricsexporter.go +++ b/internal/metricsexporter/metricsexporter.go @@ -240,7 +240,7 @@ func (nl *metricsExporter) SetMetricsExporterAsDesired(ds *appsv1.DaemonSet, dev if internalPort == port { internalPort = port - 1 } - // Bind service port to localhost only + // Bind service port to localhost only, don't expose port in ContainerPort containers[0].Args = []string{"--bind=127.0.0.1:" + fmt.Sprintf("%v", int32(internalPort))} containers[0].Env[1].Value = fmt.Sprintf("%v", internalPort) @@ -292,12 +292,26 @@ func (nl *metricsExporter) SetMetricsExporterAsDesired(ds *appsv1.DaemonSet, dev }, Args: args, VolumeMounts: volumeMounts, + Ports: []v1.ContainerPort{ + { + Name: "exporter-port", + Protocol: v1.ProtocolTCP, + ContainerPort: port, + }, + }, }) // Provide elevated privilege only when rbac-proxy is enabled serviceaccount = kubeRbacSAName } else { containers[0].Env[1].Value = fmt.Sprintf("%v", port) + containers[0].Ports = []v1.ContainerPort{ + { + Name: "exporter-port", + Protocol: v1.ProtocolTCP, + ContainerPort: port, + }, + } } gracePeriod := int64(1) From 53d34c06f8e163e2ec4670b0ead56b896266ad64 Mon Sep 17 00:00:00 2001 From: Nitish Bhat Date: Tue, 8 Apr 2025 19:18:15 +0000 Subject: [PATCH 17/24] Change default cpu/memory resource limits for Controller Manager For larger deployments, the default CPU limits of 500m (half a core) and memory limit of 384Mi in the cluster might be insufficient. It has been bumped up with this change and documentation has been added to alert the user to modify these values in helm if they have larger clusters. --- docs/installation/kubernetes-helm.md | 40 +++++++++++++++++++ hack/k8s-patch/metadata-patch/values.yaml | 8 ++-- .../metadata-patch/values.yaml | 8 ++-- helm-charts-k8s/values.yaml | 8 ++-- helm-charts-openshift/values.yaml | 8 ++-- 5 files changed, 56 insertions(+), 16 deletions(-) diff --git a/docs/installation/kubernetes-helm.md b/docs/installation/kubernetes-helm.md index c1415324..8681222f 100644 --- a/docs/installation/kubernetes-helm.md +++ b/docs/installation/kubernetes-helm.md @@ -163,6 +163,10 @@ The following parameters are able to be configued when using the Helm Chart. In | controllerManager.manager.image.tag | string | `"v1.2.0"` | AMD GPU operator controller manager image tag | | controllerManager.manager.imagePullPolicy | string | `"Always"` | Image pull policy for AMD GPU operator controller manager pod | | controllerManager.manager.imagePullSecrets | string | `""` | Image pull secret name for pulling AMD GPU operator controller manager image if registry needs credential to pull image | +| controllerManager.manager.resources.limits.cpu | string | `"1000m"` | CPU limits for the controller manager. Consider increasing for large clusters | +| controllerManager.manager.resources.limits.memory | string | `"1Gi"` | Memory limits for the controller manager. Consider increasing if experiencing OOM issues | +| controllerManager.manager.resources.requests.cpu | string | `"100m"` | CPU requests for the controller manager. Adjust based on observed CPU usage | +| controllerManager.manager.resources.requests.memory | string | `"256Mi"` | Memory requests for the controller manager. Adjust based on observed memory usage | | controllerManager.nodeSelector | object | `{}` | Node selector for AMD GPU operator controller manager deployment | | installdefaultNFDRule | bool | `true` | Default NFD rule will detect amd gpu based on pci vendor ID | | kmm.enabled | bool | `true` | Set to true/false to enable/disable the installation of kernel module management (KMM) operator | @@ -258,6 +262,42 @@ Verify that nodes with AMD GPU hardware are properly labeled: kubectl get nodes -L feature.node.kubernetes.io/amd-gpu ``` +## Resource Configuration + +### Controller Manager Resource Settings + +The AMD GPU Operator controller manager component has default resource limits and requests configured for typical usage scenarios. You may need to adjust these values based on your specific cluster environment: + +```yaml +controllerManager: + manager: + resources: + limits: + cpu: 1000m + memory: 1Gi + requests: + cpu: 100m + memory: 256Mi +``` + +#### When to Adjust Resource Settings + +You should consider adjusting the controller manager resource settings in these scenarios: + +- **Large clusters**: If managing a large number of nodes or GPU devices, consider increasing both CPU and memory limits +- **Memory pressure**: If you observe OOM (Out of Memory) kills in controller manager pods, increase the memory limit and request +- **CPU pressure**: If the controller manager is experiencing throttling or slow response times during operations, increase the CPU limit and request +- **Resource-constrained environments**: For smaller development or test clusters, you may reduce these values to conserve resources + +You can apply resource changes by updating your values.yaml file and upgrading the Helm release: + +```bash +helm upgrade amd-gpu-operator amd/gpu-operator-helm \ + --namespace kube-amd-gpu \ + --version=v1.0.0 \ + -f values.yaml +``` + ## Install Custom Resource After the installation of AMD GPU Operator, you need to create the `DeviceConfig` custom resource in order to trigger the operator start to work. By preparing the `DeviceConfig` in the YAML file, you can create the resouce by running ```kubectl apply -f deviceconfigs.yaml```. For custom resource definition and more detailed information, please refer to [Custom Resource Installation Guide](../drivers/installation). Here are some examples for common deployment scenarios. diff --git a/hack/k8s-patch/metadata-patch/values.yaml b/hack/k8s-patch/metadata-patch/values.yaml index 71bfd56c..6e6e0a0d 100644 --- a/hack/k8s-patch/metadata-patch/values.yaml +++ b/hack/k8s-patch/metadata-patch/values.yaml @@ -47,11 +47,11 @@ controllerManager: effect: "NoSchedule" resources: limits: - cpu: 500m - memory: 384Mi + cpu: 1000m + memory: 1Gi requests: - cpu: 10m - memory: 64Mi + cpu: 100m + memory: 256Mi # -- Node selector for AMD GPU operator controller manager deployment nodeSelector: {} # -- Deployment affinity configs for controller manager diff --git a/hack/openshift-patch/metadata-patch/values.yaml b/hack/openshift-patch/metadata-patch/values.yaml index b0b937a9..2bdb27ad 100644 --- a/hack/openshift-patch/metadata-patch/values.yaml +++ b/hack/openshift-patch/metadata-patch/values.yaml @@ -26,11 +26,11 @@ controllerManager: effect: "NoSchedule" resources: limits: - cpu: 500m - memory: 384Mi + cpu: 1000m + memory: 1Gi requests: - cpu: 10m - memory: 64Mi + cpu: 100m + memory: 256Mi nodeSelector: {} affinity: nodeAffinity: diff --git a/helm-charts-k8s/values.yaml b/helm-charts-k8s/values.yaml index 71bfd56c..6e6e0a0d 100644 --- a/helm-charts-k8s/values.yaml +++ b/helm-charts-k8s/values.yaml @@ -47,11 +47,11 @@ controllerManager: effect: "NoSchedule" resources: limits: - cpu: 500m - memory: 384Mi + cpu: 1000m + memory: 1Gi requests: - cpu: 10m - memory: 64Mi + cpu: 100m + memory: 256Mi # -- Node selector for AMD GPU operator controller manager deployment nodeSelector: {} # -- Deployment affinity configs for controller manager diff --git a/helm-charts-openshift/values.yaml b/helm-charts-openshift/values.yaml index b0b937a9..2bdb27ad 100644 --- a/helm-charts-openshift/values.yaml +++ b/helm-charts-openshift/values.yaml @@ -26,11 +26,11 @@ controllerManager: effect: "NoSchedule" resources: limits: - cpu: 500m - memory: 384Mi + cpu: 1000m + memory: 1Gi requests: - cpu: 10m - memory: 64Mi + cpu: 100m + memory: 256Mi nodeSelector: {} affinity: nodeAffinity: From a7570fcfb31e7b93fa3e3720202c9585b6025014 Mon Sep 17 00:00:00 2001 From: Farshad Ghodsian <47931571+farshadghodsian@users.noreply.github.com> Date: Wed, 9 Apr 2025 11:42:41 -0400 Subject: [PATCH 18/24] Updated ReadTheDocs conf to support copy code block button --- docs/conf.py | 16 ++++++--- docs/requirements.txt | 1 - docs/sphinx/requirements.in | 4 +-- docs/sphinx/requirements.txt | 67 ++++++++++++++++++------------------ 4 files changed, 46 insertions(+), 42 deletions(-) delete mode 100644 docs/requirements.txt diff --git a/docs/conf.py b/docs/conf.py index c2086415..8fe2aaba 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -1,21 +1,27 @@ """Configuration file for the Sphinx documentation builder.""" +import os +html_baseurl = os.environ.get("READTHEDOCS_CANONICAL_URL", "instinct.docs.amd.com") +html_context = {} +if os.environ.get("READTHEDOCS", "") == "True": + html_context["READTHEDOCS"] = True external_projects_local_file = "projects.yaml" external_projects_remote_repository = "" external_projects = ["amd-gpu-operator"] external_projects_current_project = "amd-gpu-operator" -project = "AMD Instinct Documentation" +project = "AMD GPU Operator" version = "1.2.0" release = version -html_title = f"AMD GPU Operator {version}" +html_title = f"{project} {version}" author = "Advanced Micro Devices, Inc." -copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved." +copyright = "Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved." # Required settings html_theme = "rocm_docs_theme" html_theme_options = { - "flavor": "instinct" + "flavor": "instinct", + "link_main_doc": True, # Add any additional theme options here } extensions = ["rocm_docs"] @@ -23,4 +29,4 @@ # Table of contents external_toc_path = "./sphinx/_toc.yml" -exclude_patterns = ['.venv'] +exclude_patterns = ['.venv'] \ No newline at end of file diff --git a/docs/requirements.txt b/docs/requirements.txt deleted file mode 100644 index 78600aa6..00000000 --- a/docs/requirements.txt +++ /dev/null @@ -1 +0,0 @@ -rocm-docs-core diff --git a/docs/sphinx/requirements.in b/docs/sphinx/requirements.in index 5efe4f66..e75ed236 100644 --- a/docs/sphinx/requirements.in +++ b/docs/sphinx/requirements.in @@ -1,2 +1,2 @@ -rocm-docs-core==1.17.1 -sphinx-reredirects +rocm-docs-core==1.18.1 +sphinx-reredirects \ No newline at end of file diff --git a/docs/sphinx/requirements.txt b/docs/sphinx/requirements.txt index fc912bea..cf7a01fd 100644 --- a/docs/sphinx/requirements.txt +++ b/docs/sphinx/requirements.txt @@ -15,38 +15,39 @@ attrs==25.1.0 # jsonschema # jupyter-cache # referencing -babel==2.17.0 +babel==2.16.0 # via # pydata-sphinx-theme # sphinx -beautifulsoup4==4.13.3 +beautifulsoup4==4.12.3 # via pydata-sphinx-theme -breathe==4.36.0 +breathe==4.35.0 # via rocm-docs-core -certifi==2025.1.31 +certifi==2024.8.30 # via requests cffi==1.17.1 # via # cryptography # pynacl -charset-normalizer==3.4.1 +charset-normalizer==3.4.0 # via requests -click==8.1.8 +click==8.1.7 # via # jupyter-cache # sphinx-external-toc comm==0.2.2 # via ipykernel -cryptography==44.0.2 +cryptography==43.0.3 # via pyjwt -debugpy==1.8.13 +debugpy==1.8.12 # via ipykernel -decorator==5.2.1 +decorator==5.1.1 # via ipython -deprecated==1.2.18 +deprecated==1.2.15 # via pygithub docutils==0.21.2 # via + # breathe # myst-parser # pydata-sphinx-theme # sphinx @@ -54,13 +55,13 @@ exceptiongroup==1.2.2 # via ipython executing==2.2.0 # via stack-data -fastjsonschema==2.21.1 +fastjsonschema==2.20.0 # via # nbformat # rocm-docs-core -gitdb==4.0.12 +gitdb==4.0.11 # via gitpython -gitpython==3.1.44 +gitpython==3.1.43 # via rocm-docs-core greenlet==3.1.1 # via sqlalchemy @@ -74,13 +75,13 @@ importlib-metadata==8.6.1 # myst-nb ipykernel==6.29.5 # via myst-nb -ipython==8.33.0 +ipython==8.31.0 # via # ipykernel # myst-nb jedi==0.19.2 # via ipython -jinja2==3.1.6 +jinja2==3.1.4 # via # myst-parser # sphinx @@ -114,9 +115,9 @@ mdit-py-plugins==0.4.2 # via myst-parser mdurl==0.1.2 # via markdown-it-py -myst-nb==1.2.0 +myst-nb==1.1.2 # via rocm-docs-core -myst-parser==4.0.1 +myst-parser==4.0.0 # via myst-nb nbclient==0.10.2 # via @@ -132,7 +133,6 @@ nest-asyncio==1.6.0 packaging==24.2 # via # ipykernel - # pydata-sphinx-theme # sphinx parso==0.8.4 # via jedi @@ -142,7 +142,7 @@ platformdirs==4.3.6 # via jupyter-core prompt-toolkit==3.0.50 # via ipython -psutil==7.0.0 +psutil==6.1.1 # via ipykernel ptyprocess==0.7.0 # via pexpect @@ -150,19 +150,19 @@ pure-eval==0.2.3 # via stack-data pycparser==2.22 # via cffi -pydata-sphinx-theme==0.15.4 +pydata-sphinx-theme==0.16.0 # via # rocm-docs-core # sphinx-book-theme -pygithub==2.6.1 +pygithub==2.5.0 # via rocm-docs-core -pygments==2.19.1 +pygments==2.18.0 # via # accessible-pygments # ipython # pydata-sphinx-theme # sphinx -pyjwt[crypto]==2.10.1 +pyjwt[crypto]==2.10.0 # via pygithub pynacl==1.5.0 # via pygithub @@ -187,15 +187,15 @@ requests==2.32.3 # via # pygithub # sphinx -rocm-docs-core==1.17.1 +rocm-docs-core==1.18.1 # via -r requirements.in -rpds-py==0.23.1 +rpds-py==0.22.3 # via # jsonschema # referencing six==1.17.0 # via python-dateutil -smmap==5.0.2 +smmap==5.0.1 # via gitdb snowballstemmer==2.2.0 # via sphinx @@ -214,7 +214,7 @@ sphinx==8.1.3 # sphinx-external-toc # sphinx-notfound-page # sphinx-reredirects -sphinx-book-theme==1.1.4 +sphinx-book-theme==1.1.3 # via rocm-docs-core sphinx-copybutton==0.5.2 # via rocm-docs-core @@ -222,7 +222,7 @@ sphinx-design==0.6.1 # via rocm-docs-core sphinx-external-toc==1.0.1 # via rocm-docs-core -sphinx-notfound-page==1.1.0 +sphinx-notfound-page==1.0.4 # via rocm-docs-core sphinx-reredirects==0.1.5 # via -r requirements.in @@ -238,13 +238,13 @@ sphinxcontrib-qthelp==2.0.0 # via sphinx sphinxcontrib-serializinghtml==2.0.0 # via sphinx -sqlalchemy==2.0.38 +sqlalchemy==2.0.37 # via jupyter-cache stack-data==0.6.3 # via ipython tabulate==0.9.0 # via jupyter-cache -tomli==2.2.1 +tomli==2.1.0 # via sphinx tornado==6.4.2 # via @@ -262,20 +262,19 @@ traitlets==5.14.3 # nbformat typing-extensions==4.12.2 # via - # beautifulsoup4 # ipython # myst-nb # pydata-sphinx-theme # pygithub # referencing # sqlalchemy -urllib3==2.3.0 +urllib3==2.2.3 # via # pygithub # requests wcwidth==0.2.13 # via prompt-toolkit -wrapt==1.17.2 +wrapt==1.17.0 # via deprecated zipp==3.21.0 - # via importlib-metadata + # via importlib-metadata \ No newline at end of file From 039ce94782ace78af1f7ff604ae16534a4d3909b Mon Sep 17 00:00:00 2001 From: vm Date: Wed, 9 Apr 2025 11:01:52 +0000 Subject: [PATCH 19/24] Evict pods consuming partition resource types --- internal/controllers/upgrademgr.go | 27 ++++++++++++++++++++++++--- 1 file changed, 24 insertions(+), 3 deletions(-) diff --git a/internal/controllers/upgrademgr.go b/internal/controllers/upgrademgr.go index 25dd5d6b..033734be 100644 --- a/internal/controllers/upgrademgr.go +++ b/internal/controllers/upgrademgr.go @@ -64,6 +64,25 @@ const ( defaultSAName = "amd-gpu-operator-utils-container" ) +var ( + computePartitionTypes = []string{"spx", "cpx", "dpx", "qpx", "tpx"} + memoryPartitionTypes = []string{"nps1", "nps4"} + validResources = buildValidResources() +) + +func buildValidResources() map[string]struct{} { + resources := map[string]struct{}{ + "amd.com/gpu": {}, + } + for _, compute := range computePartitionTypes { + for _, memory := range memoryPartitionTypes { + resourceName := fmt.Sprintf("amd.com/%s_%s", compute, memory) + resources[resourceName] = struct{}{} + } + } + return resources +} + type upgradeMgr struct { helper upgradeMgrHelperAPI } @@ -663,9 +682,11 @@ func (h *upgradeMgrHelper) getPodsToDrainOrDelete(ctx context.Context, deviceCon continue } for _, container := range pod.Spec.Containers { - if _, ok := container.Resources.Requests["amd.com/gpu"]; ok { - newPods = append(newPods, pod) - break + for resourceName := range container.Resources.Requests { + if _, ok := validResources[string(resourceName)]; ok { + newPods = append(newPods, pod) + break + } } } } From a636a9d669d21ef831fc74a267be5580142c63cf Mon Sep 17 00:00:00 2001 From: yansun1996 Date: Thu, 10 Apr 2025 00:31:47 +0000 Subject: [PATCH 20/24] [DOC] Add note that updating driver image repo is not supported --- api/v1alpha1/deviceconfig_types.go | 1 + .../manifests/amd-gpu-operator.clusterserviceversion.yaml | 8 +++++--- bundle/manifests/amd.com_deviceconfigs.yaml | 1 + config/crd/bases/amd.com_deviceconfigs.yaml | 1 + .../bases/amd-gpu-operator.clusterserviceversion.yaml | 6 ++++-- docs/drivers/installation.md | 4 +++- docs/fulldeviceconfig.rst | 4 +++- helm-charts-k8s/Chart.lock | 2 +- helm-charts-k8s/crds/deviceconfig-crd.yaml | 1 + helm-charts-openshift/Chart.lock | 2 +- helm-charts-openshift/crds/deviceconfig-crd.yaml | 1 + 11 files changed, 22 insertions(+), 9 deletions(-) diff --git a/api/v1alpha1/deviceconfig_types.go b/api/v1alpha1/deviceconfig_types.go index 4a5d0597..b4b7ba04 100644 --- a/api/v1alpha1/deviceconfig_types.go +++ b/api/v1alpha1/deviceconfig_types.go @@ -117,6 +117,7 @@ type DriverSpec struct { // for OpenShift the default value is image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod // image tag will be in the format of --- // example tag is coreos-416.94-5.14.0-427.28.1.el9_4.x86_64-6.2.2 and ubuntu-22.04-5.15.0-94-generic-6.1.3 + // NOTE: Updating the driver image repository is not supported. Please delete the existing DeviceConfig and create a new one with the updated image repository //+operator-sdk:csv:customresourcedefinitions:type=spec,displayName="Image",xDescriptors={"urn:alm:descriptor:com.amd.deviceconfigs:image"} // +optional // +kubebuilder:validation:Pattern=`^([a-z0-9]+(?:[._-][a-z0-9]+)*(:[0-9]+)?)(/[$a-zA-Z0-9_]+(?:[._-][$a-zA-Z0-9_]+)*)*(?::[a-z0-9._-]+)?(?:@[a-zA-Z0-9]+:[a-f0-9]+)?$` diff --git a/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml b/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml index c73d0351..09886fe0 100644 --- a/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml +++ b/bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml @@ -32,7 +32,7 @@ metadata: capabilities: Seamless Upgrades categories: AI/Machine Learning,Monitoring containerImage: docker.io/rocm/gpu-operator:v1.2.0 - createdAt: "2025-04-07T07:07:00Z" + createdAt: "2025-04-10T00:25:51Z" description: |- Operator responsible for deploying AMD GPU kernel drivers, device plugin, device test runner and device metrics exporter For more information, visit [documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/) @@ -262,13 +262,15 @@ spec: path: driver.enable x-descriptors: - urn:alm:descriptor:com.amd.deviceconfigs:enable - - description: defines image that includes drivers and firmware blobs, don't + - description: 'defines image that includes drivers and firmware blobs, don''t include tag since it will be fully managed by operator for vanilla k8s the default value is image-registry:5000/$MOD_NAMESPACE/amdgpu_kmod for OpenShift the default value is image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod image tag will be in the format of --- example tag is coreos-416.94-5.14.0-427.28.1.el9_4.x86_64-6.2.2 - and ubuntu-22.04-5.15.0-94-generic-6.1.3 + and ubuntu-22.04-5.15.0-94-generic-6.1.3 NOTE: Updating the driver image + repository is not supported. Please delete the existing DeviceConfig and + create a new one with the updated image repository' displayName: Image path: driver.image x-descriptors: diff --git a/bundle/manifests/amd.com_deviceconfigs.yaml b/bundle/manifests/amd.com_deviceconfigs.yaml index 606fa1d8..8a439b8d 100644 --- a/bundle/manifests/amd.com_deviceconfigs.yaml +++ b/bundle/manifests/amd.com_deviceconfigs.yaml @@ -360,6 +360,7 @@ spec: for OpenShift the default value is image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod image tag will be in the format of --- example tag is coreos-416.94-5.14.0-427.28.1.el9_4.x86_64-6.2.2 and ubuntu-22.04-5.15.0-94-generic-6.1.3 + NOTE: Updating the driver image repository is not supported. Please delete the existing DeviceConfig and create a new one with the updated image repository pattern: ^([a-z0-9]+(?:[._-][a-z0-9]+)*(:[0-9]+)?)(/[$a-zA-Z0-9_]+(?:[._-][$a-zA-Z0-9_]+)*)*(?::[a-z0-9._-]+)?(?:@[a-zA-Z0-9]+:[a-f0-9]+)?$ type: string imageRegistrySecret: diff --git a/config/crd/bases/amd.com_deviceconfigs.yaml b/config/crd/bases/amd.com_deviceconfigs.yaml index 64427582..dfd71b78 100644 --- a/config/crd/bases/amd.com_deviceconfigs.yaml +++ b/config/crd/bases/amd.com_deviceconfigs.yaml @@ -356,6 +356,7 @@ spec: for OpenShift the default value is image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod image tag will be in the format of --- example tag is coreos-416.94-5.14.0-427.28.1.el9_4.x86_64-6.2.2 and ubuntu-22.04-5.15.0-94-generic-6.1.3 + NOTE: Updating the driver image repository is not supported. Please delete the existing DeviceConfig and create a new one with the updated image repository pattern: ^([a-z0-9]+(?:[._-][a-z0-9]+)*(:[0-9]+)?)(/[$a-zA-Z0-9_]+(?:[._-][$a-zA-Z0-9_]+)*)*(?::[a-z0-9._-]+)?(?:@[a-zA-Z0-9]+:[a-f0-9]+)?$ type: string imageRegistrySecret: diff --git a/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml b/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml index 878483bd..c49d9c30 100644 --- a/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml +++ b/config/manifests/bases/amd-gpu-operator.clusterserviceversion.yaml @@ -233,13 +233,15 @@ spec: path: driver.enable x-descriptors: - urn:alm:descriptor:com.amd.deviceconfigs:enable - - description: defines image that includes drivers and firmware blobs, don't + - description: 'defines image that includes drivers and firmware blobs, don''t include tag since it will be fully managed by operator for vanilla k8s the default value is image-registry:5000/$MOD_NAMESPACE/amdgpu_kmod for OpenShift the default value is image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod image tag will be in the format of --- example tag is coreos-416.94-5.14.0-427.28.1.el9_4.x86_64-6.2.2 - and ubuntu-22.04-5.15.0-94-generic-6.1.3 + and ubuntu-22.04-5.15.0-94-generic-6.1.3 NOTE: Updating the driver image + repository is not supported. Please delete the existing DeviceConfig and + create a new one with the updated image repository' displayName: Image path: driver.image x-descriptors: diff --git a/docs/drivers/installation.md b/docs/drivers/installation.md index ed1d9041..9825e546 100644 --- a/docs/drivers/installation.md +++ b/docs/drivers/installation.md @@ -100,7 +100,9 @@ spec: # Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olmhtml#create-blacklist-for-installing-out-of-tree-kernel-module blacklist: true # Specify your repository to host driver image - # DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you + # Note: + # 1. DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you + # 2. Updating the driver image repository is not supported. Please delete the existing DeviceConfig and create a new one with the updated image repository image: docker.io/username/repo # (Optional) Specify the credential for your private registry if it requires credential to get pull/push access # you can create the docker-registry type secret by running command like: diff --git a/docs/fulldeviceconfig.rst b/docs/fulldeviceconfig.rst index 00d52de9..9f7b8441 100644 --- a/docs/fulldeviceconfig.rst +++ b/docs/fulldeviceconfig.rst @@ -44,7 +44,9 @@ Below is an example of a full DeviceConfig CR that can be used to install the AM # Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module blacklist: false # Specify your repository to host driver image - # DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you + # Note: + # 1. DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you + # 2. Updating the driver image repository is not supported. Please delete the existing DeviceConfig and create a new one with the updated image repository image: docker.io/username/repo # (Optional) Specify the credential for your private registry if it requires credential to get pull/push access # you can create the docker-registry type secret by running command like: diff --git a/helm-charts-k8s/Chart.lock b/helm-charts-k8s/Chart.lock index dd529b75..95811e74 100644 --- a/helm-charts-k8s/Chart.lock +++ b/helm-charts-k8s/Chart.lock @@ -6,4 +6,4 @@ dependencies: repository: file://./charts/kmm version: v1.0.0 digest: sha256:f9a315dd2ce3d515ebf28c8e9a6a82158b493ca2686439ec381487761261b597 -generated: "2025-04-07T07:06:50.661624221Z" +generated: "2025-04-10T00:25:36.698574082Z" diff --git a/helm-charts-k8s/crds/deviceconfig-crd.yaml b/helm-charts-k8s/crds/deviceconfig-crd.yaml index ff9c1c79..24669303 100644 --- a/helm-charts-k8s/crds/deviceconfig-crd.yaml +++ b/helm-charts-k8s/crds/deviceconfig-crd.yaml @@ -364,6 +364,7 @@ spec: for OpenShift the default value is image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod image tag will be in the format of --- example tag is coreos-416.94-5.14.0-427.28.1.el9_4.x86_64-6.2.2 and ubuntu-22.04-5.15.0-94-generic-6.1.3 + NOTE: Updating the driver image repository is not supported. Please delete the existing DeviceConfig and create a new one with the updated image repository pattern: ^([a-z0-9]+(?:[._-][a-z0-9]+)*(:[0-9]+)?)(/[$a-zA-Z0-9_]+(?:[._-][$a-zA-Z0-9_]+)*)*(?::[a-z0-9._-]+)?(?:@[a-zA-Z0-9]+:[a-f0-9]+)?$ type: string imageRegistrySecret: diff --git a/helm-charts-openshift/Chart.lock b/helm-charts-openshift/Chart.lock index d4a86324..ea8bd255 100644 --- a/helm-charts-openshift/Chart.lock +++ b/helm-charts-openshift/Chart.lock @@ -6,4 +6,4 @@ dependencies: repository: file://./charts/kmm version: v1.0.0 digest: sha256:25200c34a5cc846a1275e5bf3fc637b19e909dc68de938189c5278d77d03f5ac -generated: "2025-04-07T07:06:59.305455465Z" +generated: "2025-04-10T00:25:48.698223085Z" diff --git a/helm-charts-openshift/crds/deviceconfig-crd.yaml b/helm-charts-openshift/crds/deviceconfig-crd.yaml index ff9c1c79..24669303 100644 --- a/helm-charts-openshift/crds/deviceconfig-crd.yaml +++ b/helm-charts-openshift/crds/deviceconfig-crd.yaml @@ -364,6 +364,7 @@ spec: for OpenShift the default value is image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod image tag will be in the format of --- example tag is coreos-416.94-5.14.0-427.28.1.el9_4.x86_64-6.2.2 and ubuntu-22.04-5.15.0-94-generic-6.1.3 + NOTE: Updating the driver image repository is not supported. Please delete the existing DeviceConfig and create a new one with the updated image repository pattern: ^([a-z0-9]+(?:[._-][a-z0-9]+)*(:[0-9]+)?)(/[$a-zA-Z0-9_]+(?:[._-][$a-zA-Z0-9_]+)*)*(?::[a-z0-9._-]+)?(?:@[a-zA-Z0-9]+:[a-f0-9]+)?$ type: string imageRegistrySecret: From 50eac0758c6de8d59671034d21c635d992dcab85 Mon Sep 17 00:00:00 2001 From: vm Date: Fri, 11 Apr 2025 03:24:23 +0000 Subject: [PATCH 21/24] Handle auto driver upgrade on OpenShift when KMM self-delete the NMC --- internal/controllers/mock_upgrademgr.go | 14 +++++++++++ internal/controllers/upgrademgr.go | 32 +++++++++++++++++++++++++ 2 files changed, 46 insertions(+) diff --git a/internal/controllers/mock_upgrademgr.go b/internal/controllers/mock_upgrademgr.go index 748a33d6..33e8332e 100644 --- a/internal/controllers/mock_upgrademgr.go +++ b/internal/controllers/mock_upgrademgr.go @@ -394,6 +394,20 @@ func (mr *MockupgradeMgrHelperAPIMockRecorder) isNodeNew(ctx, node, deviceConfig return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "isNodeNew", reflect.TypeOf((*MockupgradeMgrHelperAPI)(nil).isNodeNew), ctx, node, deviceConfig) } +// isNodeNmcStatusMissing mocks base method. +func (m *MockupgradeMgrHelperAPI) isNodeNmcStatusMissing(ctx context.Context, node *v1.Node, deviceConfig *v1alpha1.DeviceConfig) bool { + m.ctrl.T.Helper() + ret := m.ctrl.Call(m, "isNodeNmcStatusMissing", ctx, node, deviceConfig) + ret0, _ := ret[0].(bool) + return ret0 +} + +// isNodeNmcStatusMissing indicates an expected call of isNodeNmcStatusMissing. +func (mr *MockupgradeMgrHelperAPIMockRecorder) isNodeNmcStatusMissing(ctx, node, deviceConfig any) *gomock.Call { + mr.mock.ctrl.T.Helper() + return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "isNodeNmcStatusMissing", reflect.TypeOf((*MockupgradeMgrHelperAPI)(nil).isNodeNmcStatusMissing), ctx, node, deviceConfig) +} + // isNodeReady mocks base method. func (m *MockupgradeMgrHelperAPI) isNodeReady(ctx context.Context, node *v1.Node, deviceConfig *v1alpha1.DeviceConfig) bool { m.ctrl.T.Helper() diff --git a/internal/controllers/upgrademgr.go b/internal/controllers/upgrademgr.go index 033734be..8a25f5db 100644 --- a/internal/controllers/upgrademgr.go +++ b/internal/controllers/upgrademgr.go @@ -187,6 +187,12 @@ func (n *upgradeMgr) HandleUpgrade(ctx context.Context, deviceConfig *amdv1alpha continue } + // Untaint to let upgrade continue in case of KMM bug after node reboot + if n.helper.isNodeNmcStatusMissing(ctx, &nodeList.Items[i], deviceConfig) { + upgradeInProgress++ + continue + } + // 3. Handle Started Nodes if n.helper.isNodeStateUpgradeStarted(&nodeList.Items[i]) { upgradeInProgress++ @@ -292,6 +298,7 @@ type upgradeMgrHelperAPI interface { // Handle node state transitions isNodeReady(ctx context.Context, node *v1.Node, deviceConfig *amdv1alpha1.DeviceConfig) bool + isNodeNmcStatusMissing(ctx context.Context, node *v1.Node, deviceConfig *amdv1alpha1.DeviceConfig) bool isNodeNew(ctx context.Context, node *v1.Node, deviceConfig *amdv1alpha1.DeviceConfig) bool isNodeStateUpgradeStarted(node *v1.Node) bool isNodeStateInstallInProgress(ctx context.Context, node *v1.Node, deviceConfig *amdv1alpha1.DeviceConfig) bool @@ -405,6 +412,31 @@ func (h *upgradeMgrHelper) isNodeNew(ctx context.Context, node *v1.Node, deviceC return false } +// Handle Driver installation for nodes with nmc status missing +func (h *upgradeMgrHelper) isNodeNmcStatusMissing(ctx context.Context, node *v1.Node, deviceConfig *amdv1alpha1.DeviceConfig) bool { + + if nodeStatus, ok := deviceConfig.Status.NodeModuleStatus[node.Name]; ok { + currentState := h.getNodeStatus(node.Name) + // during the automatic upgrade, if node reboot was triggered, KMM could possibly remove the NMC status, making the ContainerImage empty + // https://github.com/rh-ecosystem-edge/kernel-module-management/blob/b57037ec1b8ceef9961ca1baeb9529121c6df398/internal/controllers/nmc_reconciler.go#L414-L419 + // at this moment the node status would be UpgradeStateInProgress with empty ContainerImage + // we still need to proceed with this status + if nodeStatus.ContainerImage == "" && currentState == amdv1alpha1.UpgradeStateInProgress { + + // Uncordon the node + if err := h.cordonOrUncordonNode(ctx, deviceConfig, node, false); err != nil { + // Move to failure state if uncordon fails + h.setNodeStatus(ctx, node.Name, amdv1alpha1.UpgradeStateUncordonFailed) + return false + } + + return true + } + } + + return false +} + // Handle Driver installation for ready nodes. func (h *upgradeMgrHelper) isNodeReady(ctx context.Context, node *v1.Node, deviceConfig *amdv1alpha1.DeviceConfig) bool { From 0efcd891c187a43037fec8e032ce987c9e2ebcd7 Mon Sep 17 00:00:00 2001 From: vm Date: Wed, 9 Apr 2025 03:23:39 +0000 Subject: [PATCH 22/24] MaxParallel constraint with MaxUnavailable --- internal/controllers/upgrademgr.go | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/internal/controllers/upgrademgr.go b/internal/controllers/upgrademgr.go index 8a25f5db..2407dbe2 100644 --- a/internal/controllers/upgrademgr.go +++ b/internal/controllers/upgrademgr.go @@ -548,7 +548,23 @@ func (h *upgradeMgrHelper) isUpgradePolicyViolated(upgradeInProgress int, upgrad return maxParallelUpdates, true } - return maxParallelUpdates, (upgradeInProgress >= maxParallelUpdates) || (upgradeFailedState >= maxUnavailableNodes) + // Remaining space for unavailable nodes + remainingUnavailable := maxUnavailableNodes - upgradeFailedState + + var maxParallelAllowed int + if maxParallelUpdates == 0 { + // "0 means Unlimited parallel" — so allow up to remaining unavailable + maxParallelAllowed = remainingUnavailable + } else { + // Take into consideration minimum between configured value and remaining unavailable + maxParallelAllowed = min(maxParallelUpdates, remainingUnavailable) + } + + if maxParallelAllowed == 0 || upgradeInProgress >= maxParallelAllowed { + return maxParallelAllowed, true + } + + return maxParallelAllowed, false } From a4c4cc99858f008d4a118f7b7600b803e9f2335e Mon Sep 17 00:00:00 2001 From: vm Date: Fri, 11 Apr 2025 05:26:50 +0000 Subject: [PATCH 23/24] Release note doc --- docs/knownlimitations.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/knownlimitations.md b/docs/knownlimitations.md index c84430e8..a21fc6d3 100644 --- a/docs/knownlimitations.md +++ b/docs/knownlimitations.md @@ -92,6 +92,13 @@ - **Recommendation:** Please set maxParallel Upgrades to a number less than total number of worker nodes

+14. **Driver Install/Upgrade Issue if one of the nodes where KMM is running build pod gets rebooted accidentaly when rebootRequired is set to false** + + - **Impact:** Not able to perform driver install/upgrade + - **Affected Configurations:** All configurations + - **Recommendation:** Please retrigger driver install/upgrade and ensure to not reboot node manually when rebootRequired is false +

+ ## Fixed Issues 1. **When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module (Fixed in v1.2.0)** From f2acfb1d00b1e367c80f006edf3b26512d842a7c Mon Sep 17 00:00:00 2001 From: vm Date: Thu, 17 Apr 2025 09:35:31 +0000 Subject: [PATCH 24/24] Node labeller flags for partition related labels[DO NOT MERGE] --- internal/nodelabeller/nodelabeller.go | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/internal/nodelabeller/nodelabeller.go b/internal/nodelabeller/nodelabeller.go index 81293fd9..e745f6c3 100644 --- a/internal/nodelabeller/nodelabeller.go +++ b/internal/nodelabeller/nodelabeller.go @@ -175,7 +175,7 @@ func (nl *nodeLabeller) SetNodeLabellerAsDesired(ds *appsv1.DaemonSet, devConfig InitContainers: initContainers, Containers: []v1.Container{ { - Args: []string{"-c", "./k8s-node-labeller -vram -cu-count -simd-count -device-id -family -product-name -driver-version"}, + Args: []string{"-c", "./k8s-node-labeller -vram -cu-count -simd-count -device-id -family -product-name -driver-version -compute-memory-partition -compute-partitioning-supported -memory-partitioning-supported"}, Command: []string{"sh"}, Env: []v1.EnvVar{ {