Skip to content

Conversation

@djoshy
Copy link
Contributor

@djoshy djoshy commented Nov 19, 2025

This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:

  • The operator now populates the bootImageSkewEnforcementStatus field in the MachineConfiguration object based on spec.bootImageSkewEnforcement, platform defaults and cluster version.
  • The boot image controller will now update the current boot image value in bootImageSkewEnforcementStatus on a successful boot image update. Note that this requires the skew enforcement to be set to Automatic mode, and all machinesets to be opt-ed in for boot image updates.
  • The operator will set Upgradeable=False if the cluster is to be detected to be out of skew. This is done by comparing the boot image values referenced in the bootImageSkewEnforcementStatus field against the MCO's hardcoded skew limits.
  • Some unit tests have been added to sync_test.go and status_test.go to verify the above mechanisms.

Verifying API behavior

This verification will have to be done based on the platform. If the platform:

  • supports boot image updates and it is on by default(AWS and GCP at the time of writing), i.e. status.managedBootImagesStatus is set to All if spec.managedBootImages is empty. Then, skew enforcement status will be set to Automatic, with a boot image version estimated from cluster version. Then, the boot image controller will perform a sync which will update the boot image(if required) and after all resources have been successfully updated, it will update the boot image value stored in the skew enforcement status. The value set will be the OCP releaseVersion described by the coreos-bootimages configmap. Here's an example:
  spec:
    logLevel: Normal
    managementState: Managed
    operatorLogLevel: Normal
  status:
    bootImageSkewEnforcementStatus:
      automatic:
        ocpVersion: 4.21.0
      mode: Automatic
    conditions:
    - lastTransitionTime: "2025-11-19T22:06:06Z"
      message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
        | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
      reason: BootImageConfigMapAdded
      status: "False"
      type: BootImageUpdateProgressing
    - lastTransitionTime: "2025-11-19T22:06:07Z"
      message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
        0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
      reason: BootImageConfigMapAdded
      status: "False"
      type: BootImageUpdateDegraded
    managedBootImagesStatus:
      machineManagers:
      - apiGroup: machine.openshift.io
        resource: machinesets
        selection:
          mode: All
  • supports boot image updates, but is not on by default(vsphere and Azure at the time of writing) i.e. status.managedBootImagesStatus is set to None if spec.managedBootImages is empty. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
  spec:
    logLevel: Normal
    managementState: Managed
    operatorLogLevel: Normal
  status:
    bootImageSkewEnforcementStatus:
      manual:
        mode: OCPVersion
        ocpVersion: 4.21.0
      mode: Manual
    conditions:
    - lastTransitionTime: "2025-11-19T22:06:06Z"
      message: Reconciled 0 of 0 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
        | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
      reason: BootImageConfigMapAdded
      status: "False"
      type: BootImageUpdateProgressing
    - lastTransitionTime: "2025-11-19T22:06:07Z"
      message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
        0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
      reason: BootImageConfigMapAdded
      status: "False"
      type: BootImageUpdateDegraded
    managedBootImagesStatus:
      machineManagers:
      - apiGroup: machine.openshift.io
        resource: machinesets
        selection:
          mode: None

The admin can choose to opt-in for boot image updates in this case(set spec.ManagedBootImages to All), and the operator should automatically switch the skew enforcement status to Automatic, with the appropriate boot image version. This would mean the object would finally look like this:

  spec:
    logLevel: Normal
    managementState: Managed
    operatorLogLevel: Normal
    managedBootImages:
      machineManagers:
      - apiGroup: machine.openshift.io
        resource: machinesets
        selection:
          mode: All
  status:
    bootImageSkewEnforcementStatus:
      automatic:
        ocpVersion: 4.21.0
      mode: Automatic
    conditions:
    - lastTransitionTime: "2025-11-19T22:06:06Z"
      message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
        | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
      reason: BootImageConfigMapAdded
      status: "False"
      type: BootImageUpdateProgressing
    - lastTransitionTime: "2025-11-19T22:06:07Z"
      message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
        0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
      reason: BootImageConfigMapAdded
      status: "False"
      type: BootImageUpdateDegraded
    managedBootImagesStatus:
      machineManagers:
      - apiGroup: machine.openshift.io
        resource: machinesets
        selection:
          mode: All
  • does not support boot image updates(all other platforms at the time of writing) i.e. status.managedBootImagesStatus is empty and spec.managedBootImages cannot be set by the admin. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
  spec:
    logLevel: Normal
    managementState: Managed
    operatorLogLevel: Normal
  status:
    bootImageSkewEnforcementStatus:
      manual:
        mode: OCPVersion
        ocpVersion: 4.21.0
      mode: Manual

In this case, the admin is expected to manually perform boot image updates and then add a spec field like so:

spec:
  bootImageSkewEnforcement:
    mode: Manual
    manual:
      mode: OCPVersion
      ocpVersion: 4.21.2

The operator should then update the status to include this:

spec:
  bootImageSkewEnforcement:
    mode: Manual
    manual:
      mode: OCPVersion
      ocpVersion: 4.21.2
status:
  bootImageSkewEnforcementStatus:
      mode: OCPVersion
      ocpVersion: 4.21.2

The above snippet is if an admin had chosen to record the OCPVersion. In manual mode, the admin can also choose to to store the RHCOSVersion, like so:

spec:
  bootImageSkewEnforcement:
    mode: Manual
    manual:
      mode: RHCOSVersion
      rhcosVersion: 9.0.20251023-0
status:
  bootImageSkewEnforcementStatus:
    mode: Manual
    manual:
      mode: RHCOSVersion
      rhcosVersion: 9.0.20251023-0

Note that only one of RHCOSVersion or OCPVersion is permitted in Manual mode.

The admin can also choose to disable skew enforcement altogether by setting it None mode in spec.

spec:
  bootImageSkewEnforcement:
    mode: None
status:
  bootImageSkewEnforcementStatus:
    mode: None

Verifying upgrade block

Upgrades will be blocked when the cluster is to determined out of skew. This mechanism works the same way in manual and automatic mode, although it is likely easier to verify in manual mode. The current thresholds for a skew violation is set to when OCP first moved to RHEL9, which corresponds to RHEL version 9.2 and OCP version 4.13.0. The operator will perform semver comparisons of these thresholds against the boot image versions stored in bootImageSkewEnforcementStatus and set Upgradeable=False if necessary. To verify this, first set the mode to Manual with an out of skew boot image version like so:

  spec:
    bootImageSkewEnforcement:
      manual:
	mode: RHCOSVersion
        rhcosVersion: 9.0.20251023-0
      mode: Manual

Now, examine the machine-config CO object's conditions field, it should indicate an issue preventing upgrades like so:

$ oc get co machine-config -o yaml
...
  - lastTransitionTime: "2025-11-20T15:15:12Z"
    message: 'Upgrades have been disabled because the cluster is using RHCOS boot
      image version 9.0.20251023-0(RHEL version: 9.0), which is below the minimum
      required RHEL version 9.2. To enable upgrades, please update your boot images
      following the documentation at [TODO: insert link], or disable boot image skew
      enforcement at [TODO: insert link]'
    reason: ClusterBootImageSkewError
    status: "False"
    type: Upgradeable

Next, set the boot image to one within the skew limits:

  spec:
    bootImageSkewEnforcement:
      manual:
	mode: RHCOSVersion
        rhcosVersion: 9.2.20251023-0
      mode: Manual

Then, the Upgradeable condition should be restored back to True

  - lastTransitionTime: "2025-11-20T15:19:25Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable

These set of steps can be repeated with the OCPVersion specified too. This comparison should only take place in Automatic and Manual mode however, as Automatic is only permitted on the status side, I don't think there is an easy way to test that(other than the units I've included).

In None mode, this version check should not take place.

Some caveats to note about Automatic mode:

  1. The admin is not permitted to use Automatic mode within the spec. This is in an intentional choice because only the MCO will always be able to self determine if a platform is eligible for automatic skew enforcement.
  2. In Automatic mode, API validations will prevent changing the boot image configuration to a setting other than All. To change the boot image configuration, the admin is first expected to go to Manual skew enforcement mode and then attempt to change the boot image configuration of the cluster.
  3. In Automatic mode, if any machinesets are skipped for boot image updates(for example a marketplace or an unknown boot image was detected in any of the machinesets), the boot image controller will not update the boot image value stored in bootImageEnforcementStatus. This is because the cluster cannot be considered up to date on boot image if even one of the machine resources are out of skew.
  4. In Automatic mode, the operator will only populate the OCPVersion. This is because each platform may not have the same RHCOS version of the boot image(for example, across marketplace streams) in a given release, and it would involve a lot of per-platform piping to correctly track the RHCOS version per machineset within the boot image controller. I did not deem this to be worth the effort, but am open to implementing that later if the need arises.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 19, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 19, 2025

@djoshy: This pull request references MCO-1877 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

- What I did

This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:

  • The operator now populates the bootImageSkewEnforcementStatus field in the MachineConfiguration object based on spec.bootImageSkewEnforcement, platform defaults and cluster conditions.
  • The boot image controller will now update the current boot image value in bootImageSkewEnforcementStatus on a successful boot image update. Note that this requires the skew enforcement to be set to Automatic mode, and all machinesets to be opt-ed in for boot image updates.
  • The operator will set Upgradeable=False if the cluster is to be detected to be out of skew. This is done by comparing the boot image values referenced in the bootImageSkewEnforcementStatus field against the MCO's hardcoded skew limits.

I've also added unit tests to verify the behaviors above.

- How to verify it
[TBD]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 19, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 19, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 19, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 19, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@djoshy: This pull request references MCO-1877 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

- What I did

This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:

  • The operator now populates the bootImageSkewEnforcementStatus field in the MachineConfiguration object based on spec.bootImageSkewEnforcement, platform defaults and cluster version.
  • The boot image controller will now update the current boot image value in bootImageSkewEnforcementStatus on a successful boot image update. Note that this requires the skew enforcement to be set to Automatic mode, and all machinesets to be opt-ed in for boot image updates.
  • The operator will set Upgradeable=False if the cluster is to be detected to be out of skew. This is done by comparing the boot image values referenced in the bootImageSkewEnforcementStatus field against the MCO's hardcoded skew limits.

I've also added a few unit tests to verify the above behaviors.

- How to verify it
The verification will have to be done based on the platform. If the platform

  • supports boot image updates and it is on by default(AWS and GCP at the time of writing), i.e. status.managedBootImagesStatus is set to All if spec.managedBootImages is empty. Then, skew enforcement status will be set to Automatic, with a boot image version estimated from cluster version. Then, the boot image controller will perform a sync which will update the boot image(if required) and after all resources have been successfully updated, it will update the boot image value stored in the skew enforcement status. The value set will be the OCP releaseVersion described by the coreos-bootimages configmap. Here's an example:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • supports boot image updates, but is not on by default(vsphere and Azure at the time of writing) i.e. status.managedBootImagesStatus is set to None if spec.managedBootImages is empty. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 0 of 0 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: None

The user can choose to opt-in for boot image updates in this case(set spec.ManagedBootImages to All), and the operator should automatically switch the skew enforcement status to Automatic, with the appropriate boot image version. This would mean the object would finally look like this:

 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
   managedBootImages:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • does not support boot image updates(all other platforms at the time of writing) i.e. status.managedBootImagesStatus is empty and spec.managedBootImages cannot be set by the admin. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual

In this case, the user is expected to manually perform boot image updates and then add a spec field like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2

The operator should then update the status to include this:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2
status:
 bootImageSkewEnforcementStatus:
     mode: OCPVersion
     ocpVersion: 4.21.2

The above snippet is if an admin had chosen to record the OCPVersion. In manual mode, the user can also choose to to store the RHCOSVersion, like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0
status:
 bootImageSkewEnforcementStatus:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0

Some caveats to note:

  1. The admin is not permitted to use Automatic mode within the spec. This is in an intentional choice because only the MCO will always be able to self determine if a platform is eligible for automatic skew enforcement.
  2. In Automatic mode, If any machinesets are skipped for boot image updates(for example a marketplace or an unknown boot image was detected in any of the machinesets), the boot image controller will not update the boot image value stored in bootImageEnforcementStatus. This is because the cluster cannot be considered up to date on boot image if even one of the machine resources are out of skew.
  3. In Automatic mode, the operator will only populate the OCPVersion. This is because each platform may not have the same RHCOS version of the boot image(for example, across marketplace streams) in a given release, and it would involve a lot of per-platform piping to correctly track the RHCOS version per machineset within the boot image controller. I did not deem this to be worth the effort, but am open to implementing that later if the need arises.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@djoshy: This pull request references MCO-1877 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:

  • The operator now populates the bootImageSkewEnforcementStatus field in the MachineConfiguration object based on spec.bootImageSkewEnforcement, platform defaults and cluster version.
  • The boot image controller will now update the current boot image value in bootImageSkewEnforcementStatus on a successful boot image update. Note that this requires the skew enforcement to be set to Automatic mode, and all machinesets to be opt-ed in for boot image updates.
  • The operator will set Upgradeable=False if the cluster is to be detected to be out of skew. This is done by comparing the boot image values referenced in the bootImageSkewEnforcementStatus field against the MCO's hardcoded skew limits.

I've also added a few unit tests to verify the above behaviors.

Verifying API behavior

This verification will have to be done based on the platform. If the platform...

  • supports boot image updates and it is on by default(AWS and GCP at the time of writing), i.e. status.managedBootImagesStatus is set to All if spec.managedBootImages is empty. Then, skew enforcement status will be set to Automatic, with a boot image version estimated from cluster version. Then, the boot image controller will perform a sync which will update the boot image(if required) and after all resources have been successfully updated, it will update the boot image value stored in the skew enforcement status. The value set will be the OCP releaseVersion described by the coreos-bootimages configmap. Here's an example:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • supports boot image updates, but is not on by default(vsphere and Azure at the time of writing) i.e. status.managedBootImagesStatus is set to None if spec.managedBootImages is empty. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 0 of 0 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: None

The user can choose to opt-in for boot image updates in this case(set spec.ManagedBootImages to All), and the operator should automatically switch the skew enforcement status to Automatic, with the appropriate boot image version. This would mean the object would finally look like this:

 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
   managedBootImages:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • does not support boot image updates(all other platforms at the time of writing) i.e. status.managedBootImagesStatus is empty and spec.managedBootImages cannot be set by the admin. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual

In this case, the user is expected to manually perform boot image updates and then add a spec field like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2

The operator should then update the status to include this:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2
status:
 bootImageSkewEnforcementStatus:
     mode: OCPVersion
     ocpVersion: 4.21.2

The above snippet is if an admin had chosen to record the OCPVersion. In manual mode, the user can also choose to to store the RHCOSVersion, like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0
status:
 bootImageSkewEnforcementStatus:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0

The admin can also choose to disable skew enforcement altogether by setting it None mode in spec.

spec:
 bootImageSkewEnforcement:
   mode: None
status:
 bootImageSkewEnforcementStatus:
   mode: None

Verifying upgrade block

Upgrades will be blocked when the cluster is to determined out of skew. This piece works the same way in manual and automatic mode, although it is likely easier to verify in manual mode. The current thresholds for a skew violation is set to when OCP first moved to RHEL9, which corresponds to RHEL version 9.2 and OCP version 4.13.0. The operator will perform semver comparisons of these thresholds against the boot image versions stored in bootImageSkewEnforcementStatus and set Upgradeable=False if necessary. To verify set the mode to Manual with an out of skew boot image version like so:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.0.20251023-0
     mode: Manual

Now, examine the CO object named machine-config's conditions field, it should show indicate an issue preventing upgrades like so:

 - lastTransitionTime: "2025-11-20T15:15:12Z"
   message: 'Upgrades have been disabled because the cluster is using RHCOS boot
     image version 9.0.20251023-0(RHEL version: 9.0), which is below the minimum
     required RHEL version 9.2. To enable upgrades, please update your boot images
     following the documentation at [TODO: insert link], or disable boot image skew
     enforcement at [TODO: insert link]'
   reason: ClusterBootImageSkewError
   status: "False"
   type: Upgradeable

Next, set the boot image to one within the skew limits:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.2.20251023-0
     mode: Manual

Then, the Upgradeable condition should be restored back to True

 - lastTransitionTime: "2025-11-20T15:19:25Z"
   reason: AsExpected
   status: "True"
   type: Upgradeable

These set of steps can be repeated with the OCPVersion specified too. This comparison should only take place in Automatic and Manual mode, however as Automatic is only generated status side, I don't think there is an easy way to test that(other than the units I've included).

In None mode, this version check should not take place.

Some caveats to note:

  1. The admin is not permitted to use Automatic mode within the spec. This is in an intentional choice because only the MCO will always be able to self determine if a platform is eligible for automatic skew enforcement.
  2. In Automatic mode, If any machinesets are skipped for boot image updates(for example a marketplace or an unknown boot image was detected in any of the machinesets), the boot image controller will not update the boot image value stored in bootImageEnforcementStatus. This is because the cluster cannot be considered up to date on boot image if even one of the machine resources are out of skew.
  3. In Automatic mode, the operator will only populate the OCPVersion. This is because each platform may not have the same RHCOS version of the boot image(for example, across marketplace streams) in a given release, and it would involve a lot of per-platform piping to correctly track the RHCOS version per machineset within the boot image controller. I did not deem this to be worth the effort, but am open to implementing that later if the need arises.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@djoshy: This pull request references MCO-1877 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:

  • The operator now populates the bootImageSkewEnforcementStatus field in the MachineConfiguration object based on spec.bootImageSkewEnforcement, platform defaults and cluster version.
  • The boot image controller will now update the current boot image value in bootImageSkewEnforcementStatus on a successful boot image update. Note that this requires the skew enforcement to be set to Automatic mode, and all machinesets to be opt-ed in for boot image updates.
  • The operator will set Upgradeable=False if the cluster is to be detected to be out of skew. This is done by comparing the boot image values referenced in the bootImageSkewEnforcementStatus field against the MCO's hardcoded skew limits.
  • I've also added a few unit tests to verify the above mechanisms.

Verifying API behavior

This verification will have to be done based on the platform. If the platform:

  • supports boot image updates and it is on by default(AWS and GCP at the time of writing), i.e. status.managedBootImagesStatus is set to All if spec.managedBootImages is empty. Then, skew enforcement status will be set to Automatic, with a boot image version estimated from cluster version. Then, the boot image controller will perform a sync which will update the boot image(if required) and after all resources have been successfully updated, it will update the boot image value stored in the skew enforcement status. The value set will be the OCP releaseVersion described by the coreos-bootimages configmap. Here's an example:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • supports boot image updates, but is not on by default(vsphere and Azure at the time of writing) i.e. status.managedBootImagesStatus is set to None if spec.managedBootImages is empty. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 0 of 0 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: None

The admin can choose to opt-in for boot image updates in this case(set spec.ManagedBootImages to All), and the operator should automatically switch the skew enforcement status to Automatic, with the appropriate boot image version. This would mean the object would finally look like this:

 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
   managedBootImages:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • does not support boot image updates(all other platforms at the time of writing) i.e. status.managedBootImagesStatus is empty and spec.managedBootImages cannot be set by the admin. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual

In this case, the admin is expected to manually perform boot image updates and then add a spec field like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2

The operator should then update the status to include this:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2
status:
 bootImageSkewEnforcementStatus:
     mode: OCPVersion
     ocpVersion: 4.21.2

The above snippet is if an admin had chosen to record the OCPVersion. In manual mode, the admin can also choose to to store the RHCOSVersion, like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0
status:
 bootImageSkewEnforcementStatus:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0

The admin can also choose to disable skew enforcement altogether by setting it None mode in spec.

spec:
 bootImageSkewEnforcement:
   mode: None
status:
 bootImageSkewEnforcementStatus:
   mode: None

Verifying upgrade block

Upgrades will be blocked when the cluster is to determined out of skew. This piece works the same way in manual and automatic mode, although it is likely easier to verify in manual mode. The current thresholds for a skew violation is set to when OCP first moved to RHEL9, which corresponds to RHEL version 9.2 and OCP version 4.13.0. The operator will perform semver comparisons of these thresholds against the boot image versions stored in bootImageSkewEnforcementStatus and set Upgradeable=False if necessary. To verify set the mode to Manual with an out of skew boot image version like so:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.0.20251023-0
     mode: Manual

Now, examine the CO object named machine-config's conditions field, it should show indicate an issue preventing upgrades like so:

 - lastTransitionTime: "2025-11-20T15:15:12Z"
   message: 'Upgrades have been disabled because the cluster is using RHCOS boot
     image version 9.0.20251023-0(RHEL version: 9.0), which is below the minimum
     required RHEL version 9.2. To enable upgrades, please update your boot images
     following the documentation at [TODO: insert link], or disable boot image skew
     enforcement at [TODO: insert link]'
   reason: ClusterBootImageSkewError
   status: "False"
   type: Upgradeable

Next, set the boot image to one within the skew limits:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.2.20251023-0
     mode: Manual

Then, the Upgradeable condition should be restored back to True

 - lastTransitionTime: "2025-11-20T15:19:25Z"
   reason: AsExpected
   status: "True"
   type: Upgradeable

These set of steps can be repeated with the OCPVersion specified too. This comparison should only take place in Automatic and Manual mode, however as Automatic is only generated status side, I don't think there is an easy way to test that(other than the units I've included).

In None mode, this version check should not take place.

Some caveats to note about Automatic mode:

  1. The admin is not permitted to use Automatic mode within the spec. This is in an intentional choice because only the MCO will always be able to self determine if a platform is eligible for automatic skew enforcement.
  2. In Automatic mode, API validations will prevent changing the boot image configuration to a setting other than All. To change the boot image configuration, the admin is first expected to go to Manual skew enforcement mode and then attempt to change the boot image configuration of the cluster.
  3. In Automatic mode, If any machinesets are skipped for boot image updates(for example a marketplace or an unknown boot image was detected in any of the machinesets), the boot image controller will not update the boot image value stored in bootImageEnforcementStatus. This is because the cluster cannot be considered up to date on boot image if even one of the machine resources are out of skew.
  4. In Automatic mode, the operator will only populate the OCPVersion. This is because each platform may not have the same RHCOS version of the boot image(for example, across marketplace streams) in a given release, and it would involve a lot of per-platform piping to correctly track the RHCOS version per machineset within the boot image controller. I did not deem this to be worth the effort, but am open to implementing that later if the need arises.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@djoshy: This pull request references MCO-1877 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:

  • The operator now populates the bootImageSkewEnforcementStatus field in the MachineConfiguration object based on spec.bootImageSkewEnforcement, platform defaults and cluster version.
  • The boot image controller will now update the current boot image value in bootImageSkewEnforcementStatus on a successful boot image update. Note that this requires the skew enforcement to be set to Automatic mode, and all machinesets to be opt-ed in for boot image updates.
  • The operator will set Upgradeable=False if the cluster is to be detected to be out of skew. This is done by comparing the boot image values referenced in the bootImageSkewEnforcementStatus field against the MCO's hardcoded skew limits.
  • Some unit tests have been added to sync_test.go and status_test.go to verify the above mechanisms.

Verifying API behavior

This verification will have to be done based on the platform. If the platform:

  • supports boot image updates and it is on by default(AWS and GCP at the time of writing), i.e. status.managedBootImagesStatus is set to All if spec.managedBootImages is empty. Then, skew enforcement status will be set to Automatic, with a boot image version estimated from cluster version. Then, the boot image controller will perform a sync which will update the boot image(if required) and after all resources have been successfully updated, it will update the boot image value stored in the skew enforcement status. The value set will be the OCP releaseVersion described by the coreos-bootimages configmap. Here's an example:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • supports boot image updates, but is not on by default(vsphere and Azure at the time of writing) i.e. status.managedBootImagesStatus is set to None if spec.managedBootImages is empty. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 0 of 0 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: None

The admin can choose to opt-in for boot image updates in this case(set spec.ManagedBootImages to All), and the operator should automatically switch the skew enforcement status to Automatic, with the appropriate boot image version. This would mean the object would finally look like this:

 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
   managedBootImages:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • does not support boot image updates(all other platforms at the time of writing) i.e. status.managedBootImagesStatus is empty and spec.managedBootImages cannot be set by the admin. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual

In this case, the admin is expected to manually perform boot image updates and then add a spec field like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2

The operator should then update the status to include this:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2
status:
 bootImageSkewEnforcementStatus:
     mode: OCPVersion
     ocpVersion: 4.21.2

The above snippet is if an admin had chosen to record the OCPVersion. In manual mode, the admin can also choose to to store the RHCOSVersion, like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0
status:
 bootImageSkewEnforcementStatus:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0

Note that only one of RHCOSVersion or OCPVersion is permitted in Manual mode.

The admin can also choose to disable skew enforcement altogether by setting it None mode in spec.

spec:
 bootImageSkewEnforcement:
   mode: None
status:
 bootImageSkewEnforcementStatus:
   mode: None

Verifying upgrade block

Upgrades will be blocked when the cluster is to determined out of skew. This mechanism works the same way in manual and automatic mode, although it is likely easier to verify in manual mode. The current thresholds for a skew violation is set to when OCP first moved to RHEL9, which corresponds to RHEL version 9.2 and OCP version 4.13.0. The operator will perform semver comparisons of these thresholds against the boot image versions stored in bootImageSkewEnforcementStatus and set Upgradeable=False if necessary. To verify this, first set the mode to Manual with an out of skew boot image version like so:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.0.20251023-0
     mode: Manual

Now, examine the CO object named machine-config's conditions field, it should show indicate an issue preventing upgrades like so:

 - lastTransitionTime: "2025-11-20T15:15:12Z"
   message: 'Upgrades have been disabled because the cluster is using RHCOS boot
     image version 9.0.20251023-0(RHEL version: 9.0), which is below the minimum
     required RHEL version 9.2. To enable upgrades, please update your boot images
     following the documentation at [TODO: insert link], or disable boot image skew
     enforcement at [TODO: insert link]'
   reason: ClusterBootImageSkewError
   status: "False"
   type: Upgradeable

Next, set the boot image to one within the skew limits:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.2.20251023-0
     mode: Manual

Then, the Upgradeable condition should be restored back to True

 - lastTransitionTime: "2025-11-20T15:19:25Z"
   reason: AsExpected
   status: "True"
   type: Upgradeable

These set of steps can be repeated with the OCPVersion specified too. This comparison should only take place in Automatic and Manual mode, however as Automatic is only permitted on the status side, I don't think there is an easy way to test that(other than the units I've included).

In None mode, this version check should not take place.

Some caveats to note about Automatic mode:

  1. The admin is not permitted to use Automatic mode within the spec. This is in an intentional choice because only the MCO will always be able to self determine if a platform is eligible for automatic skew enforcement.
  2. In Automatic mode, API validations will prevent changing the boot image configuration to a setting other than All. To change the boot image configuration, the admin is first expected to go to Manual skew enforcement mode and then attempt to change the boot image configuration of the cluster.
  3. In Automatic mode, If any machinesets are skipped for boot image updates(for example a marketplace or an unknown boot image was detected in any of the machinesets), the boot image controller will not update the boot image value stored in bootImageEnforcementStatus. This is because the cluster cannot be considered up to date on boot image if even one of the machine resources are out of skew.
  4. In Automatic mode, the operator will only populate the OCPVersion. This is because each platform may not have the same RHCOS version of the boot image(for example, across marketplace streams) in a given release, and it would involve a lot of per-platform piping to correctly track the RHCOS version per machineset within the boot image controller. I did not deem this to be worth the effort, but am open to implementing that later if the need arises.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@djoshy: This pull request references MCO-1877 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:

  • The operator now populates the bootImageSkewEnforcementStatus field in the MachineConfiguration object based on spec.bootImageSkewEnforcement, platform defaults and cluster version.
  • The boot image controller will now update the current boot image value in bootImageSkewEnforcementStatus on a successful boot image update. Note that this requires the skew enforcement to be set to Automatic mode, and all machinesets to be opt-ed in for boot image updates.
  • The operator will set Upgradeable=False if the cluster is to be detected to be out of skew. This is done by comparing the boot image values referenced in the bootImageSkewEnforcementStatus field against the MCO's hardcoded skew limits.
  • Some unit tests have been added to sync_test.go and status_test.go to verify the above mechanisms.

Verifying API behavior

This verification will have to be done based on the platform. If the platform:

  • supports boot image updates and it is on by default(AWS and GCP at the time of writing), i.e. status.managedBootImagesStatus is set to All if spec.managedBootImages is empty. Then, skew enforcement status will be set to Automatic, with a boot image version estimated from cluster version. Then, the boot image controller will perform a sync which will update the boot image(if required) and after all resources have been successfully updated, it will update the boot image value stored in the skew enforcement status. The value set will be the OCP releaseVersion described by the coreos-bootimages configmap. Here's an example:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • supports boot image updates, but is not on by default(vsphere and Azure at the time of writing) i.e. status.managedBootImagesStatus is set to None if spec.managedBootImages is empty. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 0 of 0 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: None

The admin can choose to opt-in for boot image updates in this case(set spec.ManagedBootImages to All), and the operator should automatically switch the skew enforcement status to Automatic, with the appropriate boot image version. This would mean the object would finally look like this:

 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
   managedBootImages:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • does not support boot image updates(all other platforms at the time of writing) i.e. status.managedBootImagesStatus is empty and spec.managedBootImages cannot be set by the admin. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual

In this case, the admin is expected to manually perform boot image updates and then add a spec field like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2

The operator should then update the status to include this:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2
status:
 bootImageSkewEnforcementStatus:
     mode: OCPVersion
     ocpVersion: 4.21.2

The above snippet is if an admin had chosen to record the OCPVersion. In manual mode, the admin can also choose to to store the RHCOSVersion, like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0
status:
 bootImageSkewEnforcementStatus:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0

Note that only one of RHCOSVersion or OCPVersion is permitted in Manual mode.

The admin can also choose to disable skew enforcement altogether by setting it None mode in spec.

spec:
 bootImageSkewEnforcement:
   mode: None
status:
 bootImageSkewEnforcementStatus:
   mode: None

Verifying upgrade block

Upgrades will be blocked when the cluster is to determined out of skew. This mechanism works the same way in manual and automatic mode, although it is likely easier to verify in manual mode. The current thresholds for a skew violation is set to when OCP first moved to RHEL9, which corresponds to RHEL version 9.2 and OCP version 4.13.0. The operator will perform semver comparisons of these thresholds against the boot image versions stored in bootImageSkewEnforcementStatus and set Upgradeable=False if necessary. To verify this, first set the mode to Manual with an out of skew boot image version like so:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.0.20251023-0
     mode: Manual

Now, examine the CO object named machine-config's conditions field, it should show indicate an issue preventing upgrades like so:

 - lastTransitionTime: "2025-11-20T15:15:12Z"
   message: 'Upgrades have been disabled because the cluster is using RHCOS boot
     image version 9.0.20251023-0(RHEL version: 9.0), which is below the minimum
     required RHEL version 9.2. To enable upgrades, please update your boot images
     following the documentation at [TODO: insert link], or disable boot image skew
     enforcement at [TODO: insert link]'
   reason: ClusterBootImageSkewError
   status: "False"
   type: Upgradeable

Next, set the boot image to one within the skew limits:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.2.20251023-0
     mode: Manual

Then, the Upgradeable condition should be restored back to True

 - lastTransitionTime: "2025-11-20T15:19:25Z"
   reason: AsExpected
   status: "True"
   type: Upgradeable

These set of steps can be repeated with the OCPVersion specified too. This comparison should only take place in Automatic and Manual mode however, as Automatic is only permitted on the status side, I don't think there is an easy way to test that(other than the units I've included).

In None mode, this version check should not take place.

Some caveats to note about Automatic mode:

  1. The admin is not permitted to use Automatic mode within the spec. This is in an intentional choice because only the MCO will always be able to self determine if a platform is eligible for automatic skew enforcement.
  2. In Automatic mode, API validations will prevent changing the boot image configuration to a setting other than All. To change the boot image configuration, the admin is first expected to go to Manual skew enforcement mode and then attempt to change the boot image configuration of the cluster.
  3. In Automatic mode, If any machinesets are skipped for boot image updates(for example a marketplace or an unknown boot image was detected in any of the machinesets), the boot image controller will not update the boot image value stored in bootImageEnforcementStatus. This is because the cluster cannot be considered up to date on boot image if even one of the machine resources are out of skew.
  4. In Automatic mode, the operator will only populate the OCPVersion. This is because each platform may not have the same RHCOS version of the boot image(for example, across marketplace streams) in a given release, and it would involve a lot of per-platform piping to correctly track the RHCOS version per machineset within the boot image controller. I did not deem this to be worth the effort, but am open to implementing that later if the need arises.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@djoshy: This pull request references MCO-1877 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:

  • The operator now populates the bootImageSkewEnforcementStatus field in the MachineConfiguration object based on spec.bootImageSkewEnforcement, platform defaults and cluster version.
  • The boot image controller will now update the current boot image value in bootImageSkewEnforcementStatus on a successful boot image update. Note that this requires the skew enforcement to be set to Automatic mode, and all machinesets to be opt-ed in for boot image updates.
  • The operator will set Upgradeable=False if the cluster is to be detected to be out of skew. This is done by comparing the boot image values referenced in the bootImageSkewEnforcementStatus field against the MCO's hardcoded skew limits.
  • Some unit tests have been added to sync_test.go and status_test.go to verify the above mechanisms.

Verifying API behavior

This verification will have to be done based on the platform. If the platform:

  • supports boot image updates and it is on by default(AWS and GCP at the time of writing), i.e. status.managedBootImagesStatus is set to All if spec.managedBootImages is empty. Then, skew enforcement status will be set to Automatic, with a boot image version estimated from cluster version. Then, the boot image controller will perform a sync which will update the boot image(if required) and after all resources have been successfully updated, it will update the boot image value stored in the skew enforcement status. The value set will be the OCP releaseVersion described by the coreos-bootimages configmap. Here's an example:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • supports boot image updates, but is not on by default(vsphere and Azure at the time of writing) i.e. status.managedBootImagesStatus is set to None if spec.managedBootImages is empty. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 0 of 0 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: None

The admin can choose to opt-in for boot image updates in this case(set spec.ManagedBootImages to All), and the operator should automatically switch the skew enforcement status to Automatic, with the appropriate boot image version. This would mean the object would finally look like this:

 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
   managedBootImages:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • does not support boot image updates(all other platforms at the time of writing) i.e. status.managedBootImagesStatus is empty and spec.managedBootImages cannot be set by the admin. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual

In this case, the admin is expected to manually perform boot image updates and then add a spec field like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2

The operator should then update the status to include this:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2
status:
 bootImageSkewEnforcementStatus:
     mode: OCPVersion
     ocpVersion: 4.21.2

The above snippet is if an admin had chosen to record the OCPVersion. In manual mode, the admin can also choose to to store the RHCOSVersion, like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0
status:
 bootImageSkewEnforcementStatus:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0

Note that only one of RHCOSVersion or OCPVersion is permitted in Manual mode.

The admin can also choose to disable skew enforcement altogether by setting it None mode in spec.

spec:
 bootImageSkewEnforcement:
   mode: None
status:
 bootImageSkewEnforcementStatus:
   mode: None

Verifying upgrade block

Upgrades will be blocked when the cluster is to determined out of skew. This mechanism works the same way in manual and automatic mode, although it is likely easier to verify in manual mode. The current thresholds for a skew violation is set to when OCP first moved to RHEL9, which corresponds to RHEL version 9.2 and OCP version 4.13.0. The operator will perform semver comparisons of these thresholds against the boot image versions stored in bootImageSkewEnforcementStatus and set Upgradeable=False if necessary. To verify this, first set the mode to Manual with an out of skew boot image version like so:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.0.20251023-0
     mode: Manual

Now, examine the machine-config CO object's conditions field, it should indicate an issue preventing upgrades like so:

$ oc get co machine-config -o yaml
...
 - lastTransitionTime: "2025-11-20T15:15:12Z"
   message: 'Upgrades have been disabled because the cluster is using RHCOS boot
     image version 9.0.20251023-0(RHEL version: 9.0), which is below the minimum
     required RHEL version 9.2. To enable upgrades, please update your boot images
     following the documentation at [TODO: insert link], or disable boot image skew
     enforcement at [TODO: insert link]'
   reason: ClusterBootImageSkewError
   status: "False"
   type: Upgradeable

Next, set the boot image to one within the skew limits:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.2.20251023-0
     mode: Manual

Then, the Upgradeable condition should be restored back to True

 - lastTransitionTime: "2025-11-20T15:19:25Z"
   reason: AsExpected
   status: "True"
   type: Upgradeable

These set of steps can be repeated with the OCPVersion specified too. This comparison should only take place in Automatic and Manual mode however, as Automatic is only permitted on the status side, I don't think there is an easy way to test that(other than the units I've included).

In None mode, this version check should not take place.

Some caveats to note about Automatic mode:

  1. The admin is not permitted to use Automatic mode within the spec. This is in an intentional choice because only the MCO will always be able to self determine if a platform is eligible for automatic skew enforcement.
  2. In Automatic mode, API validations will prevent changing the boot image configuration to a setting other than All. To change the boot image configuration, the admin is first expected to go to Manual skew enforcement mode and then attempt to change the boot image configuration of the cluster.
  3. In Automatic mode, if any machinesets are skipped for boot image updates(for example a marketplace or an unknown boot image was detected in any of the machinesets), the boot image controller will not update the boot image value stored in bootImageEnforcementStatus. This is because the cluster cannot be considered up to date on boot image if even one of the machine resources are out of skew.
  4. In Automatic mode, the operator will only populate the OCPVersion. This is because each platform may not have the same RHCOS version of the boot image(for example, across marketplace streams) in a given release, and it would involve a lot of per-platform piping to correctly track the RHCOS version per machineset within the boot image controller. I did not deem this to be worth the effort, but am open to implementing that later if the need arises.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy djoshy force-pushed the implement-skew-enforcement branch from dc9203e to 7b578ab Compare November 20, 2025 16:25
@djoshy djoshy marked this pull request as ready for review November 20, 2025 20:55
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 20, 2025
@djoshy
Copy link
Contributor Author

djoshy commented Nov 21, 2025

/retest-required

This change adds logic to populate the BootImageSkewEnforcementStatus
field in the MachineConfiguration status based on spec configuration,
platform support, and cluster version information.
This adds new unit tests for TestSyncMachineConfiguration to test the
BootImageSkewEnforcementStatus sync logic added in the previous commit.
This commit updates the machine-set-boot-image controller to track and
update the BootImageSkewEnforcementStatus when in Automatic mode.
This commit implements upgrade blocking when boot image version skew
exceeds acceptable limits, via the ClusterOperator Upgradeable
condition.
This commit adds unit tests for the new Upgradeable guards added in the
previous commit.
@djoshy djoshy force-pushed the implement-skew-enforcement branch from 7b578ab to dddd5c7 Compare November 21, 2025 16:06
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 21, 2025

@djoshy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/bootstrap-unit dddd5c7 link false /test bootstrap-unit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@djoshy
Copy link
Contributor Author

djoshy commented Nov 25, 2025

/retest-required

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 29, 2025
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants