Skip to content

Conversation

@QiWang19
Copy link
Member

@QiWang19 QiWang19 commented Sep 29, 2025

Adjust the test to not fail upgrade checks if all nodes are ready. This allows for updates that do not require node drain. Shipping a default ClusterImagePolicy during upgrade openshift/cluster-update-keys#85

Signed-off-by: Qi Wang <qiwan@redhat.com>
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 29, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 29, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@QiWang19
Copy link
Member Author

/test all

@QiWang19
Copy link
Member Author

/retest-required

@QiWang19 QiWang19 marked this pull request as ready for review September 30, 2025 17:47
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 30, 2025
@QiWang19 QiWang19 changed the title Not fail upgrade checks if all nodes are ready OCPNODE-3659: Not fail upgrade checks if all nodes are ready Sep 30, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 30, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 30, 2025

@QiWang19: This pull request references OCPNODE-3659 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

not fail upgrade checks if all nodes are ready. This allows for updates that do not require node drain.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 30, 2025

@QiWang19: This pull request references OCPNODE-3659 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

not fail upgrade checks if all nodes are ready. This allows for updates that do not require node drain. Shipping a default ClusterImagePolicy during upgrade openshift/cluster-update-keys#85

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 30, 2025

@QiWang19: This pull request references OCPNODE-3659 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Adjust the test to not fail upgrade checks if all nodes are ready. This allows for updates that do not require node drain. Shipping a default ClusterImagePolicy during upgrade openshift/cluster-update-keys#85

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@QiWang19
Copy link
Member Author

/testwith openshift/cluster-update-keys/main/e2e-aws-upgrade openshift/cluster-update-keys#85 #30318

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/multi-pr-openshift-cluster-update-keys-85-openshift-cluster-update-keys-85-openshift-origin-30318-e2e-aws-upgrade/1973048767556882432

[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] test passed

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 30, 2025

@QiWang19: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn d3af51e link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-metal-ipi-ovn d3af51e link false /test e2e-metal-ipi-ovn
ci/prow/e2e-openstack-ovn d3af51e link false /test e2e-openstack-ovn
ci/prow/e2e-metal-ipi-ovn-kube-apiserver-rollout d3af51e link false /test e2e-metal-ipi-ovn-kube-apiserver-rollout
ci/prow/e2e-aws-ovn-upgrade-rollback d3af51e link false /test e2e-aws-ovn-upgrade-rollback
ci/prow/e2e-aws-ovn-edge-zones d3af51e link false /test e2e-aws-ovn-edge-zones
ci/prow/e2e-aws-csi d3af51e link false /test e2e-aws-csi
ci/prow/e2e-aws-ovn-single-node d3af51e link false /test e2e-aws-ovn-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

framework.Logf("Waiting on pools to be upgraded")
if err := wait.PollImmediate(10*time.Second, 30*time.Minute, func() (bool, error) {

nodes, err := kubeClient.CoreV1().Nodes().List(context.Background(), metav1.ListOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to imagine a MachineConfigPool update that cordons and drains control-plane nodes where we never see a single master Node that's Unschedulable=True in this 10s poll loop, so looks good to me.

Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 1, 2025
@QiWang19
Copy link
Member Author

QiWang19 commented Oct 1, 2025

/assign @neisw

@QiWang19
Copy link
Member Author

QiWang19 commented Oct 1, 2025

/verified by @QiWang19
verified by running /testwith jobs.

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Oct 1, 2025
@openshift-ci-robot
Copy link

@QiWang19: This PR has been marked as verified by @QiWang19.

In response to this:

/verified by @QiWang19
verified by running /testwith jobs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@neisw
Copy link
Contributor

neisw commented Oct 1, 2025

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 1, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: neisw, QiWang19, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 1, 2025
@neisw
Copy link
Contributor

neisw commented Oct 1, 2025

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD b2be961 and 2 for PR HEAD d3af51e in total

@openshift-trt
Copy link

openshift-trt bot commented Oct 1, 2025

Job Failure Risk Analysis for sha: d3af51e

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-upgrade-rollback High
[Monitor:oc-adm-upgrade-status][sig-cli][OCPFeatureGate:UpgradeStatus] oc adm upgrade status snapshots reflect the cluster upgrade lifecycle
This test has passed 99.94% of 5114 runs on release 4.21 [Overall] in the last week.

@QiWang19
Copy link
Member Author

QiWang19 commented Oct 2, 2025

/retest-required

@openshift-merge-bot openshift-merge-bot bot merged commit 01a7cc0 into openshift:main Oct 2, 2025
11 of 31 checks passed
QiWang19 added a commit to QiWang19/origin that referenced this pull request Nov 11, 2025
…atus

The previous PR (openshift#30318) allowed
non-drained updates but also required the node annotation
"machineconfiguration.openshift.io/state"="Done". This condition is too strict,
as the MCD may not be in the "Done" state when the nodes remain schedulable and fully functional.

Signed-off-by: Qi Wang <qiwan@redhat.com>
wking added a commit to wking/cluster-update-keys that referenced this pull request Nov 19, 2025
…-openshift-cip""

This reverts commit 7a5dcee.

This one has taken us some time:

* 2025-08-27, 94f7582, openshift#82 was our first attempt at enabling the
  ClusterImagePolicy.
* ...but it tripped up the origin test suite, so it was reverted in
  2025-08-28, c40e7b9, openshift#83.
* Qi then hardened the test suite with openshift/origin@d3af51e4acb
  (not fail upgrade checks if all nodes are ready, 2025-09-29,
  openshift/origin#30318) and openshift/origin@2fd0d8e242 (Upgrade
  test add 2min grace period allow non-drain updates to complete,
  2025-11-12, openshift/origin#30480).
* With the tougher CI in place, we tried a second time with
  2025-11-17, 1f89a67, openshift#85.
* ...but still tripped up origin, with runs like [1] taking 2.25m
  (more than the 2m grace period):

    I1119 17:26:21.890667 1511 upgrade.go:629] Waiting on pools to be upgraded
    I1119 17:26:21.939178 1511 upgrade.go:792] Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
    I1119 17:26:21.939259 1511 upgrade.go:666] Invariant violation detected: master pool requires update but nodes not ready. Waiting up to 2m0s for non-draining updates to complete
    I1119 17:26:31.984116 1511 upgrade.go:792] Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
    ...
    I1119 17:28:21.981438 1511 upgrade.go:792] Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
    I1119 17:28:21.981514 1511 upgrade.go:673] Invariant violation detected: the "master" pool should be updated before the CVO reports available at the new version

  and:

    $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade/1991158541779472384/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/inspect/cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigpools/master.yaml | yaml2json | jq -r '.status.conditions[] | select(.type == "Updating") | .lastTransitionTime + " " + .status'
    2025-11-19T17:28:36Z False

  28:36 - 26:21 = 135s = 2.25m, which overshot the 2m grace period.
  The second attempt was reverted in 7a5dcee, openshift#87.

* Qi then hardened the test suite further with
  openshift/origin@c17e560263 (Update grace period for cluster upgrade
  to 10 minutes, 2025-11-19, #openshift/origin#30506).
* This commit is taking a third attempt at enabling the
  ClusterImagePolicy.

[1]: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade/1991158541779472384
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants