-
Notifications
You must be signed in to change notification settings - Fork 565
Fix: OLM should not report Progressing=True during pod disruption from cluster upgrades #3692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
ab8fe24 to
e67282f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jianzhangbjz — thanks a lot for this PR! 🎉
I agree with the direction: OLM shouldn’t flip Progressing=True just because nodes are draining/rebooting or the API backend has a short wobble. I just added a comment about how identify the scenario see; #3692 (comment)
…m cluster upgrades
e67282f to
ae702b9
Compare
|
Thanks! Updated them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianzhangbjz I think you addressed all concerns.
Thank you for looking on that it is LGTM for me
It would be nice either get a second reviewer from @perdasilva or @tmshort here before we move forward.
Thank you a lot for the nice work 🎉
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: camilamacedo86 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description of the change:
1. Pod Disruption Detection
Added
isAPIServiceBackendDisrupted()function that checks if APIService unavailability is due to expected pod disruption:Disruption Signals Detected:
DeletionTimestampset (terminating during node drain)Pendingphase (being scheduled/created after eviction)ContainerCreatingorPodInitializingstate2. RetryableError Type
Introduced
RetryableErrorto distinguish:3. Updated Logic
areAPIServicesAvailable():RetryableErrorupdateInstallStatus():apiServiceErris retryableImpact
Behavior Change
Motivation for the change:
Fix OLM Progressing Condition Contract Violation During Cluster Upgrades
Problem
The OLM packageserver ClusterOperator was violating the documented Progressing condition contract during cluster upgrades.
Test Failure
Root Cause
During MCO-driven cluster upgrades:
Failed→ ClusterOperator reportsProgressing=TrueThis violates the contract because:
Architectural changes:
Testing remarks:
Reviewer Checklist
/doc[FLAKE]are truly flaky and have an issueAssisted-by: Claude Code