Fix: OLM should not report Progressing=True during pod disruption from cluster upgrades #3692

jianzhangbjz · 2025-10-30T04:01:18Z

Description of the change:

1. Pod Disruption Detection

Added isAPIServiceBackendDisrupted() function that checks if APIService unavailability is due to expected pod disruption:

Disruption Signals Detected:

Pod has DeletionTimestamp set (terminating during node drain)
Pod in Pending phase (being scheduled/created after eviction)
Container in ContainerCreating or PodInitializing state
Deployment has unavailable replicas with pods restarting

2. RetryableError Type

Introduced RetryableError to distinguish:

RetryableError: Expected transient failures (pod disruption) → Don't change CSV phase
Normal Error: Real failures → Mark as Failed as before

3. Updated Logic

areAPIServicesAvailable():

When APIService unavailable, check for pod disruption
If disrupted: Return RetryableError
If not disrupted: Return normal error (existing behavior)

updateInstallStatus():

Check if apiServiceErr is retryable
If retryable: Requeue without changing CSV phase (NO Progressing=True)
If not retryable: Mark as Failed (existing behavior)

Impact

Behavior Change

Scenario	Before	After
APIService unavailable, pod disrupted during upgrade	CSV→Failed, Progressing=True ❌	CSV stays current phase, Progressing=False ✅
APIService unavailable, real failure	CSV→Failed, Progressing=True	CSV→Failed, Progressing=True (unchanged)
APIService available	CSV→Succeeded	CSV→Succeeded (unchanged)

Motivation for the change:

Fix OLM Progressing Condition Contract Violation During Cluster Upgrades

Problem

The OLM packageserver ClusterOperator was violating the documented Progressing condition contract during cluster upgrades.

Test Failure

  [Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver
  should stay Progressing=False while MCO is Progressing=True

  Oct 29 06:32:38.018 W clusteroperator/operator-lifecycle-manager-packageserver
    condition/Progressing status/True Working toward 0.0.1-snapshot

Root Cause

During MCO-driven cluster upgrades:

Node reboots (planned maintenance)
PackageServer pod gets evicted and restarts (expected Kubernetes behavior)
APIService temporarily unavailable during pod startup (~14 seconds)
OLM detects unavailability → marks CSV as Failed → ClusterOperator reports Progressing=True
System self-heals after pod restarts

This violates the contract because:

No new code being rolled out (same version)
No config changes being propagated
Just reconciling to previously known state after expected disruption

Architectural changes:

Testing remarks:

Reviewer Checklist

Assisted-by: Claude Code

camilamacedo86

Hey @jianzhangbjz — thanks a lot for this PR! 🎉
I agree with the direction: OLM shouldn’t flip Progressing=True just because nodes are draining/rebooting or the API backend has a short wobble. I just added a comment about how identify the scenario see; #3692 (comment)

jianzhangbjz · 2025-10-30T06:22:43Z

Please read https://github.com/openshift/api/blob/master/config/v1/types_cluster_operator.go#L159-L164

pkg/controller/operators/olm/apiservices.go

pkg/controller/operators/olm/operator.go

pkg/controller/operators/olm/apiservices.go

…m cluster upgrades

jianzhangbjz · 2025-10-30T10:30:10Z

Thanks! Updated them.

camilamacedo86

@jianzhangbjz I think you addressed all concerns.
Thank you for looking on that it is LGTM for me

It would be nice either get a second reviewer from @perdasilva or @tmshort here before we move forward.

Thank you a lot for the nice work 🎉

openshift-ci · 2025-10-30T10:39:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: camilamacedo86
Once this PR has been reviewed and has the lgtm label, please assign perdasilva for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from camilamacedo86 and grokspawn October 30, 2025 04:01

jianzhangbjz force-pushed the OCPBUGS-63672 branch 2 times, most recently from ab8fe24 to e67282f Compare October 30, 2025 05:54

camilamacedo86 requested changes Oct 30, 2025

View reviewed changes

openshift-ci bot assigned camilamacedo86 Oct 30, 2025

camilamacedo86 reviewed Oct 30, 2025

View reviewed changes

pkg/controller/operators/olm/apiservices.go Show resolved Hide resolved

camilamacedo86 reviewed Oct 30, 2025

View reviewed changes

pkg/controller/operators/olm/operator.go Show resolved Hide resolved

camilamacedo86 reviewed Oct 30, 2025

View reviewed changes

pkg/controller/operators/olm/operator.go Show resolved Hide resolved

pkg/controller/operators/olm/apiservices.go Show resolved Hide resolved

Fix: OLM should not report Progressing=True during pod disruption fro…

ae702b9

…m cluster upgrades

jianzhangbjz force-pushed the OCPBUGS-63672 branch from e67282f to ae702b9 Compare October 30, 2025 10:29

camilamacedo86 approved these changes Oct 30, 2025

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: OLM should not report Progressing=True during pod disruption from cluster upgrades #3692

Fix: OLM should not report Progressing=True during pod disruption from cluster upgrades #3692

jianzhangbjz commented Oct 30, 2025 •

edited

Loading

Uh oh!

camilamacedo86 left a comment •

edited

Loading

Uh oh!

jianzhangbjz commented Oct 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jianzhangbjz commented Oct 30, 2025

Uh oh!

camilamacedo86 left a comment

Uh oh!

openshift-ci bot commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: OLM should not report Progressing=True during pod disruption from cluster upgrades #3692

Are you sure you want to change the base?

Fix: OLM should not report Progressing=True during pod disruption from cluster upgrades #3692

Conversation

jianzhangbjz commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Pod Disruption Detection

2. RetryableError Type

3. Updated Logic

Impact

Behavior Change

Fix OLM Progressing Condition Contract Violation During Cluster Upgrades

Problem

Test Failure

Root Cause

Uh oh!

camilamacedo86 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jianzhangbjz commented Oct 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jianzhangbjz commented Oct 30, 2025

Uh oh!

camilamacedo86 left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jianzhangbjz commented Oct 30, 2025 •

edited

Loading

camilamacedo86 left a comment •

edited

Loading