Skip to content

Conversation

@jianzhangbjz
Copy link
Contributor

@jianzhangbjz jianzhangbjz commented Oct 30, 2025

Description of the change:

1. Pod Disruption Detection

Added isAPIServiceBackendDisrupted() function that checks if APIService unavailability is due to expected pod disruption:

Disruption Signals Detected:

  • Pod has DeletionTimestamp set (terminating during node drain)
  • Pod in Pending phase (being scheduled/created after eviction)
  • Container in ContainerCreating or PodInitializing state
  • Deployment has unavailable replicas with pods restarting

2. RetryableError Type

Introduced RetryableError to distinguish:

  • RetryableError: Expected transient failures (pod disruption) → Don't change CSV phase
  • Normal Error: Real failures → Mark as Failed as before

3. Updated Logic

areAPIServicesAvailable():

  • When APIService unavailable, check for pod disruption
  • If disrupted: Return RetryableError
  • If not disrupted: Return normal error (existing behavior)

updateInstallStatus():

  • Check if apiServiceErr is retryable
  • If retryable: Requeue without changing CSV phase (NO Progressing=True)
  • If not retryable: Mark as Failed (existing behavior)

Impact

Behavior Change

Scenario Before After
APIService unavailable, pod disrupted during upgrade CSV→Failed, Progressing=True ❌ CSV stays current phase, Progressing=False ✅
APIService unavailable, real failure CSV→Failed, Progressing=True CSV→Failed, Progressing=True (unchanged)
APIService available CSV→Succeeded CSV→Succeeded (unchanged)

Motivation for the change:

Fix OLM Progressing Condition Contract Violation During Cluster Upgrades

Problem

The OLM packageserver ClusterOperator was violating the documented Progressing condition contract during cluster upgrades.

Test Failure

  [Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver
  should stay Progressing=False while MCO is Progressing=True

  Oct 29 06:32:38.018 W clusteroperator/operator-lifecycle-manager-packageserver
    condition/Progressing status/True Working toward 0.0.1-snapshot

Root Cause

During MCO-driven cluster upgrades:

  1. Node reboots (planned maintenance)
  2. PackageServer pod gets evicted and restarts (expected Kubernetes behavior)
  3. APIService temporarily unavailable during pod startup (~14 seconds)
  4. OLM detects unavailability → marks CSV as Failed → ClusterOperator reports Progressing=True
  5. System self-heals after pod restarts

This violates the contract because:

  • No new code being rolled out (same version)
  • No config changes being propagated
  • Just reconciling to previously known state after expected disruption

Architectural changes:

Testing remarks:

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

Assisted-by: Claude Code

@jianzhangbjz jianzhangbjz force-pushed the OCPBUGS-63672 branch 2 times, most recently from ab8fe24 to e67282f Compare October 30, 2025 05:54
Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jianzhangbjz — thanks a lot for this PR! 🎉
I agree with the direction: OLM shouldn’t flip Progressing=True just because nodes are draining/rebooting or the API backend has a short wobble. I just added a comment about how identify the scenario see; #3692 (comment)

@jianzhangbjz
Copy link
Contributor Author

@jianzhangbjz
Copy link
Contributor Author

Thanks! Updated them.

Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianzhangbjz I think you addressed all concerns.
Thank you for looking on that it is LGTM for me

It would be nice either get a second reviewer from @perdasilva or @tmshort here before we move forward.

Thank you a lot for the nice work 🎉

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 30, 2025
@openshift-ci
Copy link

openshift-ci bot commented Oct 30, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: camilamacedo86
Once this PR has been reviewed and has the lgtm label, please assign perdasilva for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants