Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-27477: Pausing Master MCP results in Alerts #4707

Conversation

isabella-janssen
Copy link
Member

- What I did
Update the event message on pause of required pool where no updates are outstanding.

- How to verify it
Pause a required machine config pool:

$ oc patch mcp/master --patch '{"spec": {"paused":true}}' --type=merge

Watch the machine-config-operator pod logs:

$ oc logs -n openshift-machine-config-operator -c machine-config-operator <pod-name> -f
     Starting MachineConfigOperator
     Event(v1.ObjectReference{Kind:"ClusterOperator", Namespace:"openshift-machine-config-operator", Name:"machine-config", UID:"a22bdfc5-84c1-4063-b66d-87f4e86815e2", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'OperatorDegraded: RequiredPoolsFailed' Failed to resync 4.18.0-0.nightly-arm64-2024-11-06-051902 because: **the required MachineConfigPool master was paused with no pending updates but no futher syncing will occur until it is unpaused**

- Description for the changelog
OCPBUGS-27477: Update warning message on pause of required MCP.

@openshift-ci-robot openshift-ci-robot added jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 19, 2024
@openshift-ci-robot
Copy link
Contributor

@isabella-janssen: This pull request references Jira Issue OCPBUGS-27477, which is invalid:

  • expected the bug to target the "4.18.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did
Update the event message on pause of required pool where no updates are outstanding.

- How to verify it
Pause a required machine config pool:

$ oc patch mcp/master --patch '{"spec": {"paused":true}}' --type=merge

Watch the machine-config-operator pod logs:

$ oc logs -n openshift-machine-config-operator -c machine-config-operator <pod-name> -f
    Starting MachineConfigOperator
    Event(v1.ObjectReference{Kind:"ClusterOperator", Namespace:"openshift-machine-config-operator", Name:"machine-config", UID:"a22bdfc5-84c1-4063-b66d-87f4e86815e2", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'OperatorDegraded: RequiredPoolsFailed' Failed to resync 4.18.0-0.nightly-arm64-2024-11-06-051902 because: **the required MachineConfigPool master was paused with no pending updates but no futher syncing will occur until it is unpaused**

- Description for the changelog
OCPBUGS-27477: Update warning message on pause of required MCP.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@@ -1731,6 +1731,9 @@ func (optr *Operator) syncRequiredMachineConfigPools(config *renderConfig, co *c
}
// If we don't account for pause here, we will spin in this loop until we hit the 10 minute timeout because paused pools can't sync.
if pool.Spec.Paused {
if isPoolStatusConditionTrue(pool, mcfgv1.MachineConfigPoolUpdated) {
return false, fmt.Errorf("the required MachineConfigPool %s was paused with no pending updates; no futher syncing will occur until it is unpaused", pool.Name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewer:

  • The initial bug requested that no error was thrown on the pausing of an MCP.
  • My perspective is users should still be informed they are pausing a pool that is required, so I opted to change the warning message to reassure users they have no outstanding updates while keeping the messaging of the paused pool being required.
  • I would love to discuss other perspectives.

@isabella-janssen
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Nov 19, 2024
@openshift-ci-robot
Copy link
Contributor

@isabella-janssen: This pull request references Jira Issue OCPBUGS-27477, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 19, 2024
@openshift-ci openshift-ci bot requested a review from sergiordlr November 19, 2024 20:13
@isabella-janssen
Copy link
Member Author

/retest-required

@isabella-janssen isabella-janssen force-pushed the ocpbugs-27477-pausing-alert branch from 8c71c0b to 46fb474 Compare November 20, 2024 21:00
@sergiordlr
Copy link

Verified using IPI on OSP

  1. Pause the master pool
  2. Check the events (no pending update)
2s          Warning   OperatorDegraded: RequiredPoolsFailed   clusteroperator/machine-config                    Failed to resync 4.18.0-0.test-2024-11-29-131008-ci-ln-757z3h2-latest because: the required MachineConfigPool master was paused with no pending updates; no further syncing will occur until it is unpaused
  1. Pause the master pool
  2. Create a MC
  3. Check the events (pending update)
162m        Warning   OperatorDegraded: RequiredPoolsFailed   /machine-config                                   Failed to resync 4.18.0-0.test-2024-11-29-131008-ci-ln-757z3h2-latest because: error required MachineConfigPool master is paused and cannot sync until it is unpaused

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Nov 29, 2024
@openshift-ci-robot openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Nov 29, 2024
@openshift-ci-robot
Copy link
Contributor

@isabella-janssen: This pull request references Jira Issue OCPBUGS-27477, which is invalid:

  • expected the bug to target either version "4.19." or "openshift-4.19.", but it targets "4.18.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

- What I did
Update the event message on pause of required pool where no updates are outstanding.

- How to verify it
Pause a required machine config pool:

$ oc patch mcp/master --patch '{"spec": {"paused":true}}' --type=merge

Watch the machine-config-operator pod logs:

$ oc logs -n openshift-machine-config-operator -c machine-config-operator <pod-name> -f
    Starting MachineConfigOperator
    Event(v1.ObjectReference{Kind:"ClusterOperator", Namespace:"openshift-machine-config-operator", Name:"machine-config", UID:"a22bdfc5-84c1-4063-b66d-87f4e86815e2", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'OperatorDegraded: RequiredPoolsFailed' Failed to resync 4.18.0-0.nightly-arm64-2024-11-06-051902 because: **the required MachineConfigPool master was paused with no pending updates but no futher syncing will occur until it is unpaused**

- Description for the changelog
OCPBUGS-27477: Update warning message on pause of required MCP.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy
Copy link
Contributor

djoshy commented Dec 4, 2024

/lgtm
/approve

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 4, 2024
@openshift-ci-robot
Copy link
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-27477, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/lgtm
/approve

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 4, 2024
Copy link
Contributor

openshift-ci bot commented Dec 4, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, isabella-janssen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 4, 2024
@isabella-janssen
Copy link
Member Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 4, 2024
@yuqi-zhang
Copy link
Contributor

Makes sense to me 👍 , are we tracking potentially further improvements to this via jira somewhere? I think an eventual end goal is to be more clear what states are errors, vs what states are "we should let the user know, but it's not failing at this time"

@isabella-janssen
Copy link
Member Author

Makes sense to me 👍 , are we tracking potentially further improvements to this via jira somewhere? I think an eventual end goal is to be more clear what states are errors, vs what states are "we should let the user know, but it's not failing at this time"

Thanks @yuqi-zhang! I have made tech debt story MCO-1462 to track the future state goal. Please feel free to edit/move it as you see fit.

@isabella-janssen
Copy link
Member Author

/unhold

/retest

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 5, 2024
@isabella-janssen
Copy link
Member Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 5, 2024
@isabella-janssen
Copy link
Member Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 5, 2024
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 657b65f and 2 for PR HEAD 46fb474 in total

Copy link
Contributor

openshift-ci bot commented Dec 6, 2024

@isabella-janssen: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 46fb474 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 6778953 into openshift:master Dec 6, 2024
18 of 19 checks passed
@openshift-ci-robot
Copy link
Contributor

@isabella-janssen: Jira Issue OCPBUGS-27477: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-27477 has been moved to the MODIFIED state.

In response to this:

- What I did
Update the event message on pause of required pool where no updates are outstanding.

- How to verify it
Pause a required machine config pool:

$ oc patch mcp/master --patch '{"spec": {"paused":true}}' --type=merge

Watch the machine-config-operator pod logs:

$ oc logs -n openshift-machine-config-operator -c machine-config-operator <pod-name> -f
    Starting MachineConfigOperator
    Event(v1.ObjectReference{Kind:"ClusterOperator", Namespace:"openshift-machine-config-operator", Name:"machine-config", UID:"a22bdfc5-84c1-4063-b66d-87f4e86815e2", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'OperatorDegraded: RequiredPoolsFailed' Failed to resync 4.18.0-0.nightly-arm64-2024-11-06-051902 because: **the required MachineConfigPool master was paused with no pending updates but no futher syncing will occur until it is unpaused**

- Description for the changelog
OCPBUGS-27477: Update warning message on pause of required MCP.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.19.0-202412060336.p0.g6778953.assembly.stream.el9.
All builds following this will include this PR.

@isabella-janssen
Copy link
Member Author

/cherry-pick release-4.18,release-4.17,release-4.16

@isabella-janssen isabella-janssen deleted the ocpbugs-27477-pausing-alert branch December 11, 2024 19:37
@openshift-cherrypick-robot

@isabella-janssen: cannot checkout release-4.18,release-4.17,release-4.16: error checking out "release-4.18,release-4.17,release-4.16": exit status 1 error: pathspec 'release-4.18,release-4.17,release-4.16' did not match any file(s) known to git

In response to this:

/cherry-pick release-4.18,release-4.17,release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@isabella-janssen
Copy link
Member Author

/cherry-pick release-4.18 release-4.17 release-4.16

@openshift-cherrypick-robot

@isabella-janssen: new pull request created: #4748

In response to this:

/cherry-pick release-4.18 release-4.17 release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants