Skip to content

OCPBUGS-27264: Only reconcile on Node updates with Label changes#2206

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
danwinship:node-updates
Jan 19, 2024
Merged

OCPBUGS-27264: Only reconcile on Node updates with Label changes#2206
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
danwinship:node-updates

Conversation

@danwinship
Copy link
Contributor

@danwinship danwinship commented Jan 17, 2024

e2e-aws-ovn-shared-to-local-gateway-mode-migration and its opposite flake about 50% of the time with

+ oc patch Network.operator.openshift.io cluster --type=merge --patch '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"gatewayConfig":{"routingViaHost":false}}}}}'
network.operator.openshift.io/cluster patched
+ oc wait co network --for=condition=PROGRESSING=True --timeout=60s
error: timed out waiting for the condition on clusteroperators/network 

This is because #1721 changed CNO to re-reconcile any time any Node object changes, but Nodes change a lot (specifically, their Conditions), so now we do a ton of unnecessary reconciling, causing CNO to be super busy and lag behind in processing events. This PR fixes it to only reconcile when node labels change.

(I don't know whether CNO lagginess causes any other problems currently?)

Reconciling every time any Node.Status.Condition changes means CNO
just ends up constantly reconciling.
@openshift-ci openshift-ci bot requested review from abhat and tssurya January 17, 2024 02:55
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 17, 2024
@danwinship danwinship marked this pull request as draft January 17, 2024 12:08
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 17, 2024
@danwinship danwinship marked this pull request as ready for review January 17, 2024 13:52
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 17, 2024
@openshift-ci openshift-ci bot requested a review from JacobTanenbaum January 17, 2024 13:54
@danwinship danwinship changed the title Only reconcile on Node updates with Label changes OCPBUGS-27264: Only reconcile on Node updates with Label changes Jan 17, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jan 17, 2024
@openshift-ci-robot
Copy link
Contributor

@danwinship: This pull request references Jira Issue OCPBUGS-27264, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

e2e-aws-ovn-shared-to-local-gateway-mode-migration and its opposite flake about 50% of the time with

+ oc patch Network.operator.openshift.io cluster --type=merge --patch '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"gatewayConfig":{"routingViaHost":false}}}}}'
network.operator.openshift.io/cluster patched
+ oc wait co network --for=condition=PROGRESSING=True --timeout=60s
error: timed out waiting for the condition on clusteroperators/network 

This is because #1721 changed CNO to re-reconcile any time any Node object changes, but Nodes change a lot (specifically, their Conditions), so now we do a ton of unnecessary reconciling. This PR fixes it to only reconcile when node labels change.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@danwinship
Copy link
Contributor Author

/retest

@openshift-ci-robot
Copy link
Contributor

@danwinship: This pull request references Jira Issue OCPBUGS-27264, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

Details

In response to this:

e2e-aws-ovn-shared-to-local-gateway-mode-migration and its opposite flake about 50% of the time with

+ oc patch Network.operator.openshift.io cluster --type=merge --patch '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"gatewayConfig":{"routingViaHost":false}}}}}'
network.operator.openshift.io/cluster patched
+ oc wait co network --for=condition=PROGRESSING=True --timeout=60s
error: timed out waiting for the condition on clusteroperators/network 

This is because #1721 changed CNO to re-reconcile any time any Node object changes, but Nodes change a lot (specifically, their Conditions), so now we do a ton of unnecessary reconciling, causing CNO to be super busy and lag behind in processing events. This PR fixes it to only reconcile when node labels change.

(I don't know whether CNO lagginess causes any other problems currently?)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@danwinship
Copy link
Contributor Author

/retest-required

Copy link
Contributor

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
cc @wizhaoredhat
wondering if this can be further reduced to only checking for the necessary labels but this LGTM.
CI seems to be passing for the gateway mode migration jobs

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 18, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, tssurya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 2eed7cf and 2 for PR HEAD b81354d in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD c793774 and 1 for PR HEAD b81354d in total

@wizhaoredhat
Copy link
Contributor

LGTM

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 19, 2024

@danwinship: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-dualstack-primaryv6 b81354d link false /test e2e-vsphere-ovn-dualstack-primaryv6

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@danwinship
Copy link
Contributor Author

/retest-required

@openshift-merge-bot openshift-merge-bot bot merged commit e53cc19 into openshift:master Jan 19, 2024
@openshift-ci-robot
Copy link
Contributor

@danwinship: Jira Issue OCPBUGS-27264: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-27264 has been moved to the MODIFIED state.

Details

In response to this:

e2e-aws-ovn-shared-to-local-gateway-mode-migration and its opposite flake about 50% of the time with

+ oc patch Network.operator.openshift.io cluster --type=merge --patch '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"gatewayConfig":{"routingViaHost":false}}}}}'
network.operator.openshift.io/cluster patched
+ oc wait co network --for=condition=PROGRESSING=True --timeout=60s
error: timed out waiting for the condition on clusteroperators/network 

This is because #1721 changed CNO to re-reconcile any time any Node object changes, but Nodes change a lot (specifically, their Conditions), so now we do a ton of unnecessary reconciling, causing CNO to be super busy and lag behind in processing events. This PR fixes it to only reconcile when node labels change.

(I don't know whether CNO lagginess causes any other problems currently?)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@danwinship danwinship deleted the node-updates branch January 19, 2024 15:50
@danwinship
Copy link
Contributor Author

/cherry-pick release-4.15

@openshift-cherrypick-robot

@danwinship: new pull request created: #2212

Details

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-network-operator-container-v4.16.0-202401191549.p0.ge53cc19.assembly.stream for distgit cluster-network-operator.
All builds following this will include this PR.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-01-21-092529

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants