Adding 'wait not ready' to prevent premature upgrade moves #1542

maxdrib · 2022-03-21T14:22:56Z

Issue #, if available:

Description of changes:
This PR introduces a new method to the legacy cluster controller which makes sure that prior to moving the cluster during an upgrade, the control plane actually starts upgrading. Without it, eks-a may decide the control plane is ready before the upgrade has even started and then proceed to move the components and then hang indefinitely.

Testing (if applicable):
Tested in cloudstack branch with e2e tests, unit tests, and in customer environment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…revent edge case upgrades stalling

maxdrib · 2022-03-21T14:23:13Z

/cc @vivek-koppuru @jiayiwang7

vivek-koppuru · 2022-03-24T06:53:39Z

pkg/clustermanager/cluster_manager.go

+	logger.V(3).Info("Waiting for control plane upgrade to be in progress")
+	err = c.clusterClient.WaitForControlPlaneNotReady(ctx, managementCluster, ctrlPlaneInProgressStr, newClusterSpec.Cluster.Name)
+	if err != nil {
+		logger.V(3).Info("no control plane upgrading")


This is a bit confusing to me. Won't it always be the case where we wait for control plane now? So if we encounter an error, is that an error we want to fail on instead?

Also, no control plane upgrading sounds a little confusing to me. This is saying that if this command fails, we don't anticipate any upgrading of the control plane? In that case, shouldn't we only check for this if we know that we are rolling out new control plane nodes?

I think the challenge here is how can we know if there is any upgrading of the control plane, I saw some code in method EKSAClusterSpecChanged of the same file, it is using reflect.DeepEqual to compare datacenter and ControlPlane machine config as part of logic to determine if EKSAClusterSpec changed. I have no confidence that logic is correct in all conditions to determine if any upgrading of the control plane. If we can figure out a way to know, the first half of the mothod UpgradeCluster can be put into a if statement.

Let's go back to the reason why the extra check is introduced -- checking controlplane readiness too soon before applied changes start to impact cluster.

Our first version to address this issue is to wait for some time (30s), before checking controlplane is ready, this wait can be put into RunPostControlPlaneUpgrade method which every provider has its own implementation. In cloudstack, it can wait 30s, in other, they can do nothing.

The dis-advantage of wait is that we always wait 30s, even the control plane status goes into 'no-ready' in 5s.
The advantage of wait is that is is provider specific implementation, it only impact cloudstack provider.
To answer your question:
if no control plane changes, ready status never goes to 'noready', method errors out after timing out - this is not an error that we want to fail.

I understand the issue with why the extra check is introduced, but we should trust the EKSAClusterSpecChanged imo to know whether there are new control plane nodes spinning up and fix misses based on that. It is our code so we should have confidence in that and change it if we have any concerns about it 😄

However, it might not be as simple as the cluster spec changed because it would only get here after we detect the cluster spec changed, so we need to actually compare any changes related to the KubeadmControlPlane. This is a blind wait that we are just checking for the error of, so we might be able to enhance it by looking for the specific error that kubectl wait throws if the state never changes.

Also the above function is RunPostControlPlaneUpgrade, so should we move this function before that technically?

The change to separate timed out error while waiting control plane not ready has been made. wait-not-ready has been moved to before RunPostControlPlaneUpgrade

vivek-koppuru

/lgtm

eks-distro-bot · 2022-03-28T19:28:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrisdoherty4, maxdrib, vivek-koppuru

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [chrisdoherty4,vivek-koppuru]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

maxdrib added 3 commits March 21, 2022 10:14

Adding 'wait not ready' method to legacy cluster controller to help p…

730c7cc

…revent edge case upgrades stalling

Removing additional cloudstack details

bd61a12

Removing additional cloudstack mocks

a7edb71

eks-distro-bot requested review from mdsgabriel and vincentni March 21, 2022 14:23

eks-distro-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 21, 2022

eks-distro-bot requested review from jiayiwang7 and vivek-koppuru March 21, 2022 14:23

chrisdoherty4 approved these changes Mar 21, 2022

View reviewed changes

eks-distro-bot added the approved label Mar 21, 2022

vivek-koppuru reviewed Mar 24, 2022

View reviewed changes

Merge branch 'main' into wait-not-ready

beca39c

vivek-koppuru mentioned this pull request Mar 25, 2022

Enhance detection of control plane node rollout #1627

Open

Separate wait-not-ready timed out process logic

4eb3229

vivek-koppuru approved these changes Mar 28, 2022

View reviewed changes

eks-distro-bot assigned vivek-koppuru Mar 28, 2022

eks-distro-bot added the lgtm label Mar 28, 2022

eks-distro-bot merged commit d2a4527 into aws:main Mar 28, 2022

maxdrib deleted the wait-not-ready branch April 14, 2022 18:42

jaxesn mentioned this pull request Jun 30, 2022

Adds wait for not ready for external etcd before upgrade #2640

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding 'wait not ready' to prevent premature upgrade moves #1542

Adding 'wait not ready' to prevent premature upgrade moves #1542

maxdrib commented Mar 21, 2022

maxdrib commented Mar 21, 2022

vivek-koppuru Mar 24, 2022

wanyufe Mar 24, 2022 •

edited

Loading

vivek-koppuru Mar 24, 2022 •

edited

Loading

vivek-koppuru Mar 24, 2022

wanyufe Mar 25, 2022

vivek-koppuru left a comment

eks-distro-bot commented Mar 28, 2022

Adding 'wait not ready' to prevent premature upgrade moves #1542

Adding 'wait not ready' to prevent premature upgrade moves #1542

Conversation

maxdrib commented Mar 21, 2022

maxdrib commented Mar 21, 2022

vivek-koppuru Mar 24, 2022

Choose a reason for hiding this comment

wanyufe Mar 24, 2022 • edited Loading

Choose a reason for hiding this comment

vivek-koppuru Mar 24, 2022 • edited Loading

Choose a reason for hiding this comment

vivek-koppuru Mar 24, 2022

Choose a reason for hiding this comment

wanyufe Mar 25, 2022

Choose a reason for hiding this comment

vivek-koppuru left a comment

Choose a reason for hiding this comment

eks-distro-bot commented Mar 28, 2022

wanyufe Mar 24, 2022 •

edited

Loading

vivek-koppuru Mar 24, 2022 •

edited

Loading