Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding 'wait not ready' to prevent premature upgrade moves #1542

Merged
merged 5 commits into from
Mar 28, 2022

Conversation

maxdrib
Copy link
Contributor

@maxdrib maxdrib commented Mar 21, 2022

Issue #, if available:

Description of changes:
This PR introduces a new method to the legacy cluster controller which makes sure that prior to moving the cluster during an upgrade, the control plane actually starts upgrading. Without it, eks-a may decide the control plane is ready before the upgrade has even started and then proceed to move the components and then hang indefinitely.

Testing (if applicable):
Tested in cloudstack branch with e2e tests, unit tests, and in customer environment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@eks-distro-bot eks-distro-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 21, 2022
@maxdrib
Copy link
Contributor Author

maxdrib commented Mar 21, 2022

/cc @vivek-koppuru @jiayiwang7

logger.V(3).Info("Waiting for control plane upgrade to be in progress")
err = c.clusterClient.WaitForControlPlaneNotReady(ctx, managementCluster, ctrlPlaneInProgressStr, newClusterSpec.Cluster.Name)
if err != nil {
logger.V(3).Info("no control plane upgrading")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing to me. Won't it always be the case where we wait for control plane now? So if we encounter an error, is that an error we want to fail on instead?

Also, no control plane upgrading sounds a little confusing to me. This is saying that if this command fails, we don't anticipate any upgrading of the control plane? In that case, shouldn't we only check for this if we know that we are rolling out new control plane nodes?

Copy link
Member

@wanyufe wanyufe Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the challenge here is how can we know if there is any upgrading of the control plane, I saw some code in method EKSAClusterSpecChanged of the same file, it is using reflect.DeepEqual to compare datacenter and ControlPlane machine config as part of logic to determine if EKSAClusterSpec changed. I have no confidence that logic is correct in all conditions to determine if any upgrading of the control plane. If we can figure out a way to know, the first half of the mothod UpgradeCluster can be put into a if statement.

Let's go back to the reason why the extra check is introduced -- checking controlplane readiness too soon before applied changes start to impact cluster.

Our first version to address this issue is to wait for some time (30s), before checking controlplane is ready, this wait can be put into RunPostControlPlaneUpgrade method which every provider has its own implementation. In cloudstack, it can wait 30s, in other, they can do nothing.

The dis-advantage of wait is that we always wait 30s, even the control plane status goes into 'no-ready' in 5s.
The advantage of wait is that is is provider specific implementation, it only impact cloudstack provider.
To answer your question:
if no control plane changes, ready status never goes to 'noready', method errors out after timing out - this is not an error that we want to fail.

Copy link
Member

@vivek-koppuru vivek-koppuru Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the issue with why the extra check is introduced, but we should trust the EKSAClusterSpecChanged imo to know whether there are new control plane nodes spinning up and fix misses based on that. It is our code so we should have confidence in that and change it if we have any concerns about it 😄

However, it might not be as simple as the cluster spec changed because it would only get here after we detect the cluster spec changed, so we need to actually compare any changes related to the KubeadmControlPlane. This is a blind wait that we are just checking for the error of, so we might be able to enhance it by looking for the specific error that kubectl wait throws if the state never changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the above function is RunPostControlPlaneUpgrade, so should we move this function before that technically?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to separate timed out error while waiting control plane not ready has been made. wait-not-ready has been moved to before RunPostControlPlaneUpgrade

Copy link
Member

@vivek-koppuru vivek-koppuru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@eks-distro-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrisdoherty4, maxdrib, vivek-koppuru

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [chrisdoherty4,vivek-koppuru]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@eks-distro-bot eks-distro-bot merged commit d2a4527 into aws:main Mar 28, 2022
@maxdrib maxdrib deleted the wait-not-ready branch April 14, 2022 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants