-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding 'wait not ready' to prevent premature upgrade moves #1542
Conversation
logger.V(3).Info("Waiting for control plane upgrade to be in progress") | ||
err = c.clusterClient.WaitForControlPlaneNotReady(ctx, managementCluster, ctrlPlaneInProgressStr, newClusterSpec.Cluster.Name) | ||
if err != nil { | ||
logger.V(3).Info("no control plane upgrading") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing to me. Won't it always be the case where we wait for control plane now? So if we encounter an error, is that an error we want to fail on instead?
Also, no control plane upgrading
sounds a little confusing to me. This is saying that if this command fails, we don't anticipate any upgrading of the control plane? In that case, shouldn't we only check for this if we know that we are rolling out new control plane nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the challenge here is how can we know if there is any upgrading of the control plane
, I saw some code in method EKSAClusterSpecChanged
of the same file, it is using reflect.DeepEqual to compare datacenter and ControlPlane machine config as part of logic to determine if EKSAClusterSpec changed. I have no confidence that logic is correct in all conditions to determine if any upgrading of the control plane
. If we can figure out a way to know, the first half of the mothod UpgradeCluster
can be put into a if
statement.
Let's go back to the reason why the extra check is introduced -- checking controlplane readiness too soon before applied changes start to impact cluster.
Our first version to address this issue is to wait for some time (30s), before checking controlplane is ready, this wait can be put into RunPostControlPlaneUpgrade
method which every provider has its own implementation. In cloudstack, it can wait 30s, in other, they can do nothing.
The dis-advantage of wait is that we always wait 30s, even the control plane status goes into 'no-ready' in 5s.
The advantage of wait is that is is provider specific implementation, it only impact cloudstack provider.
To answer your question:
if no control plane changes, ready status never goes to 'noready', method errors out after timing out - this is not an error that we want to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the issue with why the extra check is introduced, but we should trust the EKSAClusterSpecChanged
imo to know whether there are new control plane nodes spinning up and fix misses based on that. It is our code so we should have confidence in that and change it if we have any concerns about it 😄
However, it might not be as simple as the cluster spec changed because it would only get here after we detect the cluster spec changed, so we need to actually compare any changes related to the KubeadmControlPlane. This is a blind wait that we are just checking for the error of, so we might be able to enhance it by looking for the specific error that kubectl wait throws if the state never changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the above function is RunPostControlPlaneUpgrade
, so should we move this function before that technically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change to separate timed out error while waiting control plane not ready has been made. wait-not-ready has been moved to before RunPostControlPlaneUpgrade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chrisdoherty4, maxdrib, vivek-koppuru The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Issue #, if available:
Description of changes:
This PR introduces a new method to the legacy cluster controller which makes sure that prior to moving the cluster during an upgrade, the control plane actually starts upgrading. Without it, eks-a may decide the control plane is ready before the upgrade has even started and then proceed to move the components and then hang indefinitely.
Testing (if applicable):
Tested in cloudstack branch with e2e tests, unit tests, and in customer environment
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.