-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MachineHealthChecks cause endless loop during upgrade in slow environment #3918
Comments
msanjaq
added a commit
to msanjaq/eks-anywhere
that referenced
this issue
Nov 15, 2022
…achine health check (aws#3918)
msanjaq
added a commit
to msanjaq/eks-anywhere
that referenced
this issue
Nov 15, 2022
…achine health check (aws#3918)
msanjaq
added a commit
to msanjaq/eks-anywhere
that referenced
this issue
Nov 15, 2022
…achine health check (aws#3918)
msanjaq
added a commit
to msanjaq/eks-anywhere
that referenced
this issue
Nov 16, 2022
) Before this change, an unhealthy machine health check would timeout after five minutes. This leads to an endless loop for a RolingUpgrade on a slow system. Disabling the health checks circumvents this issue, but is not ideal. This change adds an environmental variable EKSA_UNHEALTHY_CONDITION_TIMEOUT_MINS which customers can use to make the upgrade process more reliable without sacraficing health checks.
msanjaq
added a commit
to msanjaq/eks-anywhere
that referenced
this issue
Nov 16, 2022
) Before this change, an unhealthy machine health check would timeout after five minutes. This leads to an endless loop for a RolingUpgrade on a slow system. Disabling the health checks circumvents this issue, but is not ideal. This change adds an environmental variable EKSA_UNHEALTHY_CONDITION_TIMEOUT_MINS which customers can use to make the upgrade process more reliable without sacrificing health checks.
msanjaq
added a commit
to msanjaq/eks-anywhere
that referenced
this issue
Nov 16, 2022
) Before this change, an unhealthy machine health check would timeout after five minutes. This leads to an endless loop for a RolingUpgrade on a slow system. Disabling the health checks circumvents this issue, but is not ideal. This change adds an environmental variable EKSA_UNHEALTHY_CONDITION_TIMEOUT_MINS which customers can use to make the upgrade process more reliable without sacrificing health checks.
msanjaq
added a commit
to msanjaq/eks-anywhere
that referenced
this issue
Nov 17, 2022
Before this change, an unhealthy machine health check would timeout after five minutes. This leads to an endless loop for a RolingUpgrade on a slow system. Disabling the health checks circumvents this issue, but is not ideal. This change adds a --unhealthy-machine-timeout flag which customers can use to make the upgrade process more reliable without sacrificing health checks.
msanjaq
added a commit
to msanjaq/eks-anywhere
that referenced
this issue
Nov 18, 2022
Before this change, an unhealthy machine health check would timeout after five minutes. This leads to an endless loop for a RolingUpgrade on a slow system. Disabling the health checks circumvents this issue, but is not ideal. This change adds a --unhealthy-machine-timeout flag which customers can use to make the upgrade process more reliable without sacrificing health checks.
eks-distro-bot
pushed a commit
that referenced
this issue
Nov 21, 2022
Before this change, an unhealthy machine health check would timeout after five minutes. This leads to an endless loop for a RolingUpgrade on a slow system. Disabling the health checks circumvents this issue, but is not ideal. This change adds a --unhealthy-machine-timeout flag which customers can use to make the upgrade process more reliable without sacrificing health checks.
panktishah26
pushed a commit
to panktishah26/eks-anywhere
that referenced
this issue
Nov 24, 2022
Before this change, an unhealthy machine health check would timeout after five minutes. This leads to an endless loop for a RolingUpgrade on a slow system. Disabling the health checks circumvents this issue, but is not ideal. This change adds a --unhealthy-machine-timeout flag which customers can use to make the upgrade process more reliable without sacrificing health checks.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What happened:
User was attempting to upgrade a large management cluster with many workload clusters. System slowed down significantly. new machines were being created as part of the RollingUpgrade, but taking a long time to get the Bootstrap Secret from CAPI. As a result, the MachineHealthCheck kept recycling those CAPI machines.
What you expected to happen:
I was expecting CAPI to be running quickly and for the MachineHealthChecks to not take down the machine so quickly. Disabling the machinehealthcheck allowed the upgrade to proceed.
I think a potential solution here would be to make the machinehealthcheck timeout configurable
How to reproduce it (as minimally and precisely as possible):
Potentially:
Anything else we need to know?:
Environment: Cloudstack 4.14
The text was updated successfully, but these errors were encountered: