MachineHealthChecks cause endless loop during upgrade in slow environment #3918

maxdrib · 2022-11-02T20:47:51Z

What happened:
User was attempting to upgrade a large management cluster with many workload clusters. System slowed down significantly. new machines were being created as part of the RollingUpgrade, but taking a long time to get the Bootstrap Secret from CAPI. As a result, the MachineHealthCheck kept recycling those CAPI machines.

What you expected to happen:
I was expecting CAPI to be running quickly and for the MachineHealthChecks to not take down the machine so quickly. Disabling the machinehealthcheck allowed the upgrade to proceed.

I think a potential solution here would be to make the machinehealthcheck timeout configurable

How to reproduce it (as minimally and precisely as possible):
Potentially:

Create a management cluster with 8 workload clusters on it. Each having 3 etcd - 2 CP - 3 worker nodes
Trigger a rolling upgrade on the eks-a cluster such as changing the Offering or image

Anything else we need to know?:

Environment: Cloudstack 4.14

EKS Anywhere Release: v0.8.x
EKS Distro Release:

…achine health check (aws#3918)

) Before this change, an unhealthy machine health check would timeout after five minutes. This leads to an endless loop for a RolingUpgrade on a slow system. Disabling the health checks circumvents this issue, but is not ideal. This change adds an environmental variable EKSA_UNHEALTHY_CONDITION_TIMEOUT_MINS which customers can use to make the upgrade process more reliable without sacraficing health checks.

) Before this change, an unhealthy machine health check would timeout after five minutes. This leads to an endless loop for a RolingUpgrade on a slow system. Disabling the health checks circumvents this issue, but is not ideal. This change adds an environmental variable EKSA_UNHEALTHY_CONDITION_TIMEOUT_MINS which customers can use to make the upgrade process more reliable without sacrificing health checks.

Before this change, an unhealthy machine health check would timeout after five minutes. This leads to an endless loop for a RolingUpgrade on a slow system. Disabling the health checks circumvents this issue, but is not ideal. This change adds a --unhealthy-machine-timeout flag which customers can use to make the upgrade process more reliable without sacrificing health checks.

maxdrib added team/cli area/cli Generic EKS-A CLI features labels Nov 3, 2022

maxdrib added this to the next+1 milestone Nov 3, 2022

maxdrib mentioned this issue Nov 9, 2022

Orphaned CloudStack VM's present in slow CloudStack environments kubernetes-sigs/cluster-api-provider-cloudstack#190

Closed

msanjaq added a commit to msanjaq/eks-anywhere that referenced this issue Nov 15, 2022

Add environmental variable to configure the timeout of an unhealthy m…

a02a262

…achine health check (aws#3918)

msanjaq added a commit to msanjaq/eks-anywhere that referenced this issue Nov 15, 2022

Add environmental variable to configure the timeout of an unhealthy m…

7362058

…achine health check (aws#3918)

msanjaq added a commit to msanjaq/eks-anywhere that referenced this issue Nov 15, 2022

Add environmental variable to configure the timeout of an unhealthy m…

505403a

…achine health check (aws#3918)

msanjaq mentioned this issue Nov 16, 2022

Add flag to set machine health check timeout (#3918) #4123

Merged

eks-distro-bot closed this as completed in #4123 Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MachineHealthChecks cause endless loop during upgrade in slow environment #3918

MachineHealthChecks cause endless loop during upgrade in slow environment #3918

maxdrib commented Nov 2, 2022 •

edited

Loading

MachineHealthChecks cause endless loop during upgrade in slow environment #3918

MachineHealthChecks cause endless loop during upgrade in slow environment #3918

Comments

maxdrib commented Nov 2, 2022 • edited Loading

maxdrib commented Nov 2, 2022 •

edited

Loading