Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MachineHealthChecks cause endless loop during upgrade in slow environment #3918

Closed
maxdrib opened this issue Nov 2, 2022 · 0 comments · Fixed by #4123
Closed

MachineHealthChecks cause endless loop during upgrade in slow environment #3918

maxdrib opened this issue Nov 2, 2022 · 0 comments · Fixed by #4123
Labels
area/cli Generic EKS-A CLI features team/cli
Milestone

Comments

@maxdrib
Copy link
Contributor

maxdrib commented Nov 2, 2022

What happened:
User was attempting to upgrade a large management cluster with many workload clusters. System slowed down significantly. new machines were being created as part of the RollingUpgrade, but taking a long time to get the Bootstrap Secret from CAPI. As a result, the MachineHealthCheck kept recycling those CAPI machines.

What you expected to happen:
I was expecting CAPI to be running quickly and for the MachineHealthChecks to not take down the machine so quickly. Disabling the machinehealthcheck allowed the upgrade to proceed.

I think a potential solution here would be to make the machinehealthcheck timeout configurable

How to reproduce it (as minimally and precisely as possible):
Potentially:

  1. Create a management cluster with 8 workload clusters on it. Each having 3 etcd - 2 CP - 3 worker nodes
  2. Trigger a rolling upgrade on the eks-a cluster such as changing the Offering or image

Anything else we need to know?:

Environment: Cloudstack 4.14

  • EKS Anywhere Release: v0.8.x
  • EKS Distro Release:
@maxdrib maxdrib added team/cli area/cli Generic EKS-A CLI features labels Nov 3, 2022
@maxdrib maxdrib added this to the next+1 milestone Nov 3, 2022
msanjaq added a commit to msanjaq/eks-anywhere that referenced this issue Nov 15, 2022
msanjaq added a commit to msanjaq/eks-anywhere that referenced this issue Nov 15, 2022
msanjaq added a commit to msanjaq/eks-anywhere that referenced this issue Nov 15, 2022
msanjaq added a commit to msanjaq/eks-anywhere that referenced this issue Nov 16, 2022
)

Before this change, an unhealthy machine health check would timeout
after five minutes. This leads to an endless loop for a RolingUpgrade
on a slow system. Disabling the health checks circumvents this issue,
but is not ideal. This change adds an environmental variable
EKSA_UNHEALTHY_CONDITION_TIMEOUT_MINS which customers can use to make
the upgrade process more reliable without sacraficing health checks.
msanjaq added a commit to msanjaq/eks-anywhere that referenced this issue Nov 16, 2022
)

Before this change, an unhealthy machine health check would timeout
after five minutes. This leads to an endless loop for a RolingUpgrade
on a slow system. Disabling the health checks circumvents this issue,
but is not ideal. This change adds an environmental variable
EKSA_UNHEALTHY_CONDITION_TIMEOUT_MINS which customers can use to make
the upgrade process more reliable without sacrificing health checks.
msanjaq added a commit to msanjaq/eks-anywhere that referenced this issue Nov 16, 2022
)

Before this change, an unhealthy machine health check would timeout
after five minutes. This leads to an endless loop for a RolingUpgrade
on a slow system. Disabling the health checks circumvents this issue,
but is not ideal. This change adds an environmental variable
EKSA_UNHEALTHY_CONDITION_TIMEOUT_MINS which customers can use to make
the upgrade process more reliable without sacrificing health checks.
msanjaq added a commit to msanjaq/eks-anywhere that referenced this issue Nov 17, 2022
Before this change, an unhealthy machine health check would timeout
after five minutes. This leads to an endless loop for a RolingUpgrade
on a slow system. Disabling the health checks circumvents this issue,
but is not ideal. This change adds a --unhealthy-machine-timeout flag
which customers can use to make the upgrade process more reliable
without sacrificing health checks.
msanjaq added a commit to msanjaq/eks-anywhere that referenced this issue Nov 18, 2022
Before this change, an unhealthy machine health check would timeout
after five minutes. This leads to an endless loop for a RolingUpgrade
on a slow system. Disabling the health checks circumvents this issue,
but is not ideal. This change adds a --unhealthy-machine-timeout flag
which customers can use to make the upgrade process more reliable
without sacrificing health checks.
eks-distro-bot pushed a commit that referenced this issue Nov 21, 2022
Before this change, an unhealthy machine health check would timeout
after five minutes. This leads to an endless loop for a RolingUpgrade
on a slow system. Disabling the health checks circumvents this issue,
but is not ideal. This change adds a --unhealthy-machine-timeout flag
which customers can use to make the upgrade process more reliable
without sacrificing health checks.
panktishah26 pushed a commit to panktishah26/eks-anywhere that referenced this issue Nov 24, 2022
Before this change, an unhealthy machine health check would timeout
after five minutes. This leads to an endless loop for a RolingUpgrade
on a slow system. Disabling the health checks circumvents this issue,
but is not ideal. This change adds a --unhealthy-machine-timeout flag
which customers can use to make the upgrade process more reliable
without sacrificing health checks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cli Generic EKS-A CLI features team/cli
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant