Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flag to set machine health check timeout (#3918) #4123

Merged
merged 1 commit into from
Nov 21, 2022

Conversation

msanjaq
Copy link
Contributor

@msanjaq msanjaq commented Nov 16, 2022

Before this change, an unhealthy machine health check would timeout after five minutes. This leads to an endless loop for a RolingUpgrade on a slow system. Disabling the health checks circumvents this issue, but is not ideal. This change adds a flag --unhealthy-machine-timeout which customers can use to make the upgrade process more reliable without sacrificing health checks.

Issue #, if available: Resolves #3918

Description of changes: Allow users to override the default unhealthy machine condition timeout using an --unhealthy-machine-timeout flag

Testing (if applicable):

Created unit tests for the scenarios:

  1. No environmental variable sets the timeout to the default value
  2. Invalid environmental variable (e.g negatives, NaNs, large values which cause int overflows) sets the timeout to 5m
  3. Valid environmental variable (e.g 20), sets the timeout to 20m

I performed a manual test by running the command
eksctl anywhere create cluster --unhealthy-machine-timeout 20m -f mgmt_cluster.yaml
and confirmed the timeout was set correctly with
kubectl get -o yaml machinehealthcheck -A --kubeconfig sanjaqmo-healthcheck-test-mgmt/sanjaqmo-healthcheck-test-mgmt-eks-a-cluster.kubeconfig

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@eks-distro-bot
Copy link
Collaborator

Hi @msanjaq. Thanks for your PR.

I'm waiting for a aws member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@eks-distro-bot eks-distro-bot added needs-ok-to-test size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 16, 2022
@vivek-koppuru
Copy link
Member

/ok-to-test

Copy link
Member

@vivek-koppuru vivek-koppuru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that we have a set of timeout flags here, can we add it there instead?

flagSet.StringVar(&t.externalEtcdWaitTimeout, externalEtcdWaitTimeoutFlag, clustermanager.DefaultEtcdWait.String(), "Override the default external etcd wait timeout (60m)")

@codecov
Copy link

codecov bot commented Nov 17, 2022

Codecov Report

Merging #4123 (ebef5e2) into main (ccb3624) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #4123      +/-   ##
==========================================
+ Coverage   67.55%   67.62%   +0.06%     
==========================================
  Files         396      398       +2     
  Lines       32005    32092      +87     
==========================================
+ Hits        21622    21702      +80     
- Misses       8960     8964       +4     
- Partials     1423     1426       +3     
Impacted Files Coverage Δ
pkg/clusterapi/machine_health_check.go 100.00% <100.00%> (ø)
pkg/clustermanager/cluster_manager.go 67.08% <100.00%> (+0.23%) ⬆️
controllers/vsphere_datacenter_controller.go 10.34% <0.00%> (-1.52%) ⬇️
pkg/api/v1alpha1/cluster.go 72.79% <0.00%> (-0.24%) ⬇️
controllers/snow_machineconfig_controller.go 89.09% <0.00%> (-0.20%) ⬇️
controllers/cluster_controller.go 75.51% <0.00%> (-0.17%) ⬇️
controllers/factory.go 94.94% <0.00%> (-0.12%) ⬇️
pkg/clusterapi/workers.go 100.00% <0.00%> (ø)
controllers/docker_datacenter_controller.go 100.00% <0.00%> (ø)
pkg/providers/docker/reconciler/reconciler.go 100.00% <0.00%> (ø)
... and 6 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@eks-distro-bot eks-distro-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 17, 2022
@msanjaq msanjaq changed the title Add environmental variable to set machine health check timeout (#3918) Add flag to set machine health check timeout (#3918) Nov 17, 2022
Copy link
Member

@vivek-koppuru vivek-koppuru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Before this change, an unhealthy machine health check would timeout
after five minutes. This leads to an endless loop for a RolingUpgrade
on a slow system. Disabling the health checks circumvents this issue,
but is not ideal. This change adds a --unhealthy-machine-timeout flag
which customers can use to make the upgrade process more reliable
without sacrificing health checks.
kind: MachineHealthCheck
metadata:
creationTimestamp: null
name: fluxTestCluster-worker-1-worker-unhealthy
namespace: eksa-system
spec:
clusterName: fluxTestCluster
maxUnhealthy: 40%
maxUnhealthy: 40%%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a mistake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is intentional. The '%%' is converted to '%' by sprintf.

https://stackoverflow.com/questions/35681595/escape-variables-with-printf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh so it wasn't validating before?

Copy link
Member

@vivek-koppuru vivek-koppuru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@eks-distro-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vivek-koppuru

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@eks-distro-bot eks-distro-bot merged commit fceefda into aws:main Nov 21, 2022
panktishah26 pushed a commit to panktishah26/eks-anywhere that referenced this pull request Nov 24, 2022
Before this change, an unhealthy machine health check would timeout
after five minutes. This leads to an endless loop for a RolingUpgrade
on a slow system. Disabling the health checks circumvents this issue,
but is not ideal. This change adds a --unhealthy-machine-timeout flag
which customers can use to make the upgrade process more reliable
without sacrificing health checks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm ok-to-test size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MachineHealthChecks cause endless loop during upgrade in slow environment
4 participants