Add flag to set machine health check timeout (#3918) #4123

msanjaq · 2022-11-16T22:26:13Z

Before this change, an unhealthy machine health check would timeout after five minutes. This leads to an endless loop for a RolingUpgrade on a slow system. Disabling the health checks circumvents this issue, but is not ideal. This change adds a flag --unhealthy-machine-timeout which customers can use to make the upgrade process more reliable without sacrificing health checks.

Issue #, if available: Resolves #3918

Description of changes: Allow users to override the default unhealthy machine condition timeout using an --unhealthy-machine-timeout flag

Testing (if applicable):

Created unit tests for the scenarios:

No environmental variable sets the timeout to the default value
Invalid environmental variable (e.g negatives, NaNs, large values which cause int overflows) sets the timeout to 5m
Valid environmental variable (e.g 20), sets the timeout to 20m

I performed a manual test by running the command
eksctl anywhere create cluster --unhealthy-machine-timeout 20m -f mgmt_cluster.yaml
and confirmed the timeout was set correctly with
kubectl get -o yaml machinehealthcheck -A --kubeconfig sanjaqmo-healthcheck-test-mgmt/sanjaqmo-healthcheck-test-mgmt-eks-a-cluster.kubeconfig

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

eks-distro-bot · 2022-11-16T22:26:24Z

Hi @msanjaq. Thanks for your PR.

I'm waiting for a aws member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vivek-koppuru · 2022-11-17T00:57:41Z

/ok-to-test

vivek-koppuru

I realized that we have a set of timeout flags here, can we add it there instead?

eks-anywhere/cmd/eksctl-anywhere/cmd/options.go

Line 35 in 9b99141

    
           flagSet.StringVar(&t.externalEtcdWaitTimeout, externalEtcdWaitTimeoutFlag, clustermanager.DefaultEtcdWait.String(), "Override the default external etcd wait timeout (60m)")

codecov · 2022-11-17T01:06:21Z

Codecov Report

Merging #4123 (ebef5e2) into main (ccb3624) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #4123      +/-   ##
==========================================
+ Coverage   67.55%   67.62%   +0.06%     
==========================================
  Files         396      398       +2     
  Lines       32005    32092      +87     
==========================================
+ Hits        21622    21702      +80     
- Misses       8960     8964       +4     
- Partials     1423     1426       +3

Impacted Files	Coverage Δ
pkg/clusterapi/machine_health_check.go	`100.00% <100.00%> (ø)`
pkg/clustermanager/cluster_manager.go	`67.08% <100.00%> (+0.23%)`	⬆️
controllers/vsphere_datacenter_controller.go	`10.34% <0.00%> (-1.52%)`	⬇️
pkg/api/v1alpha1/cluster.go	`72.79% <0.00%> (-0.24%)`	⬇️
controllers/snow_machineconfig_controller.go	`89.09% <0.00%> (-0.20%)`	⬇️
controllers/cluster_controller.go	`75.51% <0.00%> (-0.17%)`	⬇️
controllers/factory.go	`94.94% <0.00%> (-0.12%)`	⬇️
pkg/clusterapi/workers.go	`100.00% <0.00%> (ø)`
controllers/docker_datacenter_controller.go	`100.00% <0.00%> (ø)`
pkg/providers/docker/reconciler/reconciler.go	`100.00% <0.00%> (ø)`
... and 6 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

vivek-koppuru

/approve

Before this change, an unhealthy machine health check would timeout after five minutes. This leads to an endless loop for a RolingUpgrade on a slow system. Disabling the health checks circumvents this issue, but is not ideal. This change adds a --unhealthy-machine-timeout flag which customers can use to make the upgrade process more reliable without sacrificing health checks.

vivek-koppuru · 2022-11-19T00:31:27Z

pkg/clustermanager/cluster_manager_test.go

 kind: MachineHealthCheck
 metadata:
  creationTimestamp: null
  name: fluxTestCluster-worker-1-worker-unhealthy
  namespace: eksa-system
 spec:
  clusterName: fluxTestCluster
-  maxUnhealthy: 40%
+  maxUnhealthy: 40%%


Is this a mistake?

No, this is intentional. The '%%' is converted to '%' by sprintf.

https://stackoverflow.com/questions/35681595/escape-variables-with-printf

Oh so it wasn't validating before?

vivek-koppuru

/approve

eks-distro-bot · 2022-11-21T18:10:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vivek-koppuru

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vivek-koppuru]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Before this change, an unhealthy machine health check would timeout after five minutes. This leads to an endless loop for a RolingUpgrade on a slow system. Disabling the health checks circumvents this issue, but is not ideal. This change adds a --unhealthy-machine-timeout flag which customers can use to make the upgrade process more reliable without sacrificing health checks.

eks-distro-bot added needs-ok-to-test size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 16, 2022

eks-distro-bot added ok-to-test and removed needs-ok-to-test labels Nov 17, 2022

vivek-koppuru reviewed Nov 17, 2022

View reviewed changes

eks-distro-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 17, 2022

maxdrib approved these changes Nov 17, 2022

View reviewed changes

eks-distro-bot assigned maxdrib Nov 17, 2022

eks-distro-bot added the lgtm label Nov 17, 2022

msanjaq changed the title ~~Add environmental variable to set machine health check timeout (#3918)~~ Add flag to set machine health check timeout (#3918) Nov 17, 2022

msanjaq force-pushed the main branch from f42dd26 to d9cc839 Compare November 17, 2022 22:10

eks-distro-bot removed the lgtm label Nov 17, 2022

vivek-koppuru approved these changes Nov 18, 2022

View reviewed changes

eks-distro-bot assigned vivek-koppuru Nov 18, 2022

eks-distro-bot added lgtm approved labels Nov 18, 2022

msanjaq force-pushed the main branch from d9cc839 to ebef5e2 Compare November 18, 2022 23:48

eks-distro-bot removed the lgtm label Nov 18, 2022

vivek-koppuru reviewed Nov 19, 2022

View reviewed changes

vivek-koppuru approved these changes Nov 21, 2022

View reviewed changes

eks-distro-bot added the lgtm label Nov 21, 2022

eks-distro-bot merged commit fceefda into aws:main Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flag to set machine health check timeout (#3918) #4123

Add flag to set machine health check timeout (#3918) #4123

msanjaq commented Nov 16, 2022 •

edited

Loading

eks-distro-bot commented Nov 16, 2022

vivek-koppuru commented Nov 17, 2022

vivek-koppuru left a comment

codecov bot commented Nov 17, 2022 •

edited

Loading

vivek-koppuru left a comment

vivek-koppuru Nov 19, 2022

msanjaq Nov 21, 2022

vivek-koppuru Nov 21, 2022

vivek-koppuru left a comment

eks-distro-bot commented Nov 21, 2022

Add flag to set machine health check timeout (#3918) #4123

Add flag to set machine health check timeout (#3918) #4123

Conversation

msanjaq commented Nov 16, 2022 • edited Loading

eks-distro-bot commented Nov 16, 2022

vivek-koppuru commented Nov 17, 2022

vivek-koppuru left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 17, 2022 • edited Loading

Codecov Report

vivek-koppuru left a comment

Choose a reason for hiding this comment

vivek-koppuru Nov 19, 2022

Choose a reason for hiding this comment

msanjaq Nov 21, 2022

Choose a reason for hiding this comment

vivek-koppuru Nov 21, 2022

Choose a reason for hiding this comment

vivek-koppuru left a comment

Choose a reason for hiding this comment

eks-distro-bot commented Nov 21, 2022

msanjaq commented Nov 16, 2022 •

edited

Loading

codecov bot commented Nov 17, 2022 •

edited

Loading