Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waiting for k8s nodes to reach count #94

Open
thorro opened this issue Feb 10, 2021 · 7 comments · May be fixed by #130
Open

Waiting for k8s nodes to reach count #94

thorro opened this issue Feb 10, 2021 · 7 comments · May be fixed by #130

Comments

@thorro
Copy link

thorro commented Feb 10, 2021

We have a big and busy EKS cluster with nodes joining and leaving many times in a day (spot instances failing or being replaced). We try to update each ASG separately with ASG_NAMES setting. The problem is, the eks-rolling-update always checks the whole cluster for node count and it many times fails as node count is not matched with expected.

It should only monitor the selected ASG(s) for expected instance count.

2021-02-10 16:26:57,425 INFO     Current k8s node count is 94
2021-02-10 16:26:57,426 INFO     Current k8s node count is 94
2021-02-10 16:26:57,426 INFO     Waiting for k8s nodes to reach count 92...
2021-02-10 16:27:18,198 INFO     Getting k8s nodes...
2021-02-10 16:27:19,341 INFO     Current k8s node count is 94
2021-02-10 16:27:19,342 INFO     Current k8s node count is 94
2021-02-10 16:27:19,342 INFO     Waiting for k8s nodes to reach count 92...
2021-02-10 16:27:40,119 INFO     Getting k8s nodes...
2021-02-10 16:27:41,470 INFO     Current k8s node count is 94
2021-02-10 16:27:41,471 INFO     Current k8s node count is 94
2021-02-10 16:27:41,471 INFO     Waiting for k8s nodes to reach count 92...
...
2021-02-10 16:28:01,472 INFO     Validation failed for cluster *****. Didn't reach expected node count 92.
2021-02-10 16:28:01,472 INFO     Exiting since ASG healthcheck failed after 2 attempts
2021-02-10 16:28:01,472 ERROR    ASG healthcheck failed
2021-02-10 16:28:01,472 ERROR    *** Rolling update of ASG has failed. Exiting ***
2021-02-10 16:28:01,472 ERROR    AWS Auto Scaling Group processes will need resuming manually
@dat-cao-tien-mox
Copy link

By default CLUSTER_HEALTH_RETRY=1 so it will fail soon (you can open README and see it). you need to increase this value by export CLUSTER_HEALTH_RETRY=10. That means it will try to check 10 times and it have enough time to verify cluster health check instead 1 by default.

@thorro
Copy link
Author

thorro commented Apr 29, 2021

It helps, but it doesn't necessarily solve the issue, as sometimes some other ASG in cluster is resized in the meantime and it will never reach the wanted value.

I've modified the code to it simply gives up after some tries and does the changes anyway. It fits our case and the whole process is monitored by a human, anyway.

@dat-cao-tien-mox
Copy link

@thorro could you please share with me which code you already modified, maybe I also applied for my case.

@thorro
Copy link
Author

thorro commented Apr 29, 2021

Just a simple hack, hardcoded to 5 tries.

diff --git a/eksrollup/lib/k8s.py b/eksrollup/lib/k8s.py
index ad286cf..9cf73f7 100644
--- a/eksrollup/lib/k8s.py
+++ b/eksrollup/lib/k8s.py
@@ -226,6 +226,9 @@ def k8s_nodes_count(desired_node_count, max_retry=app_config['GLOBAL_MAX_RETRY']
     """
     logger.info('Checking k8s expected nodes are online after asg scaled up...')
     retry_count = 1
+    retry_count2 = 0
+    retry_count2_max = 5
+
     nodes_online = False
     while retry_count < max_retry:
         nodes_online = True
@@ -233,10 +236,16 @@ def k8s_nodes_count(desired_node_count, max_retry=app_config['GLOBAL_MAX_RETRY']
         nodes = get_k8s_nodes()
         logger.info('Current k8s node count is {}'.format(len(nodes)))
         if len(nodes) != desired_node_count:
+            retry_count2 += 1
+            if retry_count2 >= retry_count2_max:
+                logger.info('Not waiting for k8s nodes to reach count {} anymore, continuing anyway '.format(desired_node_count))
+                break
+

@js-timbirkett
Copy link
Contributor

js-timbirkett commented Jan 25, 2022

@thorro - are you using any other flags or features? Currently, I'm using ASG_NAMES and seeing this issue. I'm also using: EXCLUDE_NODE_LABEL_KEYS and label couple of nodes before running eks-rolling-update. Would be interested to know if you use EXCLUDE_NODE_LABEL_KEYS too?

EDIT: Looks like there is an issue with EXCLUDE_NODE_LABEL_KEYS that is fixed in #117

@thorro
Copy link
Author

thorro commented Jan 25, 2022

Not using it anymore as we switched to EKS managed node groups.

@js-timbirkett
Copy link
Contributor

Noticed the problem with this is caused by: if len(nodes) != desired_node_count: - it should probably be:if len(nodes) < desired_node_count:

@dgem dgem linked a pull request Jun 27, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants