-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node becomes NotReady #79
Comments
Any ideas on this? |
All those weird log messages about memory in your One does |
The issues happening again, here are the logs during the node status becoming NotReady kubectl describe node
Journalctl
ipamd.log
|
So we are getting no help from AWS via support and here. Issues:
Thus the AMI is not ready for production use! We will have to fork it like the 90 other people who are doing this. Probably use: Lumslabs or AdvMicrogrid as a base. Very poor support from AWS :( They told us that the image is setup for default workloads. We are not doing anything special... and we loose two nodes a day. |
@agcooke thanks for this report, and sorry for the poor support :( We'll take a look into this |
@agcooke Can you email me your Support case ID and Account ID? I'll see what I can track down on my end. mhausler@amazon.com |
Thanks @micahhausler , I will reach out to you in email and when we figure out the solution let people know. One thing I did not mention above is that we are using 64GB GP2 EBS volumes on the node. It's very hand wavy, but it seems that the node that runs prometheus (which uses a GP2 PVC) flaps to NonReady and back every few days. We will investigate using IOPS volumes for instances host and PVC's to see if it helps resolve the issue. LumoLabs also use an ext4 filesystem and I think a separate volume for docker, that could help with the IO issues. But I do not believe that IO is the root cause... We should not be seeing kernel level stack traces. |
I have seen this on 2 clusters in the last few weeks |
We were seeing this frequently. Our current theory is it's related to running out of space on the root filesystem which caused the kubelet to die silently. |
@gflarity Great we will see if that is a cause, we are now running the v25 AMI and it should rotate logs better so the disks should not fill up too much. We were having these issues on t2.large and t3.medium nodes. We had a talk with SA's from AWS and they suggested that we try use c5.large or m4.large nodes.
Summary of the issues we are seeing:
Will keep you up to date with the progress. |
@joberdick If you can add some logs to help triage the root cause it would be great! |
@agcooke I don't think the rotation fix is in v25, it was just merged two days ago. I might be wrong but when I looked at the dates yesterday v25 was put out like 8 days ago. When you log into the nodes, is the kubelet silent crashing (ie disappearing). We had this happen a couple days ago and the root fs was full so we think kubelet tried to log before it crashed and couldn't. It was pretty clear from the sys logs that the FS was full. |
What does |
@max-rocket-internet
|
The pattern we see is that there was no logging indicating why kubectl dies... not having space to log is a good excuse for that, and probably the root cause. Unfortunately our disk usage tracking wasn't setup properly so we can't go back and confirm this. |
I don't get any events when doing kubectl get events kubectl describe node ip-192-168-145-119.ec2.internal OutOfDisk Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status. default sleep-8677f78569-6gvt5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 650m (32%) 2010m (100%) 140Mi (3%) 200Mi (5%) OutOfDisk Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status. default sleep-8677f78569-6gvt5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 650m (32%) 2010m (100%) 140Mi (3%) 200Mi (5%) Unfortunately, this environment does not have a key pair setup for EC2 so the info that I will be able gather will be limited. Any suggestions on what else to check without without being on the host? I am going to try autoscaling after adding a key pair, but was avoiding touching anything until i hear from aws support |
/shrug there might be multiple causes of this... Can you ssh into the node? That's how I first diagnosed our issue. |
We have been running m4.large and are not seeing the issues anymore. I think the symptoms are hidden by faster network throughput. |
running m4.xlarge and now nodes are dying like crazy....... before it was one a week now it's 3 today already ofc no possibility to ssh and Amazon status checks are saying node is ok (sic!) |
@gacopl We are seeing similar behaviour on |
keep waiting we started around september and we are still on EKS v1 they did not even upgrade us to v2, choosing EKS was worst decision, the worst part is pods do not get resheduled when node NotReady, will have setup special healthchecks for ASG..... |
@gacopl is it helpful to add additional healthchecks even if the node is still unstable? |
@radityasurya what else do you advise since only terminating node gets pod rescheduled and basic ASG checks do not see the node is unreachable @agcooke which AMI specifically ? |
We are using amazon-eks-node-v25 (ami-00c3b2d35bddd4f5c) and I have started working on using https://github.com/lumoslabs/amazon-eks-ami AMI. If anyone know how to use a CoreOS AMI or an Ubuntu (conjure-up) AMI it would be great... |
@agcooke So v25 does not solve it right ? i was about to upgrade from v23 |
No, just found https://cloud-images.ubuntu.com/docs/aws/eks/ |
We are working on an AMI release first week of December with the log rotation patch in. That should fix the issues related to disk filling up due to logs. |
@eswarbala It's not only logs but also fd limits are too low on bigger instances |
We have an Ubuntu 18.04 AMI forked from this one. Been using it for a prototype cluster, and haven't noticed this issue. |
We have setup cluster Autoscaler as well as have set resource limits on all the deployments in kubernetes. Also use HPA for your deployment that is consuming more resource. By applying these changes we have not faced this issue anymore. |
@agcooke were the change released? we are currently running |
@tckb I've seen it suggested some of these changes will be in the 1.11.8 AMI's which haven't yet been released, but should be quite soon. |
Our EKS cluster(1.11) with AMI (ami-0f54a2f7d2e9c88b3) facing the same issue randomly, and it kills my production services many times per day. I was wondering if I upgraded the EKS cluster to 1.12 and using the latest AMI ami-0923e4b35a30a5f53 could solve this problem. (follow these steps https://docs.aws.amazon.com/eks/latest/userguide/update-stack.html) |
Same issue no |
same issue here. |
It seems to be caused by the "Out of Memory" error on the kubelet host. Here is my BootstrapArguments:
|
@benjamin658/ others can you confirm this? I did not see any of such errors in logs |
Im not 100 percent sure, but after I added the BootstrapArguments, our cluster is working well now. |
Having the same issue. EKS |
@dijeesh did you try the suggestions from @benjamin658 |
I'm experiencing the same issues. The problems started when I installed Nodes running I have weavescope installed in my cluster and when looking at edit: ran into a CNI issue as well (network addresses per host exhausted, reached pod limit aka "insufficient pods") |
I thought reserving resources for kubelet was default/built-in behavior of current k8s, but sounds like it is optional and EKS doesn’t do it 😢 The reserved resource for kubelets is extremely important where you use overcommitted workloads (collections of spikey workloads) i.e. any time where resource Limits >= Requests or is you don’t specify resource limits. Under node resource exhaustion you want some workloads to be rescheduled, not entire nodes to go down. If you are using small nodes, failures like this will be more common. Plus you have the low EKS pod limit caused by ENI limitations. I’d suggest reserving some system resources on each node, and use fewer, larger nodes. |
this still happens on EKS 1.13. It started to happen when cluster running under some really high load. |
Happening to me as well, looking at
|
I think this might be related? eksctl-io/eksctl#795 |
We are seeing similar behaviour, what appears to be almost random/possibly coincides with a deployment. A node or two will suddenly appear to be NotReady, resource graphs indicate utilisation is hardly over 50% so oom shouldn't be an issue. As mentioned by @AmazingTurtle we are also on 4-5 In line with @montanaflynn the node has the following taints suddenly applied:
I'm going to try increasing node size and adding some resource limits to deployments that may not have them correctly configured. |
@JamesDowning Have you had a look on Fargate? Using the CDK you can bootstrap apps so easily without worrying about infrastructure. It just runs. I gave it a shot a couple of days ago and it's just sexy and works the way I want it --- not like EKS. |
Im getting this on a t3.small node
What is adding these taints and will they ever get removed? |
Seeing this on Amazon EKS 1.26. |
What was the resolution this issue as this still persists on eks v1.23? |
We are facing this issue also on a daily basis. Any resolution for this? |
@dimittal this can happen for many reasons, please open a new issue with details of your environment and symptoms. |
We are running EKS in Ireland and our nodes are going unhealthy regularly.
It is not possible to SSH to the host, pods are not reachable. We have experienced this with t2.xlarge, t2.small and t3.medium instances.
We could ssh to another node in the cluster and ping the NotReady node, but are not able to ssh it either.
Graphs show the memory goes high at about the same time that the journalctl logs below. The EBS IO also goes high. The exact time is hard to pinpoint. I added logs with interesting 'failures' around the time that we think the node disappeared.
We are using the cluster for running tests, so pods are getting created and destroyed often.
We have not done anything described in #51 for log rotation.
Cluster Information:
CNI: Latest daemonset with image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.2.1
Region: eu-west-1
LOGS
** Node AMI**
** File system **
** kubectl describe node**
journalctl logs around the time
plugin logs
ipamd.log
The text was updated successfully, but these errors were encountered: