-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes become unresponsive and doesnt recover with soft lockup error #454
Comments
Ever since I moved to new kernel for 1.14 I am facing lot of issues in the nodes and nodes are going out quite frequently.Some times it doesn't become responsive and be in the ready status and sometimes it just flips to node not ready status. Is there any bug in Kernel which I am hitting into ? Following are the soft lock up messages seen
Ideally with the change #367 , kubelet should be having plenty of resources and if there are high resource consuming pods , it should just evict it rather than node getting crashed. I digged into the customizations in detail and following are the customizations done for the AMI used in environment: Customizations :
1b. docker daemon json
2a. 20-nproc.conf
2b. sysctl.conf
resolv.kubelet is following EDIT: Edited this post after removing some customization |
same here, subscribe |
We had the same problem upgrading our worker nodes to Amazon EKS 1.15 AMI. We tried:
and both had the same problem. We have pods with Syslog on the worker node reports:
As workaround, we took the offical Ubuntu EKS 1.15 image Note: Using |
@colandre Do you use stock AMI recommended by AWS or do you bake the AMI, if so what all customizations are you using? It looks like the issue might be arising due to high resource utilization. |
@cshivashankar , I used stock Amazon AMIs:
no customizations.
|
@colandre Did you reach out to AWS support? Any inputs from them? I guess this thread has little visibility as it was created some days back. As per AWS, there should not be any issues if stock AMI is being used. It's definitely worth following up with AWS. Even I am curious about the solution to this problem |
@cshivashankar , unfortunately we do not have the Business Plan, so tech issues cannot be opened. |
@colandre I think only way to get attention for this thread is to tag collaborator or contributor of this repo :) |
We're cautiously optimistic that upgrading our CNI plugin to 1.6.1 has resolved the issue for us. HTH |
I work with @eeeschwartz . We now no longer think that the CNI upgrade change fixed this for us. What did fix it, was switching to instances with SSD drives such as r5d.2xlarge, and using that SSD drive for /var/lib/docker as suggested at #349 Our actual experience with managing that filesystem was a bit different than suggested in that link. We found that dockerd needs to be stopped/started as part of the userdata script, even though we placed our code fragment in the "pre-userdata", i.e. as the very first steps. The following appears to be working for us:
|
@jae-63 Moving docker daemon to nvme should ideally provide better performance and IO. Based on your experience of issue , did you see any co-relation to the issue and IOPS of volumes? Did you face any issues with EBS volume of very high IOPS. |
@cshivashankar for cost reasons [or our perceptions of the costs], we didn't pursue IOPS-provisioned EBS volumes, so I don't have any data to provide. I.e. we were previously just using vanilla 'gp2' drives. I think that one could use IOPS-provisioned (i.e. "io1") drives instead of following our present solution. That would be a bit simpler. Ephemeral SSDs associated with ephemeral instance like our use-case is a nice match. It seems likely that our (new) SSD drives are overprovisioned w.r.t size, in light of the available CPU/RAM/SSD size ratios. Other users might use more or less disk than us. BTW we down-sized the root EBS partition to 20GB but even 10GB would probably be sufficient. |
Hi @jae-63 Thanks for the info.Any easy ways/instructions for replicating this issue? Hi @Jeffwan Do you see any potential bug or limitation which is causing these soft lockup issues. It looks like more people are facing these soft lockup issues and some of them are facing it in plain stock AMI which is recommended by AWS. It would be great to know your opinion on what might be causing these issues. |
@cshivashankar I don't see an easy way to reproduce this. The two affected clusters are mostly used for code building and deployment (primarily the former) using self-hosted Gitlab. As you may know, in ".gitlab-ci.yml" files, one invokes lots of different containers for a brief period of time, e.g. corresponding to building a Docker container. We recently started using the caching capabilities of the 'kaniko-project' https://cloud.google.com/cloud-build/docs/kaniko-cache for many of our builds. I'm pretty sure these are stock AMIs we're using. One cluster is still running EKS 1.15 and the other is EKS 1.16. They are 'ami-065418523a44331e5' and 'ami-0809659d79ce80260' respectively, in us-west-2. We did open an AWS ticket for this lockup issue a few weeks ago, but not much was accomplished on it, and it's now closed. I don't know whether it's considered OK to post an AWS ticket number here, or perhaps there's a way for us to communicate privately. Perhaps you'd find some clues there. In principle I'd be willing to revert to using regular 'r5' instances for a few hours (AFTER HOURS) rather than the 'r5d' ones we're currently using. That should reproduce the problem although the load is likely to be much lighter than during the daytime, when our developers are actively using Gitlab. In our experience we continued to observe these lockups after hours. But you'd need to tell me in advance, in detail, what diagnostics to try to collect. |
Hi, @jae-63 Even my initial analysis pointed to high IOPS. I did try to simulate the error by subjecting node to high CPU, IOPS but somehow the soft lockup issue doesn't get reproduced in the dev environment. So it's becoming quite tricky to understand what's causing this soft lockup. I had opened an AWS ticket and working actively with engineer and lack of reproduction steps has become a big hurdle. I am not from amazon, so I don't know if it's ok to share the ticket number, I can check and confirm. Main diagnostics that can help is kernel crash dump, logs collected from log collector script, and system metrics maybe using SAR or similar tools. |
Hi @cshivashankar . Sorry I had misunderstood ... I thought you were AWS staff and were seeking customer help to resolve this for the larger community. In light of the fact that you, like me, don't work for AWS I'm going to excuse myself from further activity here, because I don't think I have much to add beyond the workaround I have already provided. |
Can those of you experiencing this issue please try the following on your nodes, reboot, and see if you're still experiencing the issue?
|
Alternatively, for those of you experiencing this issue, if you switch to c4/r4/m4 instances instead of c5/r5/m5, does it make a difference? |
Can you please provide more details on how this solves the issue. I am ok to try this if this solves any known issue or bug. |
@cshivashankar There have been a number of cgroup-related bugs fixed in Kernel 5.3+. This is the Linux kernel that comes with the |
Hi @otterley, Thanks for your feedback I will definitely give it a try and update. Is there any bugfixes list available for public to get an idea on fixes? If there is any specific fix that can address the above issue.I am curious to know what exactly is causing this issue, Any insights on this? |
Unfortunately I'm not aware of any identifiable bug fixes that directly relate to this issue. But with your assistance we can hopefully narrow this down to a root cause. |
Hi, @otterley I will be glad to help. Let me try to test these changes and update.Due to the nature of the issue and difficulty in reproducing, it might take some time to provide an input on if it really solved the issue or not.Meanwhile is there any way I can check the source code of the kernel and try to debug other than the using yumdownloader for the source (https://aws.amazon.com/amazon-linux-2/faqs/.) |
I'm actually a Containers Specialist Solutions Architect and am not on the kernel team. I'm just volunteering my time to help out customers like you as best I can. :) Also, it would be good to know if this issue is reproducible on 4th generation instances. |
Thank you, tats great to hear :). |
Is the issue still occurring randomly, but you're unable to induce it yourself? Or has the issue resolved itself spontaneously? |
I am still getting issues but it's random and less frequent. All of a sudden I see a node will be gone with soft lockup in prod environment. |
I've been running this for a bit and haven't yet seen any related errors in dmesg. |
@JacobHenner I was able to see issues initially when I created a ton of pods at once using that script, but when I tried scaling in a more controlled manner, I didn't see it. Still struggling to get a consistent repro on new AMIs that doesn't always break the older AMIs. |
Tats good to hear :).
|
Was this based on the limited feedback in this ticket, or were there other customers who indicated this fixed the issue as well? I'm in the process of applying this workaround to my clusters, and I'd like to have a better understanding of the likelihood that it'll either succeed or fail @mmerkes. Also, I'm curious as to why this isn't impacting more EKS users. Seems like this would be a frequently encountered issue. |
@JacobHenner It's definitely possible that there are customers who don't notice or aren't reporting, but it's also unclear how common the scenario is given that we don't fully understand it. Of course, there are no guarantees that upgrading the kernel will resolve the issue, but so far, it has worked for every customer that has reported trying it, including some support tickets that aren't captured in this github issue. We're also working with the AmazonLinux team to create a mechanism that customers can upgrade their kernel to 5.4 in a way that will pin them to 5.4 rather using the |
I have now been able to reproduce this issue consistently in my environment. The reproducer triggers the issue with the 4.14 kernel in the 1.14 and 1.16 images, but it hasn't yet triggered the issue with the 5.4 kernel in the 1.16 image. I will attempt to figure out what combination of conditions is causing the issue, as well as develop a reproducer that I can share (the current one includes data which I cannot share). |
I think it would be able to easily re-produce this issue with a pod that has an init container which clones a git repo with a large amount of files, (possibly clone the linux kernel source) to emptydir. The main container should then either move or delete these files. I believe this combination (creating and moving / deleting a large amount of files in a short time) is key to re-creating the issue, but I don't have any evidence besides the fact that this is what we were doing and we were getting this very very consistently. Updating to 5.4 kernel completely fixed this. Hope this helps! |
@JacobHenner That's good news! In summary, what does your reproducer do? @brettplarson Thanks for the details. I can try that. I'm glad the kernel upgrade fixed it for you. |
The reproducer runs 8 replicas of a container which downloads and extracts a 22mb (277mb uncompressed) gzipped-tarball in a tight loop. At first I figured this might be related to the use of an emptyDir, as most of the pods which seemed to trigger this issue used emptyDirs, but the reproducer was able to reproduce the node failures both with and without emptyDirs. I was also able to reproduce the lockup behavior on m5.2xls and m4.2xls, although the log messages were different between the two. |
I wrote a script to git clone a repo and then compress and decompress it in a loop. I've run different variations of this on m5.xlarge nodes with 8-16 pods per node. Haven't seen the soft lockup yet, but still trying things. Primarily sharing as an update. |
Can confirm that upgrading to 5.4 has so far resolved the issue for us too. We had a pretty reliable repro (when our portworx cluster kicked off scheduled full cloud snaps of ~400 volumes at the same time every week) and have rolled out changes over the last few weekends:
|
I've shared my team's reproducer with the AWS team. I cannot share the gzipped tarball itself here, but it's essentially a few hundred mb of small files, in many directories. Running 8 replicas of |
If it helps, I noticed in the logs of one of these tar-extraction pods:
However, if this was actually an OOM condition, wouldn't the container have just been killed? |
@JacobHenner Tx for sharing your reproducer! I was able to reproduce the issue in 4.14, so that's great news. It seems to also cause the issue on the older AMI that others didn't seem to have a problem with, but now we've got something to work with. I will share this with the AmazonLinux team and dig further myself so that we can figure out what's going on and get it fixed. |
Hi @mmerkes |
@hrzbrg You're asking if we can trigger a release of the AMIs maintained by AmazonLinux rather than the EKS maintained AMIs? If so, we can't. They have a process for releasing new AMIs with the latest kernels, and I don't know when that will happen, but it will happen eventually. For this particularly issue, updating the kernel doesn't seem to help. As a quick status update for the soft lockup issue, the AmazonLinux team has the repro and is actively working on it. Once the issue is identified, they will hopefully be able to fix it, test it and release a new patch for the kernel. Once we have a verified fix and it's available, EKS will release new AMIs for customers to use or you can |
We've been able to root cause the issue with the AmazonLinux team. When the containers have a write heavy workload which run on IOPS constrained EBS volumes, EBS starts to throttle IOPS. The kernel is unable to flush the dirty pages to disk because of this throttling. The dirty pages limit is even more constrained for each cgroup/container and is directly proportional to the memory requested by the container. When number of dirty pages for a container increases, the kernel tries to flush the pages to disk. In 4.14 kernel, the code which flushes these pages to disk does wasteful work in building up the queue We are working on backporting this patch to 4.14 kernel so it can be released with an EKS optimized AMI. |
@mmerkes Tats great news . Is there any bug/PR in github for this issue in kernel ? |
@cshivashankar I don't think there's an issue on the AmazonLinux side, though I will update here when it's available. My suspicion is that there's a relation to some of the PLEG issues, but without a repro specifically on the PLEG issues, I can't say for certain. Did upgrading the kernel to 5.4 make those go away? |
@mmerkes I was referring to bug/PR from upstream linux repo if any. |
@mmerkes Could you link us to the kernel patch? |
@rphillips @cshivashankar Here is a thread that discusses the issue and I believe this is the upstream patch that resolves the issue. |
Patch has been merged into AmazonLinux kernel. It'll be available in the next kernel release. |
|
@mmerkes I noticed that AMI version |
@njtman Yes, that is correct! The latest EKS optimized AMI include the patch that fixes the soft lockup issue. Any AMI with Packer version |
Status Summary
This section added by @mmerkes
The AWS EKS team has a consistent repro and has engaged with the AmazonLinux team to root cause the issue and fix it. AmazonLinux has merged a patch that solves this issue in our repro and should work for customers, and that is now available via
yum update
. Once you've updated your kernel and rebooted your instance, you should be runningkernel.x86_64 0:4.14.203-156.332.amzn2
(or greater). All EKS optimized AMIs Packer versionv20201112
or later will include this patch. Users have 2 options for fixing this issue:yum update
Here's the commands you need to patch your instances:
Original Issue
This original content from @cshivashankar
What happened:
Node in the cluster becomes unresponsive and pods running on it also becomes unresponsive.
As per the analysis and logs provided in AWS Case 6940959821, it was informed that this is observed when high IOPS is observed and a soft lock up happens which causes node to become unresponsive. Further investigation might be required .
What you expected to happen:
Node should not crash or become unresponsive , if that was the case , control plane should identify it and mark it as not ready. State should be either node is ready and working properly or node is unresponsive and not ready and should be eventually removed from the cluster.
How to reproduce it (as minimally and precisely as possible):
As per the analysis in the AWS case 6940959821 , the issue could be reproduced by having higher IOPS than the capacity of EBS for sustained amount of time.
Anything else we need to know?:
This issue is being observed recently and I want to rule it out if it was due to using AMI of version 1.14 as we never observed this issue in 1.13. Is there any kernel bug that I am hitting into? For building the AMI, I cloned the "amazon/aws-eks-ami" repo and did the following changes
1. Installed Zabbix agent2. Ran the kubelet with "--allow-privileged=true" flag as I was getting issues with cadvisor.
So basically AMI being used is practically the same as AWS EKS AMI.
Changes mentioned in the following comment
Logs can be accessed in the AWS Case mentioned above
Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
): "eks.9"aws eks describe-cluster --name <name> --query cluster.version
): "1.14"uname -a
): Linux 4.14.165-133.209.amzn2.x86_64 Template is missing source_ami_id in the variables section #1 SMP Sun Feb 9 00:21:30 UTC 2020 x86_64 x86_64 x86_64 GNU/Linuxcat /etc/eks/release
on a node):The text was updated successfully, but these errors were encountered: