Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes become unresponsive and doesnt recover with soft lockup error #454

Closed
cshivashankar opened this issue Apr 23, 2020 · 83 comments
Closed

Comments

@cshivashankar
Copy link

cshivashankar commented Apr 23, 2020

Status Summary

This section added by @mmerkes

The AWS EKS team has a consistent repro and has engaged with the AmazonLinux team to root cause the issue and fix it. AmazonLinux has merged a patch that solves this issue in our repro and should work for customers, and that is now available via yum update. Once you've updated your kernel and rebooted your instance, you should be running kernel.x86_64 0:4.14.203-156.332.amzn2 (or greater). All EKS optimized AMIs Packer version v20201112 or later will include this patch. Users have 2 options for fixing this issue:

  1. Upgrade to nodes to use the latest EKS optimized AMI
  2. Patch your nodes with yum update

Here's the commands you need to patch your instances:

sudo yum update kernel
sudo reboot

Original Issue

This original content from @cshivashankar

What happened:
Node in the cluster becomes unresponsive and pods running on it also becomes unresponsive.
As per the analysis and logs provided in AWS Case 6940959821, it was informed that this is observed when high IOPS is observed and a soft lock up happens which causes node to become unresponsive. Further investigation might be required .

What you expected to happen:
Node should not crash or become unresponsive , if that was the case , control plane should identify it and mark it as not ready. State should be either node is ready and working properly or node is unresponsive and not ready and should be eventually removed from the cluster.

How to reproduce it (as minimally and precisely as possible):
As per the analysis in the AWS case 6940959821 , the issue could be reproduced by having higher IOPS than the capacity of EBS for sustained amount of time.

Anything else we need to know?:
This issue is being observed recently and I want to rule it out if it was due to using AMI of version 1.14 as we never observed this issue in 1.13. Is there any kernel bug that I am hitting into? For building the AMI, I cloned the "amazon/aws-eks-ami" repo and did the following changes
1. Installed Zabbix agent
2. Ran the kubelet with "--allow-privileged=true" flag as I was getting issues with cadvisor.
So basically AMI being used is practically the same as AWS EKS AMI.

Changes mentioned in the following comment

Logs can be accessed in the AWS Case mentioned above
Environment:

  • AWS Region: us-east-1
  • Instance Type(s): r5 , c5 types
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): "eks.9"
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): "1.14"
  • AMI Version: 1.14
  • Kernel (e.g. uname -a): Linux 4.14.165-133.209.amzn2.x86_64 Template is missing source_ami_id in the variables section #1 SMP Sun Feb 9 00:21:30 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
$ cat /etc/eks/release
BASE_AMI_ID="ami-08abb3d74e734d551"
BUILD_TIME="Mon Mar  2 17:21:42 UTC 2020"
BUILD_KERNEL="4.14.165-131.185.amzn2.x86_64"
ARCH="x86_64"
@cshivashankar cshivashankar changed the title Nodes become unresponsive and doesnt recover Nodes become unresponsive and doesnt recover with soft lockup error May 3, 2020
@cshivashankar
Copy link
Author

cshivashankar commented May 3, 2020

Ever since I moved to new kernel for 1.14 I am facing lot of issues in the nodes and nodes are going out quite frequently.Some times it doesn't become responsive and be in the ready status and sometimes it just flips to node not ready status. Is there any bug in Kernel which I am hitting into ?
If there is any customization which is not relevant to the 1.14 , I can do the changes and test.
Any suggestions would be greatly welcome

Following are the soft lock up messages seen

kernel: ena 0000:00:08.0 eth3: Found a Tx that wasn't completed on time, qid 4, index 787.
kernel: watchdog: BUG: soft lockup - CPU#58 stuck for 23s! [kworker/u146:7:707658]
kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xfrm_user xfrm_algo br_netfilter bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables veth iptable_mangle xt_connmark nf_conntrack_netlink nfnetlink xt_statistic xt_recent ipt_REJECT nf_reject_ipv4 xt_addrtype xt_nat xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment xt_mark iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter xt_conntrack nf_nat nf_conntrack overlay sunrpc crc32_pclmul ghash_clmulni_intel pcbc mousedev aesni_intel psmouse evdev aes_x86_64 crypto_simd glue_helper pcc_cpufreq button cryptd ena ip_tables x_tables xfs libcrc32c nvme crc32c_intel nvme_core ipv6 crc_ccitt autofs4
kernel: CPU: 58 PID: 707658 Comm: kworker/u146:7 Tainted: G             L  4.14.165-133.209.amzn2.x86_64 #1
kernel: Hardware name: Amazon EC2 c5.18xlarge/, BIOS 1.0 10/16/2017
kernel: Workqueue: writeback wb_workfn (flush-259:0)
kernel: task: ffff8893fefa0000 task.stack: ffffc9002daec000
kernel: RIP: 0010:__list_del_entry_valid+0x28/0x90
kernel: RSP: 0018:ffffc9002daefcc0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
kernel: RAX: ffff88916d19b470 RBX: ffffc9002daefce8 RCX: dead000000000200
kernel: RDX: ffff88804b1f36c8 RSI: ffff888013237e08 RDI: ffff888013237e08
kernel: RBP: ffff88916d19b470 R08: ffff889488d1eb48 R09: 0000000180400037
kernel: R10: ffffc9002daefe10 R11: 0000000000000000 R12: ffff88916d608800
kernel: R13: ffff888013237e08 R14: ffffc9002daefd78 R15: ffff889488d1eb48
kernel: FS:  0000000000000000(0000) GS:ffff88a371380000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 000010923314f000 CR3: 000000000200a002 CR4: 00000000007606e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: PKRU: 55555554
kernel: Call Trace:
kernel: move_expired_inodes+0x6a/0x230
kernel: queue_io+0x61/0xf0
kernel: wb_writeback+0x258/0x300
kernel: ? wb_workfn+0xdf/0x370
kernel: ? __local_bh_enable_ip+0x6c/0x70
kernel: wb_workfn+0xdf/0x370
kernel: ? __switch_to_asm+0x41/0x70
kernel: ? __switch_to_asm+0x35/0x70
kernel: process_one_work+0x17b/0x380
kernel: worker_thread+0x2e/0x390
kernel: ? process_one_work+0x380/0x380
kernel: kthread+0x11a/0x130
kernel: ? kthread_create_on_node+0x70/0x70
kernel: ret_from_fork+0x35/0x40

Ideally with the change #367 , kubelet should be having plenty of resources and if there are high resource consuming pods , it should just evict it rather than node getting crashed.

I digged into the customizations in detail and following are the customizations done for the AMI used in environment:

Customizations :

  1. Docker daemon related

    1a. Docker env
    docker.env is used

	        # The max number of open files for the daemon itself, and all
			# running containers.  The default value of 1048576 mirrors the value
			# used by the systemd service unit.
			DAEMON_MAXFILES=1048576

			# Additional startup options for the Docker daemon, for example:
			# OPTIONS="--ip-forward=true --iptables=true"
			# By default we limit the number of open files per container
			# Options are moved to docker-daemon.json
			#OPTIONS="--default-ulimit nofile=262144:1048576 --max-concurrent-downloads=100 --max-concurrent-uploads=100 --iptables=false"

			# How many seconds the sysvinit script waits for the pidfile to appear
			# when starting the daemon.
			DAEMON_PIDFILE_TIMEOUT=10

1b. docker daemon json
Default ulimits are changed and iptables is also set as false

			{
			  "bridge": "none",
			  "log-driver": "json-file",
			  "log-opts": {
				"max-size": "10m",
				"max-file": "10"
			  },
			  "live-restore": true,
			  "iptables": false,
			  "max-concurrent-downloads": 100,
			  "max-concurrent-uploads": 100,
			  "default-ulimits": {
				"nofile": {
				  "Name": "nofile",
				  "Hard": 1048576,
				  "Soft": 262144
				}
			  }
			}
  1. System customizations:

2a. 20-nproc.conf
Limit user process to avoid fork bombing


			# Default limit for number of user's processes to prevent
			# accidental fork bombs.
			# See rhbz #432903 for reasoning.

			*          soft    nproc     3934939
			root       soft    nproc     unlimited

2b. sysctl.conf

			# Filesystem

			# Maximum number of allowable concurrent requests
			fs.aio-max-nr = 1048576
			 
			# 1M max_user_watches takes extra 1Gb of kernel memory per real UID
			fs.inotify.max_user_watches = 1048576
			fs.inotify.max_user_instances = 2048


			# Kernel

			# Prevents Docker from crashing with:
			# > runtime/cgo: pthread_create failed: Resource temporarily unavailable
			kernel.pid_max = 1048576


			# Network

			# Max number of packets queued when the interface receives packets faster than
			# kernel can process them. Default is 1000
			net.core.netdev_max_backlog = 5000

			# Max rx/tx buffer sizes. Default 212k sounds too small for our 10g links, but
			# setting some very high numbers may have nasty consequences too
			net.core.rmem_max = 8388608
			net.core.wmem_max = 8388608

			# Threshold levels for ARP cache increased after neighbor table overflowed on DL1
			net.ipv4.neigh.default.gc_thresh1 = 4096
			net.ipv4.neigh.default.gc_thresh2 = 8192
			net.ipv4.neigh.default.gc_thresh3 = 8192

			# Number of memory-mapped areas per process used by elasticsearch and other
			# software that uses mmap
			vm.max_map_count = 262144
		

2c. systemd.conf
[Manager]
# Cgroup controllers here are joined to make system.slice appear in the common
# cgroup heirarchy. This way kubelet is able to use "system.slice" as
# --system-reserved to isolate system resources from pod resources
JoinControllers=cpu,cpuacct,cpuset,net_cls,net_prio,hugetlb,memory

3. Kubelet customizations

	Following is the kubelet output with ps		
	Eviction and system reserved option might be removed as its set in default.

/usr/bin/kubelet --cloud-provider aws --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime docker --read-only-port=10255 --network-plugin cni --node-ip=10.77.94.123 --pod-infra-container-image=602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64:3.1 --runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice --kube-api-qps=30 --kube-api-burst=50 --event-qps=30 --event-burst=50 --registry-qps=0 --enforce-node-allocatable=pods --eviction-hard=imagefs.available<15%,memory.available<5%,nodefs.available<10%,nodefs.inodesFree<5% --serialize-image-pulls=false --v=3 --pod-max-pids=8192 --node-labels=node-role.kubernetes.io/spot-worker --resolv-conf=/etc/resolv.kubelet

/usr/bin/kubelet --cloud-provider aws --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime docker --read-only-port=10255 --network-plugin cni --node-ip=10.77.94.123 --pod-infra-container-image=602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64:3.1 --eviction-hard=imagefs.available<15%,memory.available<5%,nodefs.available<10%,nodefs.inodesFree<5% --serialize-image-pulls=false --v=3 --pod-max-pids=8192 --resolv-conf=/etc/resolv.kubelet

resolv.kubelet is following
/usr/bin/grep -v "^search " /etc/resolv.conf > /etc/resolv.kubelet

EDIT: Edited this post after removing some customization

@edubacco
Copy link

same here, subscribe

@colandre
Copy link

We had the same problem upgrading our worker nodes to Amazon EKS 1.15 AMI. We tried:

  • amazon-eks-node-1.15-v20200507
  • amazon-eks-node-1.15-v20200423

and both had the same problem.

We have pods with initContainers copying about 1Gb of small files (WP install), and during the copy, in the Init phase, the worker nodes hang, becoming completely unresponsive.

Syslog on the worker node reports:

[  288.053638] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [kworker/u16:2:62]
[  288.059141] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache veth iptable_mangle xt_connmark nf_conntrack_netlink nfnetlink xt_statistic xt_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nf_defrag_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment xt_mark iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay sunrpc crc32_pclmul ghash_clmulni_intel pcbc mousedev aesni_intel aes_x86_64 crypto_simd evdev glue_helper psmouse cryptd button ena ip_tables x_tables xfs libcrc32c nvme crc32c_intel nvme_core ipv6 crc_ccitt autofs4
[  288.185290] CPU: 5 PID: 62 Comm: kworker/u16:2 Tainted: G             L  4.14.177-139.253.amzn2.x86_64 #1
[  288.191527] Hardware name: Amazon EC2 m5.2xlarge/, BIOS 1.0 10/16/2017
[  288.195344] Workqueue: writeback wb_workfn (flush-259:0)
[  288.198708] task: ffff888184670000 task.stack: ffffc90003360000
[  288.202280] RIP: 0010:move_expired_inodes+0xff/0x230
[  288.205542] RSP: 0018:ffffc90003363cc8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff10
[  288.211042] RAX: 00000000ffffa056 RBX: ffffc90003363ce8 RCX: dead000000000200
[  288.215031] RDX: 0000000000000000 RSI: ffffc90003363ce8 RDI: ffff8887068963c8
[  288.219040] RBP: ffff888802273c70 R08: ffff888706896008 R09: 0000000100400010
[  288.223047] R10: ffffc90003363e10 R11: 0000000000025400 R12: ffff8888227f6800
[  288.227062] R13: ffff888706896788 R14: ffffc90003363d78 R15: ffff888706896008
[  288.231071] FS:  0000000000000000(0000) GS:ffff888822740000(0000) knlGS:0000000000000000
[  288.236761] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  288.240282] CR2: 00007f5b703af570 CR3: 000000000200a005 CR4: 00000000007606e0
[  288.244306] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  288.248328] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  288.252351] PKRU: 55555554
[  288.254821] Call Trace:
[  288.257178]  queue_io+0x61/0xf0
[  288.259798]  wb_writeback+0x258/0x300
[  288.262600]  ? wb_workfn+0xdf/0x370
[  288.265323]  ? __local_bh_enable_ip+0x6c/0x70
[  288.268370]  wb_workfn+0xdf/0x370
[  288.271040]  ? __switch_to_asm+0x41/0x70
[  288.273944]  ? __switch_to_asm+0x35/0x70
[  288.276845]  process_one_work+0x17b/0x380
[  288.279728]  worker_thread+0x2e/0x390
[  288.282509]  ? process_one_work+0x380/0x380
[  288.285482]  kthread+0x11a/0x130
[  288.288134]  ? kthread_create_on_node+0x70/0x70
[  288.291201]  ret_from_fork+0x35/0x40
[  288.293964] Code: b9 01 00 00 00 0f 44 4c 24 04 89 4c 24 04 49 89 c4 48 8b 45 00 48 39 c5 74 1a 4d 85 f6 4c 8b 6d 08 0f 84 67 ff ff ff 49 8b 45 e0 <49> 39 06 0f 89 5a ff ff ff 8b 44 24 04 85 c0 75 51 48 8b 44 24 
[  293.673669] ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 219.
[  293.679790] ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 220.

As workaround, we took the offical Ubuntu EKS 1.15 image ami-0f54d80ab3d460266, we added in it the nfs-common package to manage the EFS and we rebuilt a new custom AMI from it.

Note: Using kiam, we had to change the CA certificate path, because the location in Ubuntu is different from the one in AmazonLinux image.

@cshivashankar
Copy link
Author

@colandre Do you use stock AMI recommended by AWS or do you bake the AMI, if so what all customizations are you using?

It looks like the issue might be arising due to high resource utilization.
It's still not clear for me on why is this happening even when kubelet and all components like docker daemon have resource restrictions set. How can they overload system and cause kernel to panic.
Maybe someone from amazon can give a clear picture.

@colandre
Copy link

@cshivashankar , I used stock Amazon AMIs:

  • amazon-eks-node-1.15-v20200507 - ami-023736532608ff45e
  • amazon-eks-node-1.15-v20200423 - ami-02955144c3a2cb6a1

no customizations.
I tried also to force kubeReserved and systemReserved in eksctl configuration to a higher limits compared to the auto-computed one by eksctl (0.19 and above), with no luck.
The suggested reservations for my instance type is 300m for the CPU and 300Mi for the memory.
I tried 1 core and 1Gb.
E.g.

...
kubeletExtraConfig:
        kubeReserved:
            cpu: "1024m"
            memory: "1Gi"
            ephemeral-storage: "1Gi"
        kubeReservedCgroup: "/kube-reserved"
        systemReserved:
            cpu: "1024m"
            memory: "1Gi"
            ephemeral-storage: "1Gi"
        evictionHard:
            memory.available:  "200Mi"
            nodefs.available: "10%"
...

@cshivashankar
Copy link
Author

cshivashankar commented May 24, 2020

@colandre Did you reach out to AWS support? Any inputs from them? I guess this thread has little visibility as it was created some days back. As per AWS, there should not be any issues if stock AMI is being used. It's definitely worth following up with AWS. Even I am curious about the solution to this problem

@colandre
Copy link

@cshivashankar , unfortunately we do not have the Business Plan, so tech issues cannot be opened.

@cshivashankar
Copy link
Author

@colandre I think only way to get attention for this thread is to tag collaborator or contributor of this repo :)

@eeeschwartz
Copy link

We're cautiously optimistic that upgrading our CNI plugin to 1.6.1 has resolved the issue for us. HTH

@jae-63
Copy link

jae-63 commented May 28, 2020

I work with @eeeschwartz . We now no longer think that the CNI upgrade change fixed this for us. What did fix it, was switching to instances with SSD drives such as r5d.2xlarge, and using that SSD drive for /var/lib/docker as suggested at #349

Our actual experience with managing that filesystem was a bit different than suggested in that link. We found that dockerd needs to be stopped/started as part of the userdata script, even though we placed our code fragment in the "pre-userdata", i.e. as the very first steps. The following appears to be working for us:

if ( lsblk | fgrep -q nvme1n1 ); then
   mkfs.ext4 /dev/nvme1n1
   systemctl stop docker
   mkdir -p /var/lib/docker
   mount /dev/nvme1n1 /var/lib/docker
   chmod 711 /var/lib/docker
   systemctl start docker
fi

@cshivashankar
Copy link
Author

@jae-63 Moving docker daemon to nvme should ideally provide better performance and IO. Based on your experience of issue , did you see any co-relation to the issue and IOPS of volumes? Did you face any issues with EBS volume of very high IOPS.
Trying to understand why the soft lockup issue is arising, ideally docker running on EBS volume should work well.

@jae-63
Copy link

jae-63 commented May 28, 2020

@cshivashankar for cost reasons [or our perceptions of the costs], we didn't pursue IOPS-provisioned EBS volumes, so I don't have any data to provide. I.e. we were previously just using vanilla 'gp2' drives.

I think that one could use IOPS-provisioned (i.e. "io1") drives instead of following our present solution. That would be a bit simpler.

Ephemeral SSDs associated with ephemeral instance like our use-case is a nice match. It seems likely that our (new) SSD drives are overprovisioned w.r.t size, in light of the available CPU/RAM/SSD size ratios. Other users might use more or less disk than us.

BTW we down-sized the root EBS partition to 20GB but even 10GB would probably be sufficient.

@cshivashankar
Copy link
Author

cshivashankar commented May 29, 2020

Hi @jae-63 Thanks for the info.Any easy ways/instructions for replicating this issue?

Hi @Jeffwan Do you see any potential bug or limitation which is causing these soft lockup issues. It looks like more people are facing these soft lockup issues and some of them are facing it in plain stock AMI which is recommended by AWS. It would be great to know your opinion on what might be causing these issues.

@jae-63
Copy link

jae-63 commented May 29, 2020

@cshivashankar I don't see an easy way to reproduce this. The two affected clusters are mostly used for code building and deployment (primarily the former) using self-hosted Gitlab. As you may know, in ".gitlab-ci.yml" files, one invokes lots of different containers for a brief period of time, e.g. corresponding to building a Docker container. We recently started using the caching capabilities of the 'kaniko-project' https://cloud.google.com/cloud-build/docs/kaniko-cache for many of our builds.
I wonder whether that caching could place a heavy disk I/O load on EKS.

I'm pretty sure these are stock AMIs we're using. One cluster is still running EKS 1.15 and the other is EKS 1.16. They are 'ami-065418523a44331e5' and 'ami-0809659d79ce80260' respectively, in us-west-2.

We did open an AWS ticket for this lockup issue a few weeks ago, but not much was accomplished on it, and it's now closed. I don't know whether it's considered OK to post an AWS ticket number here, or perhaps there's a way for us to communicate privately. Perhaps you'd find some clues there.

In principle I'd be willing to revert to using regular 'r5' instances for a few hours (AFTER HOURS) rather than the 'r5d' ones we're currently using. That should reproduce the problem although the load is likely to be much lighter than during the daytime, when our developers are actively using Gitlab. In our experience we continued to observe these lockups after hours. But you'd need to tell me in advance, in detail, what diagnostics to try to collect.

@cshivashankar
Copy link
Author

Hi, @jae-63 Even my initial analysis pointed to high IOPS. I did try to simulate the error by subjecting node to high CPU, IOPS but somehow the soft lockup issue doesn't get reproduced in the dev environment. So it's becoming quite tricky to understand what's causing this soft lockup.
Earlier I was using following qps flags with kubelet for faster rate of communication with control plane
--kube-api-qps=30 --kube-api-burst=50 --event-qps=30 --event-burst=50 --registry-qps=0. After removing these flags , issues got reduced a lot but not completely eliminated. We too use Jenkins, so there might be some co-relation regarding the activities.

I had opened an AWS ticket and working actively with engineer and lack of reproduction steps has become a big hurdle. I am not from amazon, so I don't know if it's ok to share the ticket number, I can check and confirm.

Main diagnostics that can help is kernel crash dump, logs collected from log collector script, and system metrics maybe using SAR or similar tools.

@jae-63
Copy link

jae-63 commented Jun 2, 2020

Hi @cshivashankar . Sorry I had misunderstood ... I thought you were AWS staff and were seeking customer help to resolve this for the larger community. In light of the fact that you, like me, don't work for AWS I'm going to excuse myself from further activity here, because I don't think I have much to add beyond the workaround I have already provided.

@otterley
Copy link
Contributor

Can those of you experiencing this issue please try the following on your nodes, reboot, and see if you're still experiencing the issue?

sudo amazon-linux-extras install kernel-ng
sudo reboot

@otterley
Copy link
Contributor

Alternatively, for those of you experiencing this issue, if you switch to c4/r4/m4 instances instead of c5/r5/m5, does it make a difference?

@cshivashankar
Copy link
Author

encing this issue please try the following on your nodes, reboot, and see if you're still experiencing the issue?

Can you please provide more details on how this solves the issue. I am ok to try this if this solves any known issue or bug.

@otterley
Copy link
Contributor

@cshivashankar There have been a number of cgroup-related bugs fixed in Kernel 5.3+. This is the Linux kernel that comes with the kernel-ng package, which is much later than the default kernel that comes with the EKS-optimized AMI.

@cshivashankar
Copy link
Author

Hi @otterley, Thanks for your feedback I will definitely give it a try and update. Is there any bugfixes list available for public to get an idea on fixes? If there is any specific fix that can address the above issue.I am curious to know what exactly is causing this issue, Any insights on this?

@otterley
Copy link
Contributor

Unfortunately I'm not aware of any identifiable bug fixes that directly relate to this issue. But with your assistance we can hopefully narrow this down to a root cause.

@cshivashankar
Copy link
Author

Hi, @otterley I will be glad to help. Let me try to test these changes and update.Due to the nature of the issue and difficulty in reproducing, it might take some time to provide an input on if it really solved the issue or not.Meanwhile is there any way I can check the source code of the kernel and try to debug other than the using yumdownloader for the source (https://aws.amazon.com/amazon-linux-2/faqs/.)
I assume you are from the AWS kernel team.

@otterley
Copy link
Contributor

I'm actually a Containers Specialist Solutions Architect and am not on the kernel team. I'm just volunteering my time to help out customers like you as best I can. :)

Also, it would be good to know if this issue is reproducible on 4th generation instances.

@cshivashankar
Copy link
Author

Thank you, tats great to hear :).
I am open to trying out 4th gen instances in a dev environment. The biggest problem for me has been to reproduce the issue. Do you have any ideas for reproducing the issue so that it will be much easier for me to try out any changes?
Till now I have tried everything from bombarding CPU , memory IOPS etc, nothing worked :(.

@otterley
Copy link
Contributor

Is the issue still occurring randomly, but you're unable to induce it yourself? Or has the issue resolved itself spontaneously?

@cshivashankar
Copy link
Author

I am still getting issues but it's random and less frequent. All of a sudden I see a node will be gone with soft lockup in prod environment.
Earlier I was using following kubelet flags related to qps and memory
--kube-api-qps=30 --kube-api-burst=50 --event-qps=30 --event-burst=50 --registry-qps=0
--system-reserved=cpu=1,memory=1Gi,ephemeral-storage=5Gi \ which was causing nodes to go off once in a few hours after 1.14 upgrade ( we didnt face any issues in 1.13) . Removing above flags definitely reduced lot of those incidents but hasnt eliminated them completely.
I am unable to induce the error myself to debug the issue.

@JacobHenner
Copy link

I've got a "kind of" repro now. I wrote this script to deploy to pods and do a bunch of copying of files. I've successfully hung Docker and created some PLEG issues that @cshivashankar has mentioned in this other issue. I haven't explicitly seen a soft lockup in the system logs yet, but I'm breaking my nodes, so that's progress! I'm going to compare it with a previous AMI and see if I can get to a state where I break one and not the other.

I've been running this for a bit and haven't yet seen any related errors in dmesg.

@mmerkes
Copy link
Member

mmerkes commented Oct 7, 2020

@JacobHenner I was able to see issues initially when I created a ton of pods at once using that script, but when I tried scaling in a more controlled manner, I didn't see it. Still struggling to get a consistent repro on new AMIs that doesn't always break the older AMIs.

@cshivashankar
Copy link
Author

cshivashankar commented Oct 7, 2020

I've got a "kind of" repro now. I wrote this script to deploy to pods and do a bunch of copying of files. I've successfully hung Docker and created some PLEG issues that @cshivashankar has mentioned in this other issue. I haven't explicitly seen a soft lockup in the system logs yet, but I'm breaking my nodes, so that's progress! I'm going to compare it with a previous AMI and see if I can get to a state where I break one and not the other.

Tats good to hear :).
Some more observations when I observed soft lockup, CPU utilization, and load were very high.
I guess you are running the file generator script inside the pod but not the node.
Few more things which might help is -

  1. Reduce the IOPS of the disk, might be worth trying with SC/ST disk also.
  2. Wondering if creating 20*25 MB files in parallel is better than 500 MB is better. This is to increase the context switch from user to kernel mode. Maybe this will increase the CPU load.
  3. Run CPU loader in high priority or maybe run in system slice. Idea is to starve the CPU cycles for simulation.

@JacobHenner
Copy link

Upgrading the kernel from 4.14 to 5.4 to seems to resolve the issue for customers.

Was this based on the limited feedback in this ticket, or were there other customers who indicated this fixed the issue as well? I'm in the process of applying this workaround to my clusters, and I'd like to have a better understanding of the likelihood that it'll either succeed or fail @mmerkes.

Also, I'm curious as to why this isn't impacting more EKS users. Seems like this would be a frequently encountered issue.

@mmerkes
Copy link
Member

mmerkes commented Oct 7, 2020

@JacobHenner It's definitely possible that there are customers who don't notice or aren't reporting, but it's also unclear how common the scenario is given that we don't fully understand it.

Of course, there are no guarantees that upgrading the kernel will resolve the issue, but so far, it has worked for every customer that has reported trying it, including some support tickets that aren't captured in this github issue. We're also working with the AmazonLinux team to create a mechanism that customers can upgrade their kernel to 5.4 in a way that will pin them to 5.4 rather using the ng repo, which will eventually go to 5.9.

@JacobHenner
Copy link

I have now been able to reproduce this issue consistently in my environment. The reproducer triggers the issue with the 4.14 kernel in the 1.14 and 1.16 images, but it hasn't yet triggered the issue with the 5.4 kernel in the 1.16 image.

I will attempt to figure out what combination of conditions is causing the issue, as well as develop a reproducer that I can share (the current one includes data which I cannot share).

@brettplarson
Copy link

brettplarson commented Oct 7, 2020

I think it would be able to easily re-produce this issue with a pod that has an init container which clones a git repo with a large amount of files, (possibly clone the linux kernel source) to emptydir.

The main container should then either move or delete these files. I believe this combination (creating and moving / deleting a large amount of files in a short time) is key to re-creating the issue, but I don't have any evidence besides the fact that this is what we were doing and we were getting this very very consistently.

Updating to 5.4 kernel completely fixed this.

Hope this helps!

@mmerkes
Copy link
Member

mmerkes commented Oct 7, 2020

@JacobHenner That's good news! In summary, what does your reproducer do? @brettplarson Thanks for the details. I can try that. I'm glad the kernel upgrade fixed it for you.

@JacobHenner
Copy link

The reproducer runs 8 replicas of a container which downloads and extracts a 22mb (277mb uncompressed) gzipped-tarball in a tight loop. At first I figured this might be related to the use of an emptyDir, as most of the pods which seemed to trigger this issue used emptyDirs, but the reproducer was able to reproduce the node failures both with and without emptyDirs. I was also able to reproduce the lockup behavior on m5.2xls and m4.2xls, although the log messages were different between the two.

@mmerkes
Copy link
Member

mmerkes commented Oct 7, 2020

I wrote a script to git clone a repo and then compress and decompress it in a loop. I've run different variations of this on m5.xlarge nodes with 8-16 pods per node. Haven't seen the soft lockup yet, but still trying things. Primarily sharing as an update.

@robertgates55
Copy link

Can confirm that upgrading to 5.4 has so far resolved the issue for us too.

We had a pretty reliable repro (when our portworx cluster kicked off scheduled full cloud snaps of ~400 volumes at the same time every week) and have rolled out changes over the last few weekends:

  • upgrading our nodes to m5d instances did NOT resolve the issue for us 👎
  • upgrading the kernel to 5.4 DID 👍

@JacobHenner
Copy link

I've shared my team's reproducer with the AWS team. I cannot share the gzipped tarball itself here, but it's essentially a few hundred mb of small files, in many directories. Running 8 replicas of tar extraction in a tight loop reliably produces the lockup on m5.2xl instances running 4.14, but not 5.4.

@JacobHenner
Copy link

JacobHenner commented Oct 8, 2020

If it helps, I noticed in the logs of one of these tar-extraction pods:

tar: <file_name>: Cannot write: Cannot allocate memory
(repeated)

However, if this was actually an OOM condition, wouldn't the container have just been killed?

@mmerkes
Copy link
Member

mmerkes commented Oct 8, 2020

@JacobHenner Tx for sharing your reproducer! I was able to reproduce the issue in 4.14, so that's great news. It seems to also cause the issue on the older AMI that others didn't seem to have a problem with, but now we've got something to work with. I will share this with the AmazonLinux team and dig further myself so that we can figure out what's going on and get it fixed.

@hrzbrg
Copy link

hrzbrg commented Oct 12, 2020

Hi @mmerkes
we use the AMI on which the EKS AMIs are build (amzn2-ami-minimal-hvm-*). Would it be in your power to trigger a release of these AMIs with the 4.14.198 Kernel?

@mmerkes
Copy link
Member

mmerkes commented Oct 13, 2020

@hrzbrg You're asking if we can trigger a release of the AMIs maintained by AmazonLinux rather than the EKS maintained AMIs? If so, we can't. They have a process for releasing new AMIs with the latest kernels, and I don't know when that will happen, but it will happen eventually. For this particularly issue, updating the kernel doesn't seem to help.

As a quick status update for the soft lockup issue, the AmazonLinux team has the repro and is actively working on it. Once the issue is identified, they will hopefully be able to fix it, test it and release a new patch for the kernel. Once we have a verified fix and it's available, EKS will release new AMIs for customers to use or you can yum update kernel.

@mmerkes
Copy link
Member

mmerkes commented Oct 23, 2020

We've been able to root cause the issue with the AmazonLinux team.

When the containers have a write heavy workload which run on IOPS constrained EBS volumes, EBS starts to throttle IOPS. The kernel is unable to flush the dirty pages to disk because of this throttling. The dirty pages limit is even more constrained for each cgroup/container and is directly proportional to the memory requested by the container.

When number of dirty pages for a container increases, the kernel tries to flush the pages to disk. In 4.14 kernel, the code which flushes these pages to disk does wasteful work in building up the queuework items of pages to flush instead of actually flushing them. This causes the soft lockup errors and explains why we don't see any I/O going to disk during this event. We have found the patch in kernel 4.15.x and onwards which fixes this issue.

We are working on backporting this patch to 4.14 kernel so it can be released with an EKS optimized AMI.

@cshivashankar
Copy link
Author

cshivashankar commented Oct 23, 2020

@mmerkes Tats great news . Is there any bug/PR in github for this issue in kernel ?
By any chance is there any co relation with PLEG issues seen on the nodes ?

@mmerkes
Copy link
Member

mmerkes commented Oct 23, 2020

@cshivashankar I don't think there's an issue on the AmazonLinux side, though I will update here when it's available. My suspicion is that there's a relation to some of the PLEG issues, but without a repro specifically on the PLEG issues, I can't say for certain. Did upgrading the kernel to 5.4 make those go away?

@cshivashankar
Copy link
Author

@mmerkes I was referring to bug/PR from upstream linux repo if any.
I did upgrade kernel in non prod environment , however soft lockup and PLEG issues are not observed in non prod instances , so I cant confirm if this has resolved. I am also having a tough time to reproduce issue in non prod environment which makes it even more challenging to know the root cause.

@rphillips
Copy link

@mmerkes Could you link us to the kernel patch?

@mmerkes
Copy link
Member

mmerkes commented Oct 27, 2020

@rphillips @cshivashankar Here is a thread that discusses the issue and I believe this is the upstream patch that resolves the issue.

@mmerkes
Copy link
Member

mmerkes commented Oct 28, 2020

Patch has been merged into AmazonLinux kernel. It'll be available in the next kernel release.

@mmerkes
Copy link
Member

mmerkes commented Nov 12, 2020

kernel.x86_64 0:4.14.203-156.332.amzn2 is now available via yum update and has the patch that should resolve this soft lockup issue. We're working on building, testing and releasing new EKS optimized AMIs that will include this patch. Once that's available, I will post another update here and close this issue.

@njtman
Copy link

njtman commented Nov 16, 2020

kernel.x86_64 0:4.14.203-156.332.amzn2 is now available via yum update and has the patch that should resolve this soft lockup issue. We're working on building, testing and releasing new EKS optimized AMIs that will include this patch. Once that's available, I will post another update here and close this issue.

@mmerkes I noticed that AMI version 1.18.9-20201112 has been release which uses kernel version 4.14.203. Does this AMI include this patch? Source - https://docs.aws.amazon.com/eks/latest/userguide/eks-linux-ami-versions.html

@mmerkes
Copy link
Member

mmerkes commented Nov 16, 2020

@njtman Yes, that is correct! The latest EKS optimized AMI include the patch that fixes the soft lockup issue. Any AMI with Packer version v20201112 or higher will have the patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests