-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PLEG errors with v20221027, kernel 5.4.217-126.408.amzn2 #1071
Comments
@ravisinha0506 since you're tracking this, we're seeing the same issues on AMIs built on https://alas.aws.amazon.com/AL2/ALASKERNEL-5.4-2022-036.html |
Prior to this latest release we attempted to upgrade the kernel on top of the AMI Release v20220926. We experienced instability when doing this. So we decided to wait for this latest release with the kernel upgrades that patch the latest CVEs. Unfortunately we experienced the same issues. We are reasonably confident this is due to the kernel upgrade. The issues that we are seeing now are mostly related to readiness probes using exec commands. They are failing at a much higher rate. It seems to be either a latency issue or some sort of race condition causing the failures. It also noteworthy we see these failures on high volume pods/nodes.
|
@mschenk42 Seconding this - we experienced the same instability with v20220926 and the new kernel. We had to roll-back. |
We are also experiencing the same issue with with v20221027. |
We have been looking into this issue as well in our clusters and have been seeing singular containers on some nodes enter a state where no docker commands complete when interacting with the container. As mentioned, this was first seen in the Node events show flapping between NodeReady and NodeNotReady
That NodeNotReady status seems to come from the K8s PLEG failing
All docker commands with affected pod hang, but others complete just fine
The affected container seems to have an additional
The
We then see that after running
At this point I hopped on another broken node since that one was cleaned up. Running strace on the containerd shim process
That then loops through the Generating stack trace of docker which generates an extremely large file
Inside the stack trace we see what seems like thousands of goroutines similar to
The oldest of these
The two
As far as I found there was now way to display what file was actually being opened with that amazon-eks-ami/scripts/upgrade_kernel.sh Line 16 in 1e89a44
In order to trigger this issue we've been executing a Hopefully some of this may be useful in tracking down the issue. |
@kevinsimons-wf In all cases tho (maybe you can check it out), before we start seeing the PLEG error we see a bunch of log lines containing the string Finally (I have not been able to verify it is the cause, but it could make sense) this error seems to happen within under 10 minutes from a new network interface is added, see
It could happen that there was a blip on the network that caused kubelet to fail reading the disk and entered in this bad state. At the same time, the same blip could be causing other processes to enter in bad state (I actually had a bunch of aws-node subprocesses in Same as you, hopefully this might help solve the issue... |
AmazonLinux has released kernel
|
Keeping open until we release |
The problem seems so widespread that AWS should issue an official statement. Can we expect something like that? I learned about this issue only from a support ticket and after some production clusters went down already. |
FYI, updated mitigation recommendation above to call out that |
Same here..... |
Hi, I see that the patched kernel mentioned is available. @mmerkes When should the new ami release come out? |
is already out the new image.. Look to this comment: #1071 (comment) |
The latest AMI doesn't have this kernel 5.4.219-126.411.amzn2. |
Are there any release notes for kernel |
|
Is it written anywhere what was the root cause and what was the fix? I'm interested. |
Doing a proper post incident review should be a standard procedure. I am negatively surprised that it is not taking place. |
Sorry, I was OOTO, so wasn't able to update here. v20221104 was released with the latest kernel, When issues like this happen, it is our standard practice to root cause the issue, mitigate and review our processes to avoid similar issues in the future. In this case, the process is ongoing. We have developed a test that reproduces this particular issue and have added that as part of our AMI release process. As for more details on the root cause, I'll post here if AmazonLinux has released anything with more details. In a nutshell, a commit on the kernel was causing some workloads to hang, which was reverted on the latest kernel. |
With a bit of guesswork, I suspect the above issue referred to was torvalds/linux@72d9ce5 from |
I updated our clusters Tue night and things seemed fine, but today I'm seeing nodes flapping with NotReady state again with PLEG errors, and pods left in a Terminating state: Kernel Version: 5.4.219-126.411.amzn2.x86_64 |
You could diagnose the previous issue by running |
Thanks for the quick reply. I didn't see those processes on my nodes. Today I installed an efs-csi driver update that was quickly updated to a new release today, perhaps that was causing my issues. I spun up all new nodes in my clusters and things seem stable now. |
What happened:
We noticed an increased rate of PLEG errors while running our workload with v20221027 AMI release. The same issue doesn't appear with v20220926.
What you expected to happen:
PLEG errors shouldn't appear with latest AMI
How to reproduce it (as minimally and precisely as possible):
This issue can be easily reproduced with high CPU/memory workloads
Anything else we need to know?:
So far it looks like an issue associated with the latest kernel version
5.4.217-126.408.amzn2
.Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
):aws eks describe-cluster --name <name> --query cluster.version
):uname -a
):cat /etc/eks/release
on a node):The text was updated successfully, but these errors were encountered: