-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel 5.13+ panics resulting in deadlock #940
Comments
Hey @wkruse. Thanks for the detailed report. My experience with kernel bugs like this is that (thankfully) there is usually a fix already landed or in the works upstream). Just to cover all bases here do you mind trying with:
You can find the links in the unofficial builds browser: https://builds.coreos.fedoraproject.org/browser |
hey @wkruse - mind trying out the latest |
We tested following versions
We could reproduce system freezes running moderate load in all of the versions above. It looks like starting with kernel 5.13 it broke for us. We couldn't reproduce the issue running FCOS We will try |
Also |
so let me summarize:
|
@dghubble - are other typhoon users seeing this? |
On Bare Metal, Dell Inc. PowerEdge R630
(on VirtualBox no freeze, I was just trying to reproduce it in a virtual environment, but didn't succeed, so it has something to do with the real hardware) |
Maybe related #957. |
From the traces, this looks like it could be console printing-related. Wonder if e.g. dropping the serial console karg might help as a test? Anyway I think it's probably better at this point to track this in RHBZ where kernel SMEs can take a look. I filed https://bugzilla.redhat.com/show_bug.cgi?id=2003168. Feel free to add details there. In particular, there is one question I wasn't entirely sure about:
Can anyone confirm this either way in the RHBZ? |
This seems to be a scheduler regression in the 5.13 kernel. Unfortunately, the The stack causing the issue is
Looking at the listing file for
That is the inlined call to There is a secondary bug in that we should not call |
Just started a scratch build with |
@wkruse can you try this dev build: Note, you'll have to disable secureboot if you have that enabled. |
Alternatively you can just replace the kernel on an existing node:
|
Running moderate load on FCOS 34.20210821.3.0 with replaced kernel 5.13.16-200.fc34.dusty.x86_64 (with rpm-ostree command above) also results in deadlock, but without the circular dependency logging above. I've attached the kernel log to https://bugzilla.redhat.com/show_bug.cgi?id=2003168 |
We added Dell Inc. PowerEdge R640 servers to the test cluster, just to make sure that the issue is not specific to the R630.
Same as above, deadlock without the circular dependency logging. Kernel logs is similar to the previous one. |
@wkruse - this code was touched again recently in the merge commit 5d3c0db (part of Want to try with |
@dustymabe We tried |
@dustymabe I've also been seeing this on SuperMicro boxes and pinned clusters to the last FCOS stable with 5.12 for now. Unfortunately don't have so much capacity to look into this for a bit 🤕 |
If we think this isn't limited to specific hardware (i.e. affects lots of bare metal) and we can get a small reproducer (ideally single machine) I can throw the reproducer at some hardware I've got here at home. |
We were able to reproduce it on a single node (single node K8s cluster with one controller node provisioned with Typhoon) running a warm-up of our regular test. But we don't have a synthetic reproducer yet. It seems to be related to stuff running in Kubernetes. Running busy loops directly on the node or in a container on the node didn't result in a deadlock. Also it seems that it happens more often on freshly provisioned nodes. On our single test node, the deadlock appeared after 2 warm-ups, then after 3 warm-ups, after that we were able to run 15 warm-ups in a row without deadlocks. |
Maybe also related #965. |
Is there a way in FCOS to collect more information to debug kernel crashes? |
That one is specific to running |
Nope, we are running x86_64. |
Running our test on the latest stable |
Hmm. Anything in the logs at all that would indicate issues with kdump running? I do see our docs say:
so maybe start with |
The posted stack traces here, looks very similar to what I posted in the other bug: #957 (comment) (the attached archive has logs from few machines). I did not see anything strictly related to CFS, but who knows. |
We were also running We would be happy to run bisect kernels to help to pinpoint the kernel commit, which broke it for us. |
@wkruse - any chance we ever got down to a small reproducer on a single machine? With that power we could pretty quickly bisect and find the culprit. |
@dustymabe We can fast and reliably reproduce it with our tests, but we weren't able to create a synthetic reproducer. For us it looks like running Kubernetes with some load triggers the issue. Looking into #957 that seems to match what OKD fellows observe. |
@aneagoe We have a custom application running in Kubernetes and a test environment to test it. We don't have a synthetic reproducer. We were also running basic CPU load tests (#940 (comment)) and weren't able to reproduce the issue. |
@wkruse Could you share some details about the app that triggers it? Was it the only workload on the node? Any details about the workload, i.e. mostly io, mostly network, compiled C++/Go code, Java, Python, etc? Is it highly multithreaded? |
@baryluk It is a distributed, highly multithreaded Java 16/17 app with heavy usage of multiple Redis instances for queueing and storage. Relatively low network bandwidth, but very sensitive regarding the latency. |
Maybe one additional hint, starting with F34 we had another problem with two of our services, which had huge lags in responses right after the deployment. The root cause was an old Java 11 base image and cgroups v2 (which were introduced by the F34) not fully supported in that version of Java. The fix was to upgrade to the latest Java 11 base image. |
We cannot reproduce it anymore. The last broken |
@wkruse That's great news! Maybe let's keep it open for a few more days to be sure and then we can close this (and the filed RHBZ). |
Thanks all for collaborating and helping us to find when this issue was fixed. I wish we could narrow it down to a particular kernel commit that fixed the problem, but the fact that it's fixed in Note, though, that the F35 rebase is landing in the next |
What kernel version is included in that release? |
@graysky2 it's Fedora's |
The same problem was on okd 4.8 but after update to okd 4.9.0-0.okd-2022-02-12-140851 (kernel 5.14.14-200.fc34.x86_64 ) now worker nodes sometime reboots but no hangs |
we have a similar issue, after upgrading the kernel to 5.14.14, we're facing sudden reboots/crashes without any patterns. My hypothesis is worker nodes that have more crashlooping pods face this issue more, but I couldn't reproduce it. Kernel Panic/Crash logs:
Kernel version:
RPM OSTREE / Fedora Version:
OKD Version:
|
Hi all, we also run an OKD 4.10 cluster and this problem is heavily affecting us right now to a point where this morning suddenly 70% of our cluster (consisting of 7 workers, 3 masters, 1 special purpose node in a VM) was in 'NotReady' state. These are all HP ProLiant DL360 machines of G7/G8/G9 vintage, so a variety of different CPU generations. We have never observed this with the one node which is a VM, which runs on an oVirt cluster, which in turn in also running on HP ProLiant machines. But this node only hosts one pod (our gitlab runner), all the other nodes essentially act as our CI cluster. So they experience a huge variety of loads from software/firmware builds, simulations, Python Linting etc. We managed to improve the situation by turning Hyperthreading off in BIOS, but now even this does not seem to help anymore. So even in the newest kernels there must be a fundamental issue and this is turning into a very high prio issue for us as now basically every day we have multiple boot failures we need to attend to manually by restarting the nodes and sometimes we even have to manually repair the file system or delete the CI workspaces because they crashed in the middle of whatever operation. On the last node I checked today I also get the dreaded Any advice on what this could be and how to workaround solve? This is a huge problem. |
@markusdd had the same on version
The affected kernel version in my case was
|
as documented here: #1249 (comment) We will observe how this behaves. If problems persist, we might think about trying the older one. |
I'm going to lock this ticket as it is becoming an attractor for old kernels misbehavior reports. |
Describe the bug
We are using Typhoon (https://typhoon.psdn.io/fedora-coreos/bare-metal/) to provision Fedora CoreOS and Kubernetes on bare metal. The last stable version without the issue was
34.20210711.3.0
. Starting from34.20210725.3.0
up to the34.20210808.3.0
we started to see system freezes, to force reboot we have to use the power switch.This is the kernel crash log just before the hard reboot.
The last working testing-devel was
34.20210720.20.0
, from34.20210720.20.1
it is broken for us.The diff
We are provisioning with
We were also provisioning different Kubernetes versions starting with
1.21.0
,1.22.0
and1.22.1
, which all showed same behavior.Reproduction steps
Steps to reproduce the behavior:
Expected behavior
No panics, no deadlocks.
Actual behavior
Kernel panics resulting in deadlock.
System details
The text was updated successfully, but these errors were encountered: