OKD 4.10 cluster based on CoreOS 35 exhibits very frequent complete lockups of nodes with "smp_call_function_single multi_cpu_stop" #1249

markusdd · 2022-07-07T20:43:37Z

Describe the bug
As I just realized I commented on a closed ticket, here is a new one.
This all relates to: #940

I just copy my take from there below, and everything that was said in this ticket still applies:

Hi all, we also run an OKD 4.10 cluster and this problem is heavily affecting us right now to a point where this morning suddenly 70% of our cluster (consisting of 7 workers, 3 masters, 1 special purpose node in a VM) was in 'NotReady' state.
In fact, the servers are so 'stuck' then, that only a reset or powercycle helps. In the iLO console sometimes you still see the kernel prints, but no keyboard input etc works.

These are all HP ProLiant DL360 machines of G7/G8/G9 vintage, so a variety of different CPU generations.

We have never observed this with the one node which is a VM, which runs on an oVirt cluster, which in turn in also running on HP ProLiant machines. But this node only hosts one pod (our gitlab runner), all the other nodes essentially act as our CI cluster. So they experience a huge variety of loads from software/firmware builds, simulations, Python Linting etc.

We managed to improve the situation by turning Hyperthreading off in BIOS, but now even this does not seem to help anymore.
In the previous CentOS7-based OKD3-cluster none of this ever happened and many nodes were migrated recently, so all of them having HW issues is more than unlikely.

So even in the newest kernels there must be a fundamental issue and this is turning into a very high prio issue for us as now basically every day we have multiple node failures we need to attend to manually by restarting the nodes and sometimes we even have to manually repair the file system or delete the CI workspaces because they crashed in the middle of whatever operation.

On the last node I checked today I also get the dreaded smp_call_function_single multi_cpu_stop messages.

Any advice on what this could be and how to workaround solve? This is a huge problem.

Expected behavior
No node lockups that are so bad you need a power-cycle or reset, at least a clean reboot would be helpful, of course finally everythjing should be stable.

Actual behavior
see description above for details, we have very frequent (multiple nodes a day) lockups that require hard resets of the nodes, so interaction even via iLO console is not possible anymore.

System details

Bare Metal
Latest OKD 4.10

Ignition config
Ignition managed by OKD4 installer (cluster was originally set up using a late 34 CoreOS, by now it is 35, all upgrades were done through OKD4 using the machine config operator, so this is all automatically chosen)

Additional information
This has increasing priority as the failure rate seems to even increase. (this morning 70% of nodes were in this state)

The text was updated successfully, but these errors were encountered:

bgilbert · 2022-07-07T22:28:39Z

Thanks for the report. Please provide the exact Fedora CoreOS version and the specific error message you're seeing. (Ideally as text, but a screenshot or photo is still helpful if that's all you can get.)

This is likely a kernel problem, and it's apparently one that's been around for a while. That has a couple consequences. First, there's likely not a lot we can do about it directly; it's probably a matter for the Fedora kernel team or the upstream kernel developers.

The other issue is that you're running an old Fedora CoreOS release and therefore an old kernel. (Your kernel is no newer than 5.17, and we're now shipping 5.18.) I know you're only running an old release because OKD uses it, but since we don't support or maintain non-current releases of Fedora CoreOS, we're unfortunately limited in how much we can help. If you're willing to try updating some nodes to a current Fedora CoreOS release, it'd be useful to know whether the problem still occurs on 5.18. Otherwise I'll have to close this issue and refer you to OKD for further assistance.

markusdd · 2022-07-08T06:56:23Z

ok, so currently the cluster runs uname -a: 5.17.13-200.fc35.x86_64 #1 SMP PREEMPT Mon Jun 6 14:38:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

We launched the last available update now, which brings the nodes to 5.18.5-100.fc35.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jun 16 14:44:38 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux .

We will observe stability with that one and report back (also with screens/logs) of kernel messages if possible.

markusdd · 2022-07-10T08:35:48Z

intermediate report: so far no such crashes have been seen again.
Also will go ahead and re-enable Hyperthreading on half of the nodes to see if it has any impact.

markusdd · 2022-07-18T21:26:04Z

So it seems that this is fixed. We have re-enabled Hyperthreading on all nodes for a few days now and no node has since stopped in this completely locked up state.
We had one casualty this morning, but this was due to ovn networking having a core dump, so that would be a completely different issue and does not seem to be very frequent.

dustymabe · 2022-07-18T21:40:57Z

Thank you @markusdd for keeping us informed here.

markusdd added the kind/bug label Jul 7, 2022

markusdd mentioned this issue Jul 8, 2022

Kernel 5.13+ panics resulting in deadlock #940

Closed

dustymabe closed this as completed Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OKD 4.10 cluster based on CoreOS 35 exhibits very frequent complete lockups of nodes with "smp_call_function_single multi_cpu_stop" #1249

OKD 4.10 cluster based on CoreOS 35 exhibits very frequent complete lockups of nodes with "smp_call_function_single multi_cpu_stop" #1249

markusdd commented Jul 7, 2022 •

edited

Loading

bgilbert commented Jul 7, 2022

markusdd commented Jul 8, 2022

markusdd commented Jul 10, 2022

markusdd commented Jul 18, 2022

dustymabe commented Jul 18, 2022

OKD 4.10 cluster based on CoreOS 35 exhibits very frequent complete lockups of nodes with "smp_call_function_single multi_cpu_stop" #1249

OKD 4.10 cluster based on CoreOS 35 exhibits very frequent complete lockups of nodes with "smp_call_function_single multi_cpu_stop" #1249

Comments

markusdd commented Jul 7, 2022 • edited Loading

bgilbert commented Jul 7, 2022

markusdd commented Jul 8, 2022

markusdd commented Jul 10, 2022

markusdd commented Jul 18, 2022

dustymabe commented Jul 18, 2022

markusdd commented Jul 7, 2022 •

edited

Loading