Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OKD 4.10 cluster based on CoreOS 35 exhibits very frequent complete lockups of nodes with "smp_call_function_single multi_cpu_stop" #1249

Closed
markusdd opened this issue Jul 7, 2022 · 5 comments
Labels

Comments

@markusdd
Copy link

markusdd commented Jul 7, 2022

Describe the bug
As I just realized I commented on a closed ticket, here is a new one.
This all relates to: #940

I just copy my take from there below, and everything that was said in this ticket still applies:

Hi all, we also run an OKD 4.10 cluster and this problem is heavily affecting us right now to a point where this morning suddenly 70% of our cluster (consisting of 7 workers, 3 masters, 1 special purpose node in a VM) was in 'NotReady' state.
In fact, the servers are so 'stuck' then, that only a reset or powercycle helps. In the iLO console sometimes you still see the kernel prints, but no keyboard input etc works.

These are all HP ProLiant DL360 machines of G7/G8/G9 vintage, so a variety of different CPU generations.

We have never observed this with the one node which is a VM, which runs on an oVirt cluster, which in turn in also running on HP ProLiant machines. But this node only hosts one pod (our gitlab runner), all the other nodes essentially act as our CI cluster. So they experience a huge variety of loads from software/firmware builds, simulations, Python Linting etc.

We managed to improve the situation by turning Hyperthreading off in BIOS, but now even this does not seem to help anymore.
In the previous CentOS7-based OKD3-cluster none of this ever happened and many nodes were migrated recently, so all of them having HW issues is more than unlikely.

So even in the newest kernels there must be a fundamental issue and this is turning into a very high prio issue for us as now basically every day we have multiple node failures we need to attend to manually by restarting the nodes and sometimes we even have to manually repair the file system or delete the CI workspaces because they crashed in the middle of whatever operation.

On the last node I checked today I also get the dreaded smp_call_function_single multi_cpu_stop messages.

Any advice on what this could be and how to workaround solve? This is a huge problem.

Expected behavior
No node lockups that are so bad you need a power-cycle or reset, at least a clean reboot would be helpful, of course finally everythjing should be stable.

Actual behavior
see description above for details, we have very frequent (multiple nodes a day) lockups that require hard resets of the nodes, so interaction even via iLO console is not possible anymore.

System details

  • Bare Metal
  • Latest OKD 4.10

Ignition config
Ignition managed by OKD4 installer (cluster was originally set up using a late 34 CoreOS, by now it is 35, all upgrades were done through OKD4 using the machine config operator, so this is all automatically chosen)

Additional information
This has increasing priority as the failure rate seems to even increase. (this morning 70% of nodes were in this state)

@bgilbert
Copy link
Contributor

bgilbert commented Jul 7, 2022

Thanks for the report. Please provide the exact Fedora CoreOS version and the specific error message you're seeing. (Ideally as text, but a screenshot or photo is still helpful if that's all you can get.)

This is likely a kernel problem, and it's apparently one that's been around for a while. That has a couple consequences. First, there's likely not a lot we can do about it directly; it's probably a matter for the Fedora kernel team or the upstream kernel developers.

The other issue is that you're running an old Fedora CoreOS release and therefore an old kernel. (Your kernel is no newer than 5.17, and we're now shipping 5.18.) I know you're only running an old release because OKD uses it, but since we don't support or maintain non-current releases of Fedora CoreOS, we're unfortunately limited in how much we can help. If you're willing to try updating some nodes to a current Fedora CoreOS release, it'd be useful to know whether the problem still occurs on 5.18. Otherwise I'll have to close this issue and refer you to OKD for further assistance.

@markusdd
Copy link
Author

markusdd commented Jul 8, 2022

ok, so currently the cluster runs uname -a: 5.17.13-200.fc35.x86_64 #1 SMP PREEMPT Mon Jun 6 14:38:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

We launched the last available update now, which brings the nodes to 5.18.5-100.fc35.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jun 16 14:44:38 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux .

We will observe stability with that one and report back (also with screens/logs) of kernel messages if possible.

@markusdd
Copy link
Author

intermediate report: so far no such crashes have been seen again.
Also will go ahead and re-enable Hyperthreading on half of the nodes to see if it has any impact.

@markusdd
Copy link
Author

So it seems that this is fixed. We have re-enabled Hyperthreading on all nodes for a few days now and no node has since stopped in this completely locked up state.
We had one casualty this morning, but this was due to ovn networking having a core dump, so that would be a completely different issue and does not seem to be very frequent.

@dustymabe
Copy link
Member

Thank you @markusdd for keeping us informed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants