Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to find the kernel commit which makes WSL non-responsive #1

Open
carlfriedrich opened this issue Feb 4, 2024 · 217 comments
Open
Assignees

Comments

@carlfriedrich
Copy link
Owner

carlfriedrich commented Feb 4, 2024

We're trying to find the kernel commit which makes WSL non-responsive after hibernation, which is described in the issues microsoft/WSL#8696 and microsoft/WSL#6982.

Our starting point

  • @burk3 has described here how to build and use a custom kernel based on the linux-msft-5.4.72 tag. I have followed these steps and did not have the issue within over a year.
  • @onereal7 tried other versions and reported here the first tag showing the issue is v5.5-rc1. I have built that version as well and can confirm this.
  • The common base of these two tags is v5.4. We assume that the issue does not appear in this version. This has to be confirmed, though.

Bisecting the kernel

We have about 13,000 commits between v5.4 and v5.5-rc1. Using git bisect we should be able to track down the commit introducing the issue within 14 rounds. As a start, I have built the start and end versions and one in between. I will update this table as soon as the versions are confirmed to be working or non-working and add new versions as I continue the bisection. The links in the table lead to the release page for the corresponding version where you can download the kernel image.

Kernel version Good Reports good / bad
v5.4 6 / 0
v5.4-2622-g386403a115f 5 / 0
v5.4-2759-ga86f69d3349 8 / 0
v5.4-2809-ga25bbc2644f 4 / 0
v5.4-2816-gcd4771f7709 5 / 0
v5.4-2819-g64d6a12094f 0 / 3
v5.4-2824-g24ee25a6da8 1 / 4
v5.4-2841-gda42761df5c 1 / 4
v5.4-2929-g1d87200446f 0 / 4
v5.4-3127-g77a05940eee 0 / 2
v5.4-3434-g3f1b210a7f9 0 / 2
v5.4-4535-g9a3d7fd275b 0 / 2
v5.5-rc1 0 / 2

How you can help

  • Subscribe to this issue to stay up to date about the bisection.
  • Download the current test version (the one with a ❔ in the above table) and set it up in your WSL instance like described in the README.
  • Leave a comment in this issue, either
    • when WSL becomes unresponsive using this kernel, or
    • if you don't encounter the issue within a week using this kernel.
  • In both cases, include the output of uname --kernel-release in your comment.

I will wait for a reasonable number of reports for each version, so even if somebody else reported a working or non-working version before, please do report your experience as well.

How you cannot help

We're not looking for any workarounds or environment information related to the issue here. I am not a Microsoft developer, so I am not debugging the issue or collecting any information to help solving it.
If you want to share any information of this kind, please do so in one of the upstream issues.

Thanks a lot for your help in advance. 💚


Update

We have found the kernel commit introducing the issue:

Merge commit:
microsoft/WSL2-Linux-Kernel@64d6a12094f3

Atomic commit:
microsoft/WSL2-Linux-Kernel@dce7cd62754b5

From here on I will try to build more recent kernel versions with the commit reverted. Feel free to use these and report your experience.

Kernel version Good Reports good / bad Notes
v5.5-rc1-1-g0622e5f6a3 3 / 0 v5.5-rc1 with 64d6a12094f3 reverted
v5.5-rc1-2-g0265cf1764 0 / 3 v5.5-rc1-1-g0622e5f6a3 with dce7cd62754b5 cherry-picked
linux-msft-wsl-5.10.102.2 3 / 0 linux-msft-wsl-5.10.102.1 with dce7cd62754b5 reverted
linux-msft-wsl-5.15.153.2 6 / 0 linux-msft-wsl-5.15.153.1 with dce7cd62754b5 reverted
@carlfriedrich carlfriedrich changed the title Trying to find the kernel commit making WSL non-responsive Trying to find the kernel commit which makes WSL non-responsive Feb 4, 2024
@carlfriedrich carlfriedrich self-assigned this Feb 6, 2024
@unwiredben
Copy link

unwiredben commented Feb 7, 2024

Installed v5.4-4535-g9a3d7fd275b on my laptop this morning and hibernated it while traveling to my office. After about an hour of use after coming out of hibernation, I hit the unresponsive/high-CPU-usage issue and needed to kill WSL service to recover.

@carlfriedrich
Copy link
Owner Author

@unwiredben Thanks a lot for testing it out, that is really helpful! Can you also check if v5.4 is working for you?

@unwiredben
Copy link

@unwiredben Thanks a lot for testing it out, that is really helpful! Can you also check if v5.4 is working for you?

I just switched over to 5.4 and will report back in a few days unless I see if hang first.

@onereal7
Copy link

onereal7 commented Feb 9, 2024

@carlfriedrich, nice setup you have here! :)
I wish I could contribute more now, however a Win10 update a month ago broke my hibernation at all so it now almost always acts as a regular shutdown..

@carlfriedrich
Copy link
Owner Author

@carlfriedrich, nice setup you have here! :) I wish I could contribute more now, however a Win10 update a month ago broke my hibernation at all so it now almost always acts as a regular shutdown..

Well, then one might say the update fixed the issue for you. 😋

@unwiredben
Copy link

So far, no hangs with 5.4 across three hibernate cycles.

@mannfuri
Copy link

Which CPU do you guys use, AMD or INTEL?
The wsl kernel versions of the computers at my company and at home are the same, both are the latest official versions of WSL.
The computer at the company has not had a CPU 100% issue for a long time, but the computer at home still frequently encounters this problem.
The computer at the company uses an INTEL CPU, while the one at my home uses an AMD CPU.

@mannfuri
Copy link

switched to 5.4 today
I will come back to provide feedback in a while.

@carlfriedrich
Copy link
Owner Author

Which CPU do you guys use, AMD or INTEL? The wsl kernel versions of the computers at my company and at home are the same, both are the latest official versions of WSL. The computer at the company has not had a CPU 100% issue for a long time, but the computer at home still frequently encounters this problem. The computer at the company uses an INTEL CPU, while the one at my home uses an AMD CPU.

@mannfuri Thanks for your feedback. That's quite interesting, actually. I am on Intel on both my work and my home machine, and I get the issue on both. So AMD vs. Intel does not seem to be responsible for the issue to appear. I remember someone reporting in the upstream issue, that they also get the issue on ARM.
There must be some component, though, which makes a difference. According to the comments from Microsoft in the upstream issue, they weren't able to reproduce the issue in any of their environments.
So that's why we - the affected users - are trying to find the bad kernel commit here. We hope that this gives Microsoft a hint where to look at, and maybe we also find out why it happens only on some machines.
Hence I very appreciate that you join our testing. Thanks a lot!

@tobyvinnell
Copy link

I've just had the usual hang with the current 5.15 kernel version today. I'm keen to help with this effort and have switched to 5.4.0 just now. I'll give that a few days before moving on to v5.4-4535

@carlfriedrich
Copy link
Owner Author

@tobyvinnell Great, thanks a lot for your help!

@unwiredben
Copy link

Still no freezing with 5.4. Just to add to the platform discussion, I'm using a Dell Latitude 7430 with an Intel i7-1270P.

@aquohn
Copy link

aquohn commented Feb 15, 2024

For kernel v5.4-4535-g9a3d7fd275b, I get the following error message when trying to start WSL:

A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
Error code: Wsl/Service/0x8007274c
Press any key to continue...

Anyone else facing the same issue?

@carlfriedrich
Copy link
Owner Author

@aquohn I might have seen something similar, but in my case it worked the second time I tried to start WSL. Is this reproducible for you?

@carlfriedrich
Copy link
Owner Author

@aquohn Just checked again: Yes, I get the same message, but calling wsl a second time works for me. Seems like WSL needs more time to boot with this kernel version.

@carlfriedrich
Copy link
Owner Author

FYI: I have been running v5.4 for over a week now on both my work and home machine without any hangs, and since nobody else reported a hang so far, I am marking it as "good" in the issue description. I also added a column with the number of good/bad reports for each kernel version, just to keep track of on how much feedback we based the decision. So please keep reporting your experiences, even if we already have marked a version as "good" or "bad".

I will switch to v5.4-4535-g9a3d7fd275b now.

@mungojam
Copy link

This is like the higgs boson search 🙂

@carlfriedrich
Copy link
Owner Author

@mungojam I am quite optimistic that we will need less than 40 years for this. :-)

@seebeen
Copy link

seebeen commented Feb 16, 2024

Hallo. I've been tracking the Interrupt storm issue for a while now. Due to some unrelated stuff, I needed to reinstall my distro and do a complete setup from scratch. Since I needed complete systemd to have proper lvm mounting on boot I installed XanMod Kernel - 5 days+ no issues with hangs and CPU usage.

Would any of you be willing to give it a test run for a couple of days?
I think it would be revelant to see if I hit some weird perfect storm of settings which doesn't cause the issue, or if this kernel is stable :)

@carlfriedrich
Copy link
Owner Author

@seebeen Interesting project, haven't heard of that before. We're trying to bisect to a certain commit here, though, so while trying some other kernel images might be interesting in general, it will not help with the progress of this work.

@onereal7
Copy link

Hi @carlfriedrich, sometimes hibernate does work for me. Last time after successful return from hibernation WSL with v5.4-4535-g9a3d7fd275b kernel has hanged.

@mungojam
Copy link

Somebody in the other thread observed that windows sometimes seems to start in an immune state, and other times not, so make sure you are restarting as well as hibernating when testing a given version.

Sorry I haven't got the space to help with this search.

@onereal7
Copy link

That was me, haha :)
Actually, more thorough testing (with restarts) is needed when WSL is not hanging and seems to be working good (as there were many reports (including myself) where at first is thought that some update, kernel version or something else solves issue and later appears that it is not).
But of course, overall - more testing is better

@aquohn
Copy link

aquohn commented Feb 17, 2024

@carlfriedrich Unfortunately even with four consecutive wsl commands, the kernel is still not able to boot, and I get the same error message. My .wslconfig is empty except for the

[wsl2]
kernel=...

lines. However, with the 5.4 kernel, I boot on the second wsl command.

@carlfriedrich
Copy link
Owner Author

carlfriedrich commented Feb 17, 2024

@onereal7 Thanks for your feedback!

We have two reports who had the issue with v5.4-4535-g9a3d7fd275b now, so I marked it "bad" in the issue description and continued the bisection.

Next test candidate is v5.4-2622-g386403a115f. I just switched to this version and will test it through the next days.

@aquohn Can you check if your boot issue also appears with this version? I encountered it again as well for like 2-3 times when trying to boot the new candidate, but on the next try it worked. Don't know why this happens, though.

@ everyone: please keep reporting your experiences with all prvious versions as well. The more data we have, the better.

@unwiredben
Copy link

I'll switch to that shortly. I never had any suspend issue with 5.4.0 but did have a problem using Docker Desktop with it because a /proc/sys/vm/compaction_proactiveness was missing on that build. Will check to see when that setting was enabled for current WSL kernels.

@onereal7
Copy link

onereal7 commented Jul 1, 2024

the Hyper-V folks read the Windows host dump you provided, and it was helpful.

Glad to hear that! Let me know if I can be useful in any way.

@mughees-wyne
Copy link

I have also been using kernel version "5.10.102.2-microsoft-standard-WSL2" for several days now. Hibernated several times but so far no issues. I will now test the latest linux-msft-wsl-5.15.153.2 and see how it goes.

@fzimmermann89
Copy link

So far no issues with linux-msft-wsl-5.15.153.2 after a week

@Trondster
Copy link

Hi - been following your excellent efforts for some time, and decided to join in on the testing.
No issues with linux-msft-wsl-5.15.153.2 after about a week. :)

@mughees-wyne
Copy link

Tested "linux-msft-wsl-5.15.153.2" for over a week now. So far no issues. Since we seem to have found the commit which was causing the problem, is there a plan to share it with Microsoft team so that they can integrate it into their official releases. Or is more testing needed?

@kelleymh
Copy link

I've been in contact with the Microsoft people on the Hyper-V and WSL teams about the issue. They are aware of the situation and the relationship between the Linux commit and the underlying root cause, which is in Hyper-V. I'm expecting an update from them on how they want to proceed. Many people extended the U.S. July 4th public holiday last week into a longer vacation, so I expect progress has been slowed by people being out.

@samba2
Copy link

samba2 commented Jul 16, 2024

Had the same issue, kernel linux-msft-wsl-5.15.153.2 now runs since a couple of days flawlessly.

@Endemoniada
Copy link

I'm trying 5.15.153.2 now, but I'm already seeing an improvement. On the old version (5.15.153.1) I was getting spammed by the warning below constantly, every 6 seconds. With the custom .2 version, I am no longer getting that error. I have no idea if it's related, but if it's not causation, at least it's correlation, which is almost as good ;)

Jul 16 16:09:01.763388 SEW75578 kernel: potentially unexpected fatal signal 6.
Jul 16 16:09:01.766189 SEW75578 kernel: CPU: 10 PID: 30056 Comm: wdavdaemon Not tainted 5.15.153.1-microsoft-standard-WSL2 #1
Jul 16 16:09:01.766262 SEW75578 kernel: RIP: 0033:0x7ff5e108be6c
Jul 16 16:09:01.769133 SEW75578 kernel: Code: ff ff 0f 46 ea eb 99 0f 1f 80 00 00 00 00 b8 ba 00 00 00 0f 05 89 c5 e8 32 d5 04 00 44 89 e2 89 ee 89 c7 b8 ea 00 00 00 0f 05 <89> c5 f7 dd 3d 00 f0 ff ff b8 00 00 00 00 0f 47 c5 48 83 ec 80 5b
Jul 16 16:09:01.770581 SEW75578 kernel: RSP: 002b:00007ff5caff9e10 EFLAGS: 00000246
Jul 16 16:09:01.772018 SEW75578 kernel: RAX: 0000000000000000 RBX: 00007ff5caffb640 RCX: 00007ff5e108be6c
Jul 16 16:09:01.773460 SEW75578 kernel: RDX: 0000000000000006 RSI: 0000000000007568 RDI: 000000000000750c
Jul 16 16:09:01.776543 SEW75578 kernel: RBP: 0000000000007568 R08: 00007ff5caff9ed8 R09: 0000000000000000
Jul 16 16:09:01.778036 SEW75578 kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000006
Jul 16 16:09:01.779459 SEW75578 kernel: R13: 000055f78665cd10 R14: 00007ff5ecf67034 R15: 000055f7867858c0
Jul 16 16:09:01.780890 SEW75578 kernel: FS:  00007ff5caffb640 GS:  0000000000000000

@onereal7
Copy link

@carlfriedrich, I also confirm that linux-msft-wsl-5.15.153.2 is running smoothly for 3 weeks with multiple hibernations and restarts

@Trondster
Copy link

Have been using 5.15.153.2 for several more weeks now, through multiple hibernations. No issues. :)

@onereal7
Copy link

onereal7 commented Aug 8, 2024

Hello @kelleymh,
maybe you have any news about this issue? :)

@joethesaint
Copy link

joethesaint commented Aug 8, 2024 via email

@mhklinux
Copy link

mhklinux commented Aug 9, 2024

Hello @kelleymh, maybe you have any news about this issue? :)

No news. :-( On Monday, I ping'ed my former colleagues on the Hyper-V and WSL teams again because I hadn't heard anything from them for a while regarding this issue. But they are still looking at the best way to proceed. I'll be a little more proactive in following up.

@thomas-parikka-milliman

Is anyone else here running into issues using specific kernels? An issue was reported at docker/for-win#14240 that might impact the ability to use custom kernels to address the issue reported here.

neira-daniel added a commit to neira-daniel/config-files that referenced this issue Aug 17, 2024
Emacs tiene algunos problemas con esta configuración, pero funciona
razonablemente bien.

Problemas conocidos:

- El desempeño del programa cae con el tiempo. Esto se observa, por
  ejemplo, en la demora de Emacs para desplegar en pantalla los
  caracteres ingresados con el teclado.
- Emacs suele dejar de responder luego de una hibernación del
  computador.

El primer problema se puede solucionar parcialmente al reiniciar el
programa. Pero, a medida que pase el tiempo, el desempeño de Emacs
volverá a caer.

Si bien el profiler de Emacs no ha ayudado a aislar el código que está
dando problemas, es razonable suponer que estamos usando uno o más
paquetes con bugs. La desventaja de usar straight.el para administrar
paquetes es que no podemos seleccionar sus versiones estables de forma
automática. Esto se debe hacer de forma manual y no hemos hecho dicha
verificación.

Otra causa para la caída de desempeño podría encontrarse en la versión
de Emacs que está disponible en openSUSE-Tumbleweed. Esta es la única
distribución de Linux en WSL en la que hemos probado esta configuración.
Otros bugs que ya no son reproducibles con ella fueron solucionados
actualizando el programa.

El segundo problema se debe a un bug en el kernel más actual de WSL. Ver
la siguiente discusión para saber más sobre él:

carlfriedrich/wsl-kernel-build#1

Entorno:

- openSUSE-Tumbleweed corriendo en WSL versión 2 (Windows 10
  actualizado).
- Kernel: Linux DESKTOP-NULNQSE 5.15.153.1-microsoft-standard-WSL2 #1 SMP
  Fri Mar 29 23:14:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux.
- GNU Emacs 29.4 (build 2, x86_64-suse-linux-gnu, GTK+ Version 3.24.43,
  cairo version 1.18.0)
- Org mode 9.7-pre (release_9.6.25-1345-gb45b39)

Esta versión de Org mode es vulnerable a ataques: permite la ejecución
arbitraria tanto de código Lisp como de comandos shell. La versión 9.7.5
de Org no cuenta con estos problemas de seguridad.

Este commit incluye el lockfile de straight.el necesario para reproducir
el estado de cada uno de los paquetes instalados.

También incluye early-init.el, archivo auxiliar que nos permite
configurar Emacs para que solo utilice straight.el como administrador de
paquetes.
@jetvp
Copy link

jetvp commented Sep 12, 2024

Do we know if this is still being work on? Perhaps @kelleymh @mhklinux have some insight?

@mhklinux
Copy link

Yes, there's now some specific action underway from the Microsoft folks to provide a resolution. I have suggested that they might want to update this thread and the WSL Issue #6982 thread when they are ready, which will hopefully be in the next few weeks.

@anodynos
Copy link

anodynos commented Sep 24, 2024

I guess moving to Linux completely solves this

I never managed to get hibernate on Linux (Kubuntu), is it even a thing on other Linuxes?

@Crypto-Spartan
Copy link

just ran wsl --update and it put me on 5.15.153.1. I figured I'd get something newer than 5.15.153.2 since others commented here back in August that it was working well for them.

@elmuerte
Copy link

I tried the kernel that was included with the latest WSL release (2.3.24), and I had the problem again so I reverted to linux-msft-wsl-5.15.153.2

@carlfriedrich
Copy link
Owner Author

@Crypto-Spartan As noted in this issue's description, 5.15.153.2 is a custom build from here which we built as a community effort in order to verify the bad commit and provide a quick fix to the affected users. The version will not go upstream and hence not be available via wsl --update.

@Crypto-Spartan
Copy link

Ah, I see that now. Thank you for the explanation/clarification. Apologies, I should have read more carefully.

@borjamunozf
Copy link

Tested & deployed the patched kernel one week ago with latest wsl version and custom config for 3 users in my company that experienced the hibernate/unlock/lock problem with WSL2 almost each time.

No issues, I think we can confirm the root cause is found.
May God bless you all, fantastic work.

@Bilge
Copy link

Bilge commented Nov 13, 2024

Tested & deployed the patched kernel one week ago with latest wsl version and custom config for 3 users in my company that experienced the hibernate/unlock/lock problem with WSL2 almost each time.

No issues, I think we can confirm the root cause is found. May God bless you all, fantastic work.

I can't wait for this to be deployed in Windows 12.

@borjamunozf
Copy link

@carlfriedrich by the way, do you have any .patch file or something so we can patch automatically in a CI/CD the kernel/commit?

it seems to not be as direct as I though, conflicts & stuff.

thanks

@carlfriedrich
Copy link
Owner Author

@borjamunozf You can generate the patch from the commit on v5.15. It did, however, not apply cleanly on each kernel version. For some versions I had to resolve conflicts manually. v5.15 is the latest version I applied the patch on. If you port it to newer versions, feel free to open a PR on my WSL kernel fork, then I can provide a release here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests