-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work around pacman hangs on Windows/ARM64 #4583
Work around pacman hangs on Windows/ARM64 #4583
Conversation
This downloads the PR build artifact from msys2/MSYS2-packages#4583 and replaces the `pacman.exe` in `git-sdk-arm64`'s `sync` job, to verify that this actually prevents those pesky, pesky hangs. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
In private communication the GIMP team was told that pacman hanging on arm may be solved in Windows OS build 26080.1. It is of course to be seen when that particular fix will be available in stable Windows releases. |
Thank you for sharing this! And yes, I'd like to have the work-around in hand earlier than that. |
Let me know if you have any ideas on how I can help test this. I will think on it a bit, maybe just stress-test pacman on Win10 on Raspberry PI and Win11 on VM on dev kit, as I did last time. |
Yes, stress-testing would be good, after installing the package from the PR build. Thank you for all your help @jeremyd2019! |
I just did some testing as well. Created this PR with the package from this PR build installed.
This looks very promising. Thanks a lot for diving in and providing the patch, @dscho! 🎉 |
@dennisameling thank you so much for verifying that the work-around does what we'd hoped it would do! |
I built this PR for i686 msys2, and have been running it on raspberry pi 4/windows 10 for at least 24 hours. It has not hung yet, which is better than it had done before this change. I extracted the 'double-fork' logic from gpgme and am going to test with that next to see if I can reproduce a hang in something simpler than pacman-statically-linked-to-gpgme as a reproducer. (see msys2/msys2-autobuild#62 (comment)) Note that one of the incarnations of this was apparently in pacman calling an install-info hook, which should have nothing to do with gpgme, so I'm not sure this workaround would catch every possible hang. |
Right! It would only be a band-aid for the common case. FWIW I also experienced hangs when code-signing with So this here PR would really only be helping with some scenarios, but since I expect the real fix to still be a ways off, it would be better to have this work-around than not to have it. Unless. Unless that double-fork has a deeper purpose that escapes me (maybe it tries to prevent some attack vector? The commit message is quite mum about the intention of that change. BTW I've edited the PR description to make the information I hid in that |
2a4ab64
to
e5f0ade
Compare
A couple of questions I want to clear up:
|
It looks like wget only uses gpgme for metalink, and both end up disabled in the build in the msys2 repo because |
Those zombie processes are caused by the way Unix/Linux works: To be able to read the exit code of the exited process, there is an entry in the process table that is kept until someone calls That's not how the Windows kernel works, so I imagine Cygwin emulates this by keeping a separate process running (which might be the very same hanging process that we're seeing). It's not technically a zombie process, though. @jeremyd2019 I will try to find time later this week to see whether this change does leave unwanted "eternally-running" processes behind, unless you beat me to it. |
I haven't had time to check what happens yet, will probably not get time until tomorrow (say, 20ish hours). What I kind of expect to see is that Cygwin's shared memory process table will have a 'zombie' entry in it (in order to properly emulate posix behavior wrt the |
I did a test this evening on x86_64, and I'm not seeing any evidence of zombie processes in |
Well, some "zombie" entries are in |
I wonder whether GPGME should call |
The hangs are still happening to me on 26120.670... maybe that fixed a different hang? |
It was said that an MSYS2 issue was resolved, since I was/am not aware of any other issues, I assumed it must be the pacman hang. I'll ask back for clarification. |
Only slightly related, #4605 reduces the amount of gpgme spawns a bit, for example when starting pacman from 27 down to 19 $ GPGME_DEBUG=7 pacman |& grep "_gpgme_io_spawn: leave" | wc -l
19 And one less spawn per package install, from 3 down to 2. |
6e58dae
to
26a0877
Compare
When running `pacman` on Windows/ARM64, we frequently run into curious hangs (see msys2/msys2-autobuild#62 for more details). This commit aims to work around that by replacing the double-fork with a single-fork in `_gpgme_io_spawn()`. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The `libgpgme` library was just modified to avoid those hangs. Since `pacman.exe` links to that library statically, it needs to be rebuilt to benefit from that work-around. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
26a0877
to
3d88d30
Compare
…ARM64 This downloads the PR build artifact of msys2/MSYS2-packages#4583 and installs it in `git-sdk-arm64`. The idea is that the subsequent `sync` job runs as well as the subsequent `build-and-deploy` runs in `git-for-windows-automation` won't hang but instead succeed. Signed-off-by: Dennis Ameling <dennis@dennisameling.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
…ARM64 This downloads the PR build artifact of msys2/MSYS2-packages#4583 and installs it in `git-sdk-arm64`. The idea is that the subsequent `sync` job runs as well as the subsequent `build-and-deploy` runs in `git-for-windows-automation` won't hang but instead succeed. Signed-off-by: Dennis Ameling <dennis@dennisameling.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
…ARM64 This downloads the PR build artifact of msys2/MSYS2-packages#4583 and installs it in `git-sdk-arm64`. The idea is that the subsequent `sync` job runs as well as the subsequent `build-and-deploy` runs in `git-for-windows-automation` won't hang but instead succeed. Signed-off-by: Dennis Ameling <dennis@dennisameling.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
…ARM64 This downloads the PR build artifact of msys2/MSYS2-packages#4583 and installs it in `git-sdk-arm64`. The idea is that the subsequent `sync` job runs as well as the subsequent `build-and-deploy` runs in `git-for-windows-automation` won't hang but instead succeed. Signed-off-by: Dennis Ameling <dennis@dennisameling.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
…ARM64 This downloads the PR build artifact of msys2/MSYS2-packages#4583 and installs it in `git-sdk-arm64`. The idea is that the subsequent `sync` job runs as well as the subsequent `build-and-deploy` runs in `git-for-windows-automation` won't hang but instead succeed. Signed-off-by: Dennis Ameling <dennis@dennisameling.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
…ARM64 This downloads the PR build artifact of msys2/MSYS2-packages#4583 and installs it in `git-sdk-arm64`. The idea is that the subsequent `sync` job runs as well as the subsequent `build-and-deploy` runs in `git-for-windows-automation` won't hang but instead succeed. Signed-off-by: Dennis Ameling <dennis@dennisameling.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
I received a clarification about what was fixed, so indeed this was not the pacman hang:
|
In case anyone is waiting for me for some reason, my view is that this is not a viable workaround to merge here because:
I am hopeful that someone will help try to figure out what's going on. I am guessing Corinna is on vacation or otherwise offline. If nothing else, maybe I can try to hack around on the wait thread terminate call some more and see what happens. |
Was this issue ever resolved? This issue affects a large number of systems used in production today |
No. I still have not gotten any reply on cygwin list on this issue, and I've not had any luck so far trying to figure something out on my own. Latest post (after a comment that the thread was previously incorrectly sent to cygwin-developers): https://cygwin.com/pipermail/cygwin/2024-July/256271.html |
The status update is much appreciated, thanks! Let's hope that this gets fixed on arm soon |
Hopefully this is now taken care of in msys2/msys2-runtime#234 |
Yes! I tested this and it worked! Thank you so much @jeremyd2019 for your persistence, time and patience! |
Both in the MSYS2 project as well as in the Git for Windows project, automated builds on Windows/ARM64 are plagued by semi-randomly hanging
pacman
processes.@jeremyd2019 did a great job diagnosing these, and I also started to dig into this. After many, many months (with many, many breaks), I found this here work-around.
Here is a successful run of Git for Windows'
sync
job that tries to update, commit & push thegit-sdk-arm64
repository. Previously it consistently ran into those hangs, and replacing thepacman.exe
with the version built in this here PR worked around those hangs.Technical description
A common symptom is that the hanging process has a command-line that is identical to its parent process' command-line (indicating that it has been
fork()
ed), and anecdotally, the hang occurs when_exit()
callsproc_terminate()
which is then blocked by a call toTerminateThread()
with an invalid thread handle (for more details, see msys2/msys2-autobuild#62 (comment)).In my tests, I found that the hanging process is spawned from
_gpgme_io_spawn()
which lets the child process immediately spawn another child. That seems like a fantastic way to find timing-related bugs in the MSYS2/Cygwin runtime.As a work-around, it does seem to help if we avoid that double-fork.
This partially reverts 61aa1947 (... Use a double-fork approach..., 2002-08-28).