Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around pacman hangs on Windows/ARM64 #4583

Closed

Conversation

dscho
Copy link
Contributor

@dscho dscho commented May 4, 2024

Both in the MSYS2 project as well as in the Git for Windows project, automated builds on Windows/ARM64 are plagued by semi-randomly hanging pacman processes.

@jeremyd2019 did a great job diagnosing these, and I also started to dig into this. After many, many months (with many, many breaks), I found this here work-around.

Here is a successful run of Git for Windows' sync job that tries to update, commit & push the git-sdk-arm64 repository. Previously it consistently ran into those hangs, and replacing the pacman.exe with the version built in this here PR worked around those hangs.

Technical description

A common symptom is that the hanging process has a command-line that is identical to its parent process' command-line (indicating that it has been fork()ed), and anecdotally, the hang occurs when _exit() calls proc_terminate() which is then blocked by a call to TerminateThread() with an invalid thread handle (for more details, see msys2/msys2-autobuild#62 (comment)).

In my tests, I found that the hanging process is spawned from _gpgme_io_spawn() which lets the child process immediately spawn another child. That seems like a fantastic way to find timing-related bugs in the MSYS2/Cygwin runtime.

As a work-around, it does seem to help if we avoid that double-fork.

This partially reverts 61aa1947 (... Use a double-fork approach..., 2002-08-28).

dscho added a commit to git-for-windows/git-sdk-arm64 that referenced this pull request May 4, 2024
This downloads the PR build artifact from
msys2/MSYS2-packages#4583 and replaces the
`pacman.exe` in `git-sdk-arm64`'s `sync` job, to verify that this
actually prevents those pesky, pesky hangs.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@Wormnest
Copy link

Wormnest commented May 4, 2024

In private communication the GIMP team was told that pacman hanging on arm may be solved in Windows OS build 26080.1. It is of course to be seen when that particular fix will be available in stable Windows releases.

@dscho
Copy link
Contributor Author

dscho commented May 4, 2024

In private communication the GIMP team was told that pacman hanging on arm may be solved in Windows OS build 26080.1. It is of course to be seen when that particular fix will be available in stable Windows releases.

Thank you for sharing this! And yes, I'd like to have the work-around in hand earlier than that.

@jeremyd2019
Copy link
Member

Let me know if you have any ideas on how I can help test this. I will think on it a bit, maybe just stress-test pacman on Win10 on Raspberry PI and Win11 on VM on dev kit, as I did last time.

@dscho
Copy link
Contributor Author

dscho commented May 6, 2024

Let me know if you have any ideas on how I can help test this. I will think on it a bit, maybe just stress-test pacman on Win10 on Raspberry PI and Win11 on VM on dev kit, as I did last time.

Yes, stress-testing would be good, after installing the package from the PR build. Thank you for all your help @jeremyd2019!

@dennisameling
Copy link

dennisameling commented May 6, 2024

I just did some testing as well. Created this PR with the package from this PR build installed.

  • I ran the sync workflow 17 times. We went from a (close to) 0% success rate before applying this patch, to a 100% success rate after applying it.
  • Did a bunch of pacman installs locally on my Surface Pro X, didn't experience hangs anymore.
  • Created an Azure ARM64 VM in my own Azure account and tested there as well. Also, no more hangs!

This looks very promising. Thanks a lot for diving in and providing the patch, @dscho! 🎉

@dscho
Copy link
Contributor Author

dscho commented May 6, 2024

@dennisameling thank you so much for verifying that the work-around does what we'd hoped it would do!

@jeremyd2019
Copy link
Member

jeremyd2019 commented May 7, 2024

I built this PR for i686 msys2, and have been running it on raspberry pi 4/windows 10 for at least 24 hours. It has not hung yet, which is better than it had done before this change.

I extracted the 'double-fork' logic from gpgme and am going to test with that next to see if I can reproduce a hang in something simpler than pacman-statically-linked-to-gpgme as a reproducer. (see msys2/msys2-autobuild#62 (comment))

Note that one of the incarnations of this was apparently in pacman calling an install-info hook, which should have nothing to do with gpgme, so I'm not sure this workaround would catch every possible hang.

@dscho
Copy link
Contributor Author

dscho commented May 8, 2024

Note that one of the incarnations of this was apparently in pacman calling an install-info hook, which should have nothing to do with gpgme, so I'm not sure this workaround would catch every possible hang.

Right! It would only be a band-aid for the common case.

FWIW I also experienced hangs when code-signing with osslsigncode, which would also not be covered.

So this here PR would really only be helping with some scenarios, but since I expect the real fix to still be a ways off, it would be better to have this work-around than not to have it.

Unless.

Unless that double-fork has a deeper purpose that escapes me (maybe it tries to prevent some attack vector? The commit message is quite mum about the intention of that change.

BTW I've edited the PR description to make the information I hid in that .patch message more discoverable.

@dscho dscho force-pushed the work-around-pacman-hangs-on-windows-arm64 branch from 2a4ab64 to e5f0ade Compare May 10, 2024 07:48
@jeremyd2019
Copy link
Member

A couple of questions I want to clear up:

  1. will this change result in zombie processes? I think on a real posix system, it would result in zombies for the lifetime of the process which invokes gpgme functions. I don't know if Cygwin replicates this particular behavior of posix though. I will try to test this and see.
  2. what else might be affected by the change to gpgme? Most seem to be pacman-related, but wget was unexpected:
    'wget', 'mutt', 'libgpgme-python', 'pacutils', 'pacman-contrib', 'python-pyalpm', 'pacman'

@jeremyd2019
Copy link
Member

jeremyd2019 commented May 10, 2024

It looks like wget only uses gpgme for metalink, and both end up disabled in the build in the msys2 repo because --without-metalink is specified to configure. It could therefore be removed from depends/makedepends of wget.

@dscho
Copy link
Contributor Author

dscho commented May 12, 2024

  1. will this change result in zombie processes?

Those zombie processes are caused by the way Unix/Linux works: To be able to read the exit code of the exited process, there is an entry in the process table that is kept until someone calls waitpid() with the corresponding pid.

That's not how the Windows kernel works, so I imagine Cygwin emulates this by keeping a separate process running (which might be the very same hanging process that we're seeing). It's not technically a zombie process, though.

@jeremyd2019 I will try to find time later this week to see whether this change does leave unwanted "eternally-running" processes behind, unless you beat me to it.

@jeremyd2019
Copy link
Member

I haven't had time to check what happens yet, will probably not get time until tomorrow (say, 20ish hours).

What I kind of expect to see is that Cygwin's shared memory process table will have a 'zombie' entry in it (in order to properly emulate posix behavior wrt the wait family of functions), but that the Windows process exits as expected. I do not know whether Cygwin would keep a Windows HANDLE open to the exited process (to call GetExitCodeProcess when somebody calls wait) or if they take care of that in a wait thread somewhere and thus do not leave 'zombie' Windows processes around (if the object still has outstanding HANDLE references it cannot be freed in the kernel). I thought about asking on cygwin@cygwin.com about what happens in this case, but it would be easy enough to test and find out.

@jeremyd2019
Copy link
Member

I did a test this evening on x86_64, and I'm not seeing any evidence of zombie processes in ps, /proc/*, or in task manager when removing inner fork and waitpid calls from my test program as you did in the patch on this PR. I'm sure cygwin must keep track of the exit code of the processes somehow because it cannot know nobody will ever call wait, but it doesn't seem to expose "zombie" entries in ps.

@jeremyd2019
Copy link
Member

jeremyd2019 commented May 13, 2024

Well, some "zombie" entries are in /proc/sys/BaseNamedObjects/msys-2.0*/cygpid.*. Also the entries in chld_procs. They look like they have valid hProcess members, so there are probably kernel process objects (EPROCESS? been a while since I've poked these things in kernel mode) still being kept around too

@dscho
Copy link
Contributor Author

dscho commented May 13, 2024

I wonder whether GPGME should call waitpid() somewhere, which is the recommended way to get rid of zombie processes as far as I can tell...

@reflectronic
Copy link

In private communication the GIMP team was told that pacman hanging on arm may be solved in Windows OS build 26080.1

The hangs are still happening to me on 26120.670... maybe that fixed a different hang?

@Wormnest
Copy link

The hangs are still happening to me on 26120.670... maybe that fixed a different hang?

It was said that an MSYS2 issue was resolved, since I was/am not aware of any other issues, I assumed it must be the pacman hang. I'll ask back for clarification.

@lazka
Copy link
Member

lazka commented May 20, 2024

Only slightly related, #4605 reduces the amount of gpgme spawns a bit, for example when starting pacman from 27 down to 19

$ GPGME_DEBUG=7 pacman |& grep "_gpgme_io_spawn: leave" | wc -l
19

And one less spawn per package install, from 3 down to 2.

@dscho dscho force-pushed the work-around-pacman-hangs-on-windows-arm64 branch 2 times, most recently from 6e58dae to 26a0877 Compare May 22, 2024 17:50
When running `pacman` on Windows/ARM64, we frequently run into curious
hangs (see msys2/msys2-autobuild#62 for more
details).

This commit aims to work around that by replacing the double-fork with a
single-fork in `_gpgme_io_spawn()`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The `libgpgme` library was just modified to avoid those hangs. Since
`pacman.exe` links to that library statically, it needs to be rebuilt to
benefit from that work-around.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dscho dscho force-pushed the work-around-pacman-hangs-on-windows-arm64 branch from 26a0877 to 3d88d30 Compare May 22, 2024 18:47
dscho pushed a commit to git-for-windows/git-sdk-arm64 that referenced this pull request May 22, 2024
…ARM64

This downloads the PR build artifact of
msys2/MSYS2-packages#4583 and installs it in
`git-sdk-arm64`.

The idea is that the subsequent `sync` job runs as well as the
subsequent `build-and-deploy` runs in `git-for-windows-automation` won't
hang but instead succeed.

Signed-off-by: Dennis Ameling <dennis@dennisameling.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
dscho pushed a commit to git-for-windows/git-sdk-arm64 that referenced this pull request May 22, 2024
…ARM64

This downloads the PR build artifact of
msys2/MSYS2-packages#4583 and installs it in
`git-sdk-arm64`.

The idea is that the subsequent `sync` job runs as well as the
subsequent `build-and-deploy` runs in `git-for-windows-automation` won't
hang but instead succeed.

Signed-off-by: Dennis Ameling <dennis@dennisameling.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
dscho pushed a commit to git-for-windows/git-sdk-arm64 that referenced this pull request May 22, 2024
…ARM64

This downloads the PR build artifact of
msys2/MSYS2-packages#4583 and installs it in
`git-sdk-arm64`.

The idea is that the subsequent `sync` job runs as well as the
subsequent `build-and-deploy` runs in `git-for-windows-automation` won't
hang but instead succeed.

Signed-off-by: Dennis Ameling <dennis@dennisameling.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
dscho pushed a commit to git-for-windows/git-sdk-arm64 that referenced this pull request May 22, 2024
…ARM64

This downloads the PR build artifact of
msys2/MSYS2-packages#4583 and installs it in
`git-sdk-arm64`.

The idea is that the subsequent `sync` job runs as well as the
subsequent `build-and-deploy` runs in `git-for-windows-automation` won't
hang but instead succeed.

Signed-off-by: Dennis Ameling <dennis@dennisameling.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
dscho pushed a commit to git-for-windows/git-sdk-arm64 that referenced this pull request May 22, 2024
…ARM64

This downloads the PR build artifact of
msys2/MSYS2-packages#4583 and installs it in
`git-sdk-arm64`.

The idea is that the subsequent `sync` job runs as well as the
subsequent `build-and-deploy` runs in `git-for-windows-automation` won't
hang but instead succeed.

Signed-off-by: Dennis Ameling <dennis@dennisameling.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
dscho pushed a commit to git-for-windows/git-sdk-arm64 that referenced this pull request May 22, 2024
…ARM64

This downloads the PR build artifact of
msys2/MSYS2-packages#4583 and installs it in
`git-sdk-arm64`.

The idea is that the subsequent `sync` job runs as well as the
subsequent `build-and-deploy` runs in `git-for-windows-automation` won't
hang but instead succeed.

Signed-off-by: Dennis Ameling <dennis@dennisameling.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@Wormnest
Copy link

I received a clarification about what was fixed, so indeed this was not the pacman hang:

Below was issue seen (msys2 command hang), that was resolved in build 26080.

1.	Download https://repo.msys2.org/distrib/x86_64/msys2-base-x86_64-20221028.tar.xz
2.	Extract the archive
3.	"cd" to the extracted archive in a clean CMD window
4.	Execute "msys2_shell.cmd -defterm -no-start -clang64 -c exit"
5.	Program hangs, with bash taking up 10% cpu

@jeremyd2019
Copy link
Member

In case anyone is waiting for me for some reason, my view is that this is not a viable workaround to merge here because:

  1. it will result in leaked cygpids ("zombie" processes) for the lifetime of the process calling gpgme.
  2. this will happen for all consumers of gpgme, not just pacman (though there aren't many non-pacman-related consumers of gpgme in MSYS2-packages)
  3. this will happen for all x86_64 systems, which are not experiencing the issue, to work around an issue on arm64 systems, which are much fewer.

I am hopeful that someone will help try to figure out what's going on. I am guessing Corinna is on vacation or otherwise offline. If nothing else, maybe I can try to hack around on the wait thread terminate call some more and see what happens.

@tejasraman
Copy link

tejasraman commented Jul 31, 2024

Was this issue ever resolved? This issue affects a large number of systems used in production today

@jeremyd2019
Copy link
Member

No. I still have not gotten any reply on cygwin list on this issue, and I've not had any luck so far trying to figure something out on my own. Latest post (after a comment that the thread was previously incorrectly sent to cygwin-developers): https://cygwin.com/pipermail/cygwin/2024-July/256271.html

@tejasraman
Copy link

No. I still have not gotten any reply on cygwin list on this issue, and I've not had any luck so far trying to figure something out on my own. Latest post (after a comment that the thread was previously incorrectly sent to cygwin-developers): https://cygwin.com/pipermail/cygwin/2024-July/256271.html

The status update is much appreciated, thanks! Let's hope that this gets fixed on arm soon

@jeremyd2019
Copy link
Member

Hopefully this is now taken care of in msys2/msys2-runtime#234

@dscho dscho deleted the work-around-pacman-hangs-on-windows-arm64 branch November 14, 2024 13:30
@dscho
Copy link
Contributor Author

dscho commented Nov 14, 2024

Hopefully this is now taken care of in msys2/msys2-runtime#234

Yes! I tested this and it worked!

Thank you so much @jeremyd2019 for your persistence, time and patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants