Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arm64 builds hangs on install-info a lot #62

Open
lazka opened this issue Aug 26, 2022 · 37 comments
Open

arm64 builds hangs on install-info a lot #62

lazka opened this issue Aug 26, 2022 · 37 comments

Comments

@lazka
Copy link
Member

lazka commented Aug 26, 2022

Just so we have an issue to link to and discuss maybe

lazka added a commit to lazka/msys2-autobuild that referenced this issue Aug 26, 2022
@jeremyd2019
Copy link
Member

jeremyd2019 commented Aug 26, 2022

I believe this is a bug in cygwin/msys2-runtime, because it also happened with pacman frequently when it would verify sync db signatures. The workaround I had for that was to append DatabaseNever to SigLevel.

jeremyd2019 added a commit to jeremyd2019/winautoconfig that referenced this issue Aug 27, 2022
Add attempted workaround for install-info.exe hanging, by disabling the
pacman hooks that call it.  msys2/msys2-autobuild#62
@dscho
Copy link

dscho commented Feb 16, 2023

I wonder whether this is still the case for v3.4.*...

@jeremyd2019
Copy link
Member

This hang seems to be happening much more often with the new 2023 dev kit machine compared to the qc710. It is now even happening when validating signatures on packages, where it would usually only happen validating database signatures before.

@hmartinez82
Copy link

hmartinez82 commented Feb 13, 2024

Me and the GIMP project have also seen this. There's a comment lost in a Merge Request (Gitlab) somewhere about it. They decided to stop pacman update as part of the builds and the runners are now doing this on a daily scheduled task with timeout/retry overnight :(

@dscho
Copy link

dscho commented Feb 13, 2024

I gathered some information that may be helpful for analyzing this issue, and wrote it down here.

@jeremyd2019
Copy link
Member

jeremyd2019 commented Feb 15, 2024

I wonder if it might possibly be https://cygwin.com/pipermail/cygwin/2024-February/255431.html. Maybe we can try backporting that patch (msys2/msys2-runtime@4e77fa9b8bf4) and see if the issues go away?

@jeremyd2019
Copy link
Member

jeremyd2019 commented Feb 15, 2024

If anyone else wants to try, I built msys2-runtime and msys2-runtime-3.3 with that patch applied in https://github.com/jeremyd2019/MSYS2-packages/actions/runs/7921543265. I am planning to try some things with it and see what happens.

UPDATE: that seems pretty broken. I'm guessing I didn't backport the fix correctly.

@jeremyd2019
Copy link
Member

https://github.com/jeremyd2019/MSYS2-packages/actions/runs/7924206550 is at least not as immediately broken 😉. Will test that

@jeremyd2019
Copy link
Member

I built both 3.4 and 3.3, and 3.3 for 32-bit (which took some doing because any binutils later than 2.40 resulted in a broken msys-2.0.dll). I then set both a Windows 11 VM on the Dev Kit and a Windows 10 install on a Raspberry Pi 4 in a loop running pacman (without disabling db signature checking). The raspberry pi did hang up, but the debugger looks different than I remember. The dev kit vm is still going at last check.

@jeremyd2019
Copy link
Member

I think the 32-bit on the raspberry pi hung up in pinfo::release calling CloseHandle. Not entirely sure which handle. I'd probably try getting a debug build next, but not sure I care that much about 32-bit (just that I remembered the raspberry pi having hang issues more frequently so wanted to test it too)

@jeremyd2019
Copy link
Member

Looking back at the cygwin thread, it seems that patch was introduced after a report of a hang with 100% CPU usage, rather than the hang with 0 CPU usage that we see, so I'm not sure it's the same issue. I guess I'll keep looking into the pinfo::release hang I see on the raspberry pi as I have time.

@jeremyd2019
Copy link
Member

With debug build it hung up somewhere different, but doesn't make any more sense. This time it hung up apparently during process teardown, having called _exit(0), eventually getting to proc_terminate, and is hanging calling TerminateThread on what seems to be a valid handle to a thread. I don't see that thread currently running in the debugger though. I don't see any reason why TerminateThread should hang.

@jeremyd2019
Copy link
Member

The 64-bit msys2 on windows 11 did eventually hang too ☹️ looking at some of the other messages in the thread I wasn't too positive about that being the issue we were seeing.

dscho added a commit to dscho/MSYS2-packages that referenced this issue May 4, 2024
When running `pacman` on Windows/ARM64, we frequently run into curious
hangs (see msys2/msys2-autobuild#62 for more
details).

This commit aims to work around that by replacing the double-fork with a
single-fork in `_gpgme_io_spawn()`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dscho
Copy link

dscho commented May 4, 2024

I finally have some good news. While I am not even close to a fix, I have a work-around: msys2/MSYS2-packages#4583

Here is a run of Git for Windows' sync job that tries to update, commit & push the git-sdk-arm64 repository. Previously it consistently ran into those hangs, and replacing the pacman.exe built in that PR works around those hangs.

By manually observing the hangs (RDPing into those self-hosted runners) I figured out that there were typically around half a dozen hanging processes whose command-lines were identical to their respective parent processes' command-lines. I've tracked that down to libgpgme's _gpgme_io_spawn() function, which calls a fork() that is immediately followed by another fork() in its child process. GPGME's source code history calls this a "double-fork", and it is not really clear to me why this would be needed.

One thing that helped me tremendously while debugging this was the insight that calling that PowerShell script that runs pacman several times (to do both core and system update, followed by another pacman invocation to allow for Git for Windows' hack that is the post-install script of the git-extra package) works when run in a PowerShell window, but consistently hangs when being run in an SSH session. Or, for that matter, a VS Code terminal (which I ran, for convenience, via a VS Code tunnel).

So these are my thoughts how to proceed from here:

  1. Deploying Work around pacman hangs on Windows/ARM64 MSYS2-packages#4583 should work around those Windows/ARM64 hangs.
  2. We still need to dig into this a bit further, to actually understand what is going wrong.
  3. It might be a Windows bug (or the promised Windows update might be just another work-around), we will see...
  4. The work-around I propose points to the double-fork in GPGME as triggering the issue. It won't be trivial, and it will take time, but I am confident, at least, that this can be turned into a small and simple reproducer.
  5. My hunch is that it is not a Windows bug, but that the pseudo terminal/pseudo console code in the MSYS2 runtime that runs every time a new process is forked is still not race-free.
  6. There are other issues in this area of the MSYS2 runtime that might be related, and need to be addressed even if they are not related. For example, @tyan0 said that crashing sub-processes would likely lead to hangs, too.

@jeremyd2019
Copy link
Member

jeremyd2019 commented May 5, 2024

In that case, I wonder if there's a race between starting up the wait thread and shutting it down during process exit. Assuming the second fork is followed by an exec in the (grand)child, that could further complicate things because I think that there is some magic that shuffles around the process tree to try to make it look as though exec actually replaced the process instead of starting a new one. (I think that may even be involved in the wait thread).

I never did get a good understanding of locking around this code, either. This is why I was trying Interlocked operations, to see if maybe there was a race going on, because I was seeing things in the debugger like handles that were NULL in the struct, but the stack showed a non-NULL handle passed to functions like CloseHandle or TerminateThread. I think I was satisfied that they were moving the handle into a temp variable and nulling it in the struct before closing it, but it felt like it was trying to avoid a race in a not-horribly-effective manner.

As for a Windows bug, I couldn't see any good reason for TerminateThread to block. I was a little concerned that maybe terminating a thread could leave the emulation in a bad state.

@jeremyd2019
Copy link
Member

I've tracked that down to libgpgme's _gpgme_io_spawn() function, which calls a fork() that is immediately followed by another fork() in its child process. GPGME's source code history calls this a "double-fork", and it is not really clear to me why this would be needed.

I read the code, and I think I understand what it is trying to do. It has this comment:

/* Intermediate child to prevent zombie processes.  */

As I recall, there is a "rule" on *nix that a parent must wait on a child process (or ignore SIGCHLD), or the child will be kept around in the process tables (as a "zombie"). This code seems to not want to wait for the child, and as a library doesn't want to mess with global state like signal handlers. The intermediate process exits as soon as it forks the child that will exec, so is absolved from having to wait for the child, and the parent process waits on the intermediate process allowing it to be reaped.

@Alovchin91
Copy link

Alovchin91 commented May 7, 2024

@jeremyd2019 I believe you’re 100% correct, and TIL: http://stackoverflow.com/questions/10932592/why-fork-twice/16655124#16655124

And I think it’s exactly the reason why it un-hangs if I manually kill the “right” pacman process (the intermediate child apparently). Which probably means that it’s actually the intermediate child that hangs (maybe because the grandchild exits too soon?).

Hope this information helps in diagnosing the root cause 🤞

@jeremyd2019
Copy link
Member

Another wrinkle is that Windows doesn't have a concept of exec like on posix/unix, so Cygwin has to do some funky tricks to spawn a new process but make the posix APIs it provides act as though it is still the same process that had called exec. My personal inkling is that there is a race between that and the teardown of the intermediate child, which is what the execed process is trying to make its parent at the very same time that process is trying to exit.

@jeremyd2019
Copy link
Member

Well that didn't take long:

#include <stdio.h>
#include <sys/wait.h>
#include <unistd.h>

#ifndef BINARY
#define BINARY "/bin/sleep"
#endif

#ifndef ARG
#define ARG "0.1"
#endif

int main(int argc, char ** argv)
{
    while (1)
    {
        int pid;
        printf("Starting group of 100x " BINARY " " ARG "\n");
        for (int i = 0; i < 100; ++i)
        {
            pid = fork();
            if (pid == -1)
            {
                perror("fork error");
                return 1;
            }
            else if (pid == 0)
            {
                if ((pid = fork()) == 0)
                {
                    char * const args[] = {BINARY, ARG, NULL};
                    execv(BINARY, args);
                    perror("execv failed");
                    _exit(5);
                }
                if (pid == -1)
                    _exit(1);
                else
                    _exit(0);
            }
            else
            {
                int status;
                if (waitpid(pid, &status, 0) == -1)
                {
                    perror("waitpid error");
                    return 2;
                }
                else if (status != 0)
                {
                    fprintf(stderr, "subprocess exited non-zero: %d\n", status);
                    return WEXITSTATUS(status);
                }
            }
        }
    }
    return 0;
}

built that with gcc -DBINARY=\"/bin/true\" -ggdb -o testfork testfork.c and it hung up after 11 lines of the "group of 100x" output.

@jeremyd2019
Copy link
Member

jeremyd2019 commented May 7, 2024

On the raspberry pi/windows 10 at least, this seems to hang pretty reliably after 11 lines, with either /bin/true or /bin/sleep (leaving 0.1 as the arg). I still can't get gdb to do anything useful, I'm thinking set detach-on-fork off (or follow-fork-mode) doesn't work on cygwin.

@dscho
Copy link

dscho commented May 8, 2024

@jeremyd2019! This is fantastic news! Excellent work. How about contributing this reproducer to the Cygwin (or cygwin-developers) mailing list?

@dscho
Copy link

dscho commented May 8, 2024

To the surprise of probably nobody, in my experiments this reproducer does not reproduce on my x86_64 machine, neither with MSYS2 nor with Cygwin.

@jeremyd2019
Copy link
Member

@jeremyd2019! This is fantastic news! Excellent work. How about contributing this reproducer to the Cygwin (or cygwin-developers) mailing list?

https://cygwin.com/pipermail/cygwin-developers/2024-May/012694.html

dscho added a commit to dscho/MSYS2-packages that referenced this issue May 10, 2024
When running `pacman` on Windows/ARM64, we frequently run into curious
hangs (see msys2/msys2-autobuild#62 for more
details).

This commit aims to work around that by replacing the double-fork with a
single-fork in `_gpgme_io_spawn()`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@jeremyd2019
Copy link
Member

Another wrinkle is that Windows doesn't have a concept of exec like on posix/unix, so Cygwin has to do some funky tricks to spawn a new process but make the posix APIs it provides act as though it is still the same process that had called exec. My personal inkling is that there is a race between that and the teardown of the intermediate child, which is what the execed process is trying to make its parent at the very same time that process is trying to exit.

Hmm, I was looking at the cygwin/msys2-runtime code for something else, and came across this comment:

FIXME: Is there a race here if we run this while another thread is attempting
to exec()?

Don't know, but LOL anyway

@chadlwilson
Copy link

chadlwilson commented May 13, 2024

I'm really sorry to add a noobish comment, but I have been following this with interest - as I have observed that the same behaviour seems to be exhibited when working with msys2/pacman via an x64 Windows instance during a docker build (hangs with Pacman).

Given the amazing digging you've done here, and the nature of working with Windows containers/docker build it seems possible that there are similar issues with races on the process forks/zombie processes on arm64 that you folks identify here - but within a Docker environment instead. I have seen it during a docker build on Windows Server Core 2022.

The workaround in my case is just to tell mys2-runtime and pacman not to auto-update/run on an ridk install: https://github.com/gocd-contrib/gocd-oss-cookbooks/blob/00c499cfffe6ae3daf0e625a1a61140b18a11a6a/provision/provision-install-packages.ps1#L60-L69

Thought I might add that in case you're looking to find a case where even x64 can exhibit similar problems. I've no idea if msys2/MSYS2-packages#4583 would also address that, but that it might be worth adding to the mix. If it's unnecessary FUD, please ignore me :-)

@driver1998
Copy link

@chadlwilson I guess one way to find out is see if the reproducer produces similar results in container?

@chadlwilson
Copy link

@chadlwilson I guess one way to find out is see if the reproducer produces similar results in container?

Yeah, fair comment. Unfortunately I don't have a super easy way to do so as don't have an x64 machine so have to do so via GHA or similar which I haven't got around to yet.

dscho added a commit to dscho/MSYS2-packages that referenced this issue May 22, 2024
When running `pacman` on Windows/ARM64, we frequently run into curious
hangs (see msys2/msys2-autobuild#62 for more
details).

This commit aims to work around that by replacing the double-fork with a
single-fork in `_gpgme_io_spawn()`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
dscho added a commit to dscho/MSYS2-packages that referenced this issue May 22, 2024
When running `pacman` on Windows/ARM64, we frequently run into curious
hangs (see msys2/msys2-autobuild#62 for more
details).

This commit aims to work around that by replacing the double-fork with a
single-fork in `_gpgme_io_spawn()`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dennisameling
Copy link

The new 24h2 version of Windows 11 introduces a new x64 emulation engine for arm64 called Prism. I wonder if that would solve any of the hangs we've been seeing on x64-emulated processes like pacman and other tools...

@Alovchin91
Copy link

Alovchin91 commented Jun 24, 2024

Note though that Prism isn’t really “new” but it does drop some old ARMv8.0 support code, which might indeed have had an impact I guess. Also, @dscho has mentioned before that it might have been fixed in later builds of 24H2, though it might not be related to Prism (since what’s marketed now as “Prism” presumably appeared in much earlier builds of Germanium).

@hmartinez82
Copy link

The new 24h2 version of Windows 11 introduces a new x64 emulation engine for arm64 called Prism. I wonder if that would solve any of the hangs we've been seeing on x64-emulated processes like pacman and other tools...

I bring bad news 😞 . My Surface Laptop Copilot+ PC arrived yesterday, and that comes with Windows 11 24H2. Today I started local dev stuff with MSYS2 and I already saw the pacman freeze twice.

@Biswa96
Copy link
Member

Biswa96 commented Jun 25, 2024

I bring bad news 😞 . My Surface Laptop Copilot+ PC arrived yesterday,

Good news 🙂 you can now debug the issue yourself (if possible).

@hmartinez82
Copy link

hmartinez82 commented Jun 25, 2024

I bring bad news 😞 . My Surface Laptop Copilot+ PC arrived yesterday,

Good news 🙂 you can now debug the issue yourself (if possible).

I could before 😅. I have had a Snapdragon 8cx (Gen 1 then Gen 3) laptops since 2019. The issue is that I don't know how. If it was Visual Studio I could probably help.

@jeremyd2019
Copy link
Member

I bring bad news 😞 . My Surface Laptop Copilot+ PC arrived yesterday,

Good news 🙂 you can now debug the issue yourself (if possible).

To bring my latest theory to this thread: I believe the issue happening around the tearing down of the wait thread circa https://github.com/msys2/msys2-runtime/blob/28d69fba269dd4a9f4281f8af7c2775292241e8b/winsup/cygwin/sigproc.cc#L412-L413. My latest concern I ran into is that because the chld_procs array gets re-arranged when a child is removed, there is no way for the thread itself to keep a valid pointer to NULL out the wait_thread, like it apparently thinks its doing on https://github.com/msys2/msys2-runtime/blob/28d69fba269dd4a9f4281f8af7c2775292241e8b/winsup/cygwin/pinfo.cc#L1330 but is actually only NULLing out a stack variable copy which it made at the beginning of the thread because it was known that the original pointer may not remain valid. I don't know how to work around this. Have a function that locks and iterates all of chld_procs comparing wait_thread pointers and NULLing out any that match the one shutting down? Might that deadlock somehow? Who knows, this code is complicated, and at this point I'm just blindly trying things and seeing what happens 😁. But I haven't had time to even do that lately

@pmsjt
Copy link

pmsjt commented Sep 23, 2024

Hi folks. I am not well versed in the inner workings of Cygwin but I do know quite well how the emulator works, so I am here to help in as much capacity as I can.

I am told Cygwin does not use RtlCloneUserProcess, but I compiled and ran the test code (as well as pacman) and looked at the state of the hung child and I am seeing something that is not much different than what RtlCloneUserProcess does.

The child process contains a number of threads (on of which named "sig") and they are all blocked on the emulator's code-cache lock. There is no thread in this process owning the lock. This is what creates the deadlock.

As I mentioned, I don't have any insight on how Cygwin implements fork, but the result suggests that the thread owning the emulator lock hasn't (yet?) been migrated over and, without it, nothing can make forward progress in emulation.

This also explains the time-sensitive nature of this problem. Closer to process startup there is more just-in-time translation going on. This lock is only held when code is being modified. If you fork later, you are less likely to have JIT happening on the source process then when you have fork happening on a fresh process. Just-in-time compilation will also be slower on slower machines, such as on a raspberry Pi, than on a DevKit, augmenting the chance fork happens while parts of JIT are still happening on other threads.

For the Cygwin/fork experts in this thread, does what I just said ring a bell?

@jeremyd2019
Copy link
Member

For the Cygwin/fork experts in this thread, does what I just said ring a bell?

You may need to try asking in cygwin@cygwin.com mailing list, I don't know if any cygwin fork experts are present here. https://cygwin.com/pipermail/cygwin/2024-July/256271.html is the latest in my thread there, IIRC

aiwantaozi pushed a commit to aiwantaozi/MSYS2-packages that referenced this issue Sep 24, 2024
When running `pacman` on Windows/ARM64, we frequently run into curious
hangs (see msys2/msys2-autobuild#62 for more
details).

This commit aims to work around that by replacing the double-fork with a
single-fork in `_gpgme_io_spawn()`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dscho
Copy link

dscho commented Sep 25, 2024

The child process contains a number of threads (on of which named "sig") and they are all blocked on the emulator's code-cache lock. There is no thread in this process owning the lock. This is what creates the deadlock.

This sounds highly plausible to me.

Closer to process startup there is more just-in-time translation going on. This lock is only held when code is being modified. If you fork later, you are less likely to have JIT happening on the source process then when you have fork happening on a fresh process.

That is quite an interesting aspect. Now the question is only: would this lock be lifted automatically if the thread lived just a little longer? And maybe the problem is that the Cygwin process exits before those threads could all be joined, assuming that it can rely on exit() to "clean up" the threads?

I wonder whether we can detect that situation in the Cygwin/MSYS2 runtime somewhere close to where the thread is terminating, or maybe this can be done in the emulator just before the thread ends?

Also: would you have any hints how this could be debugged by someone like me, @pmsjt? Maybe some WinDbg wizardry?

@dscho
Copy link

dscho commented Sep 25, 2024

I wonder whether we can detect that situation in the Cygwin/MSYS2 runtime somewhere close to where the thread is terminating

This might actually be precisely what the suggested CancelSynchronousIo() call in https://inbox.sourceware.org/cygwin-developers/23f23b0a-e60e-e3ff-4c1e-295599fdc813@jdrake.com/ accomplishes. What do you think @pmsjt ?

dscho pushed a commit to dscho/msys2-runtime that referenced this issue Sep 26, 2024
On Wed, 8 May 2024, Jeremy Drake wrote:

> (this is the same issue discussed in
> https://cygwin.com/pipermail/cygwin-patches/2024q1/012621.html)
>
> On MSYS2, running on Windows on ARM64 only, we've been plagued by issues
> with processes hanging up.  Usually pacman, when it is trying to validate
> signatures with gpgme.  When a process is hung in this way, no debugger
> seems to be able to attach properly.
>
> > anecdotally, the hang occurs when _exit() calls
> > proc_terminate() which is then blocked by a call to TerminateThread()
> > with an invalid thread handle (for more details, see
> > msys2/msys2-autobuild#62 (comment)).

As a follow-up to this, that was from a proposed workaround of just
commenting out the double-fork behavior in gpgme.  After reading a comment
in the code and doing some research online, it seems the double-fork is an
accepted idiom on posix to avoid having to wait for the (grand)child,
without creating zombie processes.  I was unable to see zombie processes
in ps or /proc/<pid>, but I did see extra cygpid.* entries in
/proc/sys/BaseNamedObjects/cygwin* which seem to be much the same thing.

Today, I was attempting to look at the TerminateThread situation.  The
call in question comes from the attempt to terminate the wait_thread of a
chld_procs entry.  I noticed elsewhere in cygwin code (flock.cc) that
CancelSynchronousIo was being called, and that stood out to me because
chances are that the wait thread (if running) is going to be blocked in
ReadFile.  I am testing with the following hack, and so far have not seen
a hang:

Applied-from: https://inbox.sourceware.org/cygwin-developers/23f23b0a-e60e-e3ff-4c1e-295599fdc813@jdrake.com/
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants