Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled exception causes docker container to hang on ARM64 #66707

Closed
blaskoa opened this issue Mar 16, 2022 · 14 comments
Closed

Unhandled exception causes docker container to hang on ARM64 #66707

blaskoa opened this issue Mar 16, 2022 · 14 comments

Comments

@blaskoa
Copy link

blaskoa commented Mar 16, 2022

Description

Dotnet applications running in docker containers on arm64 stop responding when unhandled exception occurs and keep a 100% utilization on a single CPU core (as measured by docker stats).

I tested the issue on the following environments:

  • Jetson Xavier NX Developer Kit
  • M1 MacBook
  • Windows/WSL with qemu emulation of arm64 (default emulation setup by docker desktop)

The only difference in the environments is that with qemu an additional line is present in the output and shell stops responding to Ctrl+C signals:

qemu: uncaught target signal 6 (Aborted) - core dumped

Issue is only present when the dotnet process is set as docker entrypoint e.g. ENTRYPOINT [ "dotnet", "arm-dotnet-repro.dll" ]
Issue is NOT present when running the dotnet binaries natively outside of docker containers, directly on the host.
Issue is NOT present when wrapping the entrypoint with shell script (see reproduction repo).
Issue is NOT present when the dotnet binary is exectued via interactive shell in docker container e.g. docker run --rm -it --entrypoint bash ex-repro:arm64-1 and then inside the container shell call dotnet arm-dotnet-repro.dll

The behavior is present on dotnet 5 and dotnet 6 runtimes (I just tested these ones).

Reproduction Steps

see https://github.com/blaskoa/arm-dotnet-repro for minimal reproducer with some available workarounds I found

Expected behavior

On unhandled exception the container should exit with non-0 exit code.

Basically the same as it behaves on x64

Actual behavior

On unhandled exception the container keeps "running" and utilizes 1 CPU core to 100%

Regression?

No response

Known Workarounds

see https://github.com/blaskoa/arm-dotnet-repro
some of the workarounds have some side effects which might render them unsuitable

Configuration

No response

Other information

No response

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 16, 2022
@janvorli
Copy link
Member

I can confirm the bad behavior. I am investigating it.

@janvorli janvorli self-assigned this Mar 16, 2022
@janvorli janvorli removed the untriaged New issue has not been triaged by the area owner label Mar 16, 2022
@janvorli
Copy link
Member

This is quite strange. The hang occurs in the abort implementation in the glibc here: https://code.woboq.org/userspace/glibc/stdlib/abort.c.html#107
It keeps looping on the ABORT_INSTRUCTION, invoking the SIGTRAP handler and then retrying the instruction again and again forever. It also means that the SIGABRT didn't abort the process as expected, as that's tried before the ABORT_INSTRUCTION is used.

@tmds
Copy link
Member

tmds commented Mar 28, 2022

It keeps looping on the ABORT_INSTRUCTION, invoking the SIGTRAP handler and then retrying the instruction again and again forever. It also means that the SIGABRT didn't abort the process as expected, as that's tried before the ABORT_INSTRUCTION is used.

@janvorli how does SIGTRAP come into play here? Or did you mean SIGABRT?

Does it happen also with a simple C program that calls abort?

@janvorli
Copy link
Member

The abort instruction is an invalid instruction that the abort in the standard c library seems to use as a way to make the process exit as a last resort if SIGABRT didn't have any effect. So it is handled by SIGTRAP. At least that's what I've figured from the abort source code.
I don't think it would happen with a simple C test calling abort. I think it is somehow related to how we setup / cleanup SIGABRT (and maybe also SIGTRAP) handler

@bgever
Copy link

bgever commented Jun 11, 2022

This seems to be related to #69923. Which causes also other runtime issues on my M1/Aarch64 MacBook Pro.

When adding the below environment variable into the Dockerfile, the hang no longer happens and the container exits properly upon an unhandled exception.

ENV COMPlus_ZapDisable=1

UPDATE: Apologies, I had a cached layer that had an explicit Environment.Exit(1); statement before, the above doesn't work.

Thank you for your example repo @blaskoa, the AppDomain.CurrentDomain.UnhandledException with explicit exit, works great for me.

@janvorli janvorli added this to the 7.0.0 milestone Jun 23, 2022
@mangod9
Copy link
Member

mangod9 commented Aug 11, 2022

Hi @AntonLapounov , since Jan is out could you please check if this repros on .NET 6? If so it might not block 7.

@AntonLapounov
Copy link
Member

From the original post:

The behavior is present on dotnet 5 and dotnet 6 runtimes (I just tested these ones).

@mangod9
Copy link
Member

mangod9 commented Aug 11, 2022

ah ok, sorry missed that. We can leave in 7 to investigate but probably would not make it into 7.

@mangod9 mangod9 modified the milestones: 7.0.0, 8.0.0 Aug 15, 2022
@turowicz
Copy link

Any updates on this? Having difficulties running in production. We are trying to migrate to ARM64 servers.

@turowicz
Copy link

turowicz commented Sep 26, 2022

Trying this

public class ProgramHelper
{
    public static async Task<int> RunAsync(Func<IHost> hostFactory)
    {
        try
        {
            using (var host = hostFactory())
            {
                var provider = host.Services;
                var logger = provider.GetService<ILogger>();

                try
                {
                    await host.RunAsync();
                }
                catch (Exception exception)
                {
                    logger.Error(exception);
                    return 1;
                }
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
            Console.WriteLine(ex.StackTrace);

            return 2;
        }

        return 0;
    }

Use:

public class Program
{
    public static async Task<int> Main(string[] args)
    {
        return await ProgramHelper.RunAsync(() => CreateHost(args)
    }

    public static IHost CreateHost(string[] args)
    {
        var host = Host.CreateDefaultBuilder(args)
        ...
    }
}

@janvorli
Copy link
Member

I have investigated this issue. The reason for the different behavior between running a bash in the container and then running the .NET binary manually vs running the .NET binary via the entrypoint is interesting. The difference is that in the case of the bash, there are two processes running in the container, the .NET one being the second. In the case of the entrypoint, there is just the .NET process and Linux considers it to be an init process that has different ways of handling signals like SIGABRT. The default for the init process is to ignore the signals while in the other case, the handlers are set to a default handler that tears down the process. See https://ddanilov.me/how-signals-are-handled-in-a-docker-container for a very nice description.

In the case this issue is about, the abort C library function first sends SIGABRT to our process. This one is ignored, so the abort tries a second way - it executes a brk machine code instruction. That results in SIGTRAP. We have handler installed for that, but we don't handle it when it comes from external location (libc in this case) and just return back to the original code. It triggers SIGTRAP again and then it goes on and on.

I was going to fix it, but I've found that a recent change with a different purpose already fixed the process. Before the change, we were unregistering only our SIGABRT handler, so the SIGTRAP stayed in place. After that change, all signal handlers are unregistered. The change is #80474 by @mikem8361.

It seems we want to backport this change to .NET 7.

@blaskoa as a workaround, you can add --init command line option to the docker command that runs the app. That ensures that the .NET app is run with standard signal handling.

@mikem8361
Copy link
Member

@janvorli do you want to backport this or should I?

@janvorli
Copy link
Member

@mikem8361 if you could do it, it would be great!

@ghost ghost locked as resolved and limited conversation to collaborators Feb 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants