Unhandled exception causes docker container to hang on ARM64 #66707

blaskoa · 2022-03-16T11:17:23Z

Description

Dotnet applications running in docker containers on arm64 stop responding when unhandled exception occurs and keep a 100% utilization on a single CPU core (as measured by docker stats).

I tested the issue on the following environments:

Jetson Xavier NX Developer Kit
M1 MacBook
Windows/WSL with qemu emulation of arm64 (default emulation setup by docker desktop)

The only difference in the environments is that with qemu an additional line is present in the output and shell stops responding to Ctrl+C signals:

qemu: uncaught target signal 6 (Aborted) - core dumped

Issue is only present when the dotnet process is set as docker entrypoint e.g. ENTRYPOINT [ "dotnet", "arm-dotnet-repro.dll" ]
Issue is NOT present when running the dotnet binaries natively outside of docker containers, directly on the host.
Issue is NOT present when wrapping the entrypoint with shell script (see reproduction repo).
Issue is NOT present when the dotnet binary is exectued via interactive shell in docker container e.g. docker run --rm -it --entrypoint bash ex-repro:arm64-1 and then inside the container shell call dotnet arm-dotnet-repro.dll

The behavior is present on dotnet 5 and dotnet 6 runtimes (I just tested these ones).

Reproduction Steps

see https://github.com/blaskoa/arm-dotnet-repro for minimal reproducer with some available workarounds I found

Expected behavior

On unhandled exception the container should exit with non-0 exit code.

Basically the same as it behaves on x64

Actual behavior

On unhandled exception the container keeps "running" and utilizes 1 CPU core to 100%

Regression?

No response

Known Workarounds

Using a shell script as an entrypoint in docker container
Using a global exception handler

see https://github.com/blaskoa/arm-dotnet-repro
some of the workarounds have some side effects which might render them unsuitable

Configuration

No response

Other information

No response

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2022-03-16T11:17:26Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

janvorli · 2022-03-16T20:43:39Z

I can confirm the bad behavior. I am investigating it.

janvorli · 2022-03-17T00:34:29Z

This is quite strange. The hang occurs in the abort implementation in the glibc here: https://code.woboq.org/userspace/glibc/stdlib/abort.c.html#107
It keeps looping on the ABORT_INSTRUCTION, invoking the SIGTRAP handler and then retrying the instruction again and again forever. It also means that the SIGABRT didn't abort the process as expected, as that's tried before the ABORT_INSTRUCTION is used.

tmds · 2022-03-28T13:13:29Z

It keeps looping on the ABORT_INSTRUCTION, invoking the SIGTRAP handler and then retrying the instruction again and again forever. It also means that the SIGABRT didn't abort the process as expected, as that's tried before the ABORT_INSTRUCTION is used.

@janvorli how does SIGTRAP come into play here? Or did you mean SIGABRT?

Does it happen also with a simple C program that calls abort?

janvorli · 2022-03-28T13:53:17Z

The abort instruction is an invalid instruction that the abort in the standard c library seems to use as a way to make the process exit as a last resort if SIGABRT didn't have any effect. So it is handled by SIGTRAP. At least that's what I've figured from the abort source code.
I don't think it would happen with a simple C test calling abort. I think it is somehow related to how we setup / cleanup SIGABRT (and maybe also SIGTRAP) handler

bgever · 2022-06-11T11:29:59Z

~~This seems to be related to #69923. Which causes also other runtime issues on my M1/Aarch64 MacBook Pro.~~

~~When adding the below environment variable into the Dockerfile, the hang no longer happens and the container exits properly upon an unhandled exception.~~

ENV COMPlus_ZapDisable=1

UPDATE: Apologies, I had a cached layer that had an explicit Environment.Exit(1); statement before, the above doesn't work.

Thank you for your example repo @blaskoa, the AppDomain.CurrentDomain.UnhandledException with explicit exit, works great for me.

mangod9 · 2022-08-11T18:13:53Z

Hi @AntonLapounov , since Jan is out could you please check if this repros on .NET 6? If so it might not block 7.

AntonLapounov · 2022-08-11T19:16:39Z

From the original post:

The behavior is present on dotnet 5 and dotnet 6 runtimes (I just tested these ones).

mangod9 · 2022-08-11T19:18:32Z

ah ok, sorry missed that. We can leave in 7 to investigate but probably would not make it into 7.

turowicz · 2022-09-26T14:04:02Z

Any updates on this? Having difficulties running in production. We are trying to migrate to ARM64 servers.

turowicz · 2022-09-26T14:17:48Z

Trying this

public class ProgramHelper
{
    public static async Task<int> RunAsync(Func<IHost> hostFactory)
    {
        try
        {
            using (var host = hostFactory())
            {
                var provider = host.Services;
                var logger = provider.GetService<ILogger>();

                try
                {
                    await host.RunAsync();
                }
                catch (Exception exception)
                {
                    logger.Error(exception);
                    return 1;
                }
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
            Console.WriteLine(ex.StackTrace);

            return 2;
        }

        return 0;
    }

Use:

public class Program
{
    public static async Task<int> Main(string[] args)
    {
        return await ProgramHelper.RunAsync(() => CreateHost(args)
    }

    public static IHost CreateHost(string[] args)
    {
        var host = Host.CreateDefaultBuilder(args)
        ...
    }
}

janvorli · 2023-01-13T19:33:41Z

I have investigated this issue. The reason for the different behavior between running a bash in the container and then running the .NET binary manually vs running the .NET binary via the entrypoint is interesting. The difference is that in the case of the bash, there are two processes running in the container, the .NET one being the second. In the case of the entrypoint, there is just the .NET process and Linux considers it to be an init process that has different ways of handling signals like SIGABRT. The default for the init process is to ignore the signals while in the other case, the handlers are set to a default handler that tears down the process. See https://ddanilov.me/how-signals-are-handled-in-a-docker-container for a very nice description.

In the case this issue is about, the abort C library function first sends SIGABRT to our process. This one is ignored, so the abort tries a second way - it executes a brk machine code instruction. That results in SIGTRAP. We have handler installed for that, but we don't handle it when it comes from external location (libc in this case) and just return back to the original code. It triggers SIGTRAP again and then it goes on and on.

I was going to fix it, but I've found that a recent change with a different purpose already fixed the process. Before the change, we were unregistering only our SIGABRT handler, so the SIGTRAP stayed in place. After that change, all signal handlers are unregistered. The change is #80474 by @mikem8361.

It seems we want to backport this change to .NET 7.

@blaskoa as a workaround, you can add --init command line option to the docker command that runs the app. That ensures that the .NET app is run with standard signal handling.

mikem8361 · 2023-01-15T19:21:25Z

@janvorli do you want to backport this or should I?

janvorli · 2023-01-16T13:55:26Z

@mikem8361 if you could do it, it would be great!

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 16, 2022

jkotas added the area-ExceptionHandling-coreclr label Mar 16, 2022

janvorli self-assigned this Mar 16, 2022

janvorli removed the untriaged New issue has not been triaged by the area owner label Mar 16, 2022

janvorli added this to the 7.0.0 milestone Jun 23, 2022

mangod9 added the blocking-release label Aug 9, 2022

AntonLapounov removed the blocking-release label Aug 11, 2022

mangod9 modified the milestones: 7.0.0, 8.0.0 Aug 15, 2022

janvorli closed this as completed Jan 13, 2023

mikem8361 mentioned this issue Jan 16, 2023

[release/7.0] Fix multiple dumps from being generated #80708

Merged

mikem8361 mentioned this issue Jan 24, 2023

[release/6.0] Fix multiple dumps from being generated #81064

Merged

ghost locked as resolved and limited conversation to collaborators Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unhandled exception causes docker container to hang on ARM64 #66707

Unhandled exception causes docker container to hang on ARM64 #66707

blaskoa commented Mar 16, 2022 •

edited

Loading

dotnet-issue-labeler bot commented Mar 16, 2022

janvorli commented Mar 16, 2022

janvorli commented Mar 17, 2022

tmds commented Mar 28, 2022

janvorli commented Mar 28, 2022

bgever commented Jun 11, 2022 •

edited

Loading

mangod9 commented Aug 11, 2022

AntonLapounov commented Aug 11, 2022

mangod9 commented Aug 11, 2022

turowicz commented Sep 26, 2022

turowicz commented Sep 26, 2022 •

edited

Loading

janvorli commented Jan 13, 2023

mikem8361 commented Jan 15, 2023

janvorli commented Jan 16, 2023

Unhandled exception causes docker container to hang on ARM64 #66707

Unhandled exception causes docker container to hang on ARM64 #66707

Comments

blaskoa commented Mar 16, 2022 • edited Loading

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

dotnet-issue-labeler bot commented Mar 16, 2022

janvorli commented Mar 16, 2022

janvorli commented Mar 17, 2022

tmds commented Mar 28, 2022

janvorli commented Mar 28, 2022

bgever commented Jun 11, 2022 • edited Loading

mangod9 commented Aug 11, 2022

AntonLapounov commented Aug 11, 2022

mangod9 commented Aug 11, 2022

turowicz commented Sep 26, 2022

turowicz commented Sep 26, 2022 • edited Loading

janvorli commented Jan 13, 2023

mikem8361 commented Jan 15, 2023

janvorli commented Jan 16, 2023

blaskoa commented Mar 16, 2022 •

edited

Loading

bgever commented Jun 11, 2022 •

edited

Loading

turowicz commented Sep 26, 2022 •

edited

Loading