Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Launching dotnet using mono on macOS will hang if dotnet launches processes #55645

Closed
rolfbjarne opened this issue Jul 14, 2021 · 10 comments · Fixed by #55673
Closed

Launching dotnet using mono on macOS will hang if dotnet launches processes #55645

rolfbjarne opened this issue Jul 14, 2021 · 10 comments · Fixed by #55673

Comments

@rolfbjarne
Copy link
Member

Description

Repro:

Here's a test case: signalbug-521a1b6.zip

  • Download & extract
  • Run make mono to repro: this will execute csharp which will execute dotnet test (this will hang)
$ make mono  
mono --version
Mono JIT compiler version 6.12.0.140 (2020-02/51d876a041e Thu Apr 29 10:44:55 EDT 2021)
Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors. www.mono-project.com
	TLS:           
	SIGSEGV:       altstack
	Notification:  kqueue
	Architecture:  amd64
	Disabled:      none
	Misc:          softdebug 
	Interpreter:   yes
	LLVM:          yes(610)
	Suspend:       hybrid
	GC:            sgen (concurrent by default)
csharp -e 'System.Diagnostics.Process.Start ("dotnet", "test tests.csproj").WaitForExit ();'
  Determining projects to restore...
  Restored /Users/rolf/test/dotnet/signalbug/tests.csproj (in 950 ms).
  You are using a preview version of .NET. See: https://aka.ms/dotnet-core-preview
  tests -> /Users/rolf/test/dotnet/signalbug/bin/Debug/net6.0/tests.dll
Test run for /Users/rolf/test/dotnet/signalbug/bin/Debug/net6.0/tests.dll (.NETCoreApp,Version=v6.0)
Microsoft (R) Test Execution Command Line Tool Version 17.0.0-preview-20210712-03
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...
A total of 1 test files matched the specified pattern.
No test is available in /Users/rolf/test/dotnet/signalbug/bin/Debug/net6.0/tests.dll. Make sure that test discoverer & executors are registered and platform & framework version settings are appropriate and try again.

Additionally, path to test adapters can be specified using /TestAdapterPath command. Example  /TestAdapterPath:<pathToCustomAdapters>.
[... and it hangs here, nothing else happens...]
  • Run make dotnet to run dotnet test directly (which works just fine)
$ make dotnet
dotnet --version
6.0.100-preview.7.21364.4
dotnet test
  Determining projects to restore...
  Restored /Users/rolf/test/dotnet/signalbug/tests.csproj (in 965 ms).
  You are using a preview version of .NET. See: https://aka.ms/dotnet-core-preview
  tests -> /Users/rolf/test/dotnet/signalbug/bin/Debug/net6.0/tests.dll
Test run for /Users/rolf/test/dotnet/signalbug/bin/Debug/net6.0/tests.dll (.NETCoreApp,Version=v6.0)
Microsoft (R) Test Execution Command Line Tool Version 17.0.0-preview-20210712-03
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...
A total of 1 test files matched the specified pattern.
No test is available in /Users/rolf/test/dotnet/signalbug/bin/Debug/net6.0/tests.dll. Make sure that test discoverer & executors are registered and platform & framework version settings are appropriate and try again.

Additionally, path to test adapters can be specified using /TestAdapterPath command. Example  /TestAdapterPath:<pathToCustomAdapters>.
[this command completes successfully]

I did some debugging, and the difference is that the signal handler for SIGCHLD is different when dotnet is launched from a mono process.

Soon after launch, this is what I get when checking the signal handler for SIGCHLD:

# allocate some memory
(lldb) p (void *) malloc (40)
(void *) $0 = 0x00007fed89c04080
# set a dummy value for that memory
(lldb) expr ((void**)$0)[0] = (void*) 0xdeadf00ddeadf00d
(void *) $1 = 0xdeadf00ddeadf00d
# call sigaction to get the existing signal handler
(lldb) p (int) sigaction (20, 0, $0)
(int) $2 = 0
# inspect the result
(lldb) x/4wx $0
0x7fed89c04080: 0x00000000 0x00000000 0x00000000 0x00000042

The sigaction struct is 16 bytes, where the first 8 bytes are sa_handler, the next 4 bytes are sa_mask, and the final 4 bytes are sa_flags.

This means that:

sa_handler: 0x0 (SIG_DFL)
sa_mask: 0
sa_flags = 0x42 (SA_SIGINFO | SA_RESTART)

man sigaction says this about SA_SIGINFO: "This bit should not be set when assigning SIG_DFL or SIG_IGN.", so the behavior I'm seeing does not follow the spec. That said, if I attach to the mono process, the SIGCHLD handler is very different:

(lldb) p (void *) malloc (40)
(void *) $0 = 0x00007fd756468d50
(lldb) expr ((void**)$0)[0] = (void*) 0xdeadf00ddeadf00d
$1 = 0xdeadf00ddeadf00d
(lldb) p (int) sigaction (20, 0, $0)
(int) $2 = 0
(lldb) x/4wx $0
0x7fd756468d50: 0x06377930 0x00000001 0x00000000 0x0000004a

so I have no idea why the initial SIGCHLD handler is different in dotnet when dotnet is launched from mono.

The end result is that this will crash:

assert(origHandler->sa_sigaction);
origHandler->sa_sigaction(sig, siginfo, context);

and things will go badly from there:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000000000000
    frame #1: 0x00000001067e17a1 libcoreclr.dylib`sigsegv_handler(int, __siginfo*, void*) + 49
    frame #2: 0x00007fff205d5d7d libsystem_platform.dylib`_sigtramp + 29
    frame #3: 0x0000000000000001
    frame #4: 0x000000010767c4d5 libSystem.Native.dylib`SignalHandler + 101
    frame #5: 0x00007fff205d5d7d libsystem_platform.dylib`_sigtramp + 29
    frame #6: 0x00007fff2055dcdf libsystem_kernel.dylib`__psynch_cvwait + 11
    frame #7: 0x00007fff20590e49 libsystem_pthread.dylib`_pthread_cond_wait + 1298
    frame #8: 0x000000010680e05b libcoreclr.dylib`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 315
    frame #9: 0x000000010680dd2a libcoreclr.dylib`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 458
    frame #10: 0x00000001068125aa libcoreclr.dylib`CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) + 1946
    frame #11: 0x0000000106812892 libcoreclr.dylib`WaitForMultipleObjectsEx + 82
    frame #12: 0x000000010691871e libcoreclr.dylib`Thread::DoAppropriateWaitWorker(int, void**, int, unsigned int, WaitMode) + 734
    frame #13: 0x00000001069139d0 libcoreclr.dylib`Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*) + 48
    frame #14: 0x000000010697b013 libcoreclr.dylib`WaitHandleNative::CorWaitOneNative(void*, int) + 179
    frame #15: 0x000000010ce9242b
    frame #16: 0x000000010ce9308c
    frame #17: 0x000000010f374718
    frame #18: 0x000000010f373e24
    frame #19: 0x000000010f3727d3
    frame #20: 0x000000010f371d79
    frame #21: 0x000000010db2a873
    frame #22: 0x000000010d8455c8
    frame #23: 0x000000010d8256b7
    frame #24: 0x000000010d83ad71
    frame #25: 0x000000010d867cb5
    frame #26: 0x000000010d867191
    frame #27: 0x0000000106b02f29 libcoreclr.dylib`CallDescrWorkerInternal + 124
    frame #28: 0x0000000106954cc8 libcoreclr.dylib`MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 1496
    frame #29: 0x0000000106837ad8 libcoreclr.dylib`RunMain(MethodDesc*, short, int*, PtrArray**) + 776
    frame #30: 0x0000000106837dfb libcoreclr.dylib`Assembly::ExecuteMainMethod(PtrArray**, int) + 395
    frame #31: 0x000000010686aaec libcoreclr.dylib`CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*) + 508
    frame #32: 0x0000000106821164 libcoreclr.dylib`coreclr_execute_assembly + 196
    frame #33: 0x000000010677f5f1 libhostpolicy.dylib`run_app_for_context(hostpolicy_context_t const&, int, char const**) + 1313
    frame #34: 0x0000000106780511 libhostpolicy.dylib`corehost_main + 241
    frame #35: 0x000000010670d42e libhostfxr.dylib`fx_muxer_t::handle_exec_host_command(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, host_startup_info_t const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::unordered_map<known_options, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >, known_options_hash, std::__1::equal_to<known_options>, std::__1::allocator<std::__1::pair<known_options const, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > > > const&, int, char const**, int, host_mode_t, char*, int, int*) + 1550
    frame #36: 0x000000010670ca01 libhostfxr.dylib`fx_muxer_t::handle_cli(host_startup_info_t const&, int, char const**, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 1457
    frame #37: 0x000000010670c2c6 libhostfxr.dylib`fx_muxer_t::execute(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, int, char const**, host_startup_info_t const&, char*, int, int*) + 646
    frame #38: 0x0000000106708d38 libhostfxr.dylib`hostfxr_main_startupinfo + 152
    frame #39: 0x00000001066b9c17 dotnet`exe_start(int, char const**) + 1191
    frame #40: 0x00000001066b9ddf dotnet`main + 143
    frame #41: 0x00007fff205abf5d libdyld.dylib`start + 1
    frame #42: 0x00007fff205abf5d libdyld.dylib`start + 1
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jul 14, 2021
@rolfbjarne
Copy link
Member Author

CC @tmds since I think this is the cause: a25bece

@rolfbjarne
Copy link
Member Author

Also CC @stephentoub and @steveisok

@tmds
Copy link
Member

tmds commented Jul 14, 2021

Use mono to execute dotnet

What does this mean?

And, is this mono built from the dotnet/runtime repo?

This means that:

sa_handler: 0x0 (SIG_DFL)
sa_mask: 0
sa_flags = 0x42 (SA_SIGINFO | SA_RESTART)

and sa_sigaction is 0 also? Probably this is mapped to the same field as sa_handler (as on Linux)?

man sigaction says this about SA_SIGINFO: "This bit should not be set when assigning SIG_DFL or SIG_IGN.", so the behavior I'm seeing does not follow the spec. That said, if I attach to the mono process, the SIGCHLD handler is very different:

Yes, this violates an assumption that is in the code (assert(origHandler->sa_sigaction); in a25bece).

so I have no idea why the initial SIGCHLD handler is different in dotnet when dotnet is launched from mono.

That would be interesting to find out.

Can I run these steps on Linux?

@rolfbjarne
Copy link
Member Author

rolfbjarne commented Jul 14, 2021

Use mono to execute dotnet

What does this mean?

The test case does this:

csharp -e 'System.Diagnostics.Process.Start ("dotnet", "test tests.csproj").WaitForExit ();'

which is short for something like this:

echo 'class Program { static void Main () { System.Diagnostics.Process.Start ("dotnet", "test tests.csproj").WaitForExit ();"); } }' >  test.cs
csc test.cs
mono test.exe

And, is this mono built from the dotnet/runtime repo?

It's from the mono/mono repo (this hash in particular: mono/mono@51d876a041e)

and sa_sigaction is 0 also? Probably this is mapped to the same field as sa_handler (as on Linux)?

Yes, sa_sigaction and sa_handler are the same field on macOS (from man sigaction):

struct  sigaction {
        union __sigaction_u __sigaction_u;  /* signal handler */
        sigset_t sa_mask;               /* signal mask to apply */
        int     sa_flags;               /* see signal options below */
};

union __sigaction_u {
        void    (*__sa_handler)(int);
        void    (*__sa_sigaction)(int, siginfo_t *,
                       void *);
};

#define sa_handler      __sigaction_u.__sa_handler
#define sa_sigaction    __sigaction_u.__sa_sigaction

Can I run these steps on Linux?

I don't see why not, there's nothing macOS-specific in the test case.

@lambdageek
Copy link
Member

lambdageek commented Jul 14, 2021

This is how "classic" mono sets up the SIGCHLD handler:

https://github.com/mono/mono/blob/8dba54da1a85a55a7063945dec0c891b0c31e810/mono/metadata/w32process-unix.c#L1152-L1155

Maybe we need to uninstall it in the child after fork before execve (here)? I'm surprised the handler is null but the flags aren't.

This is from Apple's sigaction(2):

After a fork(2) or vfork(2) all signals, the signal mask, the signal
stack, and the restart/interrupt flags are inherited by the child.

The execve(2) system call reinstates the default action for all signals
which were caught and resets all signals to be caught on the user stack.
Ignored signals remain ignored; the signal mask remains the same; signals
that restart pending system calls continue to do so.

If I read this literally, this means flags other than restart/interrupt aren't inherited by the child and aren't reinstated by execve. But apparently SA_SIGINFO makes it through.


Should be able to reproduce the issue by writing a launcher C app that sets a SIGCHLD handler with SA_SIGINFO and then fork/execve to dotnet test test.csproj

@rolfbjarne
Copy link
Member Author

Maybe we need to uninstall it in the child after fork before execve (here)?

I don't think it's something that mono should change, because if it's something mono does, then somebody else might do it too. IMHO it's something dotnet will just have to cope with.

@steveisok
Copy link
Member

Unless we're not all in agreement that the potential fix should come from dotnet, @tmds @stephentoub, can I assign the issue to one of you?

If we aren't sure, I would like to try to sort it out asap.

@lambdageek
Copy link
Member

lambdageek commented Jul 14, 2021

Here's some experiment in pure C: build with make, run with ./foo: https://gist.github.com/lambdageek/9eb268d9aecd0f1c9bc020b91bed0c37

output on macOS (Big Sur 11.4):

$ ./foo
in parent, setting sa_flags to 0x4a
in parent
parent waiting
in child
bar executing
SIG_IGN is 0x1
SIG_DFL is 0x0
orig_chld.sa_flags = 0x42
orig_chld.sa_sigaction = 0x0
orig_chld.sa_handler = 0x0
bar 1 = hello
parent waiting
bar exiting

So yea, looks like SA_SIGINFO persists after a fork/execve, while sigaction gets reset to SIG_DFL


For completeness here's the output from Ubuntu 20.04:

in parent, setting sa_flags to 0x10000005
in parent
parent waiting
in child
bar executing
SIG_IGN is 0x1
SIG_DFL is (nil)
orig_chld.sa_flags = 0x0
orig_chld.sa_sigaction = (nil)
orig_chld.sa_handler = (nil)
bar 1 = hello
parent waiting
parent waiting
bar exiting

@tmds
Copy link
Member

tmds commented Jul 14, 2021

You can assign it to me.

@lambdageek lambdageek assigned lambdageek and tmds and unassigned lambdageek Jul 14, 2021
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 14, 2021
@steveisok steveisok added area-System.Runtime.InteropServices and removed untriaged New issue has not been triaged by the area owner labels Jul 15, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 15, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants