Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iOS: building with LLVM causes some exception catch clauses to not work #56100

Closed
rolfbjarne opened this issue Jul 21, 2021 · 36 comments · Fixed by #58491
Closed

iOS: building with LLVM causes some exception catch clauses to not work #56100

rolfbjarne opened this issue Jul 21, 2021 · 36 comments · Fixed by #58491

Comments

@rolfbjarne
Copy link
Member

Description

See the following test code: https://gist.github.com/rolfbjarne/f5c1a3697343c5e9bb6cb8e5b796d328#file-appdelegate-cs-L23-L73

When executed on an iOS device (AOT compiled), this produces the following output:

TCP connection failed when selecting 'hostname': fe80::6aa3:6295:30dd:b552 and 'port': 49681. System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (65): No route to host [fe80::6aa3:6295:30dd:b552]:49681
   at System.Net.Sockets.Socket.DoConnect(EndPoint , SocketAddress ) in System.Net.Sockets.dll:token 0x6000127+0x68
   at System.Net.Sockets.Socket.Connect(EndPoint ) in System.Net.Sockets.dll:token 0x600010a+0xca
   at System.Net.Sockets.Socket.Connect(IPAddress , Int32 ) in System.Net.Sockets.dll:token 0x600010b+0x62
   at System.Net.Sockets.TcpClient.Connect(String , Int32 ) in System.Net.Sockets.dll:token 0x60001b7+0xc7
--- End of stack trace from previous location ---
   at System.Net.Sockets.TcpClient.Connect(String , Int32 ) in System.Net.Sockets.dll:token 0x60001b7+0x158
   at System.Net.Sockets.TcpClient..ctor(String , Int32 ) in System.Net.Sockets.dll:token 0x60001b4+0x56
   at MySingleView.AppDelegate.<>c__DisplayClass3_2.<SelectHostName>b__0(Object v) in /Users/rolf/work/maccore/squashed-onedotnet/xamarin-macios/tests/dotnet/MySingleView/AppDelegate.cs:line 49
[...]
Selected host name: 

if I enable LLVM, the following happens:

Unhandled managed exception: No route to host [fe80::6aa3:6295:30dd:b552]:49681 (System.Net.Internals.SocketExceptionFactory+ExtendedSocketException)
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
Unhandled managed exception: No route to host [fe80::14a6:1e61:2f4e:e2c5]:49681 (System.Net.Internals.SocketExceptionFactory+ExtendedSocketException)
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()

=================================================================
	Native Crash Reporting
=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

=================================================================
	Native stacktrace:
=================================================================

It's a rather strange bug, because in my original test it doesn't crash this way, the SelectHostName code causes a lot of other test failures due to Assert.Throws<FooException> statements not actually catching the FooException, but then the general NUnit catch handler catches these exceptions, and turn them into test failures. The baffling effect was that running the tests outside of our test harness (without needing to call the SelectHostName method), the tests passed.

In any case, let me know if you can reproduce with this information, or I'll create a test case you can use (which will likely require building a custom branch of xamarin-macios).

@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Net untriaged New issue has not been triaged by the area owner labels Jul 21, 2021
@ghost
Copy link

ghost commented Jul 21, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

See the following test code: https://gist.github.com/rolfbjarne/f5c1a3697343c5e9bb6cb8e5b796d328#file-appdelegate-cs-L23-L73

When executed on an iOS device (AOT compiled), this produces the following output:

TCP connection failed when selecting 'hostname': fe80::6aa3:6295:30dd:b552 and 'port': 49681. System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (65): No route to host [fe80::6aa3:6295:30dd:b552]:49681
   at System.Net.Sockets.Socket.DoConnect(EndPoint , SocketAddress ) in System.Net.Sockets.dll:token 0x6000127+0x68
   at System.Net.Sockets.Socket.Connect(EndPoint ) in System.Net.Sockets.dll:token 0x600010a+0xca
   at System.Net.Sockets.Socket.Connect(IPAddress , Int32 ) in System.Net.Sockets.dll:token 0x600010b+0x62
   at System.Net.Sockets.TcpClient.Connect(String , Int32 ) in System.Net.Sockets.dll:token 0x60001b7+0xc7
--- End of stack trace from previous location ---
   at System.Net.Sockets.TcpClient.Connect(String , Int32 ) in System.Net.Sockets.dll:token 0x60001b7+0x158
   at System.Net.Sockets.TcpClient..ctor(String , Int32 ) in System.Net.Sockets.dll:token 0x60001b4+0x56
   at MySingleView.AppDelegate.<>c__DisplayClass3_2.<SelectHostName>b__0(Object v) in /Users/rolf/work/maccore/squashed-onedotnet/xamarin-macios/tests/dotnet/MySingleView/AppDelegate.cs:line 49
[...]
Selected host name: 

if I enable LLVM, the following happens:

Unhandled managed exception: No route to host [fe80::6aa3:6295:30dd:b552]:49681 (System.Net.Internals.SocketExceptionFactory+ExtendedSocketException)
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
Unhandled managed exception: No route to host [fe80::14a6:1e61:2f4e:e2c5]:49681 (System.Net.Internals.SocketExceptionFactory+ExtendedSocketException)
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()

=================================================================
	Native Crash Reporting
=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

=================================================================
	Native stacktrace:
=================================================================

It's a rather strange bug, because in my original test it doesn't crash this way, the SelectHostName code causes a lot of other test failures due to Assert.Throws<FooException> statements not actually catching the FooException, but then the general NUnit catch handler catches these exceptions, and turn them into test failures. The baffling effect was that running the tests outside of our test harness (without needing to call the SelectHostName method), the tests passed.

In any case, let me know if you can reproduce with this information, or I'll create a test case you can use (which will likely require building a custom branch of xamarin-macios).

Author: rolfbjarne
Assignees: -
Labels:

area-System.Net, untriaged

Milestone: -

@rolfbjarne
Copy link
Member Author

CC @SamMonoRT @lambdageek

@karelz karelz added the os-ios Apple iOS label Jul 21, 2021
@SamMonoRT SamMonoRT added area-Codegen-LLVM-mono and removed area-System.Net untriaged New issue has not been triaged by the area owner labels Jul 21, 2021
@SamMonoRT SamMonoRT added this to the 6.0.0 milestone Jul 21, 2021
@SamMonoRT
Copy link
Member

@imhameed should be back in office next week and can take a look then. @vargaz - any pointers ?

@vargaz
Copy link
Contributor

vargaz commented Jul 21, 2021

How can I reproduce this ?

@rolfbjarne
Copy link
Member Author

@vargaz

How can I reproduce this ?

  1. Download & install Xcode 13 beta 3.

  2. Clone my fork of xamarin-macios, and checkout the dotnet-llvm-exception-repro branch, and build it:

     git clone https://github.com/rolfbjarne/xamarin-macios
     git checkout dotnet-llvm-exception-repro
     cd xamarin-macios
     ./configure --enable-dotnet --enable-xamarin
     make reset
     make all -j8
     make install -j8
    
  3. Connect an iOS device

  4. Build & run the test project:

     make -C tests/dotnet/MySingleView run-llvm
    

@vargaz
Copy link
Contributor

vargaz commented Jul 23, 2021

I can reproduce, however the failure seems random, sometimes it works.

@vargaz
Copy link
Contributor

vargaz commented Jul 23, 2021

Did this ever work with .net core ?

@rolfbjarne
Copy link
Member Author

Did this ever work with .net core ?

I don't know, it's the first time we've had support for LLVM (I ran into this while implementing LLVM support).

@imhameed
Copy link
Contributor

imhameed commented Aug 2, 2021

@rolfbjarne I tried reproducing this locally but wasn't successful. Also make -C tests/dotnet/MySingleView run-llvm takes a very long time for me to run locally, because it seems to rebuild MySingleView with llvm from start to finish; @vargaz when you determined that the failure was intermittent, did you rerun make -C tests/dotnet/MySingleView run-llvm? Or did you reuse/relaunch an already-built test app?

EDIT: Nevermind, I can intermittently repro this failure

EDIT 2: I can't reproduce this using "normal" LLVM FullAOT (not statically linking everything into a single binary) on arm64 Linux

@imhameed
Copy link
Contributor

imhameed commented Aug 13, 2021

A note: if I build the repro with /p:Configuration=Release, this stops

EDIT: Nevermind, it doesn't. I can reproduce this in Release too.

@SamMonoRT
Copy link
Member

Doesn't seem like a consistent repro. @imhameed - if you determine consistent repro steps and have a low risk fix, we will consider backporting to 6.0 - For now moving to 7.0

@SamMonoRT SamMonoRT modified the milestones: 6.0.0, 7.0.0 Aug 14, 2021
@rolfbjarne
Copy link
Member Author

Doesn't seem like a consistent repro. @imhameed - if you determine consistent repro steps and have a low risk fix, we will consider backporting to 6.0 - For now moving to 7.0

This is blocking LLVM for us. It's 100% consistent on our main test suite, where it breaks a huge number of tests. There's no way we can release LLVM support with this bug.

@SamMonoRT
Copy link
Member

Got it @rolfbjarne ---- @imhameed has been working on narrowing down the issue, so we should be able to get in a fix. @imhameed please update this with your progress

@SamMonoRT SamMonoRT modified the milestones: 7.0.0, 6.0.0 Aug 16, 2021
@imhameed
Copy link
Contributor

A smaller repro: https://gist.github.com/imhameed/cdeba5b8bb9cd879b688b97292602e58

Still fails intermittently for me, even when re-launching an already-installed copy of this via mlaunch. Exception handling ought to be deterministic, but maybe the attempt to open a socket can fail in multiple ways that all raise the same exception, and one of these failing cases has bad codegen. Haven't yet looked at the IR we're generating.

Also bad: this code crashes with an "Unhandled managed exception" 100% of the time.

@imhameed
Copy link
Contributor

imhameed commented Aug 17, 2021

Nested try clauses ought to fall back to mini instead of LLVM:

// FIXME: Nested try clauses fail in some cases too, i.e. #37273
if (i != j && clause1->try_offset >= clause2->try_offset && clause1->handler_offset <= clause2->handler_offset) {
//(clause1->flags == MONO_EXCEPTION_CLAUSE_FINALLY || clause2->flags == MONO_EXCEPTION_CLAUSE_FINALLY)) {
cfg->exception_message = g_strdup ("nested clauses");
cfg->disable_llvm = TRUE;
break;
}

Working on getting a locally-built cross-compiler shimmed into xamarin-macios right now.

EDIT: Got a locally-built cross-compiler working with the existing toolchain. I see LLVM failed for 'Application.SelectHostName': nested clauses when compiling my shrunken repro with --aot=...,llvm,..., so it looks like we are indeed falling back to mini for nested clauses. Will compare the IR to the IR produced by a non-LLVM build of the repro app next.

@filipnavara
Copy link
Member

Ref: https://bugzilla.xamarin.com/37/37273/bug.html (since it's not obvious where the original issue was filed)

@imhameed
Copy link
Contributor

Also related (tracking a longer-term cleaner fix): #54176

@vargaz
Copy link
Contributor

vargaz commented Aug 18, 2021

So llvm used to work with previous xam.ios versions based on the mono/mono codebase, I don't think this problem is caused by bcl changes, it seems more likely to be a change in the build system, a missing flag etc.

@imhameed
Copy link
Contributor

imhameed commented Aug 18, 2021

EDIT: Nevermind, spoke too soon.

Observation: The generated assembly and the "JitInfo EH clause" list for SelectHostName in this look identical with or without LLVM, barring seemingly irrelevant differences in layout and addresses. Raise is different, and is actually compiled with LLVM when LLVM is enabled.

The flags I saw used when building with LLVM are:

--aot=mtriple=arm64-ios,data-outfile=obj/Release/net6.0-ios/ios-arm64/nativelibraries/aot-output/arm64/MySingleView.aotdata,static,asmonly,direct-icalls,full,nodebug,dwarfdebug,llvm-path=/Users/imhameed/ms/xammac-rolf/xamarin-macios/builds/downloads/dotnet-sdk-6.0.100-rc.1.21370.12-osx-x64/packs/Microsoft.NETCore.App.Runtime.AOT.osx-x64.Cross.ios-arm64/6.0.0-rc.1.21369.14/Sdk/../tools,outfile=obj/Release/net6.0-ios/ios-arm64/nativelibraries/aot-output/arm64/MySingleView.dll.s,llvm-outfile=obj/Release/net6.0-ios/ios-arm64/nativelibraries/aot-output/arm64/MySingleView.dll.llvm.o --debug --llvm -O=gsharedvt /Users/imhameed/ms/xammac-rolf/xamarin-macios/tests/dotnet/MySingleView/obj/Release/net6.0-ios/ios-arm64/linked/MySingleView.dll

and without:

--aot=mtriple=arm64-ios,data-outfile=obj/Release/net6.0-ios/ios-arm64/nativelibraries/aot-output/arm64/MySingleView.aotdata,static,asmonly,direct-icalls,full,nodebug,dwarfdebug,outfile=obj/Release/net6.0-ios/ios-arm64/nativelibraries/aot-output/arm64/MySingleView.dll.s --debug -O=gsharedvt /Users/imhameed/ms/xammac-rolf/xamarin-macios/tests/dotnet/MySingleView/obj/Release/net6.0-ios/ios-arm64/linked/MySingleView.dll

If I hack up mono_llvm_check_method_supported so that it disables LLVM when compiling Raise, then the resulting app doesn't crash.

@imhameed
Copy link
Contributor

imhameed commented Aug 20, 2021

Here's the LLVM IR along with the mono-aot-cross/opt/llc flags used for compilation for this repro: https://gist.github.com/imhameed/bd2564d0df45c580294888eb9f4770c4

I tried using a cross-compiler built against LLVM 9 and rebuilding the repro using LLVM 9 opt and llc. That made no difference.

And nested clauses might not be related; https://gist.github.com/imhameed/ccd542f151624c6399bd6d5c0f74b085 also crashes. Going to trace our EH code next.

@vargaz
Copy link
Contributor

vargaz commented Aug 20, 2021

If you change the testcase to:

static int foo;

		[MethodImpl(MethodImplOptions.NoInlining)]
		static string Raise() {
                         if (foo == 0)
			throw new InvalidOperationException("okay");
		}

does it still fail ?

@imhameed
Copy link
Contributor

@vargaz Yes:

		static int foo;
		[MethodImpl(MethodImplOptions.NoInlining)]
		static string Raise() {
			if (foo == 0)
			throw new InvalidOperationException("okay");
			return "";
		}

still causes:

2021-08-20 11:15:24.797 MySingleView[3141:8804269] --DEBUG: before
2021-08-20 11:15:24.809 MySingleView[3141:8804269] Unhandled managed exception: okay (System.InvalidOperationException)

=================================================================
        Native Crash Reporting
=================================================================
...

@imhameed
Copy link
Contributor

Got a locally-built copy of the runtime working and I can printf debug on-device now. When built with LLVM, the runtime fails to find any associated catch instruction ranges when unwinding. I'll keep looking into why.

@imhameed
Copy link
Contributor

When unwinding, the unwinder instruction pointer is set to an address that is not contained within any of the module ranges registered with mono_jit_info_add_aot_module. The initial IP is set to an in-range address, and then one step of mono_arch_unwind_frame (as part of unwinder_init in handle_exception_first_pass) restores the link register's address, which is out of bounds. Don't yet know why the LR has this address in it.

@filipnavara
Copy link
Member

filipnavara commented Aug 29, 2021

JFYI I was able to reproduce this on maccatalyst-arm64. It may be slightly easier to debug than the iOS version.

Curiously, when compiled as part of the FunctionalTests framework in dotnet/runtime it run just fine. When compiled using Xamarin SDK it failed 100% 50% of the time (unless it's run with MONO_LOG_LEVEL=debug when it suddenly stops failing). There could be a difference in the options used for the AOT, or the way the runtime is linked or initialized.

@imhameed
Copy link
Contributor

imhameed commented Aug 30, 2021

@filipnavara how did you build this with the dotnet/runtime functional test framework? And which repro .cs did you use?

(Currently I'm hacking up llc to dump mono unwinding information to stdout in a somewhat readable way.)

@filipnavara
Copy link
Member

filipnavara commented Aug 30, 2021

how did you build this with the dotnet/runtime functional test framework? And which repro .cs did you use?

So, I started with an M1 MacBook and build.sh -os maccatalyst -arch arm64.

Then I replaced Program.cs of the AOT-LLVM functional test with one of your codes above (doesn't matter which one but I used the first reduced repro from GIST).

Finally I compiled the sample with dotnet build /p:TargetOS=MacCatalyst /p:TargetArchitecture=arm64.

if you need more precise steps I can send them when I get back to my computer. It never crashed though.

I also copied the same source code into a dotnet new console template, changed TFM and RID to net6.0-maccatalyst/maccatalyst-arm64 and it crashes in the same way as the iOS one.

@imhameed
Copy link
Contributor

I'll give that a shot; thanks. I'll take a look at the AOT flags we use and why the functional test has different behavior.

@filipnavara
Copy link
Member

filipnavara commented Aug 31, 2021

If there's anything else I can test or help with, let me know. I am afraid I am still few steps behind you though so there's not much I can offer.

I compared the AOT flags in both MSBuild logs and they seemed identical. However, different tasks are used to compose the final set so I could have easily missed something.

@imhameed
Copy link
Contributor

Unwinding directives for LLVM-compiled Raise and SelectHostName in https://gist.github.com/imhameed/ccd542f151624c6399bd6d5c0f74b085 look reasonable: https://gist.github.com/imhameed/aef3e164aa20cffe5d7e1e0ac85c149e (just a prologue stack adjustment followed by DW_CFA_offsets that mark where the saved registers are located relative to the frame start).

@filipnavara
Copy link
Member

I can confirm that it fails very early on in the unwinding so it's likely not a problem in the actual unwinding information.

Working unwind:

2021-09-01 10:08:51.490 llvmbug[20432:4362454] --DEBUG: before
M: System.Net.Sockets.Socket:DoConnect (System.Net.EndPoint,System.Net.Internals.SocketAddress) 1.
M: System.Net.Sockets.Socket:Connect (System.Net.EndPoint) 2.
M: System.Net.Sockets.Socket:Connect (System.Net.IPAddress,int) 3.
M: System.Net.Sockets.TcpClient:Connect (string,int) 4.
M: System.Net.Sockets.TcpClient:Connect (string,int) 1.
M: System.Runtime.ExceptionServices.ExceptionDispatchInfo:Throw () 1.
M: System.Net.Sockets.TcpClient:Connect (string,int) 2.
M: System.Net.Sockets.TcpClient:.ctor (string,int) 3.
M: System.Net.Sockets.TcpClient:.ctor (string,int) 1.
M: MySingleView.Application:SelectHostName (string,int) 2.

Non-working unwind:

2021-09-01 10:08:53.519 llvmbug[20433:4362488] --DEBUG: before
M: (wrapper managed-to-native) Interop/Sys:GetProcessArchitecture () 1.
M: (wrapper managed-to-native) Interop/Sys:GetProcessArchitecture () 2.
M: (wrapper managed-to-native) Interop/Sys:StrErrorR (int,byte*,int) 3.

@filipnavara
Copy link
Member

filipnavara commented Sep 1, 2021

I've narrowed it down a bit. When looking up the very first framemono_jit_info_table_find_internal is called to find MonoJitInfo from the ip address. When it works correctly it goes through the try_aot branch, finds correct method and unwinds stack correctly. When it fails it hits the first branch with jit_info_table_find and returns early with an incorrect result [without even looking at the AOT modules].

The looked up entry looks like this:

AOT F: (wrapper managed-to-native) Interop/Sys:GetProcessArchitecture ()
AOT code 0x104fce470 code_len 3c00358

The code size looks incorrect.

@filipnavara
Copy link
Member

It seems that it's possible to catch the corruption early on by adding this:

	if (code >= amodule->jit_code_start && code < amodule->jit_code_end)
		g_assert(code + code_len <= amodule->jit_code_end);
	else
		g_assert(code + code_len <= amodule->llvm_code_end);

to mono_aot_find_jit_info just before the jinfo = decode_exception_debug_info (amodule, method, ex_info, code, code_len); line.

@filipnavara
Copy link
Member

filipnavara commented Sep 1, 2021

So, what happens is that one AOT module has both JIT code and LLVM code. When linked into final executable these sections are quite far apart and there is a different code in-between. When the last JIT method is added it (incorrectly) fills the whole gap. Later lookups for the code that is actually in the gap returns incorrect information.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 1, 2021
@imhameed
Copy link
Contributor

imhameed commented Sep 1, 2021

Thank you! And it looks like the link order xamarin-macios is using now is determined here: https://github.com/xamarin/xamarin-macios/blob/4380161309528e543107ede2ea5881299495d7ef/dotnet/targets/Xamarin.Shared.Sdk.targets#L856-L866

Contrast with: https://github.com/xamarin/xamarin-macios/blob/xamarin-ios-13.18.0.21/tools/mtouch/Target.cs#L1120-L1125

@filipnavara
Copy link
Member

Thanks for explaining the missing piece of what has changed! I don't necessarily care if my PR is accepted as the final solution but I am happy to see progress on this and explanation for the behaviour.

lambdageek pushed a commit that referenced this issue Sep 1, 2021
…58491)

Fixes #56100

Here's a little bit of background on the problem. When multiple AOT modules got linked into the same executable for iOS/MacCatalyst the LLVM and JIT code sections of different modules got interleaved. The resulting AOT module info looks something like this:

```
JIT start: 0x101002100 JIT end: 0x1010035e0
LLVM start: 0x104c027c8 LLVM end: 0x104c603e8
Sorted methods:
  0x101002250
  0x101002390
  0x101002470
  0x104c027c8
  0x104c027e0
  ...
```

The previous code incorrectly assumed that the third method had a code size `0x104c027c8 - 0x101002470`. When inserted into lookup table any of the AOT code inside the range `0x1010035e0 (JIT code end) - 0x104c027c8 (LLVM code start)` would incorrectly end up being mapped to the last JIT method.

(Note: The JIT name here is misleading.)
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 1, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Oct 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants