Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: stack walk failure in arm64 function fragments #91139

Closed
AndyAyersMS opened this issue Aug 25, 2023 · 17 comments · Fixed by #92678
Closed

JIT: stack walk failure in arm64 function fragments #91139

AndyAyersMS opened this issue Aug 25, 2023 · 17 comments · Fixed by #92678
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI bug
Milestone

Comments

@AndyAyersMS
Copy link
Member

Compile: TestClass156-test-assertion.cs.txt

Run on osx arm64 with

export COMPlus_JitAggressiveInlining=1
export COMPlus_JitSplitFunctionSize=100
export COMPlus_GCStress=0x4

Assert failure(PID 94626 [0x000171a2], Thread: 23328476 [0x163f6dc]): !"About to FailFast. set DOTNET_AssertOnFailFast=0 if this is expected"
    File: /Users/andya/repos/runtime0/src/coreclr/vm/jithelpers.cpp Line: 4616
    Image: /Users/andya/repos/runtime0/artifacts/tests/coreclr/osx.arm64.Checked/Tests/Core_Root/corerun

(may also repro on other arm64 hosts, did not try)

By increasing the split function size you can make the test "pass":

export COMPlus_JitAggressiveInlining=1
export COMPlus_JitSplitFunctionSize=138
export COMPlus_GCStress=0x4

Unhandled exception. System.OverflowException: Value was either too large or too small for a Decimal.
   at System.Decimal.op_Multiply(Decimal d1, Decimal d2)
   at TestClass.LeafMethod0() in /Users/andya/bugs/antigen/156.cs:line 66
   at TestClass.Method21(S2 p_s2_593, UInt16& p_ushort_594, Double& p_double_595, S2 p_s2_596, S1 p_s1_597, UInt32 p_uint_598, Decimal& p_decimal_599, S1& p_s1_600) in /Users/andya/bugs/antigen/156.cs:line 1309
   at TestClass.Method0() in /Users/andya/bugs/antigen/156.cs:line 5701
   at TestClass.Main(String[] args) in /Users/andya/bugs/antigen/156.cs:line 5798
@AndyAyersMS AndyAyersMS added bug arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Aug 25, 2023
@AndyAyersMS AndyAyersMS added this to the Future milestone Aug 25, 2023
@ghost
Copy link

ghost commented Aug 25, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Compile: TestClass156-test-assertion.cs.txt

Run on osx arm64 with

export COMPlus_JitAggressiveInlining=1
export COMPlus_JitSplitFunctionSize=100
export COMPlus_GCStress=0x4

Assert failure(PID 94626 [0x000171a2], Thread: 23328476 [0x163f6dc]): !"About to FailFast. set DOTNET_AssertOnFailFast=0 if this is expected"
    File: /Users/andya/repos/runtime0/src/coreclr/vm/jithelpers.cpp Line: 4616
    Image: /Users/andya/repos/runtime0/artifacts/tests/coreclr/osx.arm64.Checked/Tests/Core_Root/corerun

(may also repro on other arm64 hosts, did not try)

By increasing the split function size you can make the test "pass":

export COMPlus_JitAggressiveInlining=1
export COMPlus_JitSplitFunctionSize=138
export COMPlus_GCStress=0x4

Unhandled exception. System.OverflowException: Value was either too large or too small for a Decimal.
   at System.Decimal.op_Multiply(Decimal d1, Decimal d2)
   at TestClass.LeafMethod0() in /Users/andya/bugs/antigen/156.cs:line 66
   at TestClass.Method21(S2 p_s2_593, UInt16& p_ushort_594, Double& p_double_595, S2 p_s2_596, S1 p_s1_597, UInt32 p_uint_598, Decimal& p_decimal_599, S1& p_s1_600) in /Users/andya/bugs/antigen/156.cs:line 1309
   at TestClass.Method0() in /Users/andya/bugs/antigen/156.cs:line 5701
   at TestClass.Main(String[] args) in /Users/andya/bugs/antigen/156.cs:line 5798
Author: AndyAyersMS
Assignees: -
Labels:

bug, arch-arm64, area-CodeGen-coreclr

Milestone: Future

@AndyAyersMS
Copy link
Member Author

@BruceForstall FYI

@BruceForstall
Copy link
Member

Looks like we're failing fast due to a GS cookie check failure.

I've seen it also fail with more "normal" GC hole asserts (e.g., CREATE_CHECK_STRING).

@BruceForstall
Copy link
Member

I've been repro'ing on win-arm64 with:

set COMPlus_JitAggressiveInlining=1
set COMPlus_JitSplitFunctionSize=100
set COMPlus_GCStress=F

If I use COMPlus_GCStress=4 the "FailFast" doesn't repro.

Stack trace:

KERNELBASE!DebugBreak
coreclr!DbgAssertDialog(char * szFile = 0x00007ffc`86a8ea48 "C:\gh\runtime\src\coreclr\vm\jithelpers.cpp", int iLine = 0n4691, char * szExpr = 0x00007ffc`86a8f2a0 "!"About to FailFast. set DOTNET_AssertOnFailFast=0 if this is expected"")+0x170
coreclr!DoJITFailFast(void)+0x54
coreclr!CrawlFrame::CheckGSCookies(void)+0x34
coreclr!CrawlFrame::SetCurGSCookie(void)+0x44
coreclr!StackFrameIterator::PreProcessingForManagedFrames(void)+0x290
coreclr!StackFrameIterator::ProcessCurrentFrame(void)+0x188
coreclr!StackFrameIterator::Init(class Thread * pThread = 0x000002c1`1ec66b30, class Frame * pFrame = <Value unavailable error>, struct REGDISPLAY * pRegDisp = 0x000000fe`ef0787d0, unsigned int flags = 0x388)+0x23c
coreclr!Thread::StackWalkFramesEx(struct REGDISPLAY * pRD = 0x000000fe`ef0787d0, <function> * pCallback = 0x00007ffc`8665c920, void * pData = 0x000000fe`ef078400, unsigned int flags = 0x388, class Frame * pStartFrame = 0x00000000`00000000)+0x1bc
coreclr!Thread::HandledJITCase(void)+0x274
coreclr!ThreadSuspend::SuspendRuntime(ThreadSuspend::SUSPEND_REASON reason = SUSPEND_FOR_GC (0n1))+0x7f4
coreclr!ThreadSuspend::SuspendEE(ThreadSuspend::SUSPEND_REASON reason = SUSPEND_FOR_GC (0n1))+0x244
coreclr!GCToEEInterface::SuspendEE(SUSPEND_REASON reason = SUSPEND_FOR_GC (0n1))+0x60
coreclr!WKS::GCHeap::GarbageCollectGeneration(unsigned int gen = 2, gc_reason reason = reason_gcstress (0n8))+0x37c
coreclr!WKS::GCHeap::GarbageCollectTry(void)+0x78
coreclr!WKS::GCHeap::GarbageCollect(int generation = <Value unavailable error>, bool low_memory_p = <Value unavailable error>, int mode = 0n-2147483648)+0x1c8
coreclr!WKS::GCHeap::StressHeap(struct gc_alloc_context * context = <Value unavailable error>)+0x798
coreclr!Thread::PerformPreemptiveGC(void)+0x1f0
coreclr!Thread::RareEnablePreemptiveGC(void)+0xa8
coreclr!Thread::EnablePreemptiveGC(void)+0xc0
coreclr!GCHolderBase::PopInternal(void)+0x134
coreclr!GCCoop::~GCCoop(void)+0x15c
coreclr!CEEInfo::getFieldInfo(struct CORINFO_RESOLVED_TOKEN * pResolvedToken = 0x000000fe`ef079990, struct CORINFO_METHOD_STRUCT_ * callerHandle = 0x00007ffc`2772c8c0, CORINFO_ACCESS_FLAGS flags = CORINFO_ACCESS_GET (0n256), struct CORINFO_FIELD_INFO * pResult = 0x000000fe`ef07a4d0)+0x444
clrjit!Compiler::impImportBlockCode(struct BasicBlock * block = 0x00000000`00000000)+0x64c4
clrjit!Compiler::impImportBlock(struct BasicBlock * block = 0x00000301`334db058 BB28->BB30; BBJ_ALWAYS)+0x134
clrjit!Compiler::impImport(void)+0x51c
clrjit!Compiler::fgImport(void)+0x1c
clrjit!CompilerPhaseWithStatus::DoPhase(void)+0x28
clrjit!Phase::Run(void)+0x8c
clrjit!DoPhase(void)+0x34

The failure is unreliable. Using a Debug build doesn't fail. If I turn on JitDump it doesn't fail.

@BruceForstall
Copy link
Member

@janvorli Here's a case where we seem to cause stack walking or GC issues when we use a JIT stress mode that causes us to create multiple unwind info "fragments" more frequently than required (using DOTNET_JitSplitFunctionSize; see https://learn.microsoft.com/en-us/cpp/build/arm64-exception-handling?view=msvc-170#function-fragments). The fragment creating code has been there for a long time, but it might not get exercised much since the fragments are normally 512K (arm32) or 1MB (arm64) and presumably you'd need a function larger than that plus a stack walk (e.g., GC) in the non-zero fragment to hit any related code.

If you have any suggestions for investigation, please let me know.

@janvorli
Copy link
Member

I'll take a look

@BruceForstall
Copy link
Member

@janvorli fyi, I created a test PR that enables fragment splitting aggressively (basically, create as many fragments as possible): #92203. I triggered the gcstress pipeline and there are lots of GCStress=3 failures: https://dev.azure.com/dnceng-public/public/_build/results?buildId=409281&view=results.

My current theory is that the VM doesn't know how to look up the GS cookie location correctly on the frame for the non-first fragment.

@janvorli
Copy link
Member

janvorli commented Sep 19, 2023

I have debugged the issue and I believe the problem is that when function is split into multiple fragments, it should not have epilog in the first fragment. The doc mentioned in the comment above is not 100% clear about it, but it seems it can be interpreted this way (it says at a place where it shows an example: Prolog only (region 1: all epilogs are in separated regions))
When I have stepped through the unwinding code for the failing test, the problem occurs when we are unwinding from an address in the second fragment. We always use the fragment #1 unwind info for unwinding, so we have an offset that is behind the end of the first fragment. Since the first fragment contains an epilog, the unwinder thinks we are in epilog and based on the location in the epilog decides how many unwind codes to skip. Since the address is after the epilog, it skips all of them, so the only thing the unwinding ends up doing is to set PC to LR. So, the SP and other registers end up not being updated. That results in the stack cookie check failure, as the SP is incorrect, and the cookie location is relative to it.

@BruceForstall
Copy link
Member

The first fragment is the only one where a prolog can exist (naturally, at the beginning of the function) and the only place where a stack pointer adjustment can be made. Any fragment can have any number of epilogs (and not just at the end of the code range).

We always use the fragment 1 unwind info for unwinding

This sounds like the bug; we need to find the fragment in which the PC exists to find the right unwind info applicable to that fragment.

@janvorli
Copy link
Member

The other fragments don't have any information on the prolog. So, the unwinder would not be able to restore registers pushed in prolog from those. @AndyAyersMS has made a fix to use the first fragment info for unwinding a long time ago when we have hit that in our tests (dotnet/coreclr#22202).
The unwind info for the fragment 2 and the rest start with unwind code end_c. The unwinder exits on encountering this code, which happens always when it is not in one of the epilogs of the fragment.

@BruceForstall
Copy link
Member

I think that change dotnet/coreclr#22202 is incorrect, as it causes the VM to ignore the non-first fragments in all cases. There may be some cases where the VM wants to look up the RUNTIME_FUNCTION corresponding to the beginning of the function (or funclet), but that first fragment shouldn't be used for unwinding: it has the incorrect prolog and, especially, epilog info. However, it appears we pass the "root" fragment RUNTIME_FUNCTION to RtlVirtualUnwind.

The end_c code is the first code in non-first fragments to indicate that there is no prolog in the fragment (the prolog is considered to be zero length). Thus, if we want to unwind from offset zero in a fragment, the unwinder can tell we're not within the prolog (which would require doing a "partial" unwind). The full set of prolog unwind codes then follow the end_c, terminated by a normal end code.

This is what is documented in https://learn.microsoft.com/en-us/cpp/build/arm64-exception-handling?view=msvc-170#function-fragments:

Epilogs only (region 2: prolog is in host region)

It's assumed that by the time control jumps into this region, all prolog codes have been executed. Partial unwind can happen in epilogs the same way as in a normal function. ... it can be encoded with a "phantom" prolog, bracketed by an end_c and end unwind code pair. The leading end_c indicates the size of prolog is zero. Epilog start index of the single epilog points to set_fp.

Unwind code for region 2: end_c, set_fp, save_regp 0,240, save_fplr_x_256, end.

However, by reading the code (I haven't debugged it yet) the unwinder we have in the code base (I believe only used on Linux arm64) (src\coreclr\unwinder\arm64\unwinder.cpp, RtlpUnwindFunctionFull) doesn't seem to have this special handling for end_c. It seems to treat end_c and end equivalently in all cases. I would have expected it to skip the end_c when executing the codes. I'd be curious to know what the current Windows implementation does.

Again, reading the unwinder code, it appears that if you pass an offset that is far beyond the length of the function specified in the unwind codes, it ends up executing all the prolog unwind codes (which seems odd). I would expect to see that behavior since we're passing a ControlPC for an incorrect RUNTIME_FUNCTION, leading to the calculation of an incorrect (and very large) function offset.

@janvorli
Copy link
Member

I'd be curious to know what the current Windows implementation does.

I've checked and it actually has a fix - the handling of the end_c now just ignores the code and continues executing the codes following that. You are right that getting that fix should fix the problem.

it ends up executing all the prolog unwind codes

It actually ends up skipping all of them. Imagine if the location was in the middle of the epilog, it would want to skip part of them and if it was at the "ret", it would need to skip all of them. Since the unwinder doesn't expect getting offset that's past the end of the function, it ends up behaving as if it was at the "ret".

@janvorli
Copy link
Member

I'll update the unwinder to match the current state in Windows.

@BruceForstall
Copy link
Member

I'll update the unwinder to match the current state in Windows.

The CoreCLR built-in unwinder is only used on Linux, though. On Windows we use RtlUnwind, which I presume is already doing the right thing.

I think we still need to fix the bug introduced by dotnet/coreclr#22202 in LazyGetFunctionEntry that always finds the first fragment for a function/funclet and uses that in subsequent RtlVirtualUnwind calls. E.g., IsIPInEpilog, JIT_Patchpoint.

It's not clear to me which other code gets the RUNTIME_FUNCTION using GetFunctionEntry (which always returns the first fragment unwind info) and subsequently passes it to RtlVirtualUnwind. E.g., Thread::VirtualUnwindCallFrame on Windows uses RtlLookupFunctionEntry which should give the correct fragment. However, there are some cases where it uses GetFunctionEntry.

@janvorli
Copy link
Member

I think we still need to fix the bug introduced by dotnet/coreclr#22202

Right, that would be part of the fix.

It's not clear to me which other code gets the RUNTIME_FUNCTION using GetFunctionEntry (which always returns the first fragment unwind info) and subsequently passes it to RtlVirtualUnwind.

By looking at the current state and the old change that introduced the problem, I believe that in all cases, reverting the old change would get a correct value at all the callers of the EECodeInfo::GetFunctionEntry. I don't see a place where the first fragment function entry would be correct.

@BruceForstall
Copy link
Member

GetFuncletStartAddress() calls FindRootEntry to find the first fragment. That actually does make sense. (It currently seems unnecessary because it first calls GetFunctionEntry() which also calls FindRootEntry).

Maybe the issue (#10800) that was attempted to be fixed by dotnet/coreclr#22202 is actually fixed by the fixed end_c unwinder handling? I can see in that issue that the end_c and "phantom prolog" are emitted the same as they are today.

@BruceForstall BruceForstall modified the milestones: Future, 9.0.0 Sep 20, 2023
@janvorli
Copy link
Member

GetFuncletStartAddress() calls FindRootEntry to find the first fragment.

Right, before old fix in dotnet/coreclr#22202, the GetFuncletStartAddress actually contained the code that was extracted into the FindRootEntry. So fully reverting that change will fix this issue.

Maybe the issue (#10800) that was attempted to be fixed by dotnet/coreclr#22202 is actually fixed by the fixed end_c unwinder handling?

Yes, that's exactly it.

@BruceForstall BruceForstall changed the title JIT: antigen test case fails when unwind fragment size is set very low JIT: stack walk failure in arm64 function fragments Sep 21, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 26, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 27, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Oct 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants