-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: stack walk failure in arm64 function fragments #91139
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsCompile: TestClass156-test-assertion.cs.txt Run on osx arm64 with
(may also repro on other arm64 hosts, did not try) By increasing the split function size you can make the test "pass":
|
@BruceForstall FYI |
Looks like we're failing fast due to a GS cookie check failure. I've seen it also fail with more "normal" GC hole asserts (e.g., CREATE_CHECK_STRING). |
I've been repro'ing on win-arm64 with:
If I use Stack trace:
The failure is unreliable. Using a Debug build doesn't fail. If I turn on JitDump it doesn't fail. |
@janvorli Here's a case where we seem to cause stack walking or GC issues when we use a JIT stress mode that causes us to create multiple unwind info "fragments" more frequently than required (using If you have any suggestions for investigation, please let me know. |
I'll take a look |
@janvorli fyi, I created a test PR that enables fragment splitting aggressively (basically, create as many fragments as possible): #92203. I triggered the gcstress pipeline and there are lots of GCStress=3 failures: https://dev.azure.com/dnceng-public/public/_build/results?buildId=409281&view=results. My current theory is that the VM doesn't know how to look up the GS cookie location correctly on the frame for the non-first fragment. |
I have debugged the issue and I believe the problem is that when function is split into multiple fragments, it should not have epilog in the first fragment. The doc mentioned in the comment above is not 100% clear about it, but it seems it can be interpreted this way (it says at a place where it shows an example: Prolog only (region 1: all epilogs are in separated regions)) |
The first fragment is the only one where a prolog can exist (naturally, at the beginning of the function) and the only place where a stack pointer adjustment can be made. Any fragment can have any number of epilogs (and not just at the end of the code range).
This sounds like the bug; we need to find the fragment in which the PC exists to find the right unwind info applicable to that fragment. |
The other fragments don't have any information on the prolog. So, the unwinder would not be able to restore registers pushed in prolog from those. @AndyAyersMS has made a fix to use the first fragment info for unwinding a long time ago when we have hit that in our tests (dotnet/coreclr#22202). |
I think that change dotnet/coreclr#22202 is incorrect, as it causes the VM to ignore the non-first fragments in all cases. There may be some cases where the VM wants to look up the The This is what is documented in https://learn.microsoft.com/en-us/cpp/build/arm64-exception-handling?view=msvc-170#function-fragments:
However, by reading the code (I haven't debugged it yet) the unwinder we have in the code base (I believe only used on Linux arm64) (src\coreclr\unwinder\arm64\unwinder.cpp, RtlpUnwindFunctionFull) doesn't seem to have this special handling for Again, reading the unwinder code, it appears that if you pass an offset that is far beyond the length of the function specified in the unwind codes, it ends up executing all the prolog unwind codes (which seems odd). I would expect to see that behavior since we're passing a ControlPC for an incorrect |
I've checked and it actually has a fix - the handling of the
It actually ends up skipping all of them. Imagine if the location was in the middle of the epilog, it would want to skip part of them and if it was at the "ret", it would need to skip all of them. Since the unwinder doesn't expect getting offset that's past the end of the function, it ends up behaving as if it was at the "ret". |
I'll update the unwinder to match the current state in Windows. |
The CoreCLR built-in unwinder is only used on Linux, though. On Windows we use RtlUnwind, which I presume is already doing the right thing. I think we still need to fix the bug introduced by dotnet/coreclr#22202 in It's not clear to me which other code gets the |
Right, that would be part of the fix.
By looking at the current state and the old change that introduced the problem, I believe that in all cases, reverting the old change would get a correct value at all the callers of the |
Maybe the issue (#10800) that was attempted to be fixed by dotnet/coreclr#22202 is actually fixed by the fixed |
Right, before old fix in dotnet/coreclr#22202, the GetFuncletStartAddress actually contained the code that was extracted into the FindRootEntry. So fully reverting that change will fix this issue.
Yes, that's exactly it. |
Compile: TestClass156-test-assertion.cs.txt
Run on osx arm64 with
(may also repro on other arm64 hosts, did not try)
By increasing the split function size you can make the test "pass":
The text was updated successfully, but these errors were encountered: