-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix stack overflow handling issue in GC stress #56733
Fix stack overflow handling issue in GC stress #56733
Conversation
This change fixes a problem when in GC stress mode 3, GC started to run on the thread that hit stack overflow due to the GCX_PREEMP in DebuggerRCThread::DoFavor that is called from the EEPolicy::HandleFatalStackOverflow. It was causing failures in the CI. The issue is GC stress specific, the GCX_PREEMP would not start running GC on the current thread in regular cases. The fix is to inhibit GC stress in the HandleFatalStackOverflow.
This fixes part of the #46279 that was reported on x86 (and that reproed there locally with GC stress). There are also failures on arm / x64 Windows that are not related. |
@janvorli Do you know that this unrelated to the arm32 issue? I've been trying to reproduce that one locally and I can't seem to get anything out of it. |
The behavior on arm and x64 is different. It just prints "stack overflow." without any stack trace, but it returns correct stack overflow return code. Also, it happens without GC stress while the x86 was GC stress only. |
@davidwrighton I was able to repro it locally on x64. Trying again with instrumented runtime to see why it didn't print the call stack. Not sure how often it reproduces though. |
So, the problem is that the stackwalker cannot walk the stack in the failure case, so it doesn't call the callback to log any frames. In the case I've seen, the stack overflow happened in MethodTable::SanityCheck when the method attempted to make the first call. The prolog starts like this:
The rsp after this is 00000031`78646ff0, that is 16 bytes below the last valid stack page. Then a bit later, there is a call that triggers the stack overflow since its return address would go to the stack guard page. And this is the top of the stack trace that WinDbg shows:
WinDbg had some issue here too, the frame of the failing function is missing, it should be between frames 8 and 9 above. But I don't think it is related. When our stack walker tries to walk the stack, it starts at the FaultingExceptionFrame, which points to a native frame. I think this is the actual issue, the stack walker is not handling that case. I think the EEPolicy::HandleFatalStackOverflow should call Thread::VirtualUnwindToFirstManagedCallFrame on the context instead of the AdjustContextForJITHelpers. We don't care about native frames in the stack overflow case anyways (for the purpose of reporting). |
* origin/main: (64 commits) [wasm][debugger] Create test Inherited Properties (dotnet#56754) Mark new test as incompatible with GC Mark4781_1GcStressIncompatible (dotnet#56739) Ensure MetadataEnumResult is sufficiently updated by MetaDataImport::Enum (dotnet#56756) [mono] Remove gdb xdebug and binary writer support, it hasn't worked in a while. (dotnet#56759) Update windows-requirements.md (dotnet#56476) Update doc and generic parameter name for JsonValue.GetValue (dotnet#56639) [wasm][debugger] Inspect static class (dotnet#56740) Fix stack overflow handling issue in GC stress (dotnet#56733) Use ReflectionOnly as serialization mode in case dynamic code runtime feature is not supported (dotnet#56604) Move Windows Compat pack to NuGet pack task (dotnet#56686) Fix build error when building some packages (dotnet#56767) Simplify JIT shutdown logic in crossgen2 (dotnet#56687) Fix race in crossdac publishing with PGO (dotnet#56762) Add DictionaryKeyPolicy support for EnumConverter [dotnet#47765] (dotnet#54429) Use ComWrappers in some Marshal unit-tests and update platform metadata (dotnet#56595) Set `DisableImplicitNamespaceImports_Dotnet=true` to workaround sdk issue (dotnet#56744) Make sure ServerGCHeapDetails is up to date (dotnet#56056) [libraries] Reenable System.Diagnostics.DiagnosticSorce.Switches.Tests on mobile (dotnet#56737) Disable failing arm64 win10 Graphics.FromHdc tests (dotnet#56732) Match xplat event source conditions (dotnet#56435) ...
…ger_proxy_attribute * origin/main: (340 commits) add RID for Debian 11 (dotnet#56789) [wasm] [debugger] Skip thread static field (dotnet#56749) Fix timeouts in coreroot_determinism test in GC stress mode (dotnet#56770) Use File.OpenHandle in Socket.SendFile directly (dotnet#56777) accept empty realm for digest auth (dotnet#56369) (dotnet#56455) [wasm][debugger] Create test Inherited Properties (dotnet#56754) Mark new test as incompatible with GC Mark4781_1GcStressIncompatible (dotnet#56739) Ensure MetadataEnumResult is sufficiently updated by MetaDataImport::Enum (dotnet#56756) [mono] Remove gdb xdebug and binary writer support, it hasn't worked in a while. (dotnet#56759) Update windows-requirements.md (dotnet#56476) Update doc and generic parameter name for JsonValue.GetValue (dotnet#56639) [wasm][debugger] Inspect static class (dotnet#56740) Fix stack overflow handling issue in GC stress (dotnet#56733) Use ReflectionOnly as serialization mode in case dynamic code runtime feature is not supported (dotnet#56604) Move Windows Compat pack to NuGet pack task (dotnet#56686) Fix build error when building some packages (dotnet#56767) Simplify JIT shutdown logic in crossgen2 (dotnet#56687) Fix race in crossdac publishing with PGO (dotnet#56762) Add DictionaryKeyPolicy support for EnumConverter [dotnet#47765] (dotnet#54429) Use ComWrappers in some Marshal unit-tests and update platform metadata (dotnet#56595) ...
This change fixes a problem when in GC stress mode 3, GC started to run
on the thread that hit stack overflow due to the GCX_PREEMP in
DebuggerRCThread::DoFavor that is called from the
EEPolicy::HandleFatalStackOverflow. It was causing failures in the CI.
The issue is GC stress specific, the GCX_PREEMP would not start running
GC on the current thread in regular cases.
The fix is to inhibit GC stress in the HandleFatalStackOverflow.