Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stack walking for new EH #99624

Merged
merged 5 commits into from
Mar 15, 2024

Conversation

janvorli
Copy link
Member

@janvorli janvorli commented Mar 12, 2024

There are three issues caused by the new exception handling implementation causing GC holes. These were discovered by the CI GC stress tests and mostly occur only with GC stress 3 and tiered compilation disabled.

  • There is an edge case when the StackFrameIterator was incorrectly setting the
    fShouldParentFrameUseUnwindTargetPCforGCReporting, which resulted in a 100% reproducible GC hole in the
    baseservices\exceptions\unittests\ThrowInFinallyNestedInTry with GC stress C and tiered compilation off.
    The fix is to delete code that is already present in an "if" branch above the change and that should really be executed only when the funclet parent is the caller of the actual handler frame.

  • When GC occurs when a finally funclet is on the stack and then next GC happens when we are in the managed code of the second pass of the same exception handling, it introduces a GC hole. The latter GC still needs to report some slots in the parent of the finallys so that next GC scan when there is a finally again needs to find those alive. The current code works when the parent of the finally is the same as the parent of the actual exception handling catch funclet, but in other cases, the stack walk was not reporting the slots. It cannot just report all slots in the parent, as some of them would already be dead. Only slots that the funclets would report should be reported.
    The fix is to remember the PC, frame pointer and flags of the last GC reporting of a finally funclet and use that to report slots based off the frame pointer when no funclet of the same exception handling is found during a future GC.

  • Last issue is related to the usage of SEH to pass through native frames when stack walking hits CallDescrWorkerInternal frame. The SEH stack unwinding doesn't update and doesn't even pass out the context pointers, so if a non-volatile register stored in a native frame contains a reference to an object that got relocated, a GC hole occurs. The Interop/Interop test was hitting that problem on Windows.
    The fix is to use context pointers from a skipped explicit frame to update context pointers in the current REGDISPLAY. When the SEH hits a managed frame in the 2nd pass, we restart exception handling on the context of the managed frame, so the explicit frame that was on the transition boundary between that managed frame and the just unwound native frames contains the context pointers we need.

@janvorli janvorli added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) NO-REVIEW Experimental/testing PR, do NOT review it labels Mar 12, 2024
@janvorli janvorli self-assigned this Mar 12, 2024
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@janvorli
Copy link
Member Author

/azp run runtime-coreclr gcstress0x3-gcstress0xc

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

There is an edge case when the StackFrameIterator was incorrectly
setting the fShouldParentFrameUseUnwindTargetPCforGCReporting, which
resulted in a 100% reproducible GC hole in the
baseservices\exceptions\unittests\ThrowInFinallyNestedInTry with GC
stress C and tiered compilation off.
The fix is to delete code that is already present in an "if" branch
above the change and that should really be executed only when the
funclet parent is the caller of the actual handler frame.
@janvorli janvorli force-pushed the fix-stack-walking-for-new-eh-4 branch from 975dedc to 0d71a2f Compare March 13, 2024 21:59
@janvorli janvorli requested a review from jkotas March 13, 2024 22:20
@janvorli janvorli added this to the 9.0.0 milestone Mar 13, 2024
@janvorli janvorli added area-ExceptionHandling-coreclr and removed NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) NO-REVIEW Experimental/testing PR, do NOT review it area-VM-coreclr labels Mar 13, 2024
@janvorli
Copy link
Member Author

/azp run runtime-coreclr gcstress0x3-gcstress0xc

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@janvorli janvorli changed the title Fix stack walking for new eh 4 Fix stack walking for new EH Mar 14, 2024
@janvorli
Copy link
Member Author

/azp run runtime-coreclr gcstress0x3-gcstress0xc

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@janvorli janvorli force-pushed the fix-stack-walking-for-new-eh-4 branch from 581cdcd to fcbfa2c Compare March 14, 2024 14:12
@janvorli
Copy link
Member Author

The fix for the last issue needs to be a bit involved, it seems - it is hitting an assert I've added there, so there seem to be a case when a skipped frame isn't one that needs register updates or it doesn't point to the managed frame. I need to check what is such case.

@janvorli
Copy link
Member Author

/azp run runtime-coreclr gcstress0x3-gcstress0xc

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

InlinedCallFrame needs to be excluded from updating the context
pointers, as it restores PC / SP to a different location than the
callsite of the pinvoke and it doesn't update the context pointers
anyways.
@janvorli janvorli force-pushed the fix-stack-walking-for-new-eh-4 branch from 850a0c4 to 49a2b8b Compare March 15, 2024 15:16
@janvorli
Copy link
Member Author

/azp run runtime-coreclr gcstress0x3-gcstress0xc

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@janvorli
Copy link
Member Author

/azp run runtime-coreclr outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@janvorli
Copy link
Member Author

Only two GC stress legs have failures due to timeout in a test
arm32: JIT/Directed/tailcall/more_tailcalls/more_tailcalls.cmd
arm64: Loader/binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd

The arm32 happens even without the new EH enabled - see #99410
The arm64 seems to be an existing issue too - see #97735

@janvorli janvorli merged commit 4a69553 into dotnet:main Mar 15, 2024
88 of 91 checks passed
@janvorli janvorli deleted the fix-stack-walking-for-new-eh-4 branch March 15, 2024 22:19
@AndyAyersMS
Copy link
Member

AndyAyersMS commented Mar 26, 2024

Windows x64 improvements: dotnet/perf-autofiling-issues#31934

@github-actions github-actions bot locked and limited conversation to collaborators Apr 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants