Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use RtlDllShutdownInProgress to detect process shutdown on Windows #103877

Merged
merged 2 commits into from
Jul 2, 2024

Conversation

jkotas
Copy link
Member

@jkotas jkotas commented Jun 24, 2024

Switching to cooperative mode is not safe during process shutdown on Windows. Process shutdown can terminate a thread in the middle of the GC. The shutdown thread deadlocks if it tries to switch to cooperative mode and wait for the GC to finish in this situation.

Use RtlDllShutdownInProgress Windows API to detect process shutdown to skip cleanup that has to be done in cooperative mode.

The existing g_fProcessDetach flag is set too late - using this flag to skip cooperative mode switch would lead to shutdown deadlocks, and the existing g_fEEShutDown flag is set too early - using this flag to skip cooperative mode switch would lead to shutdown crashes.

Fixes #103624

@jkotas
Copy link
Member Author

jkotas commented Jun 24, 2024

This is not 100% reliable fix, but it matches what we do in similar situations during shutdown currently.

@jkotas jkotas requested a review from jkoritzinsky June 24, 2024 04:44
@jkotas jkotas changed the title Disable of RuntimeThreadLocals during shutdown Disable cleanup of RuntimeThreadLocals during shutdown Jun 24, 2024
@jkotas jkotas added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) area-VM-coreclr and removed area-System.Runtime labels Jun 24, 2024
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@jkotas
Copy link
Member Author

jkotas commented Jun 26, 2024

Unfortunately, this fix does not work well. It replaces intermittent hang with intermittent crash during shutdown.

@jkoritzinsky
Copy link
Member

What if we do the alloc context fixing during the destruction of the RuntimeThreadLocals object? Might require us to move the PreemptiveGCDisabled bool over first though.

@jkotas
Copy link
Member Author

jkotas commented Jun 26, 2024

What if we do the alloc context fixing during the destruction of the RuntimeThreadLocals object?

I do not see how it would help.

This is the sequence that leads to the hang:

  • OS initiates process shutdown
  • There is a thread in the middle of the GC that gets killed by the process shutdown sequence. It means that the process state is indefinitely stuck in "GC in progress" state
  • OS sends thread detach notification to the last thread in the process that runs the thread-local destructors. We try to switch to cooperative mode and wait for GC to finish - but that's never going to happen.

The problem with the current fix is that it is starts skipping the cleanup too early while there are still multiple threads running in the process. It means that we can end up running a GC without cleaned up alloc context that the GC gets very unhappy about.

I have been looking for a more precise detection of the moment after which it is fine to start skipping the cleanup.

@jkotas jkotas removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Jun 30, 2024
@jkotas jkotas force-pushed the fix-103624 branch 4 times, most recently from eb524c2 to 6b18f2a Compare June 30, 2024 04:55
@jkotas
Copy link
Member Author

jkotas commented Jun 30, 2024

Ok, I think I found something that works. I have also fixed other instances of thread cleanup switching to cooperative mode that I found during testing and review.

@jkotas jkotas changed the title Disable cleanup of RuntimeThreadLocals during shutdown Use RtlDllShutdownInProgress to detect process shutdown on Windows Jun 30, 2024
Switching to cooperative mode is not safe during process shutdown on
Windows. Process shutdown can terminate a thread in the middle of the
GC. The shutdown thread deadlocks if it tries to switch to cooperative
mode and wait for the GC to finish in this situation.

Use RtlDllShutdownInProgress Windows API to detect process
shutdown to skip cleanup that has to be done in cooperative mode.

The existing g_fProcessDetach flag is set too late - using this flag to
skip cooperative mode switch would lead to shutdown deadlocks, and the
existing g_fEEShutDown flag is set too early - using this flag to skip
cooperative mode switch would lead to shutdown crashes.

Fixes dotnet#103624
@jkotas
Copy link
Member Author

jkotas commented Jun 30, 2024

/azp run runtime-coreclr outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jkotas
Copy link
Member Author

jkotas commented Jun 30, 2024

/azp run runtime-coreclr outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@davidwrighton davidwrighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is race condition city, and I think that there are still races that exist. Notably, I don't understand when the GC thread can be eliminated without running its cleanup logic, so I don't know how to analyze this for complete correctness, but based on the description you provided this should strictly reduce the set of deadlocks that are likely to occur.

@jkotas
Copy link
Member Author

jkotas commented Jul 2, 2024

I don't understand when the GC thread can be eliminated without running its cleanup logic

Any shutdown with ongoing background activity is prone to this problem. What happens is:

  • Some thread (typically the main app thread) decides to shutdown the process. It must switch to preemptive mode to do so.
  • Some background thread starts GC
  • Windows kernel suspends all threads (except the one thread running the shutdown) and kills them. The thread running in the middle of the GC can be one of those.
  • The one remaining shutdown thread is resumed to proceed with the cleanup. If it ever tries to switch to cooperative mode and wait for GC to finish, it is going to hang.

I think that there are still races that exist.

Yes, I agree. The code has a lot of checks that should be either unnecessary or that have race conditions.

@jkotas jkotas merged commit 5044e93 into dotnet:main Jul 2, 2024
89 checks passed
@jkotas jkotas deleted the fix-103624 branch July 2, 2024 00:13
return g_fProcessDetach;
#else
// RtlDllShutdownInProgress provides more accurace information about whether the process is shutting down.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accurace -> accurate ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - fix included in #104318

jkotas added a commit to jkotas/runtime that referenced this pull request Jul 2, 2024
shutdown

Port dotnet#103877 to Native AOT. This is fixing intermittent shutdown hangs
that can observed by running the tests attached to dotnet#103877.
jkotas added a commit that referenced this pull request Jul 2, 2024
…04318)

shutdown

Port #103877 to Native AOT. This is fixing intermittent shutdown hangs
that can observed by running the tests attached to #103877.
@github-actions github-actions bot locked and limited conversation to collaborators Aug 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test failure: CoreMangLib/system/span/RefStructWithSpan/RefStructWithSpan.cmd
5 participants