-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC deadlock/livelock? #66759
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
@filipnavara I can see that there is managed code executed on a worker queue thread. Apple disallows sending async signals to such threads, so we cannot interrupt these and GC has to rely on those threads to sync with GC explicitly. If that doesn't happen, the runtime suspension would hang. We have discussed this problem with the Apple threading team recently and they have came up with a solution to the problem. But that will require us to change the way we suspend runtime and it will work only on new macOS after they make a specific change. Would there be a way for your app to not to run managed code on a worker queue thread? Anyways, I'll investigate the dump you've shared later today to confirm that it is really what's causing the trouble here. |
There are several things executed on the worker threads. One of them is some background indexing scheduled through |
Btw, I still have the process running so if there's anything else that I can collect I'd be happy to do so. |
Why is that? If the thread is not executing managed code anymore, it should be in preemptive mode and considered to be suspended. The thread suspension may take a little bit longer in this case, but it should not be causing the process to freeze. |
The preemptive mode doesn't work on dispatch/worker threads. Ref: #63393 That said, I would still expect it not to freeze on that condition. I am well aware that I need to depend on coop suspension on the dispatch threads [when managed code is on the stack] and that it may cause the GC pauses to take longer than usual. (Btw, in the state hit due to #63393 it ends up in busy loop and takes 100% CPU time. That's quite suboptimal even in the case where it eventually unlocks.) |
I don't know what triggered it (SDK update, workload update?) but now I am literally unable to run the app for more than a minute until I get this freeze. I disabled all code that knowingly called APIs on dispatcher threads (NSUrlSession, NSBackgroundActivityScheduler) and I still hit one thread like this eventually:
|
The problem is that until the signal is delivered, we cannot tell whether the thread is executing native or managed code. And since we cannot deliver the signal, all we know is that the thread is in our managed thread list, so we think we need to suspend it. |
That sounds like a bug. Once we notice that the thread is preemptive, we can assume that it is suspended - we do not have to wait for the signal to be actually delivered. |
Maybe we actually do that and I may be mistaken. I'll take a look to refresh my memory on what we do there. |
Seems you are right: runtime/src/coreclr/vm/threadsuspend.cpp Lines 3372 to 3378 in 048da75
|
I looked at the code a bit more in detail and realized that there's one important thing I omitted in the description. The process was started from VS Code with attached debugger. There could be something fishy in the code path with the attached debugger: runtime/src/coreclr/vm/threadsuspend.cpp Line 5725 in 048da75
I tried to run the process again with no debugger attached and it seemed to behave much better. I didn't have time to do any extensive testing though. |
@filipnavara when you say it behaves much better, do you actually mean that it still hangs sometimes? I've noticed that in the dump you've shared, many threads have
|
I had just couple of minutes to try so I cannot tell for sure one way or another. It was not behaving jerky and I didn't hit a hang outside of debugger yet. I've already been at this for couple of hours so I cannot tell for certain what I did before the freeze. The application was definitely freezing a lot under the debugger so I may have tried to pause it after the livelock occured. |
From what I can see in the dump, it was actually not stuck waiting for the runtime to suspend. It was just starting the suspension here: runtime/src/coreclr/vm/threadsuspend.cpp Line 3308 in 048da75
So I wonder what's causing the hang. I can see the thread 110 also wanting to suspend runtime in ep_rt_coreclr_sample_profiler_write_sampling_event_for_threads. I wonder if there is some kind of suspension train happening between the GC suspension and the profiling suspension. As can be seen above, there is over 100 threads running so maybe it takes the profiling a long time to walk all the stacks of the threads and then a time to take a next sample comes right after or something along these lines. The sampling rate seems to be set to 1ms. I also wonder why is the profiler sampling running. sampling_thread @filipnavara how did you get the dump? Have you used the createdump tool? It seems that the dump is missing some memory areas and SOS cannot find managed function names in most cases, which is usually due to the dump taken by the OS (and maybe lldb). |
The dump was created with |
Here's a new one without the sampling profiler (at least I didn't invoke it externally) and with I run it to the point where it appeared freezing. Then I hit "pause" in the debugger which never finished. After that I manually run |
Great, thank you! I'll take a look. |
@filipnavara the dump looks similar to the previous one, except for the sampling profiler stuff not being there (as expected) and the managed method names being shown correctly via SOS clrstack now. It would also be great to take two dumps one after another so that I can see whether the thread that's suspending for GC is making any progress. There is one more thing that's the same between those two dumps - in both cases the thread that's suspending for GC is in the |
I will try but so far that has been unsuccessful.
Oh, I still have the process from morning running so I can do more dumps. One of the threads is stuck in a busy loop. |
I tried to attach with LLDB and put couple of breakpoints but none of them seem to be hit 🤷🏻
(it could be LLDB misbehaving so I would not necessarily draw conclusions from it) |
Nevertheless, there is something interesting in the LLDB stack trace:
... and after putting few more breakpoints I started getting this:
and the process died :/ |
The crash in Anyways, the two extra dumps have revealed that the thread that tries to suspend the runtime is actually progressing. In one case, it was running in SuspendRuntime, in other in RestartEE: case 1:
case 2:
So it seems it is spinning in the runtime/src/coreclr/vm/threadsuspend.cpp Lines 5661 to 5768 in 4d39501
|
I keep investigating ... |
The g_pDebugInterface->ThreadsAtUnsafePlaces() is returning true because the underlying |
runtime/src/coreclr/vm/callhelpers.cpp Line 182 in 8e571cd
|
Just-my-code is disabled btw. It was one of the options I changed before it started happening. |
The debugger does not place these tracing breakpoints with just-my-code enabled (or does not place them frequently at least). These tracing breakpoints are placed frequently with just-my-code disable. It explains why disabling just-my-code would make this dead-lock show up. |
@jkotas I am trying to reason about what's not expected in this case:
|
I think it is the first one (Is it unexpected that the DebuggerController::DispatchPatchOrSingleStep calls Thread::RareDisablePreemptiveGC). The first method instruction is not a place where the GC can run, so it sounds right to me that the thread is marked as not being at safe place. |
@VSadov you have modified the code in runtime/src/coreclr/vm/threadsuspend.cpp Lines 5734 to 5736 in 798d52b
It seems that's what we are hitting here, but not in a rare case. The debugger puts a breakpoint at the first instruction of |
@VSadov ah, the comment was there before, it just got moved, I am sorry for the confusion. |
I do not recall adding this comment. It looks like it may have been there before. |
Note that this is not about setting a breakpoint. I think it refers to freeze/thaw functionality, so it should not be common |
The strange thing is that nothing seems to have changed since .NET Core 1.0 in the @filipnavara did it start happening for you recently and was it working ok say with an older .NET 6 version? Or maybe the debugger started to hook the |
I first noticed it during work to upgrade to Xamarin.Mac Preview 14 workload. I don't think the workload is at fault though, the runtime version didn't change. I suspect that disabling "just my code" could have been a trigger. It is something that I had to change to debug unrelated issue and I just left it on. The particular data that caused this heavy GC and high thread count is something I got from our test team about week ago. It could be simply that I never had data that stresses the GC/threading/debugger so heavily. |
In my understanding the logic is that debugger has priority here. We revert the suspension, since we can't walk stacks anyways and let debugger do something. The debugger either proceeds with full debugger suspension or releases the threads. Only in rare cases debugger does not make progress and everything else ends up waiting for that. Maybe something changed in that logic. |
When we say "freeze" here - does the app locks up completely or there are pauses? |
@VSadov Complete lock up. We are actually not letting the debugger to do anything, as the thread that wants to send the breakpoint event to the debugger is waiting on the suspension event and it has incremented the count of threads at unsafe places. So the loop in the |
But looking at the |
See the clrthreads output, the lowest nibble of the state is 8, which corresponds to the TS_DebugSuspendPending:
|
I really appreciate all the effort and prompt responses. Thanks! |
Closing since @kouvel's PR has merged. |
Do we want to backport to .NET 6 (where I originally hit the issue and keep hitting it)? |
@kouvel's fix was quite large and we felt it was too risky to apply it as a servicing patch for 6.0. Instead we made a much smaller change in 6.0 that we hope will avoid the majority of the problematic cases, but it isn't a total fix: The original issue that discussed debugger deadlock is here if it helps fill out the story. @davmason - do you know which servicing release your mitigation fix went out in (or will go out in)? |
Thanks, should be 6.0.6. I spent most of the time with 6.0.6/6.0.7 on Windows so I didn't get to try the scenario again and I missed the targeted fix. |
Description
Our process routinely seems to freeze. There are some dispatcher threads which may be interfering with the thread suspension logic but in this particular case the only one seems to be running CoreCLR code.
Reproduction Steps
No idea so far, happens randomly but consistently in first few minutes of the process run.
Expected behavior
No process freeze.
Actual behavior
Process becomes locked and non-responsive.
Regression?
No response
Known Workarounds
No response
Configuration
.NET 6.0.3, Xamarin.Mac Preview 14 (happend on Preview 13 too)
Other information
Thread sample: https://gist.github.com/filipnavara/afdd426069dfa00f18efa5b8508dd34c
The text was updated successfully, but these errors were encountered: