GC deadlock/livelock? #66759

filipnavara · 2022-03-17T13:35:24Z

Description

Our process routinely seems to freeze. There are some dispatcher threads which may be interfering with the thread suspension logic but in this particular case the only one seems to be running CoreCLR code.

Reproduction Steps

No idea so far, happens randomly but consistently in first few minutes of the process run.

Expected behavior

No process freeze.

Actual behavior

Process becomes locked and non-responsive.

Regression?

No response

Known Workarounds

No response

Configuration

.NET 6.0.3, Xamarin.Mac Preview 14 (happend on Preview 13 too)

Other information

Thread sample: https://gist.github.com/filipnavara/afdd426069dfa00f18efa5b8508dd34c

filipnavara · 2022-03-17T13:37:16Z

Dump: https://1drv.ms/u/s!AgaMhbe7wFq8m41s3OXKH34CLOoa7Q?e=q931lK

cc @janvorli

dotnet-issue-labeler · 2022-03-17T13:37:43Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

janvorli · 2022-03-17T13:56:08Z

@filipnavara I can see that there is managed code executed on a worker queue thread. Apple disallows sending async signals to such threads, so we cannot interrupt these and GC has to rely on those threads to sync with GC explicitly. If that doesn't happen, the runtime suspension would hang.
I can see two threads running managed code on a worker queue thread in your thread sample. One of them is already suspended, but the other (Thread_350741) is not. Its call stack doesn't have managed code, but to hit the problem, it would be sufficient if it was ever executing managed code before. I should be able to see that from the dump once I get to it.

We have discussed this problem with the Apple threading team recently and they have came up with a solution to the problem. But that will require us to change the way we suspend runtime and it will work only on new macOS after they make a specific change.

Would there be a way for your app to not to run managed code on a worker queue thread?

Anyways, I'll investigate the dump you've shared later today to confirm that it is really what's causing the trouble here.

filipnavara · 2022-03-17T14:00:45Z

There are several things executed on the worker threads. One of them is some background indexing scheduled through NSBackgroundActivityScheduler which is thread Thread_349538 that got suspended properly. We have alternative code path on Windows to run that. The other thing are callbacks from WebView and NSUrlSession. These callbacks are scheduled by Apple APIs and there doesn't seem to be a simple workaround for that.

filipnavara · 2022-03-17T14:06:40Z

Btw, I still have the process running so if there's anything else that I can collect I'd be happy to do so.

jkotas · 2022-03-17T15:15:36Z

Its call stack doesn't have managed code, but to hit the problem, it would be sufficient if it was ever executing managed code before.

Why is that? If the thread is not executing managed code anymore, it should be in preemptive mode and considered to be suspended. The thread suspension may take a little bit longer in this case, but it should not be causing the process to freeze.

filipnavara · 2022-03-17T15:37:02Z

it should be in preemptive mode and considered to be suspended

The preemptive mode doesn't work on dispatch/worker threads. Ref: #63393

That said, I would still expect it not to freeze on that condition. I am well aware that I need to depend on coop suspension on the dispatch threads [when managed code is on the stack] and that it may cause the GC pauses to take longer than usual.

(Btw, in the state hit due to #63393 it ends up in busy loop and takes 100% CPU time. That's quite suboptimal even in the case where it eventually unlocks.)

filipnavara · 2022-03-17T16:00:20Z

I don't know what triggered it (SDK update, workload update?) but now I am literally unable to run the app for more than a minute until I get this freeze. I disabled all code that knowingly called APIs on dispatcher threads (NSUrlSession, NSBackgroundActivityScheduler) and I still hit one thread like this eventually:

    1737 Thread_643010
      1737 start_wqthread  (in libsystem_pthread.dylib) + 8  [0x19237c018]
        1737 _pthread_wqthread  (in libsystem_pthread.dylib) + 364  [0x19237d350]
          1737 __workq_kernreturn  (in libsystem_kernel.dylib) + 8  [0x192347604]

janvorli · 2022-03-17T16:34:57Z

Why is that? If the thread is not executing managed code anymore, it should be in preemptive mode and considered to be suspended. The thread suspension may take a little bit longer in this case, but it should not be causing the process to freeze.

The problem is that until the signal is delivered, we cannot tell whether the thread is executing native or managed code. And since we cannot deliver the signal, all we know is that the thread is in our managed thread list, so we think we need to suspend it.

jkotas · 2022-03-17T16:49:24Z

The problem is that until the signal is delivered, we cannot tell whether the thread is executing native or managed code

That sounds like a bug. Once we notice that the thread is preemptive, we can assume that it is suspended - we do not have to wait for the signal to be actually delivered.

janvorli · 2022-03-17T16:52:02Z

Maybe we actually do that and I may be mistaken. I'll take a look to refresh my memory on what we do there.

janvorli · 2022-03-17T16:54:08Z

Seems you are right:

runtime/src/coreclr/vm/threadsuspend.cpp

Lines 3372 to 3378 in 048da75

    
           if (!thread->m_fPreemptiveGCDisabled.LoadWithoutBarrier()) 
        
           { 
        
               STRESS_LOG1(LF_SYNC, LL_INFO1000, "    Thread %x went preemptive it is at a GC safe point\n", thread); 
        
               countThreads--; 
        
               thread->ResetThreadState(Thread::TS_GCSuspendFlags); 
        
               continue; 
        
           }

filipnavara · 2022-03-17T17:10:08Z

I looked at the code a bit more in detail and realized that there's one important thing I omitted in the description. The process was started from VS Code with attached debugger. There could be something fishy in the code path with the attached debugger:

runtime/src/coreclr/vm/threadsuspend.cpp

Line 5725 in 048da75

if ((CORDebuggerAttached() && g_pDebugInterface->ThreadsAtUnsafePlaces()))

I tried to run the process again with no debugger attached and it seemed to behave much better. I didn't have time to do any extensive testing though.

janvorli · 2022-03-17T17:56:24Z

@filipnavara when you say it behaves much better, do you actually mean that it still hangs sometimes?

I've noticed that in the dump you've shared, many threads have TS_DebugSuspendPending state, but the actual thread suspension was triggered by GC. And none of the threads has TS_GCSuspendPending. Hopefully I am decoding the thread state correctly. So I wonder if that could be some bug in the thread suspension leaving this state uncleared and later influencing the suspension process.

(lldb) clrthreads
ThreadCount:      117
UnstartedThread:  0
BackgroundThread: 60
PendingThread:    0
DeadThread:       8
Hosted Runtime:   no
                                                                                                            Lock
 DBG   ID     OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   1    1    55378 0000000120012000    20028 Preemptive  0000000142B08148:0000000142B08158 000000012080BA00 -00001 Ukn
   8    2    5539d 000000012001EC00    21228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Finalizer)
   9    3    55428 0000000122035E00    21228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn <Invalid Object> (0000000142372518)
  10    4    55495 000000012085BA00  10a0228 Preemptive  0000000142916118:0000000142917D20 000000012080BA00 -00001 Ukn (Threadpool Worker)
  11    5    55496 0000000122045400  2021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX    7        0 0000000122058000    31820 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX    8        0 0000000122388600  1031820 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  15    9    55566 0000000161AE7200  10a1228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  16   10    55569 0000000108838200  10a1228 Preemptive  0000000142AD0720:0000000142AD0FA0 000000012080BA00 -00001 Ukn (Threadpool Worker)
  17   11    5556a 00000001204D7000    a1028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX   12        0 00000001182DFC00  1031820 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX   13        0 0000000161B24A00  1031820 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX   14        0 0000000108816800  1031820 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  18   15    555a3 00000001182E3600  10a1228 Preemptive  0000000142AB9750:0000000142ABAFA0 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX   16        0 000000012186E600  1031820 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX   17        0 0000000120550400  1031820 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  19   18    555af 0000000161B68A00  2021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX   19        0 00000001210D0800  1031820 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  20   20    555d4 00000001183D7200    21228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  21   21    555e8 0000000161C8D400    a1228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX    6    55551 0000000108938200    20228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  14   22    55562 00000001219AA800    a0228 Preemptive  0000000142AD5950:0000000142AD6FA0 000000012080BA00 -00001 Ukn
  22   23    55975 0000000120788A00  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  23   24    55976 0000000108941400  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  24   25    55978 00000001219B9600  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  25   26    5598b 00000001207B8600  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  26   27    5598c 000000012006E400  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX   28    559d7 000000012278B400    20228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  27   29    559f0 000000012276B800    20228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  28   30    559f2 0000000120784400  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  29   31    559f3 0000000118729400  3021228 Preemptive  0000000142A9B0D8:0000000142A9BD30 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX   32    55991 0000000122733200    20228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  30   33    559f6 0000000120369400  2021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  31   34    559f7 0000000118232400    a1228 Preemptive  000000014294E360:000000014294FE00 000000012080BA00 -00001 Ukn
  32   35    559ff 0000000161F1BE00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  39   36    55a06 000000010894CC00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  38   37    55a05 0000000118727400  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  34   38    55a01 0000000121896200  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  35   39    55a02 00000001208A7800  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn <Invalid Object> (000000014266ac48)
  36   40    55a03 00000001219B5400  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  37   41    55a04 000000011896EA00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  33   42    55a00 0000000161D34600  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  40   43    55a08 00000001219CD000  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  41   44    55a09 000000011864A000  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  42   45    55a0a 0000000118982C00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  44   46    55a0c 00000001219DC200  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  43   47    55a0b 0000000166090C00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  45   48    55a0d 0000000118980E00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  46   49    55a0e 0000000121195600  2021028 Preemptive  00000001429BDD90:00000001429BEA78 000000012080BA00 -00001 Ukn
  47   50    55a0f 0000000118957C00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  48   51    55a10 0000000108949800  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  49   53    55a11 00000001211B1800  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  50   54    55a12 000000012088D200  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn <Invalid Object> (0000000142732488)
  51   52    55a13 0000000108959000  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  52   55    55a14 0000000161C88C00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  54   56    55a17 0000000118999800  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  55   57    55a19 00000001211AF400  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn <Invalid Object> (0000000142665878)
  56   58    55a1a 00000001186B3200    21028 Preemptive  0000000142B10178:0000000142B10190 000000012080BA00 -00001 Ukn (GC)
  57   59    55a1b 00000001219CA000    21028 Preemptive  00000001429D6388:00000001429D6AE0 000000012080BA00 -00001 Ukn
  58   60    55a1d 0000000108959800  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  59   61    55a1e 0000000118961000  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  60   62    55a1f 000000010895D200  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  61   63    55a21 000000010895E800  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  62   64    55a22 0000000118639A00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  63   65    55a25 000000011899D000  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  64   66    55a26 0000000108961A00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  65   67    55a27 000000011899D800    a1028 Preemptive  0000000142B09780:0000000142B0A158 000000012080BA00 -00001 Ukn
  66   68    55a28 00000001189A1600  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  67   69    55a29 00000001189A0C00    21028 Preemptive  0000000142B0B408:0000000142B0C158 000000012080BA00 -00001 Ukn
  68   70    55a2d 00000001189A2600  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  69   71    55a2e 000000011899F200  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  70   72    55a31 00000001219E1400  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
  71   73    55a32 0000000161F6F000  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  72   74    55a33 0000000108962200  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  73   75    55a34 0000000108963C00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  74   76    55a35 00000001186CDA00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  75   77    55a36 000000011870C200  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  76   78    55a37 000000010896AE00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  77   79    55a38 0000000118840A00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  78   80    55a39 0000000118734400  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  79   81    55a3a 00000001189A4800  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  80   82    55a3b 0000000166036C00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  81   83    55a3c 0000000165873400  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  82   84    55a3d 0000000118735A00  2021028 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  83   85    55a48 0000000165922200  3021228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn (Threadpool Worker)
XXXX   86    55990 00000001189A7400    20228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX   87    55a16 00000001219DA600    20228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  53   88    55a15 0000000161D55000    20228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
XXXX   89    55993 000000011898D000    20228 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
  85   90    55a6e 0000000108965400  10a1228 Preemptive  0000000142A9C7F8:0000000142A9DD30 000000012080BA00 -00001 Ukn (Threadpool Worker)
  86   91    55a70 0000000108970200  10a1228 Preemptive  0000000142B01AB8:0000000142B02158 000000012080BA00 -00001 Ukn (Threadpool Worker)
  87   92    55a97 0000000161FD2E00  10a1228 Preemptive  0000000142AB68B0:0000000142AB7F70 000000012080BA00 -00001 Ukn (Threadpool Worker)
  88   93    55af2 0000000161FA4E00    21228 Preemptive  0000000142A4E3B8:0000000142A4F568 000000012080BA00 -00001 Ukn
  90   94    55afe 000000016588CC00    21228 Preemptive  0000000142A5C810:0000000142A5D598 000000012080BA00 -00001 Ukn
  89   95    55afd 00000001219F1E00  10a1228 Preemptive  0000000142ADADF0:0000000142ADAFA0 000000012080BA00 -00001 Ukn (Threadpool Worker)
  91   96    55b04 00000001189ACC00  1121228 Preemptive  0000000142AF0120:0000000142AF0120 000000012080BA00 -00001 Ukn (Threadpool Worker)
  92   97    55b05 0000000108978800  10a1228 Preemptive  000000014291DC78:000000014291DD20 000000012080BA00 -00001 Ukn (Threadpool Worker)
  93   98    55b06 00000001211D9E00  10a1228 Preemptive  0000000142915788:0000000142915D20 000000012080BA00 -00001 Ukn (Threadpool Worker)
  94   99    55b07 0000000108966600  10a1228 Preemptive  00000001429B20C8:00000001429B2A78 000000012080BA00 -00001 Ukn (Threadpool Worker)
  95  100    55bbb 00000001660E1600  1021228 Preemptive  0000000142988770:00000001429889D0 000000012080BA00 -00001 Ukn (Threadpool Worker)
  96  101    55bbd 0000000121A02400  10a1228 Preemptive  0000000142AFF3C0:0000000142B00158 000000012080BA00 -00001 Ukn (Threadpool Worker)
  98  102    55be2 0000000165905600  10a1228 Preemptive  0000000142ADB158:0000000142ADCFA0 000000012080BA00 -00001 Ukn (Threadpool Worker)
  97  103    55be1 0000000118634600  10a1228 Preemptive  000000014293F170:000000014293FE00 000000012080BA00 -00001 Ukn (Threadpool Worker)
 100  104    55c02 000000016596C000  10a1228 Preemptive  00000001429D9050:00000001429DAAE0 000000012080BA00 -00001 Ukn (Threadpool Worker) <Invalid Object> (00000001429d8af8) (nested exceptions)
  99  105    55bfa 0000000166022000  10a1228 Preemptive  0000000142AA2978:0000000142AA3F70 000000012080BA00 -00001 Ukn (Threadpool Worker)
 101  106    55c0c 0000000161F92200  10a1228 Preemptive  0000000142AE5B70:0000000142AE6968 000000012080BA00 -00001 Ukn (Threadpool Worker)
 105  107    55c1c 0000000161F94800  10a1228 Preemptive  0000000142AA64B0:0000000142AA7F70 000000012080BA00 -00001 Ukn (Threadpool Worker)
 106  108    55c1d 0000000165914A00  10a1228 Preemptive  0000000142AD3408:0000000142AD4FA0 000000012080BA00 -00001 Ukn (Threadpool Worker)
 102  109    55c15 00000001211DCE00  10a1228 Preemptive  0000000142AE6A10:0000000142AE8968 000000012080BA00 -00001 Ukn (Threadpool Worker)
 103  110    55c16 0000000165915200  10a1228 Preemptive  0000000142AE8A10:0000000142AEA968 000000012080BA00 -00001 Ukn (Threadpool Worker)
 104  111    55c17 0000000161F95000  10a1228 Preemptive  0000000142AF01C8:0000000142AF2120 000000012080BA00 -00001 Ukn (Threadpool Worker)
 107  112    55c25 00000001200B8600  10a1228 Preemptive  00000001429DAFA8:00000001429DCAE0 000000012080BA00 -00001 Ukn (Threadpool Worker)
 108  113    55c26 00000001189C1C00    21228 Preemptive  0000000142AFE038:0000000142AFE158 000000012080BA00 -00001 Ukn
 109  114    55c2e 00000001659BAC00    21228 Preemptive  0000000142A7C538:0000000142A7D620 000000012080BA00 -00001 Ukn
 110  115    591ea 0000000161F19A00    21220 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
 111  116    591ec 00000001211EE600    21220 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn
   4  117    55393 000000011877CA00    20220 Preemptive  0000000000000000:0000000000000000 000000012080BA00 -00001 Ukn

filipnavara · 2022-03-17T18:22:29Z

when you say it behaves much better, do you actually mean that it still hangs sometimes?

I had just couple of minutes to try so I cannot tell for sure one way or another. It was not behaving jerky and I didn't hit a hang outside of debugger yet.

I've already been at this for couple of hours so I cannot tell for certain what I did before the freeze. The application was definitely freezing a lot under the debugger so I may have tried to pause it after the livelock occured.

janvorli · 2022-03-17T22:59:31Z

From what I can see in the dump, it was actually not stuck waiting for the runtime to suspend. It was just starting the suspension here:

runtime/src/coreclr/vm/threadsuspend.cpp

Line 3308 in 048da75

::FlushProcessWriteBuffers();

So I wonder what's causing the hang. I can see the thread 110 also wanting to suspend runtime in ep_rt_coreclr_sample_profiler_write_sampling_event_for_threads. I wonder if there is some kind of suspension train happening between the GC suspension and the profiling suspension. As can be seen above, there is over 100 threads running so maybe it takes the profiling a long time to walk all the stacks of the threads and then a time to take a next sample comes right after or something along these lines. The sampling rate seems to be set to 1ms.

I also wonder why is the profiler sampling running. sampling_thread

@filipnavara how did you get the dump? Have you used the createdump tool? It seems that the dump is missing some memory areas and SOS cannot find managed function names in most cases, which is usually due to the dump taken by the OS (and maybe lldb).

filipnavara · 2022-03-18T06:09:22Z

The dump was created with createdump manually. The sampling profiler was likely a futile attempt of me to diagnose the already stuck process. I'll try to reproduce the same thing without all these extra tools.

filipnavara · 2022-03-18T06:56:23Z

Here's a new one without the sampling profiler (at least I didn't invoke it externally) and with createdump --full:

sample.txt
coredump

I run it to the point where it appeared freezing. Then I hit "pause" in the debugger which never finished. After that I manually run sample and createdump.

janvorli · 2022-03-18T09:21:43Z

Great, thank you! I'll take a look.

janvorli · 2022-03-18T11:49:55Z

@filipnavara the dump looks similar to the previous one, except for the sampling profiler stuff not being there (as expected) and the managed method names being shown correctly via SOS clrstack now.
It still has many threads with the TS_DebugSuspendPending state. So it is remotely possible it is some problem related to the debugger thread suspension. Could you please try to repro the problem without the debugger? The coredump should be able to take a dump of a live process. This would help me to find out where to focus the investigations.

It would also be great to take two dumps one after another so that I can see whether the thread that's suspending for GC is making any progress. There is one more thing that's the same between those two dumps - in both cases the thread that's suspending for GC is in the libsystem_kernel.dylib task_threads called from the FlushProcessWriteBuffers, which is kind of a strange coincidence. I wonder if it is possible that this OS call ends up waiting for something forever for some reason.

filipnavara · 2022-03-18T11:54:48Z

So it is remotely possible it is some problem related to the debugger thread suspension. Could you please try to repro the problem without the debugger?

I will try but so far that has been unsuccessful.

It would also be great to take two dumps one after another so that I can see whether the thread that's suspending for GC is making any progress.

Oh, I still have the process from morning running so I can do more dumps. One of the threads is stuck in a busy loop.

filipnavara · 2022-03-18T12:11:00Z

Two more dumps of the process from the morning: 1 2

filipnavara · 2022-03-18T12:14:13Z

I tried to attach with LLDB and put couple of breakpoints but none of them seem to be hit 🤷🏻

1: name = 'RestartEE', locations = 3, resolved = 3, hit count = 0
  1.1: where = libcoreclr.dylib`GCToEEInterface::RestartEE(bool), address = 0x0000000104348dfc, resolved, hit count = 0 
  1.2: where = libcoreclr.dylib`ThreadSuspend::RestartEE(int, int), address = 0x00000001043dd494, resolved, hit count = 0 
  1.3: where = libcoreclr.dylib`standalone::GCToEEInterface::RestartEE(bool), address = 0x00000001043f8948, resolved, hit count = 0 

2: name = 'CORDebuggerAttached', locations = 0 (pending)

3: name = 'ThreadsAtUnsafePlaces', locations = 1, resolved = 1, hit count = 0
  3.1: where = libcoreclr.dylib`Debugger::ThreadsAtUnsafePlaces(), address = 0x00000001044dc1f0, resolved, hit count = 0 

4: name = 'FlushProcessWriteBuffers', locations = 5, resolved = 5, hit count = 0
  4.1: where = libclrjit.dylib`FlushProcessWriteBuffers, address = 0x0000000103a4df18, resolved, hit count = 0 
  4.2: where = libcoreclr.dylib`FlushProcessWriteBuffers, address = 0x00000001041e4228, resolved, hit count = 0 
  4.3: where = libcoreclr.dylib`GCToOSInterface::FlushProcessWriteBuffers(), address = 0x000000010434c5c0, resolved, hit count = 0 
  4.4: where = libdbgshim.dylib`FlushProcessWriteBuffers, address = 0x0000000103763160, resolved, hit count = 0 
  4.5: where = libmscordaccore.dylib`FlushProcessWriteBuffers, address = 0x0000000103f068cc, resolved, hit count = 0 

5: name = 'ThreadSuspend::SuspendRuntime', locations = 1, resolved = 1, hit count = 0
  5.1: where = libcoreclr.dylib`ThreadSuspend::SuspendRuntime(ThreadSuspend::SUSPEND_REASON), address = 0x00000001043dbd58, resolved, hit count = 0 

6: name = 'ThreadSuspend::SuspendEE', locations = 1, resolved = 1, hit count = 0
  6.1: where = libcoreclr.dylib`ThreadSuspend::SuspendEE(ThreadSuspend::SUSPEND_REASON), address = 0x00000001043dd738, resolved, hit count = 0

(it could be LLDB misbehaving so I would not necessarily draw conclusions from it)

filipnavara · 2022-03-18T12:24:23Z

Nevertheless, there is something interesting in the LLDB stack trace:

(lldb) thread select 62
* thread #62, name = '.NET ThreadPool Worker'
    frame #0: 0x0000000192345a5c libsystem_kernel.dylib`swtch_pri + 8
libsystem_kernel.dylib`swtch_pri:
->  0x192345a5c <+8>: ret    

libsystem_kernel.dylib`swtch:
    0x192345a60 <+0>: mov    x16, #-0x3c
    0x192345a64 <+4>: svc    #0x80
    0x192345a68 <+8>: ret    
(lldb) bt
* thread #62, name = '.NET ThreadPool Worker'
  * frame #0: 0x0000000192345a5c libsystem_kernel.dylib`swtch_pri + 8
    frame #1: 0x000000019237e0e4 libsystem_pthread.dylib`cthread_yield + 20
    frame #2: 0x00000001041e7ec8 libcoreclr.dylib`MachSetThreadContext + 52
    frame #3: 0x00000001041e7e60 libcoreclr.dylib`PAL_DispatchException + 292
    frame #4: 0x00000001041e7a54 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 16
    frame #5: 0x00000001044dc1f0 libcoreclr.dylib`Debugger::SetModuleDefaultJMCStatus(Module*, bool) + 284
    frame #6: 0x0000000104348de4 libcoreclr.dylib`GCToEEInterface::SuspendEE(SUSPEND_REASON) + 48
    frame #7: 0x000000010446bc28 libcoreclr.dylib`WKS::GCHeap::GarbageCollectGeneration(unsigned int, gc_reason) + 420
    frame #8: 0x000000010446dde0 libcoreclr.dylib`WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, unsigned int, int) + 792
    frame #9: 0x0000000104490b58 libcoreclr.dylib`WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) + 164
    frame #10: 0x000000010434df70 libcoreclr.dylib`AllocateObject(MethodTable*) + 216
    frame #11: 0x0000000104369bdc libcoreclr.dylib`JIT_New(CORINFO_CLASS_STRUCT_*) + 124

... and after putting few more breakpoints I started getting this:

* thread #2, stop reason = EXC_BAD_ACCESS (code=2, address=0x34c19bef8)
  * frame #0: 0x00000001923981b8 libsystem_platform.dylib`_platform_memmove + 168
    frame #1: 0x00000001041e8428 libcoreclr.dylib`SEHExceptionThread(void*) + 1372
    frame #2: 0x0000000192381240 libsystem_pthread.dylib`_pthread_start + 148

and the process died :/

janvorli · 2022-03-18T13:15:22Z

The crash in memmove migh be related to a stack overflow handling issue I've fixed few days ago. Just a wild guess though, maybe fiddling with the debugging resulted in stack overflow somewhere somehow.

Anyways, the two extra dumps have revealed that the thread that tries to suspend the runtime is actually progressing. In one case, it was running in SuspendRuntime, in other in RestartEE:

case 1:

(lldb) bt
* thread #62, stop reason = signal SIGSTOP
  * frame #0: 0x0000000192345954 libsystem_kernel.dylib`mach_msg_trap + 8
    frame #1: 0x0000000192345d00 libsystem_kernel.dylib`mach_msg + 76
    frame #2: 0x000000019234e23c libsystem_kernel.dylib`thread_get_state + 140
    frame #3: 0x00000001923558a0 libsystem_kernel.dylib`thread_get_register_pointer_values + 132
    frame #4: 0x00000001041e4294 libcoreclr.dylib`::FlushProcessWriteBuffers() at process.cpp:3569:23 [opt]
    frame #5: 0x00000001043dbe6c libcoreclr.dylib`ThreadSuspend::SuspendRuntime(reason=SUSPEND_FOR_GC) at threadsuspend.cpp:3297:5 [opt]
    frame #6: 0x00000001043dd858 libcoreclr.dylib`ThreadSuspend::SuspendEE(reason=SUSPEND_FOR_GC) at threadsuspend.cpp:5702:5 [opt]
    frame #7: 0x0000000104348de4 libcoreclr.dylib`GCToEEInterface::SuspendEE(reason=<unavailable>) at gcenv.ee.cpp:26:5 [opt]
    frame #8: 0x000000010446bc28 libcoreclr.dylib`WKS::GCHeap::GarbageCollectGeneration(this=<unavailable>, gen=0, reason=reason_alloc_soh) at gc.cpp:44676:9 [opt]

case 2:

(lldb) bt
* thread #62, stop reason = signal SIGSTOP
  * frame #0: 0x0000000192345954 libsystem_kernel.dylib`mach_msg_trap + 8
    frame #1: 0x0000000192345d00 libsystem_kernel.dylib`mach_msg + 76
    frame #2: 0x000000019234e23c libsystem_kernel.dylib`thread_get_state + 140
    frame #3: 0x00000001923558a0 libsystem_kernel.dylib`thread_get_register_pointer_values + 132
    frame #4: 0x00000001041e4294 libcoreclr.dylib`::FlushProcessWriteBuffers() at process.cpp:3569:23 [opt]
    frame #5: 0x00000001043dd4e4 libcoreclr.dylib`ThreadSuspend::RestartEE(bFinishedGC=NO, SuspendSucceded=NO) at threadsuspend.cpp:5523:5 [opt]
    frame #6: 0x00000001043dd890 libcoreclr.dylib`ThreadSuspend::SuspendEE(reason=SUSPEND_FOR_GC) at threadsuspend.cpp:5736:9 [opt]
    frame #7: 0x0000000104348de4 libcoreclr.dylib`GCToEEInterface::SuspendEE(reason=<unavailable>) at gcenv.ee.cpp:26:5 [opt]
    frame #8: 0x000000010446bc28 libcoreclr.dylib`WKS::GCHeap::GarbageCollectGeneration(this=<unavailable>, gen=0, reason=reason_alloc_soh) at gc.cpp:44676:9 [opt]

So it seems it is spinning in the ThreadSuspend::SuspendEE in this loop:

runtime/src/coreclr/vm/threadsuspend.cpp

Lines 5661 to 5768 in 4d39501

    
           retry_for_debugger: 
        
           #ifdef TIME_SUSPEND 
        
               DWORD startAcquire = g_SuspendStatistics.GetTime(); 
        
           #endif 
        
               // 
        
               // Acquire the TSL.  We will hold this until the we restart the EE. 
        
               // 
        
               ThreadSuspend::LockThreadStore(reason); 
        
           #ifdef TIME_SUSPEND 
        
               g_SuspendStatistics.acquireTSL.Accumulate(SuspendStatistics::GetElapsed(startAcquire, 
        
                   g_SuspendStatistics.GetTime())); 
        
           #endif 
        
               // 
        
               // Now we're going to acquire an exclusive lock on managed code execution (including 
        
               // "manually managed" code in GCX_COOP regions). 
        
               // 
        
               // First, we reset the event that we're about to tell other threads to wait for. 
        
               // 
        
               GCHeapUtilities::GetGCHeap()->ResetWaitForGCEvent(); 
        
               // 
        
               // Remember that we're the one suspending the EE 
        
               // 
        
               g_pSuspensionThread = pCurThread; 
        
               // 
        
               // Tell all threads, globally, to wait for WaitForGCEvent. 
        
               // 
        
               ThreadStore::TrapReturningThreads(TRUE); 
        
               // 
        
               // Remember why we're doing this. 
        
               // 
        
               m_suspendReason = reason; 
        
               // 
        
               // There's a GC in progress.  (again, not necessarily - we suspend the EE for other reasons. 
        
               // I wonder how much confusion this has caused....) 
        
               // It seems like much of the above is redundant. We should investigate reducing the number 
        
               // of mechanisms we use to indicate that a suspension is in progress. 
        
               GCHeapUtilities::GetGCHeap()->SetGCInProgress(true); 
        
               // set tls flags for compat with SOS 
        
               ClrFlsSetThreadType(ThreadType_DynamicSuspendEE); 
        
               _ASSERTE(ThreadStore::HoldingThreadStore() || g_fProcessDetach); 
        
               // 
        
               // Now that we've instructed all threads to please stop, 
        
               // go interrupt the ones that are running managed code and force them to stop. 
        
               // This does not return until all threads have acknowledged that they 
        
               // will not run managed code. 
        
               // 
        
               SuspendRuntime(reason); 
        
           #ifdef DEBUGGING_SUPPORTED 
        
               // If the debugging services are attached, then its possible 
        
               // that there is a thread which appears to be stopped at a gc 
        
               // safe point, but which really is not. If that is the case, 
        
               // back off and try again. 
        
               if ((CORDebuggerAttached() && g_pDebugInterface->ThreadsAtUnsafePlaces())) 
        
               { 
        
                   // In this case, the debugger has stopped at least one 
        
                   // thread at an unsafe place.  The debugger will usually 
        
                   // have already requested that we stop.  If not, it will 
        
                   // usually either do so shortly, or resume the thread that is 
        
                   // at the unsafe place. Either way, we have to wait for the 
        
                   // debugger to decide what it wants to do. 
        
                   // 
        
                   // In some rare cases, the end-user debugger may have frozen 
        
                   // a thread at a gc-unsafe place, and so we'll loop forever 
        
                   // here and never resolve the deadlock.  Unfortunately we can't 
        
                   // easily abort a GC 
        
                   // and so for now we just wait for the debugger to timeout and 
        
                   // hopefully thaw that thread.  Maybe instead we should try to 
        
                   // detect this situation sooner (when thread abort is possible) 
        
                   // and notify the debugger with NotifyOfCrossThreadDependency, giving 
        
                   // it the chance to thaw other threads or abort us before getting 
        
                   // wedged in the GC. 
        
                   // 
        
                   // Note: we've still got the ThreadStore lock held. 
        
                   // 
        
                   LOG((LF_GCROOTS | LF_GC | LF_CORDB, 
        
                       LL_INFO10, 
        
                       "***** Giving up on current GC suspension due to debugger *****\n")); 
        
                   // Mark that we're done with the gc, so that the debugger can proceed. 
        
                   RestartEE(FALSE, FALSE); 
        
                   LOG((LF_GCROOTS | LF_GC | LF_CORDB, LL_INFO10, "The EE is free now...\n")); 
        
                   // If someone's trying to suspent *this* thread, this is a good opportunity. 
        
                   if (pCurThread && pCurThread->CatchAtSafePointOpportunistic()) 
        
                   { 
        
                       pCurThread->PulseGCMode();  // Go suspend myself. 
        
                   } 
        
                   else 
        
                   { 
        
                       // otherwise, just yield so the debugger can finish what it's doing. 
        
                       __SwitchToThread(0, ++dwSwitchCount); 
        
                   } 
        
                   goto retry_for_debugger; 
        
               }

janvorli · 2022-03-18T13:15:34Z

I keep investigating ...

janvorli · 2022-03-18T13:27:26Z

The g_pDebugInterface->ThreadsAtUnsafePlaces() is returning true because the underlying Debugger::m_threadsAtUnsafePlaces is 2. It is possible that there is a bug causing an imbalance in this count (we maintain the count by incrementing / decrementing it in the debugger related code) or that there is a bug in evaluation of the unsafe places in the code called from Debugger::IsThreadAtSafePlace.

jkotas · 2022-03-18T19:30:24Z

StartCallback is called via CALL_MANAGED_METHOD macro (aka "reverse FCall"). All reverse FCalls give a debugger a chance to place a breakpoint at the managed target here:

runtime/src/coreclr/vm/callhelpers.cpp

Line 182 in 8e571cd

g_pDebugInterface->TraceCall((const BYTE *)pTargetAddress);

. I do not think that the exact reason why debugger placed the breakpoint in this case is interesting. It depends on number of factors, like whether the just-my-code is enabled or not. The runtime is expected to allow it and handle it gracefully.

filipnavara · 2022-03-18T19:36:24Z

Just-my-code is disabled btw. It was one of the options I changed before it started happening.

jkotas · 2022-03-18T19:40:52Z

Just-my-code is disabled btw. It was one of the options I changed before it started happening.

The debugger does not place these tracing breakpoints with just-my-code enabled (or does not place them frequently at least). These tracing breakpoints are placed frequently with just-my-code disable. It explains why disabling just-my-code would make this dead-lock show up.

janvorli · 2022-03-18T19:41:12Z

@jkotas I am trying to reason about what's not expected in this case:

Is it unexpected that the DebuggerController::DispatchPatchOrSingleStep calls Thread::RareDisablePreemptiveGC
Or, is it unexpected that the g_pDebugger->IsThreadAtSafePlace(thread) for this case (when the m_debuggerFilterContext points to the first instruction of the StartCallback) returns false?

jkotas · 2022-03-18T19:53:52Z

I think it is the first one (Is it unexpected that the DebuggerController::DispatchPatchOrSingleStep calls Thread::RareDisablePreemptiveGC).

The first method instruction is not a place where the GC can run, so it sounds right to me that the thread is marked as not being at safe place.

janvorli · 2022-03-18T20:27:59Z

@VSadov you have modified the code in ThreadSuspend::SuspendEE a year ago. I can see that you've put in a comment there saying:

runtime/src/coreclr/vm/threadsuspend.cpp

Lines 5734 to 5736 in 798d52b

    
           // In some rare cases, the end-user debugger may have frozen 
        
           // a thread at a gc-unsafe place, and so we'll loop forever 
        
           // here and never resolve the deadlock.  Unfortunately we can't

It seems that's what we are hitting here, but not in a rare case. The debugger puts a breakpoint at the first instruction of System.Threading.Thread.StartCallback(), probably to get notified on a new thread. That location is not a GC safe location, so if sending the notification to the debugger on the breakpoint in DebuggerController::DispatchPatchOrSingleStep races with the ThreadSuspend::SuspendEE, we get this deadlock. @filipnavara is now hitting it consistently when debugging his application that uses a lot of threads. Was the original code before your cleanup resilient to this issue?

janvorli · 2022-03-18T20:33:32Z

@VSadov ah, the comment was there before, it just got moved, I am sorry for the confusion.

VSadov · 2022-03-18T20:34:29Z

I do not recall adding this comment. It looks like it may have been there before.

VSadov · 2022-03-18T20:38:54Z

frozen
// a thread at a gc-unsafe place,

Note that this is not about setting a breakpoint. I think it refers to freeze/thaw functionality, so it should not be common

janvorli · 2022-03-18T20:50:35Z

The strange thing is that nothing seems to have changed since .NET Core 1.0 in the DebuggerController::DispatchPatchOrSingleStep and related stuff, the SENDIPCEVENT_END that triggers the Thread::RareDisablePreemptiveGC is the same, so I have hard time understanding why we haven't heard of this issue before.

@filipnavara did it start happening for you recently and was it working ok say with an older .NET 6 version?

Or maybe the debugger started to hook the System.Threading.Thread.StartCallback() recently? And maybe only on MacOS? I am just wildly guessing...

filipnavara · 2022-03-18T21:09:34Z

did it start happening for you recently and was it working ok say with an older .NET 6 version?

I first noticed it during work to upgrade to Xamarin.Mac Preview 14 workload. I don't think the workload is at fault though, the runtime version didn't change. I suspect that disabling "just my code" could have been a trigger. It is something that I had to change to debug unrelated issue and I just left it on.

The particular data that caused this heavy GC and high thread count is something I got from our test team about week ago. It could be simply that I never had data that stresses the GC/threading/debugger so heavily.

VSadov · 2022-03-18T21:23:53Z

In my understanding the logic is that debugger has priority here. We revert the suspension, since we can't walk stacks anyways and let debugger do something. The debugger either proceeds with full debugger suspension or releases the threads.

Only in rare cases debugger does not make progress and everything else ends up waiting for that. Maybe something changed in that logic.

VSadov · 2022-03-18T21:45:59Z

When we say "freeze" here - does the app locks up completely or there are pauses?

janvorli · 2022-03-18T21:46:27Z

@VSadov Complete lock up. We are actually not letting the debugger to do anything, as the thread that wants to send the breakpoint event to the debugger is waiting on the suspension event and it has incremented the count of threads at unsafe places. So the loop in the ThreadSuspend::SuspendEE can never succeed due to that count and the thread that was attempting to sent the event to the debugger can never continue due to its waiting in Thread::WaitSuspendEvents called by the Thread::RareEnablePreemptiveGC().

janvorli · 2022-03-18T21:49:02Z

But looking at the Thread::WaitSuspendEventsHelper made me remember that there is one thing that I've mentioned somewhere up in this issue. All threads have TS_DebugSuspendPending state, but we are attempting to do a GC suspension and there is no debug runtime suspension. I wonder if that's the cause of the issue. But I don't know when is that state supposed to get cleaned.

janvorli · 2022-03-18T21:52:20Z

See the clrthreads output, the lowest nibble of the state is 8, which corresponds to the TS_DebugSuspendPending:

(lldb) clrthreads
ThreadCount:      117
UnstartedThread:  0
BackgroundThread: 68
PendingThread:    0
DeadThread:       0
Hosted Runtime:   no
                                                                                                            Lock
 DBG   ID     OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   1    1   11224f 000000012502DC00    20028 Preemptive  00000001432E5790:00000001432E57A0 0000000125022400 -00001 Ukn
   8    2   112269 000000012502E400    a1228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn (Finalizer)
   9    3   11226a 0000000125032A00    21228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn System.Security.SecurityException 0000000                                      1424c9dc8
  10    4   1122f3 000000012504FE00  10a0228 Preemptive  0000000143147F80:0000000143149C48 0000000125022400 -00001 Ukn (Threadpool Worker)
  11    5   1122f6 000000010780A200  2021228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  14    6   11231f 0000000126876000    b1228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  15    7   112326 0000000126890E00  10a1228 Preemptive  0000000143243098:00000001432444A8 0000000125022400 -00001 Ukn (Threadpool Worker)
  16    8   112327 0000000127817400  10a1228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn (Threadpool Worker)
  17    9   112329 0000000127828E00  10a1228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn (Threadpool Worker)
  18   10   11232b 0000000127012800    a1028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  19   11   112348 0000000127075400  10a1228 Preemptive  000000014314A050:000000014314BC48 0000000125022400 -00001 Ukn (Threadpool Worker)
  20   12   112364 0000000124D65000  1021228 Preemptive  000000014314CE40:000000014314DC48 0000000125022400 -00001 Ukn (Threadpool Worker)
  21   13   112365 0000000126038800  1121228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn (Threadpool Worker)
  22   14   112366 000000012659CC00  10a1228 Preemptive  0000000143331C68:0000000143333160 0000000125022400 -00001 Ukn (Threadpool Worker)
  23   15   112367 0000000126521800  1121228 Preemptive  00000001432BA408:00000001432BBB90 0000000125022400 -00001 Ukn (Threadpool Worker)
  24   16   112368 00000001265C7C00  1021228 Preemptive  00000001432E37A0:00000001432E37A0 0000000125022400 -00001 Ukn (Threadpool Worker)
  25   17   112370 000000012662A400  2021228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  26   18   11237b 0000000126F42400    21228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  27   19   112390 000000012577D600    a1228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
XXXX   20   112315 00000001049CA400 80030228 Preemptive  000000014329A878:000000014329C540 0000000125022400 -00001 Ukn
XXXX   21   112252 00000001049CC400 80030228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  28   22   112471 000000033C221E00    20228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
XXXX   23   11246c 0000000124D57200 80030228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  29   24   112476 000000033C1C0000  2021228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  30   25   112477 000000033C278200  2021228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  31   26   11247a 0000000127187E00  10a1228 Preemptive  0000000143154EE8:0000000143155C48 0000000125022400 -00001 Ukn (Threadpool Worker) System.Threading.Task                                      s.TaskCanceledException 0000000143154cc0
  32   27   11247b 000000033D9D9E00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  33   28   11247c 0000000127970400  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  34   29   11247d 00000001271A5C00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  35   30   11247e 0000000107980A00  2021028 Preemptive  0000000143157060:0000000143157C48 0000000125022400 -00001 Ukn
  38   31   112481 0000000107983C00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  36   32   11247f 0000000127970C00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  37   33   112480 00000001271A6400    a1028 Preemptive  0000000143308E58:0000000143309820 0000000125022400 -00001 Ukn
  39   34   112482 00000001271A6C00    a1028 Preemptive  00000001432BC058:00000001432BDB90 0000000125022400 -00001 Ukn
  40   35   112483 000000010798BC00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  41   36   112486 000000033D9D3A00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  42   37   112487 000000033C0BAA00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  43   38   112488 000000012796DE00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  46   39   11248b 000000033C0BBA00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  45   40   11248a 000000033C24CA00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  44   41   112489 000000033C24E400  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  47   42   11248c 000000033C24EC00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  48   43   11248d 000000033D9DE400  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  49   44   11248e 000000033D95C800  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  50   45   11248f 0000000127969E00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  51   46   112490 00000001049E5800  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  52   48   112491 00000001049E6800  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  53   47   112492 000000033D200000    a1028 Preemptive  0000000143158048:0000000143159C48 0000000125022400 -00001 Ukn
  54   49   112493 0000000127958A00    21028 Preemptive  00000001432B5AC0:00000001432B5B90 0000000125022400 -00001 Ukn
  55   50   112494 0000000107981200    21028 Preemptive  0000000143339158:0000000143339160 0000000125022400 -00001 Ukn
  56   51   112495 00000001049CF200    21028 Preemptive  0000000143341188:0000000143341190 0000000125022400 -00001 Ukn
  57   52   112497 0000000126E72400  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  58   53   11249a 0000000127180800  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  59   54   11249b 000000033D9DEC00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  60   55   11249c 000000012717E400  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  61   56   11249d 0000000107992000  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  62   57   11249e 000000033D971000  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  63   58   11249f 000000012717FE00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  64   59   1124a0 0000000126069C00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  65   60   1124a1 0000000107993000  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  66   61   1124a4 000000033C277A00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  67   62   1124a5 000000033C09CC00    21028 Preemptive  0000000143337110:0000000143337160 0000000125022400 -00001 Ukn
  68   63   1124af 0000000124F8EC00  1121228 Preemptive  0000000143334C90:0000000143335160 0000000125022400 -00001 Ukn (Threadpool Worker)
  69   64   1124b8 00000001271A9E00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  70   65   1124b9 000000033C0BB200  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  71   66   1124ba 000000033C0A2000  2021028 Preemptive  00000001431D9DB0:00000001431DBD58 0000000125022400 -00001 Ukn
  72   67   1124bb 0000000127962200  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  73   68   1124bc 000000033C0BA200  2021028 Preemptive  0000000143177D18:0000000143179C48 0000000125022400 -00001 Ukn
  74   69   1124bd 0000000125793A00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  77   70   1124c0 00000001049E9A00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  75   71   1124be 00000001271ADE00  2021028 Preemptive  0000000143201AD8:0000000143203A80 0000000125022400 -00001 Ukn
  76   72   1124bf 00000001049EE200  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  78   73   1124c1 000000012606E000  2021028 Preemptive  0000000143179D90:000000014317BC48 0000000125022400 -00001 Ukn
  79   74   1124c2 00000001049ECA00  2021028 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  80   75   1124c3 00000001265F9000  1021228 Preemptive  000000014333D178:000000014333D190 0000000125022400 -00001 Ukn (Threadpool Worker)
  81   76   1124c8 00000001271A9600    21228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  82   77   1124c9 00000001271AF000  2021228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  83   78   1124e9 000000033C250A00  1021228 Preemptive  00000001432A5A50:00000001432A6540 0000000125022400 -00001 Ukn (Threadpool Worker)
XXXX   79   1122ce 00000001271B0800 80030228 Preemptive  00000001431DD828:00000001431DDD58 0000000125022400 -00001 Ukn
XXXX   80   112318 000000033C2A9000 80030228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
XXXX   81   11246a 00000001271B7800 80030228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  85   82   112539 000000033E9DD600    21228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  86   83   11253a 000000033E9E2E00  2021228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
  88   84   11253f 00000001271C4000  10a1228 Preemptive  00000001432A93A0:00000001432AA540 0000000125022400 -00001 Ukn (Threadpool Worker) System.Threading.Task                                      s.TaskCanceledException 00000001432a92c8
  89   85   112540 00000001049F7A00  10a1228 Preemptive  000000014316BC48:000000014316BC48 0000000125022400 -00001 Ukn (Threadpool Worker)
  90   86   112541 00000001049F9000  10a1228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn (Threadpool Worker)
  91   87   112542 000000033DAB3A00  10a1228 Preemptive  000000014324A658:000000014324C4A8 0000000125022400 -00001 Ukn (Threadpool Worker)
  92   88   112543 000000033DA5C800  1021228 Preemptive  0000000143346D38:0000000143346D50 0000000125022400 -00001 Ukn (GC) (Threadpool Worker)
  93   89   112547 00000001271C5200  10a1228 Preemptive  00000001432B1E28:00000001432B3B90 0000000125022400 -00001 Ukn (Threadpool Worker)
  94   90   112549 000000033DA60A00  10a1228 Preemptive  0000000143189FC8:000000014318BD28 0000000125022400 -00001 Ukn (Threadpool Worker) System.Threading.Task                                      s.TaskCanceledException 0000000143189d40 (nested exceptions)
  96   91   11254b 000000033C1E3800  1021228 Preemptive  0000000143331148:0000000143331160 0000000125022400 -00001 Ukn (Threadpool Worker)
  97   92   11254c 000000033DAAEC00  1021228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn (Threadpool Worker)
  95   94   11254a 000000033DA9C600  1021228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn (Threadpool Worker)
  99   93   11254e 000000033DA9BE00  10a1228 Preemptive  0000000143252988:0000000143252C68 0000000125022400 -00001 Ukn (Threadpool Worker)
  98   95   11254d 00000001271C5A00  10a1228 Preemptive  000000014325FEB0:0000000143260400 0000000125022400 -00001 Ukn (Threadpool Worker)
 100   97   11254f 00000001271CEA00  10a1228 Preemptive  0000000143190320:0000000143191D28 0000000125022400 -00001 Ukn (Threadpool Worker)
 102   98   112551 000000033EACE200  10a1228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn (Threadpool Worker)
 101   96   112550 000000033EA0BC00  10a1228 Preemptive  0000000143196658:0000000143197D58 0000000125022400 -00001 Ukn (Threadpool Worker)
 105   99   112555 000000010799E000  10a1228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn (Threadpool Worker)
 103  100   112553 000000033D2A8E00  10a1228 Preemptive  000000014317E520:000000014317FC48 0000000125022400 -00001 Ukn (Threadpool Worker)
 104  101   112554 000000033DAC8200  10a1228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn (Threadpool Worker)
 106  102   112556 000000033E94F200  1021228 Preemptive  0000000143305020:0000000143305820 0000000125022400 -00001 Ukn (Threadpool Worker)
XXXX  103   112458 000000033C1E0400 80030228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
 108  104   112649 000000012799E400  10a1228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn (Threadpool Worker)
 109  105   11264a 00000001271DA800  10a1228 Preemptive  0000000143208078:0000000143209E10 0000000125022400 -00001 Ukn (Threadpool Worker)
 111  106   112651 000000012485B400  1021228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn (Threadpool Worker)
 110  107   11264f 00000001079C1E00  10a1228 Preemptive  00000001432FABD0:00000001432FB7A0 0000000125022400 -00001 Ukn (Threadpool Worker)
 112  108   112666 00000001079DAE00    21228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
 113  109   112667 00000001271EB800    21228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
 115  110   112669 000000033C282200  2021228 Preemptive  00000001431FFE78:00000001432009C8 0000000125022400 -00001 Ukn
 114  111   112668 00000001279B7200  2021228 Preemptive  000000014326A640:000000014326C400 0000000125022400 -00001 Ukn
 116  112   11266a 000000033EA91C00    21228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
 117  113   11266b 00000001079D5800  2021228 Preemptive  00000001431A51B8:00000001431A5D58 0000000125022400 -00001 Ukn
XXXX  114   11260d 0000000126078400 80030228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn
XXXX  115   11260f 000000012486BE00 80030228 Preemptive  00000001431B14A8:00000001431B1D58 0000000125022400 -00001 Ukn
 118  116   112677 000000033E9DBE00    a1228 Preemptive  0000000143315B18:0000000143317850 0000000125022400 -00001 Ukn
 119  117   112683 000000033DB1AC00    a1228 Preemptive  0000000000000000:0000000000000000 0000000125022400 -00001 Ukn

janvorli · 2022-03-19T02:09:27Z

I've spent some time investigating it with great help from @hoyosjs. It turns out it is an issue that @kouvel is already looking into, related to tiered compilation.

filipnavara · 2022-03-19T07:19:41Z

I really appreciate all the effort and prompt responses. Thanks!

mangod9 · 2022-07-27T15:58:58Z

Closing since @kouvel's PR has merged.

filipnavara · 2022-07-27T16:30:14Z

Do we want to backport to .NET 6 (where I originally hit the issue and keep hitting it)?

noahfalk · 2022-07-27T20:29:21Z

@kouvel's fix was quite large and we felt it was too risky to apply it as a servicing patch for 6.0. Instead we made a much smaller change in 6.0 that we hope will avoid the majority of the problematic cases, but it isn't a total fix:
#69121

The original issue that discussed debugger deadlock is here if it helps fill out the story.

@davmason - do you know which servicing release your mitigation fix went out in (or will go out in)?

filipnavara · 2022-07-27T20:46:02Z

Thanks, should be 6.0.6. I spent most of the time with 6.0.6/6.0.7 on Windows so I didn't get to try the scenario again and I missed the targeted fix.

filipnavara added arch-arm64 os-mac-os-x macOS aka OSX labels Mar 17, 2022

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 17, 2022

janvorli removed the untriaged New issue has not been triaged by the area owner label Mar 18, 2022

kouvel added area-TieredCompilation-coreclr and removed area-VM-coreclr labels Mar 25, 2022

kouvel added this to the 7.0.0 milestone Mar 25, 2022

kouvel self-assigned this Mar 25, 2022

kouvel mentioned this issue Mar 25, 2022

Change CrossLoaderAllocatorHash to not use the GC #67160

Merged

mangod9 closed this as completed Jul 27, 2022

ghost locked as resolved and limited conversation to collaborators Aug 27, 2022

GC deadlock/livelock? #66759

GC deadlock/livelock? #66759

Comments

filipnavara commented Mar 17, 2022

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

filipnavara commented Mar 17, 2022

dotnet-issue-labeler bot commented Mar 17, 2022

janvorli commented Mar 17, 2022

filipnavara commented Mar 17, 2022

filipnavara commented Mar 17, 2022

jkotas commented Mar 17, 2022

filipnavara commented Mar 17, 2022 • edited Loading

filipnavara commented Mar 17, 2022

janvorli commented Mar 17, 2022

jkotas commented Mar 17, 2022

janvorli commented Mar 17, 2022

janvorli commented Mar 17, 2022

filipnavara commented Mar 17, 2022

janvorli commented Mar 17, 2022

filipnavara commented Mar 17, 2022 • edited Loading

janvorli commented Mar 17, 2022

filipnavara commented Mar 18, 2022

filipnavara commented Mar 18, 2022

janvorli commented Mar 18, 2022

janvorli commented Mar 18, 2022

filipnavara commented Mar 18, 2022

filipnavara commented Mar 18, 2022

filipnavara commented Mar 18, 2022 • edited Loading

filipnavara commented Mar 18, 2022 • edited Loading

janvorli commented Mar 18, 2022

janvorli commented Mar 18, 2022

janvorli commented Mar 18, 2022

jkotas commented Mar 18, 2022 • edited Loading

filipnavara commented Mar 18, 2022

jkotas commented Mar 18, 2022

janvorli commented Mar 18, 2022

jkotas commented Mar 18, 2022

janvorli commented Mar 18, 2022

janvorli commented Mar 18, 2022

VSadov commented Mar 18, 2022

VSadov commented Mar 18, 2022

janvorli commented Mar 18, 2022

filipnavara commented Mar 18, 2022 • edited Loading

VSadov commented Mar 18, 2022

VSadov commented Mar 18, 2022

janvorli commented Mar 18, 2022

janvorli commented Mar 18, 2022

janvorli commented Mar 18, 2022

janvorli commented Mar 19, 2022

filipnavara commented Mar 19, 2022

mangod9 commented Jul 27, 2022

filipnavara commented Jul 27, 2022

noahfalk commented Jul 27, 2022 • edited Loading

filipnavara commented Jul 27, 2022

filipnavara commented Mar 17, 2022 •

edited

Loading

filipnavara commented Mar 17, 2022 •

edited

Loading

filipnavara commented Mar 18, 2022 •

edited

Loading

filipnavara commented Mar 18, 2022 •

edited

Loading

jkotas commented Mar 18, 2022 •

edited

Loading

filipnavara commented Mar 18, 2022 •

edited

Loading

noahfalk commented Jul 27, 2022 •

edited

Loading