AddRemoveFromDifferentThreads<string>.ConcurrentStack benchmark hangs on ARM64 #64980

adamsitnik · 2022-02-08T12:46:27Z

The AddRemoveFromDifferentThreads<string>.ConcurrentStack benchmark hangs on Windows ARM64 (.NET 7 Preview 1).

Repro:

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py --architecture arm64 -f net7.0 --dotnet-versions 7.0.100-preview.1.22077.12  --filter *AddRemoveFromDifferentThreads*string*ConcurrentStack*

I was not able to attach a VS debugger to the live process as VS seems to not support it on ARM64 yet.

I've captured a dump and uploaded it here.

PerfView shows me that the process is stuck in the following loop:

https://github.com/dotnet/performance/blob/3edf4c3149f5c903777f13af9b08b562b1672302/src/benchmarks/micro/libraries/System.Collections/Concurrent/AddRemoveFromDifferentThreads.cs#L94-L100

Since AddRemoveFromDifferentThreads<int>.ConcurrentStack and AddRemoveFromDifferentThreads<string>.ConcurrentBag work fine I guess it's a codegen issue.

Hardware: Surface Pro X, Win 10, arm64

Please let me know if there is any way I could help.

cc @kunalspathak @AndyAyersMS @kouvel

The text was updated successfully, but these errors were encountered:

ghost · 2022-02-08T12:46:31Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

The AddRemoveFromDifferentThreads<string>.ConcurrentStack benchmark hangs on Windows ARM64 (.NET 7 Preview 1).

Repro:

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py --architecture arm64 -f net7.0 --dotnet-versions 7.0.100-preview.1.22077.12  --filter *AddRemoveFromDifferentThreads*string*ConcurrentStack*

I was not able to attach a VS debugger to the live process as VS seems to not support it on ARM64 yet.

I've captured a dump and uploaded it here.

PerfView shows me that the process is stuck in the following loop:

https://github.com/dotnet/performance/blob/3edf4c3149f5c903777f13af9b08b562b1672302/src/benchmarks/micro/libraries/System.Collections/Concurrent/AddRemoveFromDifferentThreads.cs#L94-L100

Since AddRemoveFromDifferentThreads<int>.ConcurrentStack and AddRemoveFromDifferentThreads<string>.ConcurrentBag work fine I guess it's a codegen issue.

Hardware: Surface Pro X, Win 10, arm64

Please let me know if there is any way I could help.

cc @kunalspathak @AndyAyersMS @kouvel

Author:	adamsitnik
Assignees:	-
Labels:	`arch-arm64`, `os-windows`, `area-CodeGen-coreclr`
Milestone:	-

kunalspathak · 2022-02-08T15:22:20Z

I will take a look at the code gen.

adamsitnik · 2022-02-09T14:51:22Z

@janvorli has hit this issue on Linux, I am going to rename the issue and change the labels

janvorli · 2022-02-09T15:37:03Z

What I can see on my Arm64 linux is that at the time of the hang, the stack is empty (the _head member of the stack is null), but we have read only 73033 elements (based on the count) while the Size is 200000.
It sounds like a concurrency issue between the ConcurrentStack.Push and ConcurrentStack.TryPop, might be due to some arm64 specific JIT optimization or something. Lock free code is always prone to various subtle details.

EgorBo · 2022-02-09T15:40:41Z

Does it reproduce every run? We can try to revert all the recent VOLATILE-related changes which I believe were correct but slightly changed things.

janvorli · 2022-02-09T15:49:42Z

I don't know, I just hit that while running the whole benchmark suite and I have attached lldb to the process to see what's wrong.
@adamsitnik might know details on the reproducibility.

janvorli · 2022-02-09T16:15:11Z

I've attempted to run the test (with the filter set as @adamsitnik described in the repro) five times and it hung in all of the cases. So it sounds like it reproduces with at least high probability.

EgorBo · 2022-02-09T16:29:48Z

Thanks! I'll check a couple of Memory-model related commits

EgorBo · 2022-02-09T16:30:13Z

Ah, I didn't notice @kunalspathak already self-assigned it so leaving it for him. I'd check these:
#64354
#62895
#62224
#62043
#60219
#63720 (JIC)

kunalspathak · 2022-02-09T18:39:36Z

Yes, I was looking into this yesterday. Unfortunately, the windbg on the machine is not working as expected and dotnet/diagnostics#2858 is in progress to fix it. Here are the threads that are hanging. I will debug it little more.

kunalspathak · 2022-02-09T22:59:25Z

I have verified that this issue doesn't reproduce when I set COmPlus_ReadyToRun=0 and most likely seems to be related to codegen. However, I see that this benchmark runs without problem in our perf lab and most likely because we don't use R2R images for those runs? Is that correct assumption @adamsitnik or @DrewScoggins ? Should we use R2R images because at the end of release, when we do the comparison of .NET X vs. .NET Y, we might see surprises that were not caught throught out the release.

adamsitnik · 2022-02-10T07:11:45Z

@kunalspathak afaik the perf lab is currently using .NET SDK to build the micro benchmarks project, but it uses CoreRun from local build of dotnet/runtime to execute it. So if R2R is not enabled by default in dotnet/runtime build output then it's not used.

IIRC In the past we had pure SDK runs from dotnet/performance but SDK wasn't updated as regularly as we wanted (they were often few days long delays because the process runtime->SDK->installer changes propagation wasn't very smooth). For a while we did both, but due to machine availability it was decided to run the benchmarks only for dotnet/runtime builds. @DrewScoggins please correct me if I am wrong.

The dotnet/performance CI validation uses SDK to run all the benchmarks just once, but it currently has only x86 and x64 CI legs. Perhaps we should re-add ARM64 leg?

janvorli · 2022-02-10T09:49:00Z

I am hitting the same hang with ready to run disabled as well.

adamsitnik · 2022-02-10T13:11:08Z

@janvorli how have you configured the env var?

The python script from the perf repo has one "hidden feature" : it by default dumps all the DOTNET* and ComPlus* variables:

[2022/02/04 17:19:03][INFO] ------------------------------------------
[2022/02/04 17:19:03][INFO] Running .NET micro benchmarks for 'net7.0'
[2022/02/04 17:19:03][INFO] ------------------------------------------
[2022/02/04 17:19:03][INFO] --------------------------------------------------
[2022/02/04 17:19:03][INFO] Dumping COMPlus/DOTNET environment:
[2022/02/04 17:19:03][INFO]   "DOTNET_CLI_TELEMETRY_OPTOUT=1"
[2022/02/04 17:19:03][INFO]   "DOTNET_MULTILEVEL_LOOKUP=0"
[2022/02/04 17:19:03][INFO]   "DOTNET_ROOT=D:\projects\performance\tools\dotnet\x86"
[2022/02/04 17:19:03][INFO] --------------------------------------------------

So if you want the script to actually use given env var, you need to do it in an explicit way:

--bdn-arguments "--envVars key:value"

janvorli · 2022-02-10T13:23:17Z

Ah, I've thought it would just use the ambient environment. Let me try again.

janvorli · 2022-02-10T14:13:20Z

I can confirm that the issue doesn't occur with R2R disabled.

kunalspathak · 2022-02-10T15:59:00Z

The dotnet/performance CI validation uses SDK to run all the benchmarks just once, but it currently has only x86 and x64 CI legs. Perhaps we should re-add ARM64 leg?

I am more inclined to just do the daily runs based on SDK because we ship R2R code and as you see, things change between R2R and JIT. Running with CoreRun, we are not measuring the performance of bits that we are shipping. @danmoseley - any thoughts?

AndyAyersMS · 2022-02-10T16:26:51Z

We generally don't expect BDN to be measuring R2R code unless we've somehow failed to tier up appropriately. So to first order it should not matter a whole lot whether the assemblies have R2R or not.

However Tier1 codegen could shift a bit because with R2R disabled I think we also lose any embedded PGO data.

kunalspathak · 2022-02-10T22:25:48Z

I confirm that the issue doesn't reproduce if I revert the changes in #62895. For that, I need the clrjit.dll as well as crossgened binary System.Collections.Concurrent.dll produced from the reverted clrjit changes in order for the hang to disappear. I will keep investigating.

kunalspathak · 2022-02-11T04:36:20Z

I confirm that the issue doesn't reproduce if I revert the changes in #62895.

Looks like I was wrong. The benchmark is very sensitive and slightest change affects the behavior. Basically, if the methods in System.Collections.Concurrent.dll gets JITted in-time, we don't see the hang. That makes it harder to pin down to the offending commit. Moreover, to understand which commit might have affected it, I tried to replace coreclr.dll, clrjit.dll, SPC.dll and SCC.dll but sometimes the methods are JITted and other times not.

The observation of @janvorli is correct that Pop thread (Consumer thread) waits in this while loop. I didn't notice until sometime back that the Push thread (Producer thread) is stuck while trying to allocate an object. Below is the call stack of the Producer thread where we can see that when it tries to allocate an object from inside Push, the GC kicks off and the thread is put for suspension, however that call never returns. The producer thread seems to wait for the suspension to complete forever and thus cannot add anything to the stack.

I could verify the same observation in the dump that @adamsitnik has provided.

00 ntdll!ZwWaitForSingleObject
01 KERNELBASE!WaitForSingleObjectEx
02 coreclr!CLREventWaitHelper2
03 coreclr!CLREventWaitHelper
04 coreclr!CLREventBase::WaitEx
05 coreclr!CLREventBase::Wait
06 coreclr!ThreadSuspend::SuspendRuntime
07 coreclr!ThreadSuspend::SuspendEE
08 coreclr!GCToEEInterface::SuspendEE
09 coreclr!WKS::GCHeap::GarbageCollectGeneration
0a coreclr!WKS::gc_heap::trigger_gc_for_alloc
0b coreclr!WKS::gc_heap::try_allocate_more_space
0c coreclr!WKS::gc_heap::allocate_more_space
0d coreclr!WKS::gc_heap::allocate
0e coreclr!WKS::GCHeap::Alloc
0f coreclr!Alloc
10 coreclr!AllocateObject
11 coreclr!JIT_New
12 System_Collections_Concurrent
13 0x0
14 0x0
15 System_Private_CoreLib
16 System_Private_CoreLib
17 coreclr!CallDescrWorkerInternal
18 coreclr!CallDescrWorkerWithHandler
19 coreclr!DispatchCallSimple
1a coreclr!ThreadNative::KickOffThread_Worker
1b coreclr!ManagedThreadBase_DispatchInner
1c coreclr!ManagedThreadBase_DispatchMiddle
1d coreclr!ManagedThreadBase_DispatchOuter
1e coreclr!ManagedThreadBase_FullTransition
1f coreclr!ManagedThreadBase::KickOff
20 coreclr!ThreadNative::KickOffThread
21 KERNEL32!BaseThreadInitThunk
22 ntdll!RtlUserThreadStart

When the Push method is waiting for the allocation to complete, it is still R2R code and isn't JITTed.

I tried looking for !locks in the dump but nothing stands out. At the same time, it is not straightforward to reason about why this repros only on R2R images and not JIT code. It may be something with memory barrier changes, but I am unsure about it until we understand for what lock is GC waiting.

@mangod9 or @Maoni0 - any pointers on how to move forward?

Maoni0 · 2022-02-11T04:55:20Z

GC is simply waiting for suspension to finish and it's not finishing. this means some thread is not getting suspended. what's the output for !threads? it should tell you which thread(s) are still in coop mode.

EgorBo · 2022-02-11T17:31:07Z

Here is the difference in the disasm. Left is with tail call (where issue repros) and right is no tail call.

Is there a comparison with what JIT produces here?

TryPop tier1 jit codegen

; Assembly listing for method System.Collections.Concurrent.ConcurrentStack`1[__Canon][System.__Canon]:TryPop(byref):bool:this
; Emitting BLENDED_CODE for generic ARM64 CPU - MacOS
; Tier-1 compilation
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 this         [V00,T00] (  5,  4   )     ref  ->  x19         this class-hnd single-def
;  V01 arg1         [V01,T01] (  5,  3.50)   byref  ->  x20         single-def
;  V02 loc0         [V02,T02] (  6,  4   )     ref  ->  x21         class-hnd single-def
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;* V04 tmp1         [V04,T06] (  0,  0   )    long  ->  zero-ref    "impRuntimeLookup slot"
;  V05 tmp2         [V05,T04] (  2,  2   )   byref  ->   x0         single-def "impAppendStmt"
;  V06 tmp3         [V06,T05] (  2,  2   )     ref  ->   x1         class-hnd single-def "impAppendStmt"
;* V07 tmp4         [V07    ] (  0,  0   )    long  ->  zero-ref    "spilling Runtime Lookup tree"
;  V08 cse0         [V08,T03] (  3,  2.50)   byref  ->   x0         "CSE - aggressive"
;
; Lcl frame size = 8

G_M57257_IG01:              ;; offset=0000H
        00000000          stp     fp, lr, [sp,#-48]!
        00000000          stp     x19, x20, [sp,#24]
        00000000          str     x21, [sp,#40]
        00000000          mov     fp, sp
        00000000          mov     x19, x0
        00000000          mov     x20, x1
      ;; bbWeight=1    PerfScore 4.50
G_M57257_IG02:              ;; offset=0018H
        00000000          add     x0, x19, #8
        00000000          ldar    x21, [x0]
        00000000          cbnz    x21, G_M57257_IG05
      ;; bbWeight=1    PerfScore 4.50
G_M57257_IG03:              ;; offset=0024H
        00000000          str     xzr, [x20]
        00000000          mov     w0, wzr
      ;; bbWeight=0.50 PerfScore 0.75
G_M57257_IG04:              ;; offset=002CH
        00000000          ldr     x21, [sp,#40]
        00000000          ldp     x19, x20, [sp,#24]
        00000000          ldp     fp, lr, [sp],#48
        00000000          ret     lr
      ;; bbWeight=0.50 PerfScore 2.50
G_M57257_IG05:              ;; offset=003CH
        00000000          ldrsb   wzr, [x19]
        00000000          ldr     x1, [x21,#16]
        00000000          mov     x2, x21
        00000000          bl      System.Threading.Interlocked:CompareExchange(byref,System.Object,System.Object):System.Object
        00000000          cmp     x0, x21
        00000000          bne     G_M57257_IG07
        00000000          ldr     x15, [x21,#8]
        00000000          mov     x14, x20
        00000000          bl      CORINFO_HELP_CHECKED_ASSIGN_REF
        00000000          mov     w0, #1
      ;; bbWeight=0.50 PerfScore 7.00
G_M57257_IG06:              ;; offset=0064H
        00000000          ldr     x21, [sp,#40]
        00000000          ldp     x19, x20, [sp,#24]
        00000000          ldp     fp, lr, [sp],#48
        00000000          ret     lr
      ;; bbWeight=0.50 PerfScore 2.50
G_M57257_IG07:              ;; offset=0074H
        00000000          mov     x0, x19
        00000000          mov     x1, x20
      ;; bbWeight=0.50 PerfScore 0.50
G_M57257_IG08:              ;; offset=007CH
        00000000          ldr     x21, [sp,#40]
        00000000          ldp     x19, x20, [sp,#24]
        00000000          ldp     fp, lr, [sp],#48
        00000000          b       System.Collections.Concurrent.ConcurrentStack`1[__Canon][System.__Canon]:TryPopCore(byref):bool:this
      ;; bbWeight=0.50 PerfScore 2.50

tail calls TryPopCore

janvorli · 2022-02-11T17:32:43Z

I think that rather than the TryPop code, the caller's code is more relevant to see why it repros with R2R enabled and doesn't with R2R disabled.

jakobbotsch · 2022-02-11T17:33:20Z

We do try to ensure we do not form uninterruptible loops with tailcalls:

runtime/src/coreclr/jit/lower.cpp

Lines 1951 to 1966 in e04d750

    
           // Gc-interruptability in the following case: 
        
           //     foo(a, b, c, d, e) { bar(a, b, c, d, e); } 
        
           //     bar(a, b, c, d, e) { foo(a, b, d, d, e); } 
        
           // 
        
           // Since the instruction group starting from the instruction that sets up first 
        
           // stack arg to the end of the tail call is marked as non-gc interruptible, 
        
           // this will form a non-interruptible tight loop causing gc-starvation. To fix 
        
           // this we insert GT_NO_OP as embedded stmt before GT_START_NONGC, if the method 
        
           // has a single basic block and is not a GC-safe point.  The presence of a single 
        
           // nop outside non-gc interruptible region will prevent gc starvation. 
        
           if ((comp->fgBBcount == 1) && !(comp->compCurBB->bbFlags & BBF_GC_SAFE_POINT)) 
        
           { 
        
               assert(comp->fgFirstBB == comp->compCurBB); 
        
               GenTree* noOp = new (comp, GT_NO_OP) GenTree(GT_NO_OP, TYP_VOID); 
        
               BlockRange().InsertBefore(startNonGCNode, noOp); 
        
           }

This only runs in some cases so perhaps this needs to be expanded. I am not at my PC but I will take a closer look when I am.

kunalspathak · 2022-02-11T17:42:40Z

he caller's code is more relevant to see why it repros with R2R enabled and doesn't with R2R disabled.

Its same with or without R2R.

EgorBo · 2022-02-11T17:44:14Z

I think that rather than the TryPop code, the caller's code is more relevant to see why it repros with R2R enabled and doesn't with R2R disabled.

G_M42430_IG01:              ;; offset=0000H
        00000000          stp     fp, lr, [sp,#-48]!
        00000000          stp     x19, x20, [sp,#32]
        00000000          mov     fp, sp
        00000000          str     xzr, [fp,#24] // [V02 loc1]
        00000000          mov     x19, x0
      ;; bbWeight=1    PerfScore 4.00
G_M42430_IG02:              ;; offset=0014H
        00000000          ldr     x0, [x19,#8]
        00000000          ldr     x0, [x0,#8]
        00000000          ldrsb   wzr, [x0]
        00000000          mov     x2, xzr
        00000000          movn    w1, #0
        00000000          bl      System.Threading.Barrier:SignalAndWait(int,System.Threading.CancellationToken):bool:this
        00000000          ldr     x0, [x19,#8]
        00000000          ldr     x0, [x0,#8]
        00000000          ldrsb   wzr, [x0]
        00000000          mov     x2, xzr
        00000000          movn    w1, #0
        00000000          bl      System.Threading.Barrier:SignalAndWait(int,System.Threading.CancellationToken):bool:this
        00000000          mov     w20, wzr
        00000000          ldr     x0, [x19,#8]
        00000000          ldr     w0, [x0,#32]
        00000000          cmp     w0, #0
        00000000          ble     G_M42430_IG06
      ;; bbWeight=1    PerfScore 30.00
G_M42430_IG03:              ;; offset=0058H
        00000000          ldr     x0, [x19,#16]
        00000000          add     x1, fp, #24 // [V02 loc1]
        00000000          ldr     wzr, [x0]
        00000000          bl      System.Collections.Concurrent.ConcurrentStack`1[__Canon][System.__Canon]:TryPop(byref):bool:this
        00000000          cbz     w0, G_M42430_IG05
      ;; bbWeight=4    PerfScore 34.00
G_M42430_IG04:              ;; offset=006CH
        00000000          add     w20, w20, #1
      ;; bbWeight=2    PerfScore 1.00
G_M42430_IG05:              ;; offset=0070H
        00000000          ldr     x0, [x19,#8]
        00000000          ldr     w0, [x0,#32]
        00000000          cmp     w20, w0
        00000000          blt     G_M42430_IG03
      ;; bbWeight=4    PerfScore 30.00
G_M42430_IG06:              ;; offset=0080H
        00000000          ldp     x19, x20, [sp,#32]
        00000000          ldp     fp, lr, [sp],#48
        00000000          ret     lr
      ;; bbWeight=1    PerfScore 3.00

; Total bytes of code 140, prolog size 16, PerfScore 116.00, instruction count 35, allocated bytes for code 140 (MethodHash=b47d5a41) for method <>c__DisplayClass6_0[__Canon][System.__Canon]:<SetupConcurrentStackIteration>b__1():this
; ============================================================

Codegen for that lambda

EgorBo · 2022-02-11T17:45:45Z

Its same with or without R2R.

Yeah I guess the only difference is the PRECODE for TryPop

janvorli · 2022-02-11T17:51:54Z

Ok, that makes sense

danmoseley · 2022-02-11T18:59:52Z

I am more inclined to just do the daily runs based on SDK because we ship R2R code and as you see, things change between R2R and JIT. Running with CoreRun, we are not measuring the performance of bits that we are shipping. @danmoseley - any thoughts?

@kunalspathak this has been discussed before, unless perhaps I'm thinking of the discussion about running unit tests on R2R bits. I suggest perhaps opening an issue to discuss in the perf repo.

DrewScoggins · 2022-02-11T19:08:38Z

@kunalspathak afaik the perf lab is currently using .NET SDK to build the micro benchmarks project, but it uses CoreRun from local build of dotnet/runtime to execute it. So if R2R is not enabled by default in dotnet/runtime build output then it's not used.

This is correct as far as I know.

IIRC In the past we had pure SDK runs from dotnet/performance but SDK wasn't updated as regularly as we wanted (they were often few days long delays because the process runtime->SDK->installer changes propagation wasn't very smooth). For a while we did both, but due to machine availability it was decided to run the benchmarks only for dotnet/runtime builds. @DrewScoggins please correct me if I am wrong.

No, this is the exact reason that we decided against running the microbenchmarks on the published SDK. We would find regressions and then have to sort through hundreds of possible commits between the two builds from the runtime repo to try and find the source of the regression. It was also extremely tedious to try and trace back to the correct hashes from the runtime repo.

The dotnet/performance CI validation uses SDK to run all the benchmarks just once, but it currently has only x86 and x64 CI legs. Perhaps we should re-add ARM64 leg?

Do we have some kind of Arm64 VMs that we could do this validation on, or would we need to use our current Arm64 performance hardware? We have very limited Arm64 hardware, and moving to doing this level of validation there would tax that limited infrastructure too greatly.

kunalspathak · 2022-02-12T00:20:10Z

This only runs in some cases so perhaps this needs to be expanded. I am not at my PC but I will take a closer look when I am.

Assigning to you @jakobbotsch

AndyAyersMS · 2022-02-12T01:49:30Z

My understanding is that the when the caller sees a definite (every iteration) call to a managed method in a loop, it's free to assume that the loop has a gc safepoint and so no polling is needed in the loop. There's no safe way for the caller to know that the callee can't actually be hijacked.

So seems like any fix here has to happen in the callee. In the somewhat similar #57219 there was just a tiny window where the callee could be suspended. Perhaps e we need to look at forcing small methods (or perhaps methods with very short paths from entry to exit) to be fully interruptible so we don't need to hijack them?

jakobbotsch · 2022-02-12T10:07:56Z

I remembered it right: dotnet/coreclr#16039

@janvorli If return address hijacking is completely disabled for any function containing tailcalls then surely all such methods have to be forced to be fully interruptible by the JIT, or it is very easy to create mutually tailcalling methods that can never be suspended. We can also have "irrelevant" tailcalls in these functions that are never executed, but then have the side effect of disabling return address hijacking globally for the method.

Is there no alternative to completely disabling return address hijacking in this case? For example, implementing support for hijacking by modifying lr, which would also help the JIT to optimize leaf methods?

EDIT: Hmm, we should already need to mark such functions fully interruptible anyway, so not sure if this can actually get us into trouble.

jakobbotsch · 2022-02-12T10:29:50Z

TryPop is only marked partially interruptible when crossgenned while it is marked fully interruptible in JIT, so that explains the difference between those. However, TryPop should have been marked fully interruptible by the following code, so there must be a JIT bug here.

runtime/src/coreclr/jit/flowgraph.cpp

Lines 4225 to 4232 in 55884ce

    
           if (block->endsWithTailCallOrJmp(this, true) && optReachWithoutCall(fgFirstBB, block)) 
        
           { 
        
               // This tail call might combine with other tail calls to form a 
        
               // loop.  Thus we need to either add a poll, or make the method 
        
               // fully interruptible.  I chose the later because that's what 
        
               // JIT64 does. 
        
               SetInterruptible(true); 
        
           }

jakobbotsch · 2022-02-12T10:49:48Z

The JIT believes that System.Interlocked.CompareExchange results in a GC safe point when being crossgenned while it does not believe so when jitting. The reason is that crossgen2 does not return this flag to the JIT:

runtime/src/coreclr/jit/importer.cpp

Lines 9047 to 9050 in 55884ce

    
           if (mflags & CORINFO_FLG_NOGCCHECK) 
        
           { 
        
               call->AsCall()->gtCallMoreFlags |= GTF_CALL_M_NOGCCHECK; 
        
           }

So looks like crossgen2 needs to report this flag back.

jkotas · 2022-02-12T17:06:28Z

implementing support for hijacking by modifying lr, which would also help the JIT to optimize leaf methods?

It would be useful to implement hijacking for lr. It is non-trivial. The problem is actually with unhijacking. The current implementation assumes that it is possible to unhijack the thread by just writing into memory without having the context that is not possible when the return address is in register.

So looks like crossgen2 needs to report this flag back.

It is easy to fix crossgen2 to report this flag back for FCalls. It violates the R2R versioning principles a bit. We do not consider changing method implementation from FCall to managed and vice versa to be a breaking change. Once crossgen2 starts returning this flag back for FCalls, changing an implementation from managed to FCall will become potential R2R breaking change.

jakobbotsch · 2022-02-12T17:49:26Z

It is easy to fix crossgen2 to report this flag back for FCalls. It violates the R2R versioning principles a bit. We do not consider changing method implementation from FCall to managed and vice versa to be a breaking change. Once crossgen2 starts returning this flag back for FCalls, changing an implementation from managed to FCall will become potential R2R breaking change.

Is there any alternative, other than marking practically all methods fully interruptible in R2R codegen?

From what I understand, even today changing a managed function to a fcall will cause GC starvation if that fcall is not hijackable. Changing crossgen to check for the attribute would only make us better off in this regard.

The current case is with a tailcall, but the same problem exists for any function that has a loop with a call in it that may change to a fcall then.

jkotas · 2022-02-12T19:33:45Z

I think it is ok to change crossgen to check for this flag as an imperfect workaround. As you have pointed out, the story for FCalls and GC suspension is far from perfect, even for the JIT case.

When there are loops or tailcalls in a function the JIT will check for a dominating call that is a GC safepoint to figure out if the function needs to be marked fully interruptible or not. Methods with the InternalCall flag may turn into fcalls that are not hijackable and thus not GC safepoints, so in this case JIT should still mark the function fully interruptible, but was not doing so because crossgen2 was not reporting this flag back. Fix dotnet#64980

When there are loops or tailcalls in a function the JIT will check for a dominating call that is a GC safepoint to figure out if the function needs to be marked fully interruptible or not. Methods with the InternalCall flag may turn into fcalls that are not hijackable and thus not GC safepoints, so in this case JIT should still mark the function fully interruptible, but was not doing so because crossgen2 was not reporting this flag back. Fix #64980

…ark on ARM64 as it hangs details: dotnet/runtime#64980

* define NET7_0_PREVIEW2_OR_GREATER const for .NET 7 Preview2+ builds * don't use Regex.Count API for older versions of .NET 7 (used as baseline for monthly perf runs) * don't try to run AddRemoveFromDifferentThreads.ConcurrentStack benchmark on ARM64 as it hangs details: dotnet/runtime#64980 * don't try to run Json_FromString.Jil_ benchmark on ARM64 as it throws AccessViolationException details: dotnet/runtime#64657

adamsitnik added arch-arm64 os-windows area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Feb 8, 2022

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 8, 2022

kunalspathak self-assigned this Feb 8, 2022

adamsitnik removed the os-windows label Feb 9, 2022

adamsitnik changed the title ~~AddRemoveFromDifferentThreads<string>.ConcurrentStack benchmark hangs on Windows ARM64~~ AddRemoveFromDifferentThreads<string>.ConcurrentStack benchmark hangs on ARM64 Feb 9, 2022

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Feb 9, 2022

JulieLeeMSFT added this to the 7.0.0 milestone Feb 9, 2022

adamsitnik mentioned this issue Feb 11, 2022

.NET 7.0 Preview 1 Microbenchmarks Performance Study Report #65200

Closed

13 tasks

kunalspathak assigned jakobbotsch and unassigned kunalspathak Feb 12, 2022

jakobbotsch mentioned this issue Feb 14, 2022

crossgen2: Report CORINFO_FLG_NOGCCHECK back for internal calls #65300

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 14, 2022

jkotas closed this as completed in #65300 Feb 14, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 14, 2022

adamsitnik added a commit to adamsitnik/performance that referenced this issue Mar 1, 2022

don't try to run AddRemoveFromDifferentThreads.ConcurrentStack benchm…

15f5498

…ark on ARM64 as it hangs details: dotnet/runtime#64980

adamsitnik mentioned this issue Mar 1, 2022

Few workarounds for the next manual perf run dotnet/performance#2282

Merged

ghost locked as resolved and limited conversation to collaborators Mar 17, 2022

AddRemoveFromDifferentThreads<string>.ConcurrentStack benchmark hangs on ARM64 #64980

AddRemoveFromDifferentThreads<string>.ConcurrentStack benchmark hangs on ARM64 #64980

Comments

adamsitnik commented Feb 8, 2022

ghost commented Feb 8, 2022

kunalspathak commented Feb 8, 2022

adamsitnik commented Feb 9, 2022

janvorli commented Feb 9, 2022

EgorBo commented Feb 9, 2022

janvorli commented Feb 9, 2022

janvorli commented Feb 9, 2022

EgorBo commented Feb 9, 2022

EgorBo commented Feb 9, 2022 • edited Loading

kunalspathak commented Feb 9, 2022

kunalspathak commented Feb 9, 2022

adamsitnik commented Feb 10, 2022

janvorli commented Feb 10, 2022

adamsitnik commented Feb 10, 2022

janvorli commented Feb 10, 2022

janvorli commented Feb 10, 2022

kunalspathak commented Feb 10, 2022

AndyAyersMS commented Feb 10, 2022

kunalspathak commented Feb 10, 2022

kunalspathak commented Feb 11, 2022 • edited Loading

Maoni0 commented Feb 11, 2022

EgorBo commented Feb 11, 2022 • edited Loading

janvorli commented Feb 11, 2022

jakobbotsch commented Feb 11, 2022 • edited Loading

kunalspathak commented Feb 11, 2022

EgorBo commented Feb 11, 2022

EgorBo commented Feb 11, 2022

janvorli commented Feb 11, 2022

danmoseley commented Feb 11, 2022

DrewScoggins commented Feb 11, 2022

kunalspathak commented Feb 12, 2022

AndyAyersMS commented Feb 12, 2022

jakobbotsch commented Feb 12, 2022 • edited Loading

jakobbotsch commented Feb 12, 2022

jakobbotsch commented Feb 12, 2022

jkotas commented Feb 12, 2022 • edited Loading

jakobbotsch commented Feb 12, 2022 • edited Loading

jkotas commented Feb 12, 2022 • edited Loading

EgorBo commented Feb 9, 2022 •

edited

Loading

kunalspathak commented Feb 11, 2022 •

edited

Loading

EgorBo commented Feb 11, 2022 •

edited

Loading

jakobbotsch commented Feb 11, 2022 •

edited

Loading

jakobbotsch commented Feb 12, 2022 •

edited

Loading

jkotas commented Feb 12, 2022 •

edited

Loading

jakobbotsch commented Feb 12, 2022 •

edited

Loading

jkotas commented Feb 12, 2022 •

edited

Loading