Thread.InitializeCurrentThread should be marked as cold code #49520

benaadams · 2021-03-12T01:49:19Z

Thread.InitializeCurrentThread is only called once for the Thread lifetime, however the method it is used in CurrentThread is called lots and every async method has its own call to it (to get ExecutionContext etc)

runtime/src/libraries/System.Private.CoreLib/src/System/Threading/Thread.cs

Lines 328 to 335 in bdd6d35

 public static Thread CurrentThread 

 { 

 [Intrinsic] 

 get 

 { 

 return t_currentThread ?? InitializeCurrentThread(); 

 } 

 }

Currently the asm generated is a conditional jmp forward for it already being set e.g. from AsyncMethodBuilderCore:Start(byref) which is bad for static branch prediction.

G_M49992_IG03:
       call     [CORINFO_HELP_READYTORUN_STATIC_BASE]
       mov      rax, gword ptr [rax+24]
       test     rax, rax
       jne      SHORT G_M49992_IG05
						;; bbWeight=1    PerfScore 6.25
G_M49992_IG04:
       call     [Thread:InitializeCurrentThread():Thread]
						;; bbWeight=0.25 PerfScore 0.75
G_M49992_IG05:
       mov      ...

Reversing the condition in CurrentThread

Thread? currentThread = t_currentThread;
if (currentThread is not null)
{
    return currentThread;
}

return InitializeCurrentThread();

While it generates a conditional jmp forward for InitializeCurrentThread it also then adds a jmp for the regular path:

G_M49992_IG03:
       call     [CORINFO_HELP_READYTORUN_STATIC_BASE]
       mov      rax, gword ptr [rax+24]
       test     rax, rax
       je       SHORT G_M49992_IG05
						;; bbWeight=1    PerfScore 6.25
G_M49992_IG04:
       jmp      SHORT G_M49992_IG06
						;; bbWeight=0.50 PerfScore 1.00
G_M49992_IG05:
       call     [Thread:InitializeCurrentThread():Thread]
						;; bbWeight=0.50 PerfScore 1.50
G_M49992_IG06:
       mov      ...

Marking it as a cold method should resolve this by moving the call to InitializeCurrentThread to the end of the method so the regular path didn't have to jump over it?

/cc @EgorBo

category:performance
theme:block-layout

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2021-03-12T01:49:22Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

AndyAyersMS · 2021-03-12T02:09:54Z

My understanding is that static prediction is out of favor these days.

But reordering like you propose has other benefits:

not taken branch is generally cheaper than taken branch
hot and cold code should not intermix for locality reasons

benaadams · 2021-03-12T02:21:09Z

My understanding is that static prediction is out of favor these days

Yes as normally the branch predictor will be the dominant factor; however the peculiarity here is its included in every async method's kick off; as they all have different generic types for ref TStateMachine so it will be more common that these calls are "fresh" as far as the branch predictor history is aware as its in so many places? (The MoveNext methods less so since they go via ExecutionContext.Run which calls CurrentThread so is a common code point)

Being more generic if there are more examples could add another enum to MethodImplOptions? e.g. ColdCode

benaadams · 2021-03-12T02:25:07Z

PGO could obviously also address this; however it seems problematic to always start in the "wrong" state for all these methods?

jkotas · 2021-03-12T05:52:40Z

We should be able to start in a good state with static pre-generate PGO data.

EgorBo · 2021-03-12T07:25:13Z

Maybe it's time to introduce Likely/Unlikely APIs?

jkotas · 2021-03-12T07:46:10Z

Yes, these APIs would make these micro-optimizations to be done manually. I do not think we would welcome PRs to add likely/unlikely annotations in the core libraries. I think we would rather want to depend on the static PGO data instead since it scales a lot better and it does not require humans to guess what is more likely.

AndyAyersMS · 2021-03-13T02:25:05Z

Here you can see PGO doing this sort of thing -- I'll see if I can dig up an example with InitializeCurrentThread

;;; NO PGO

; Assembly listing for method Winsock:EnsureInitialized()
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; PGO data available, but JitDisablePGO == 1
; Final local variable assignments
;
;# V00 OutArgs      [V00    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M27667_IG01:        ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
						;; bbWeight=1    PerfScore 0.00
G_M27667_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       mov      rax, 0xD1FFAB1E
       cmp      dword ptr [rax], 0
       jne      SHORT G_M27667_IG04
						;; bbWeight=1    PerfScore 3.25
G_M27667_IG03:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
       jmp      Winsock:<EnsureInitialized>g__Initialize|55_0()
       ; gcr arg pop 0
						;; bbWeight=0.50 PerfScore 1.00
G_M27667_IG04:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
       ret      
						;; bbWeight=0.50 PerfScore 0.50

;;; PGO

; Assembly listing for method Winsock:EnsureInitialized()
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 669
; Final local variable assignments
;
;# V00 OutArgs      [V00    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M27667_IG01:        ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
						;; bbWeight=1    PerfScore 0.00
G_M27667_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       mov      rax, 0xD1FFAB1E
       cmp      dword ptr [rax], 0
       je       SHORT G_M27667_IG04
						;; bbWeight=1    PerfScore 3.25
G_M27667_IG03:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
       ret      
						;; bbWeight=1.00 PerfScore 1.00
G_M27667_IG04:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
       jmp      Winsock:<EnsureInitialized>g__Initialize|55_0()
       ; gcr arg pop 0
						;; bbWeight=0.00 PerfScore 0.00

AndyAyersMS · 2021-03-13T02:36:37Z

Here it is -- Note SPMI hides the callee name.

This is with dynamic PGO, but static PGO will do similar things.

;;; NO PGO

; Assembly listing for method System.Threading.Thread:get_CurrentThread():System.Threading.Thread
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; PGO data available, but JitDisablePGO == 1
; Final local variable assignments
;
;  V00 OutArgs      [V00    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V01 tmp1         [V01,T00] (  2,  4   )     ref  ->  rax         class-hnd "dup spill"
;  V02 tmp2         [V02,T01] (  3,  2.50)     ref  ->  rax        
;
; Lcl frame size = 40

G_M9749_IG01:        ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
       sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25
G_M9749_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       mov      rcx, 0xD1FFAB1E
       mov      edx, 639
       call     CORINFO_HELP_GETSHARED_GCTHREADSTATIC_BASE
       ; byrRegs +[rax]
       ; gcr arg pop 0
       mov      rax, gword ptr [rax+24]
       ; gcrRegs +[rax]
       ; byrRegs -[rax]
       test     rax, rax
       jne      SHORT G_M9749_IG04
						;; bbWeight=1    PerfScore 4.75
G_M9749_IG03:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
       ; gcrRegs -[rax]
       add      rsp, 40
       jmp      hackishModuleName:hackishMethodName()
       ; gcr arg pop 0
						;; bbWeight=0.50 PerfScore 1.12
G_M9749_IG04:        ; gcrefRegs=00000001 {rax}, byrefRegs=00000000 {}, byref, epilog, nogc
       ; gcrRegs +[rax]
       add      rsp, 40
       ret      
						;; bbWeight=0.50 PerfScore 0.62

;;; PGO

; Assembly listing for method System.Threading.Thread:get_CurrentThread():System.Threading.Thread
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 240550
; Final local variable assignments
;
;  V00 OutArgs      [V00    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V01 tmp1         [V01,T00] (  2,  4   )     ref  ->  rax         class-hnd "dup spill"
;  V02 tmp2         [V02,T01] (  3,  3.00)     ref  ->  rax        
;
; Lcl frame size = 40

G_M9749_IG01:        ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
       sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25
G_M9749_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       mov      rcx, 0xD1FFAB1E
       mov      edx, 639
       call     CORINFO_HELP_GETSHARED_GCTHREADSTATIC_BASE
       ; byrRegs +[rax]
       ; gcr arg pop 0
       mov      rax, gword ptr [rax+24]
       ; gcrRegs +[rax]
       ; byrRegs -[rax]
       test     rax, rax
       je       SHORT G_M9749_IG04
						;; bbWeight=1    PerfScore 4.75
G_M9749_IG03:        ; gcrefRegs=00000001 {rax}, byrefRegs=00000000 {}, byref, epilog, nogc
       add      rsp, 40
       ret      
						;; bbWeight=1.00 PerfScore 1.25
G_M9749_IG04:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
       ; gcrRegs -[rax]
       add      rsp, 40
       jmp      hackishModuleName:hackishMethodName()
       ; gcr arg pop 0
						;; bbWeight=0.00 PerfScore 0.00

Raw profile view showing branch bias:

benaadams · 2021-03-13T05:40:54Z

That's beautiful! 🤩

Should this issue be closed?

AndyAyersMS · 2021-03-15T19:27:47Z

We should be seeing profile data flowing into both prejitting and jitting soon, so it would be good to confirm once that happens that we get the above without any explicit enabling of PGO.

So let's keep this open as a reminder.

AndyAyersMS · 2021-03-19T22:29:00Z

Here's a listing from today's build. We now have PGO data flowing into SPC and on through to jitting.

This is happening without any COMPlus settings.

; Assembly listing for method Thread:get_CurrentThread():Thread
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 157048
; Final local variable assignments
;
;  V00 OutArgs      [V00    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V01 tmp1         [V01,T00] (  2,  4   )     ref  ->  rax         class-hnd "dup spill"
;  V02 tmp2         [V02,T01] (  3,  3.00)     ref  ->  rax
;
; Lcl frame size = 40

G_M63029_IG01:              ;; offset=0000H
       4883EC28             sub      rsp, 40
                                                ;; bbWeight=1    PerfScore 0.25
G_M63029_IG02:              ;; offset=0004H
       48B92000239AFD7F0000 mov      rcx, 0x7FFD9A230020
       BA7F020000           mov      edx, 639
       E8886AAB5F           call     CORINFO_HELP_GETSHARED_GCTHREADSTATIC_BASE
       488B4018             mov      rax, gword ptr [rax+24]
       4885C0               test     rax, rax
       7405                 je       SHORT G_M63029_IG04
                                                ;; bbWeight=1    PerfScore 4.75
G_M63029_IG03:              ;; offset=0021H
       4883C428             add      rsp, 40
       C3                   ret
                                                ;; bbWeight=1.00 PerfScore 1.25
G_M63029_IG04:              ;; offset=0026H
       4883C428             add      rsp, 40
       E9C955FFFF           jmp      Thread:InitializeCurrentThread():Thread
                                                ;; bbWeight=0.00 PerfScore 0.00

benaadams · 2021-03-19T22:45:42Z

Does it pick that up for async Method kick offs (as they will be new inlines); though presumably that will kick in at Tier1 anyway and at Tier0 the ordering of that if probably isn't a significant feature

AndyAyersMS · 2021-03-25T02:05:01Z

We still have some work to do here, in cases where the caller doesn't have PGO data.

For instance

****** START compiling AsyncMethodBuilderCore:Start(byref) (MethodHash=6f463cb7)

The inlinee Thread:get_CurrentThread():Thread has good looking profile data initially:

Expanding INLINE_CANDIDATE in statement STMT00001 in BB03:
STMT00001 (IL 0x014...0x019)
               [000008] I-C-G-------              *  CALL      ref    Thread.get_CurrentThread (exactContextHnd=0x00007FFF7268B729)
...
Computing inlinee profile scale:
   ... call site not profiled

*************** Inline @[000008] Finishing PHASE Profile incorporation
Trees after Profile incorporation

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd                 weight      IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB14 [0012]  1                             157k 157048    [000..008)-> BB16 ( cond )                     IBC 
BB15 [0013]  1                           296       296    [008..00E)                                     IBC 
BB16 [0014]  2                             157k 157048    [00E..00F)        (return)                     IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

but the caller does not, and so we don't bother to scale the inlinee counts, and after the resultant block fusion and unity weight descaling we end up seriously confused:

----------- Statements (and blocks) added due to the inlining of call [000008] -----------

Inlinee method body:New Basic Block BB17 [0015] created.

Convert bbJumpKind of BB16 to BBJ_NONE
fgInlineAppendStatements: no gc ref inline locals.

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd                 weight   IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB14 [0012]  1                             1           [014..015)-> BB16 ( cond )                     i 
BB15 [0013]  1                             2.96 296    [014..015)                                     i IBC 
BB16 [0014]  2                             1           [014..015)                                     i 
--

Looks like we should always compute a scale factor if the callee has PGO.

In general the jit is not very smart yet about mixing PGO and non-PGO.

benaadams · 2021-03-25T02:15:03Z

Looks like we should always compute a scale factor if the callee has PGO.

Would this currently be an issue as if the caller has a loop as it wouldn't do a retiering Jit?

AndyAyersMS · 2021-03-25T03:11:40Z

With the advent of #49793 many framework methods will come equipped with profile data. And with the advent of #47558 the runtime will always tell the jit to look for profile data for optimized methods.

Thus for any method that inlines Thread:get_CurrentThread() and is optimized the jit will see the profile data and be able to tell that the branch is biased. You can see that in the profile incorporation above, before the jit decides to spoil things...

benaadams added the tenet-performance Performance related issue label Mar 12, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 12, 2021

jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 12, 2021

JulieLeeMSFT added this to the Future milestone Mar 12, 2021

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Mar 12, 2021

AndyAyersMS mentioned this issue Mar 25, 2021

Dynamic PGO #43618

Closed

54 tasks

AndyAyersMS mentioned this issue Apr 5, 2023

[API Proposal]: MethodImplOptions.Cold for marking cold methods for the JIT #84333

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread.InitializeCurrentThread should be marked as cold code #49520

Thread.InitializeCurrentThread should be marked as cold code #49520

benaadams commented Mar 12, 2021 •

edited by BruceForstall

Loading

dotnet-issue-labeler bot commented Mar 12, 2021

AndyAyersMS commented Mar 12, 2021

benaadams commented Mar 12, 2021

benaadams commented Mar 12, 2021

jkotas commented Mar 12, 2021

EgorBo commented Mar 12, 2021

jkotas commented Mar 12, 2021

AndyAyersMS commented Mar 13, 2021

AndyAyersMS commented Mar 13, 2021

benaadams commented Mar 13, 2021

AndyAyersMS commented Mar 15, 2021

AndyAyersMS commented Mar 19, 2021 •

edited by EgorBo

Loading

benaadams commented Mar 19, 2021

AndyAyersMS commented Mar 25, 2021

benaadams commented Mar 25, 2021 •

edited

Loading

AndyAyersMS commented Mar 25, 2021

Thread.InitializeCurrentThread should be marked as cold code #49520

Thread.InitializeCurrentThread should be marked as cold code #49520

Comments

benaadams commented Mar 12, 2021 • edited by BruceForstall Loading

dotnet-issue-labeler bot commented Mar 12, 2021

AndyAyersMS commented Mar 12, 2021

benaadams commented Mar 12, 2021

benaadams commented Mar 12, 2021

jkotas commented Mar 12, 2021

EgorBo commented Mar 12, 2021

jkotas commented Mar 12, 2021

AndyAyersMS commented Mar 13, 2021

AndyAyersMS commented Mar 13, 2021

benaadams commented Mar 13, 2021

AndyAyersMS commented Mar 15, 2021

AndyAyersMS commented Mar 19, 2021 • edited by EgorBo Loading

benaadams commented Mar 19, 2021

AndyAyersMS commented Mar 25, 2021

benaadams commented Mar 25, 2021 • edited Loading

AndyAyersMS commented Mar 25, 2021

benaadams commented Mar 12, 2021 •

edited by BruceForstall

Loading

AndyAyersMS commented Mar 19, 2021 •

edited by EgorBo

Loading

benaadams commented Mar 25, 2021 •

edited

Loading