Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread.InitializeCurrentThread should be marked as cold code #49520

Open
benaadams opened this issue Mar 12, 2021 · 16 comments
Open

Thread.InitializeCurrentThread should be marked as cold code #49520

benaadams opened this issue Mar 12, 2021 · 16 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@benaadams
Copy link
Member

benaadams commented Mar 12, 2021

Thread.InitializeCurrentThread is only called once for the Thread lifetime, however the method it is used in CurrentThread is called lots and every async method has its own call to it (to get ExecutionContext etc)

public static Thread CurrentThread
{
[Intrinsic]
get
{
return t_currentThread ?? InitializeCurrentThread();
}
}

Currently the asm generated is a conditional jmp forward for it already being set e.g. from AsyncMethodBuilderCore:Start(byref) which is bad for static branch prediction.

G_M49992_IG03:
       call     [CORINFO_HELP_READYTORUN_STATIC_BASE]
       mov      rax, gword ptr [rax+24]
       test     rax, rax
       jne      SHORT G_M49992_IG05
						;; bbWeight=1    PerfScore 6.25
G_M49992_IG04:
       call     [Thread:InitializeCurrentThread():Thread]
						;; bbWeight=0.25 PerfScore 0.75
G_M49992_IG05:
       mov      ...

Reversing the condition in CurrentThread

Thread? currentThread = t_currentThread;
if (currentThread is not null)
{
    return currentThread;
}

return InitializeCurrentThread();

While it generates a conditional jmp forward for InitializeCurrentThread it also then adds a jmp for the regular path:

G_M49992_IG03:
       call     [CORINFO_HELP_READYTORUN_STATIC_BASE]
       mov      rax, gword ptr [rax+24]
       test     rax, rax
       je       SHORT G_M49992_IG05
						;; bbWeight=1    PerfScore 6.25
G_M49992_IG04:
       jmp      SHORT G_M49992_IG06
						;; bbWeight=0.50 PerfScore 1.00
G_M49992_IG05:
       call     [Thread:InitializeCurrentThread():Thread]
						;; bbWeight=0.50 PerfScore 1.50
G_M49992_IG06:
       mov      ...

Marking it as a cold method should resolve this by moving the call to InitializeCurrentThread to the end of the method so the regular path didn't have to jump over it?

/cc @EgorBo

category:performance
theme:block-layout

@benaadams benaadams added the tenet-performance Performance related issue label Mar 12, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 12, 2021
@AndyAyersMS
Copy link
Member

My understanding is that static prediction is out of favor these days.

But reordering like you propose has other benefits:

  • not taken branch is generally cheaper than taken branch
  • hot and cold code should not intermix for locality reasons

@benaadams
Copy link
Member Author

My understanding is that static prediction is out of favor these days

Yes as normally the branch predictor will be the dominant factor; however the peculiarity here is its included in every async method's kick off; as they all have different generic types for ref TStateMachine so it will be more common that these calls are "fresh" as far as the branch predictor history is aware as its in so many places? (The MoveNext methods less so since they go via ExecutionContext.Run which calls CurrentThread so is a common code point)

Being more generic if there are more examples could add another enum to MethodImplOptions? e.g. ColdCode

@benaadams
Copy link
Member Author

PGO could obviously also address this; however it seems problematic to always start in the "wrong" state for all these methods?

@jkotas
Copy link
Member

jkotas commented Mar 12, 2021

We should be able to start in a good state with static pre-generate PGO data.

@jkotas jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 12, 2021
@EgorBo
Copy link
Member

EgorBo commented Mar 12, 2021

Maybe it's time to introduce Likely/Unlikely APIs?

@jkotas
Copy link
Member

jkotas commented Mar 12, 2021

Yes, these APIs would make these micro-optimizations to be done manually. I do not think we would welcome PRs to add likely/unlikely annotations in the core libraries. I think we would rather want to depend on the static PGO data instead since it scales a lot better and it does not require humans to guess what is more likely.

@JulieLeeMSFT JulieLeeMSFT added this to the Future milestone Mar 12, 2021
@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Mar 12, 2021
@AndyAyersMS
Copy link
Member

Here you can see PGO doing this sort of thing -- I'll see if I can dig up an example with InitializeCurrentThread

;;; NO PGO

; Assembly listing for method Winsock:EnsureInitialized()
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; PGO data available, but JitDisablePGO == 1
; Final local variable assignments
;
;# V00 OutArgs      [V00    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M27667_IG01:        ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
						;; bbWeight=1    PerfScore 0.00
G_M27667_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       mov      rax, 0xD1FFAB1E
       cmp      dword ptr [rax], 0
       jne      SHORT G_M27667_IG04
						;; bbWeight=1    PerfScore 3.25
G_M27667_IG03:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
       jmp      Winsock:<EnsureInitialized>g__Initialize|55_0()
       ; gcr arg pop 0
						;; bbWeight=0.50 PerfScore 1.00
G_M27667_IG04:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
       ret      
						;; bbWeight=0.50 PerfScore 0.50

;;; PGO

; Assembly listing for method Winsock:EnsureInitialized()
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 669
; Final local variable assignments
;
;# V00 OutArgs      [V00    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M27667_IG01:        ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
						;; bbWeight=1    PerfScore 0.00
G_M27667_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       mov      rax, 0xD1FFAB1E
       cmp      dword ptr [rax], 0
       je       SHORT G_M27667_IG04
						;; bbWeight=1    PerfScore 3.25
G_M27667_IG03:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
       ret      
						;; bbWeight=1.00 PerfScore 1.00
G_M27667_IG04:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
       jmp      Winsock:<EnsureInitialized>g__Initialize|55_0()
       ; gcr arg pop 0
						;; bbWeight=0.00 PerfScore 0.00

@AndyAyersMS
Copy link
Member

Here it is -- Note SPMI hides the callee name.

This is with dynamic PGO, but static PGO will do similar things.

;;; NO PGO

; Assembly listing for method System.Threading.Thread:get_CurrentThread():System.Threading.Thread
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; PGO data available, but JitDisablePGO == 1
; Final local variable assignments
;
;  V00 OutArgs      [V00    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V01 tmp1         [V01,T00] (  2,  4   )     ref  ->  rax         class-hnd "dup spill"
;  V02 tmp2         [V02,T01] (  3,  2.50)     ref  ->  rax        
;
; Lcl frame size = 40

G_M9749_IG01:        ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
       sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25
G_M9749_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       mov      rcx, 0xD1FFAB1E
       mov      edx, 639
       call     CORINFO_HELP_GETSHARED_GCTHREADSTATIC_BASE
       ; byrRegs +[rax]
       ; gcr arg pop 0
       mov      rax, gword ptr [rax+24]
       ; gcrRegs +[rax]
       ; byrRegs -[rax]
       test     rax, rax
       jne      SHORT G_M9749_IG04
						;; bbWeight=1    PerfScore 4.75
G_M9749_IG03:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
       ; gcrRegs -[rax]
       add      rsp, 40
       jmp      hackishModuleName:hackishMethodName()
       ; gcr arg pop 0
						;; bbWeight=0.50 PerfScore 1.12
G_M9749_IG04:        ; gcrefRegs=00000001 {rax}, byrefRegs=00000000 {}, byref, epilog, nogc
       ; gcrRegs +[rax]
       add      rsp, 40
       ret      
						;; bbWeight=0.50 PerfScore 0.62

;;; PGO

; Assembly listing for method System.Threading.Thread:get_CurrentThread():System.Threading.Thread
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 240550
; Final local variable assignments
;
;  V00 OutArgs      [V00    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V01 tmp1         [V01,T00] (  2,  4   )     ref  ->  rax         class-hnd "dup spill"
;  V02 tmp2         [V02,T01] (  3,  3.00)     ref  ->  rax        
;
; Lcl frame size = 40

G_M9749_IG01:        ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
       sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25
G_M9749_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       mov      rcx, 0xD1FFAB1E
       mov      edx, 639
       call     CORINFO_HELP_GETSHARED_GCTHREADSTATIC_BASE
       ; byrRegs +[rax]
       ; gcr arg pop 0
       mov      rax, gword ptr [rax+24]
       ; gcrRegs +[rax]
       ; byrRegs -[rax]
       test     rax, rax
       je       SHORT G_M9749_IG04
						;; bbWeight=1    PerfScore 4.75
G_M9749_IG03:        ; gcrefRegs=00000001 {rax}, byrefRegs=00000000 {}, byref, epilog, nogc
       add      rsp, 40
       ret      
						;; bbWeight=1.00 PerfScore 1.25
G_M9749_IG04:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
       ; gcrRegs -[rax]
       add      rsp, 40
       jmp      hackishModuleName:hackishMethodName()
       ; gcr arg pop 0
						;; bbWeight=0.00 PerfScore 0.00

Raw profile view showing branch bias:

@benaadams
Copy link
Member Author

That's beautiful! 🤩

Should this issue be closed?

@AndyAyersMS
Copy link
Member

We should be seeing profile data flowing into both prejitting and jitting soon, so it would be good to confirm once that happens that we get the above without any explicit enabling of PGO.

So let's keep this open as a reminder.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Mar 19, 2021

Here's a listing from today's build. We now have PGO data flowing into SPC and on through to jitting.

This is happening without any COMPlus settings.

; Assembly listing for method Thread:get_CurrentThread():Thread
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 157048
; Final local variable assignments
;
;  V00 OutArgs      [V00    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V01 tmp1         [V01,T00] (  2,  4   )     ref  ->  rax         class-hnd "dup spill"
;  V02 tmp2         [V02,T01] (  3,  3.00)     ref  ->  rax
;
; Lcl frame size = 40

G_M63029_IG01:              ;; offset=0000H
       4883EC28             sub      rsp, 40
                                                ;; bbWeight=1    PerfScore 0.25
G_M63029_IG02:              ;; offset=0004H
       48B92000239AFD7F0000 mov      rcx, 0x7FFD9A230020
       BA7F020000           mov      edx, 639
       E8886AAB5F           call     CORINFO_HELP_GETSHARED_GCTHREADSTATIC_BASE
       488B4018             mov      rax, gword ptr [rax+24]
       4885C0               test     rax, rax
       7405                 je       SHORT G_M63029_IG04
                                                ;; bbWeight=1    PerfScore 4.75
G_M63029_IG03:              ;; offset=0021H
       4883C428             add      rsp, 40
       C3                   ret
                                                ;; bbWeight=1.00 PerfScore 1.25
G_M63029_IG04:              ;; offset=0026H
       4883C428             add      rsp, 40
       E9C955FFFF           jmp      Thread:InitializeCurrentThread():Thread
                                                ;; bbWeight=0.00 PerfScore 0.00

@benaadams
Copy link
Member Author

Does it pick that up for async Method kick offs (as they will be new inlines); though presumably that will kick in at Tier1 anyway and at Tier0 the ordering of that if probably isn't a significant feature

@AndyAyersMS
Copy link
Member

We still have some work to do here, in cases where the caller doesn't have PGO data.

For instance

****** START compiling AsyncMethodBuilderCore:Start(byref) (MethodHash=6f463cb7)

The inlinee Thread:get_CurrentThread():Thread has good looking profile data initially:

Expanding INLINE_CANDIDATE in statement STMT00001 in BB03:
STMT00001 (IL 0x014...0x019)
               [000008] I-C-G-------              *  CALL      ref    Thread.get_CurrentThread (exactContextHnd=0x00007FFF7268B729)
...
Computing inlinee profile scale:
   ... call site not profiled

*************** Inline @[000008] Finishing PHASE Profile incorporation
Trees after Profile incorporation

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd                 weight      IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB14 [0012]  1                             157k 157048    [000..008)-> BB16 ( cond )                     IBC 
BB15 [0013]  1                           296       296    [008..00E)                                     IBC 
BB16 [0014]  2                             157k 157048    [00E..00F)        (return)                     IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

but the caller does not, and so we don't bother to scale the inlinee counts, and after the resultant block fusion and unity weight descaling we end up seriously confused:

----------- Statements (and blocks) added due to the inlining of call [000008] -----------

Inlinee method body:New Basic Block BB17 [0015] created.

Convert bbJumpKind of BB16 to BBJ_NONE
fgInlineAppendStatements: no gc ref inline locals.

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd                 weight   IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB14 [0012]  1                             1           [014..015)-> BB16 ( cond )                     i 
BB15 [0013]  1                             2.96 296    [014..015)                                     i IBC 
BB16 [0014]  2                             1           [014..015)                                     i 
--

Looks like we should always compute a scale factor if the callee has PGO.

In general the jit is not very smart yet about mixing PGO and non-PGO.

@AndyAyersMS AndyAyersMS mentioned this issue Mar 25, 2021
54 tasks
@benaadams
Copy link
Member Author

benaadams commented Mar 25, 2021

Looks like we should always compute a scale factor if the callee has PGO.

Would this currently be an issue as if the caller has a loop as it wouldn't do a retiering Jit?

@AndyAyersMS
Copy link
Member

With the advent of #49793 many framework methods will come equipped with profile data. And with the advent of #47558 the runtime will always tell the jit to look for profile data for optimized methods.

Thus for any method that inlines Thread:get_CurrentThread() and is optimized the jit will see the profile data and be able to tell that the branch is biased. You can see that in the profile incorporation above, before the jit decides to spoil things...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
Development

No branches or pull requests

5 participants