-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RyuJIT: Fasta benchmark: hot method random() is not in-lined by legacy policy into SelectRandom() #7311
Comments
The method in question is const int IM = 139968;
const int IA = 3877;
const int IC = 29573;
static int seed = 42;
static double random (double max)
{
return max * ((seed = (seed * IA + IC) % IM) * (1.0 / IM));
} This ends up being 22 bytes of IL because CSC turns |
The code noted above still appears in the fasta-2 variant. The caller is Also note profile data suggests the subsequent loop is not very hot, with weight 1.73 so tends not to iterate much (not clear yet if this profile data is accurate late in the jit pipeline, so will drill into that and the inlining heuristics). ; Assembly listing for method Fasta_2:SelectRandom(ref):ubyte
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 4297046
; invoked as altjit
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 5, 4.73) ref -> rsi class-hnd
; V01 loc0 [V01,T04] ( 2, 2.73) double -> mm0
; V02 loc1 [V02,T01] ( 5, 4.92) int -> rax
; V03 OutArgs [V03 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] "OutgoingArgSpace"
; V04 cse0 [V04,T02] ( 3, 4.46) ref -> rcx "CSE - aggressive"
; V05 cse1 [V05,T03] ( 6, 2.73) int -> rdx "CSE - aggressive"
;
; Lcl frame size = 32
G_M9962_IG01: ;; offset=0000H
56 push rsi
4883EC20 sub rsp, 32
C5F877 vzeroupper
488BF1 mov rsi, rcx
;; bbWeight=1 PerfScore 2.50
G_M9962_IG02: ;; offset=000BH
C5FB100555000000 vmovsd xmm0, qword ptr [reloc @RWD00]
E8206BD9FF call Fasta_2:random(double):double
33C0 xor eax, eax
8B5608 mov edx, dword ptr [rsi+8]
85D2 test edx, edx
7E1C jle SHORT G_M9962_IG05
;; bbWeight=1 PerfScore 6.50
G_M9962_IG03: ;; offset=0021H
4863C8 movsxd rcx, eax
488B4CCE10 mov rcx, gword ptr [rsi+8*rcx+16]
C5FB104908 vmovsd xmm1, qword ptr [rcx+8]
C5F92EC8 vucomisd xmm1, xmm0
7723 ja SHORT G_M9962_IG07
;; bbWeight=1.73 PerfScore 10.81
G_M9962_IG04: ;; offset=0034H
FFC0 inc eax
3BD0 cmp edx, eax
7FE7 jg SHORT G_M9962_IG03
;; bbWeight=0.73 PerfScore 1.09
G_M9962_IG05: ;; offset=003AH
8D42FF lea eax, [rdx-1]
3BC2 cmp eax, edx
731E jae SHORT G_M9962_IG09
FFCA dec edx
4863C2 movsxd rax, edx
488B44C610 mov rax, gword ptr [rsi+8*rax+16]
0FB64010 movzx rax, byte ptr [rax+16]
;; bbWeight=0 PerfScore 0.00
G_M9962_IG06: ;; offset=004FH
4883C420 add rsp, 32
5E pop rsi
C3 ret
;; bbWeight=0 PerfScore 0.00
G_M9962_IG07: ;; offset=0055H
0FB64110 movzx rax, byte ptr [rcx+16]
;; bbWeight=1 PerfScore 2.00
G_M9962_IG08: ;; offset=0059H
4883C420 add rsp, 32
5E pop rsi
C3 ret
;; bbWeight=1 PerfScore 1.75
G_M9962_IG09: ;; offset=005FH
E84C3FA45F call CORINFO_HELP_RNGCHKFAIL
CC int3
;; bbWeight=0 PerfScore 0.00
RWD00 dq 3FF0000000000000h ; 1
; Total bytes of code 101, prolog size 11, PerfScore 35.06, instruction count 34 (MethodHash=a6f4d915) for method Fasta_2:SelectRandom(ref):ubyte [edit: clarified the call site is not in a loop so no extra boost expected with PGO] |
Importer profile data shows loop is indeed not very hot and 1.73 is the right weight.
Default heuristic (no PGO)
Default heuristic (Tiered PGO). Here "warm" just means nonzero profile count. Note this requires QJFL=1 or
New PGO heuristic (note call site frequency is 1, so no extra boost; also we predict this is a size decreasing inline):
assembly for the PGO case below, note size is definitely larger. ; Assembly listing for method Fasta_2:SelectRandom(ref):ubyte
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 7033212
; invoked as altjit
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 5, 4.72) ref -> rcx class-hnd
; V01 loc0 [V01,T07] ( 2, 2.72) double -> mm0
; V02 loc1 [V02,T04] ( 5, 4.88) int -> rax
; V03 OutArgs [V03 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] "OutgoingArgSpace"
; V04 tmp1 [V04,T01] ( 3, 6 ) int -> r8 "dup spill"
; V05 tmp2 [V05,T02] ( 3, 6 ) int -> r8 "fgInsertCommaFormTemp is creating a new local variable"
; V06 cse0 [V06,T05] ( 3, 4.44) ref -> r8 "CSE - aggressive"
; V07 cse1 [V07,T06] ( 6, 2.72) int -> rdx "CSE - aggressive"
; V08 rat0 [V08,T03] ( 3, 6 ) int -> rdx "ReplaceWithLclVar is creating a new local variable"
;
; Lcl frame size = 40
G_M9962_IG01: ;; offset=0000H
4883EC28 sub rsp, 40
C5F877 vzeroupper
;; bbWeight=1 PerfScore 1.25
G_M9962_IG02: ;; offset=0007H
48B82410FFC1F87F0000 mov rax, 0x7FF8C1FF1024
446900250F0000 imul r8d, dword ptr [rax], 0xF25
4181C085730000 add r8d, 0x7385
BA8156F71D mov edx, 0x1DF75681
8BC2 mov eax, edx
41F7E8 imul edx:eax, r8d
8BC2 mov eax, edx
C1E81F shr eax, 31
C1FA0E sar edx, 14
03C2 add eax, edx
69C0C0220200 imul eax, eax, 0x222C0
442BC0 sub r8d, eax
48B82410FFC1F87F0000 mov rax, 0x7FF8C1FF1024
448900 mov dword ptr [rax], r8d
C5F857C0 vxorps xmm0, xmm0
C4C17B2AC0 vcvtsi2sd xmm0, r8d
C5FB590556000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD00]
33C0 xor eax, eax
8B5108 mov edx, dword ptr [rcx+8]
85D2 test edx, edx
7E1D jle SHORT G_M9962_IG05
;; bbWeight=1 PerfScore 26.83
G_M9962_IG03: ;; offset=0063H
4C63C0 movsxd r8, eax
4E8B44C110 mov r8, gword ptr [rcx+8*r8+16]
C4C17B104808 vmovsd xmm1, qword ptr [r8+8]
C5F92EC8 vucomisd xmm1, xmm0
7721 ja SHORT G_M9962_IG07
;; bbWeight=1.72 PerfScore 10.75
G_M9962_IG04: ;; offset=0077H
FFC0 inc eax
3BD0 cmp edx, eax
7FE6 jg SHORT G_M9962_IG03
;; bbWeight=0.72 PerfScore 1.08
G_M9962_IG05: ;; offset=007DH
8D42FF lea eax, [rdx-1]
3BC2 cmp eax, edx
731D jae SHORT G_M9962_IG09
FFCA dec edx
4863C2 movsxd rax, edx
488B44C110 mov rax, gword ptr [rcx+8*rax+16]
0FB64010 movzx rax, byte ptr [rax+16]
;; bbWeight=0 PerfScore 0.00
G_M9962_IG06: ;; offset=0092H
4883C428 add rsp, 40
C3 ret
;; bbWeight=0 PerfScore 0.00
G_M9962_IG07: ;; offset=0097H
410FB64010 movzx rax, byte ptr [r8+16]
;; bbWeight=1 PerfScore 2.00
G_M9962_IG08: ;; offset=009CH
4883C428 add rsp, 40
C3 ret
;; bbWeight=1 PerfScore 1.25
G_M9962_IG09: ;; offset=00A1H
E8EA3EA15F call CORINFO_HELP_RNGCHKFAIL
CC int3
;; bbWeight=0 PerfScore 0.00
RWD00 dq 3EDDF75680FEB65Fh ; 7.14449017e-06
; Total bytes of code 167, prolog size 7, PerfScore 60.16, instruction count 45 (MethodHash=a6f4d915) for method Fasta_2:SelectRandom(ref):ubyte Profile data shows this is indeed where all the time is spent: For default:
and with PGO/inlining
Per BDN the PGO inline version is not consistently faster in cycles. Should revisit once #44370 is merged. |
At any rate, PGO inlining does inline the hot methods here. |
Looking at instructions retired (per BDN)
So about 2.43/(2.73 + 0.06) = 0.87 reduction in instructions, and (in this run) similar reduction in cycles. So we'd expect about a 10% improvement overall. |
CqPerf version of this benchmark is very sensitive to in-lining.
Forcing inline of random() would cause RyuJIT to beat Legacy Jit64 by 7.4% in execution perf.
Model policy in-lines random() method into SelectRandom().
category:cq
theme:inlining
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: