-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Generalized struct promotion #76928
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsDescriptionStruct promotion is an optimization that replaces structs with their constituent fields, allowing those fields to be optimized as if they were normal local variables. This is a very important optimization for low-level performance oriented code that makes heavy use of structs, so it is important that it is supported well by the JIT. LimitationsThe JIT supports promotion but with the following limitations today:
This issue is about removing (some of) these limitations. PlanThe preliminary idea is to introduce a new pass that replaces struct fields by new local variables and the "whole struct value" by the reassembling of the promoted fields and the residual fields. The pass will need the proper heuristics to figure out which fields to promote (depending on in which contexts they are used), and potentially in which parts of the function (e.g. due to being address exposed on some paths). It is likely that some form of struct liveness will be needed by this pass and the hope is that the liveness pass from #76069 will be beneficial here as well. One difficulty is in the representation of multi-reg args and returns at the ABI boundaries. Today they more or less "fall out" from the whole-promotion representation by using the parent struct local as the use/def. A new representation will likely be needed if structs no longer need to be entirely promoted. Initially I expect we can piggyback on the existing mechanism to get to a working prototype though, however as a long term goal it would be nice to replace the existing mechanism entirely.
|
cc @dotnet/jit-contrib |
The kinds of struct locals we can promote appear only as operands of After thinking some more I've come back to another idea which I think I will investigate. Instead of introducing a new node in HIR that persists until lowering, we introduce the equivalent of For the generalized struct promotion pass we would then only introduce new assignments: to pass one of these struct locals to a call, or to return one of them, it will just write all the constituent fields back into the struct local and leave the local in the call argument/ A priori this would just amount to a lot of unnecessary copying in the generated code. To get to an acceptable state, we would introduce an optimization in lowering to handle these patterns without copying. For example, after generalized struct promotion + rationalization, we would have LIR like:
we would do some analysis to figure out whether
Similarly for call results, e.g. we would get LIR like the following for generalized promotion after rationalization:
and, as an optimization, lower it into:
Some questions to investigate:
|
What about for Same for cases like |
What particular intrinsics take arbitrary struct arguments? Can they be decomposed early like |
There is #80297, where we are handling an intrinsic that essentially has a "multi-reg arg" via early decomposition into a |
Single beat me to it, that was the example I was going to give 😄 |
I don't see why the existing approach there wouldn't continue to work. The representation for call args here could also be |
A similar node would be needed for parameters. Generalized promotion would create IR in the start of functions to load (parts of) parameters into the promoted field locals and lowering would optimize these into some |
) Introduce a "physical" promotion pass that generalizes the existing promotion. More specifically, it does not have restrictions on field count and it can handle arbitrary recursive promotion. The pass is physical in the sense that it does not rely on any field metadata for structs. Instead, it works in two separate passes over the IR: 1. In the first pass we find and analyze how unpromoted struct locals are accessed. For example, for a simple program like: ``` public static void Main() { S s = default; Call(s, s.C); Console.WriteLine(s.B + s.C); } [MethodImpl(MethodImplOptions.NoInlining)] private static void Call(S s, byte b) { } private struct S { public byte A, B, C, D, E; } ``` we see IR like: ``` ***** BB01 STMT00000 ( 0x000[E-] ... 0x003 ) [000003] IA--------- ▌ ASG struct (init) [000001] D------N--- ├──▌ LCL_VAR struct<Program+S, 5> V00 loc0 [000002] ----------- └──▌ CNS_INT int 0 ***** BB01 STMT00001 ( 0x008[E-] ... 0x026 ) [000008] --C-G------ ▌ CALL void Program:Call(Program+S,ubyte) [000004] ----------- arg0 ├──▌ LCL_VAR struct<Program+S, 5> V00 loc0 [000007] ----------- arg1 └──▌ LCL_FLD ubyte V00 loc0 [+2] ***** BB01 STMT00002 ( 0x014[E-] ... ??? ) [000016] --C-G------ ▌ CALL void System.Console:WriteLine(int) [000015] ----------- arg0 └──▌ ADD int [000011] ----------- ├──▌ LCL_FLD ubyte V00 loc0 [+1] [000014] ----------- └──▌ LCL_FLD ubyte V00 loc0 [+2] ``` and the analysis produces ``` Accesses for V00 [000..005) #: (2, 200) # assigned from: (0, 0) # assigned to: (1, 100) # as call arg: (1, 100) # as implicit by-ref call arg: (1, 100) # as on-stack call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) ubyte @ 001 #: (1, 100) # assigned from: (0, 0) # assigned to: (0, 0) # as call arg: (0, 0) # as implicit by-ref call arg: (0, 0) # as on-stack call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) ubyte @ 002 #: (2, 200) # assigned from: (0, 0) # assigned to: (0, 0) # as call arg: (1, 100) # as implicit by-ref call arg: (0, 0) # as on-stack call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) ``` Here the pairs are (#ref counts, wtd ref counts). Based on this accounting, the analysis estimates the profitability of replacing some of the accessed parts of the struct with a local. This may be costly because overlapping struct accesses (e.g. passing the whole struct as an argument) may require more expensive codegen after promotion. And of course, creating new locals introduces more register pressure. Currently the profitability analysis is very crude. In this case the logic decides that promotion is not worth it: ``` Evaluating access ubyte @ 001 Single write-back cost: 5 Write backs: 100 Read backs: 100 Cost with: 1350 Cost without: 650 Disqualifying replacement Evaluating access ubyte @ 002 Single write-back cost: 5 Write backs: 100 Read backs: 100 Cost with: 1700 Cost without: 1300 Disqualifying replacement ``` 2. In the second pass the field accesses are replaced with new locals for the profitable cases. For overlapping accesses that currently involves writing back replacements to the struct local first. For arguments/OSR locals, it involves reading them back from the struct first. In the above case we can override the profitability analysis with stress mode STRESS_PHYSICAL_PROMOTION_COST and we get: ``` Evaluating access ubyte @ 001 Single write-back cost: 5 Write backs: 100 Read backs: 100 Cost with: 1350 Cost without: 650 Promoting replacement due to stress lvaGrabTemp returning 2 (V02 tmp1) (a long lifetime temp) called for V00.[001..002). Evaluating access ubyte @ 002 Single write-back cost: 5 Write backs: 100 Read backs: 100 Cost with: 1700 Cost without: 1300 Promoting replacement due to stress lvaGrabTemp returning 3 (V03 tmp2) (a long lifetime temp) called for V00.[002..003). V00 promoted with 2 replacements [001..002) promoted as ubyte V02 [002..003) promoted as ubyte V03 ... ***** BB01 STMT00000 ( 0x000[E-] ... 0x003 ) [000003] IA--------- ▌ ASG struct (init) [000001] D------N--- ├──▌ LCL_VAR struct<Program+S, 5> V00 loc0 [000002] ----------- └──▌ CNS_INT int 0 ***** BB01 STMT00001 ( 0x008[E-] ... 0x026 ) [000008] -ACXG------ ▌ CALL void Program:Call(Program+S,ubyte) [000004] ----------- arg0 ├──▌ LCL_VAR struct<Program+S, 5> V00 loc0 [000022] -A--------- arg1 └──▌ COMMA ubyte [000021] -A--------- ├──▌ ASG ubyte [000019] D------N--- │ ├──▌ LCL_VAR ubyte V03 tmp2 [000020] ----------- │ └──▌ LCL_FLD ubyte V00 loc0 [+2] [000018] ----------- └──▌ LCL_VAR ubyte V03 tmp2 ***** BB01 STMT00002 ( 0x014[E-] ... ??? ) [000016] -ACXG------ ▌ CALL void System.Console:WriteLine(int) [000015] -A--------- arg0 └──▌ ADD int [000027] -A--------- ├──▌ COMMA ubyte [000026] -A--------- │ ├──▌ ASG ubyte [000024] D------N--- │ │ ├──▌ LCL_VAR ubyte V02 tmp1 [000025] ----------- │ │ └──▌ LCL_FLD ubyte V00 loc0 [+1] [000023] ----------- │ └──▌ LCL_VAR ubyte V02 tmp1 [000028] ----------- └──▌ LCL_VAR ubyte V03 tmp2 ``` The pass still only has rudimentary support and is missing many basic CQ optimization optimizations. For example, it does not make use of any liveness yet and it does not have any decomposition support for assignments. Yet, it already shows good potential in user benchmarks. I have listed some follow-up improvements in #76928. This PR is adding the pass but it is disabled by default. It can be enabled by setting DOTNET_JitStressModeNames=STRESS_PHYSICAL_PROMOTION. There are two new scenarios added to jit-experimental that enables it, to be used for testing purposes.
Some measurements over asp.net for block copies/inits and whether they involve promoted structs: Copies physical -> physical: 3
Copies physical -> old: 283
Copies old -> physical: 250
Copies physical -> : 65
Copies -> physical: 268
Inits -> physical: 37 ("old" means structs that are promoted by the normal mechanism) It would be great to reuse block morphing to do the decomposition, but I'm not sure how simple that would be -- the decomposition for copies involving physically promoted structs is quite a bit more complicated. Same measurements with old promotion disabled: Copies physical -> physical: 162
Copies physical -> old: 0
Copies old -> physical: 0
Copies physical -> : 1332
Copies -> physical: 6034
Inits -> physical: 99 |
We frequently see promotion opportunities for standard C# code iterating lists via Where the codegen for the loop ends up with the following diff: @@ -1,35 +1,33 @@
G_M45156_IG09:
- mov rax, gword ptr [rbp-28H]
- mov rdx, gword ptr [rbp-20H]
+ mov rax, gword ptr [rbp-38H]
+ mov rdx, gword ptr [rbp-30H]
mov rcx, gword ptr [rax+08H]
call [rax+18H]System.Action`1[System.__Canon]:Invoke(System.__Canon):this
;; size=15 bbWeight=1 PerfScore 7.00
G_M45156_IG10:
- mov rcx, gword ptr [rbp-38H]
- mov esi, dword ptr [rbp-2CH]
- mov edi, dword ptr [rcx+14H]
- cmp esi, edi
+ mov rcx, rsi
+ mov r14d, dword ptr [rcx+14H]
+ cmp edi, r14d
jne SHORT G_M45156_IG14
- mov edx, dword ptr [rbp-30H]
- cmp edx, dword ptr [rcx+10H]
+ cmp ebx, dword ptr [rsi+10H]
jae SHORT G_M45156_IG15
- ;; size=22 bbWeight=2 PerfScore 20.50
+ ;; size=17 bbWeight=2 PerfScore 15.00
G_M45156_IG11:
- mov rcx, gword ptr [rcx+08H]
- mov eax, edx
- cmp eax, dword ptr [rcx+08H]
+ mov rcx, gword ptr [rsi+08H]
+ cmp ebx, dword ptr [rcx+08H]
jae SHORT G_M45156_IG08
- shl rax, 4
+ mov edx, ebx
+ shl rdx, 4
;; size=15 bbWeight=1 PerfScore 6.75
G_M45156_IG12:
- vmovdqu xmm0, xmmword ptr [rcx+rax+10H]
- vmovdqu xmmword ptr [rbp-28H], xmm0
+ vmovdqu xmm0, xmmword ptr [rcx+rdx+10H]
+ vmovdqu xmmword ptr [rbp-38H], xmm0
;; size=11 bbWeight=1 PerfScore 5.00
G_M45156_IG13:
- inc edx
- mov dword ptr [rbp-30H], edx
+ inc ebx
jmp SHORT G_M45156_IG09
- ;; size=7 bbWeight=1 PerfScore 3.25
+ ;; size=4 bbWeight=1 PerfScore 2.25
G_M45156_IG14:
- cmp esi, edi
+ cmp edi, r14d
jne SHORT G_M45156_IG07 |
Investigating some current causes of regressions when enabling physical promotion by default. (edit: handled by #87265) aspnet.run.windows.x64.checked.mch: +37 (+14.57%) : 18820.dasm - System.Collections.Concurrent.ConcurrentDictionary`2+Enumerator[System.__Canon,Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo]:MoveNext():bool:this@@ -14,40 +14,44 @@
;* V04 loc3 [V04 ] ( 0, 0 ) int -> zero-ref
; V05 loc4 [V05,T04] ( 3, 6 ) int -> rcx
; V06 OutArgs [V06 ] ( 1, 1 ) struct (32) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-; V07 tmp1 [V07,T06] ( 5, 5 ) struct (32) [rsp+28H] do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp"
+; V07 tmp1 [V07,T06] ( 4, 4 ) struct (32) [rsp+48H] do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp"
;* V08 tmp2 [V08 ] ( 0, 0 ) long -> zero-ref "spilling helperCall"
-; V09 tmp3 [V09,T10] ( 2, 2 ) ref -> rax class-hnd single-def "Inlining Arg"
+; V09 tmp3 [V09,T11] ( 2, 2 ) ref -> rax class-hnd single-def "Inlining Arg"
;* V10 tmp4 [V10 ] ( 0, 0 ) struct (24) zero-ref "Inlining Arg"
-;* V11 tmp5 [V11 ] ( 0, 0 ) struct (32) zero-ref do-not-enreg[S] "Inlining Arg"
-; V12 tmp6 [V12,T11] ( 2, 1 ) ref -> rdx single-def V10.<TokenSource>k__BackingField(offs=0x00) P-INDEP "field V10.<TokenSource>k__BackingField (fldOffset=0x0)"
-; V13 tmp7 [V13,T12] ( 2, 1 ) ref -> rcx single-def V10.<ChangeToken>k__BackingField(offs=0x08) P-INDEP "field V10.<ChangeToken>k__BackingField (fldOffset=0x8)"
-; V14 tmp8 [V14,T13] ( 2, 1 ) ref -> r8 single-def V10.<Matcher>k__BackingField(offs=0x10) P-INDEP "field V10.<Matcher>k__BackingField (fldOffset=0x10)"
-; V15 cse0 [V15,T07] ( 2, 4 ) int -> rax "CSE - aggressive"
-;* V16 rat0 [V16,T09] ( 0, 0 ) long -> zero-ref "Spilling to split statement for tree"
-;* V17 rat1 [V17,T14] ( 0, 0 ) long -> zero-ref "runtime lookup"
-;* V18 rat2 [V18,T08] ( 0, 0 ) long -> zero-ref "fgMakeTemp is creating a new local variable"
-; V19 rat3 [V19,T05] ( 3, 6 ) int -> rdx "ReplaceWithLclVar is creating a new local variable"
+; V11 tmp5 [V11,T08] ( 3, 3 ) struct (32) [rsp+28H] do-not-enreg[S] must-init "Inlining Arg"
+; V12 tmp6 [V12,T12] ( 2, 1 ) ref -> rdx single-def V10.<TokenSource>k__BackingField(offs=0x00) P-INDEP "field V10.<TokenSource>k__BackingField (fldOffset=0x0)"
+; V13 tmp7 [V13,T13] ( 2, 1 ) ref -> rcx single-def V10.<ChangeToken>k__BackingField(offs=0x08) P-INDEP "field V10.<ChangeToken>k__BackingField (fldOffset=0x8)"
+; V14 tmp8 [V14,T14] ( 2, 1 ) ref -> r8 single-def V10.<Matcher>k__BackingField(offs=0x10) P-INDEP "field V10.<Matcher>k__BackingField (fldOffset=0x10)"
+;* V15 tmp9 [V15 ] ( 0, 0 ) ref -> zero-ref single-def "V07.[000..008)"
+; V16 cse0 [V16,T07] ( 2, 4 ) int -> rax "CSE - aggressive"
+;* V17 rat0 [V17,T10] ( 0, 0 ) long -> zero-ref "Spilling to split statement for tree"
+;* V18 rat1 [V18,T15] ( 0, 0 ) long -> zero-ref "runtime lookup"
+;* V19 rat2 [V19,T09] ( 0, 0 ) long -> zero-ref "fgMakeTemp is creating a new local variable"
+; V20 rat3 [V20,T05] ( 3, 6 ) int -> rdx "ReplaceWithLclVar is creating a new local variable"
;
-; Lcl frame size = 72
+; Lcl frame size = 104
G_M47209_IG01: ; bbWeight=1, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref, nogc <-- Prolog IG
push rdi
push rsi
push rbp
push rbx
- sub rsp, 72
+ sub rsp, 104
+ vzeroupper
xor eax, eax
mov qword ptr [rsp+28H], rax
vxorps xmm4, xmm4, xmm4
vmovdqa xmmword ptr [rsp+30H], xmm4
- mov qword ptr [rsp+40H], rax
+ vmovdqa xmmword ptr [rsp+40H], xmm4
+ vmovdqa xmmword ptr [rsp+50H], xmm4
+ mov qword ptr [rsp+60H], rax
mov rbx, rcx
; gcrRegs +[rbx]
- ;; size=33 bbWeight=1 PerfScore 9.08
+ ;; size=48 bbWeight=1 PerfScore 14.08
G_M47209_IG02: ; bbWeight=1, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref
mov edx, dword ptr [rbx+24H]
cmp edx, 2
- ja G_M47209_IG08
+ ja G_M47209_IG10
lea rcx, [reloc @RWD00]
mov ecx, dword ptr [rcx+4*rdx]
lea rax, G_M47209_IG02
@@ -66,7 +70,7 @@ G_M47209_IG03: ; bbWeight=0.50, gcrefRegs=0008 {rbx}, byrefRegs=0000 {},
; byrRegs -[rcx]
mov dword ptr [rbx+20H], -1
;; size=28 bbWeight=0.50 PerfScore 4.25
-G_M47209_IG04: ; bbWeight=2, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref, isz
+G_M47209_IG04: ; bbWeight=2, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref
mov rdx, gword ptr [rbx+10H]
; gcrRegs +[rdx]
mov ecx, dword ptr [rbx+20H]
@@ -74,7 +78,7 @@ G_M47209_IG04: ; bbWeight=2, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byr
mov dword ptr [rbx+20H], ecx
mov eax, dword ptr [rdx+08H]
cmp eax, ecx
- jbe SHORT G_M47209_IG08
+ jbe G_M47209_IG10
mov rdx, gword ptr [rdx+8*rcx+10H]
lea rcx, bword ptr [rbx+18H]
; byrRegs +[rcx]
@@ -82,7 +86,7 @@ G_M47209_IG04: ; bbWeight=2, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byr
; gcrRegs -[rdx]
; byrRegs -[rcx]
mov dword ptr [rbx+24H], 2
- ;; size=40 bbWeight=2 PerfScore 26.00
+ ;; size=44 bbWeight=2 PerfScore 26.00
G_M47209_IG05: ; bbWeight=4, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref, isz
mov rbp, gword ptr [rbx+18H]
; gcrRegs +[rbp]
@@ -98,10 +102,16 @@ G_M47209_IG06: ; bbWeight=0.50, gcrefRegs=0028 {rbx rbp}, byrefRegs=0000
; gcrRegs +[rcx]
mov r8, gword ptr [rbp+30H]
; gcrRegs +[r8]
+ mov gword ptr [rsp+50H], rdx
+ mov gword ptr [rsp+58H], rcx
+ mov gword ptr [rsp+60H], r8
+ ;; size=31 bbWeight=0.50 PerfScore 5.50
+G_M47209_IG07: ; bbWeight=0.50, nogc, extend
+ vmovdqu ymm0, ymmword ptr [rsp+48H]
+ vmovdqu ymmword ptr [rsp+28H], ymm0
+ ;; size=12 bbWeight=0.50 PerfScore 2.50
+G_M47209_IG08: ; bbWeight=0.50, extend
mov gword ptr [rsp+28H], rax
- mov gword ptr [rsp+30H], rdx
- mov gword ptr [rsp+38H], rcx
- mov gword ptr [rsp+40H], r8
lea rdi, bword ptr [rbx+28H]
; byrRegs +[rdi]
lea rsi, bword ptr [rsp+28H]
@@ -119,33 +129,35 @@ G_M47209_IG06: ; bbWeight=0.50, gcrefRegs=0028 {rbx rbp}, byrefRegs=0000
; gcrRegs -[rdx rbp]
; byrRegs -[rcx rsi rdi]
mov eax, 1
- ;; size=83 bbWeight=0.50 PerfScore 10.38
-G_M47209_IG07: ; bbWeight=0.50, epilog, nogc, extend
- add rsp, 72
+ ;; size=52 bbWeight=0.50 PerfScore 4.88
+G_M47209_IG09: ; bbWeight=0.50, epilog, nogc, extend
+ vzeroupper
+ add rsp, 104
pop rbx
pop rbp
pop rsi
pop rdi
ret
- ;; size=9 bbWeight=0.50 PerfScore 1.62
-G_M47209_IG08: ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, gcvars, byref
+ ;; size=12 bbWeight=0.50 PerfScore 2.12
+G_M47209_IG10: ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, gcvars, byref
mov dword ptr [rbx+24H], 3
xor eax, eax
;; size=9 bbWeight=0.50 PerfScore 0.62
-G_M47209_IG09: ; bbWeight=0.50, epilog, nogc, extend
- add rsp, 72
+G_M47209_IG11: ; bbWeight=0.50, epilog, nogc, extend
+ vzeroupper
+ add rsp, 104
pop rbx
pop rbp
pop rsi
pop rdi
ret
- ;; size=9 bbWeight=0.50 PerfScore 1.62
+ ;; size=12 bbWeight=0.50 PerfScore 2.12
RWD00 dd G_M47209_IG03 - G_M47209_IG02
dd G_M47209_IG04 - G_M47209_IG02
dd G_M47209_IG05 - G_M47209_IG02
-; Total bytes of code 254, prolog size 33, PerfScore 100.98, instruction count 71, allocated bytes for code 254 (MethodHash=bed44796) for method System.Collections.Concurrent.ConcurrentDictionary`2+Enumerator[System.__Canon,Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo]:MoveNext():bool:this
+; Total bytes of code 291, prolog size 48, PerfScore 113.18, instruction count 78, allocated bytes for code 291 (MethodHash=bed44796) for method System.Collections.Concurrent.ConcurrentDictionary`2+Enumerator[System.__Canon,Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo]:MoveNext():bool:this
; ============================================================ Promotions: Accesses for V07
ref @ 000
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
[000..032) as System.Collections.Generic.KeyValuePair`2[System.__Canon,Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo]
#: (1, 100)
# assigned from: (1, 100)
# assigned to: (0, 0)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
[008..032) as Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
Accesses for V11
[000..032) as System.Collections.Generic.KeyValuePair`2[System.__Canon,Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo]
#: (2, 200)
# assigned from: (1, 100)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
Picking promotions for V07
Evaluating access ref @ 000
Single write-back cost: 3
Write backs: 0
Read backs: 0
Cost with: 50
Cost without: 300
Promoting replacement
lvaGrabTemp returning 15 (V15 tmp9) (a long lifetime temp) called for V07.[000..008).
V07 promoted with 1 replacements
[000..008) promoted as ref V15
Computing unpromoted remainder for V07
Remainder: [008..032) We end up with the following decomposition: STMT00025 ( 0x096[--] ... ??? )
[000107] DA--------- ▌ STORE_LCL_VAR struct<System.Collections.Generic.KeyValuePair`2, 32> V11 tmp5
[000036] ----------- └──▌ LCL_VAR struct<System.Collections.Generic.KeyValuePair`2, 32> V07 tmp1 (last use)
Processing block operation [000107] that involves replacements
dst+000 <- V15 (V07.[000..008)) (last use)
Remainder: [008..032)
=> Remainder strategy: retain a full block op
Local V11 should not be enregistered because: was accessed as a local field
New statement:
STMT00025 ( 0x096[--] ... ??? )
[000112] -A--------- ▌ COMMA void
[000107] DA--------- ├──▌ STORE_LCL_VAR struct<System.Collections.Generic.KeyValuePair`2, 32> V11 tmp5
[000036] ----------- │ └──▌ LCL_VAR struct<System.Collections.Generic.KeyValuePair`2, 32> V07 tmp1
[000111] UA--------- └──▌ STORE_LCL_FLD ref V11 tmp5 [+0]
[000110] ----------- └──▌ LCL_VAR ref V15 tmp9 (last use) However, after - [000107]: [000104] is last use of [000107] (V11) -- fwd subbing [000036]; new next stmt is
-STMT00024 ( INL02 @ 0x000[E-] ... ??? ) <- INLRT @ 0x096[--]
- [000106] nA-XG------ ▌ STORE_BLK struct<System.Collections.Generic.KeyValuePair`2, 32> (copy)
- [000105] ---X------- ├──▌ FIELD_ADDR byref <unknown class>:<unknown field>
- [000020] ----------- │ └──▌ LCL_VAR ref V00 this
- [000036] ----------- └──▌ LCL_VAR struct<System.Collections.Generic.KeyValuePair`2, 32> V07 tmp1 (last use)
-
-removing useless STMT00025 ( 0x096[--] ... ??? )
- [000107] DA--------- ▌ STORE_LCL_VAR struct<System.Collections.Generic.KeyValuePair`2, 32> V11 tmp5
- [000036] ----------- └──▌ LCL_VAR struct<System.Collections.Generic.KeyValuePair`2, 32> V07 tmp1 (last use)
- from BB07 It would be possible to look ahead to try to predict this situation and then handle the store by writing back to V07 ahead of it instead. Alternatively we could also run forward sub before physical promotion. |
(edit: partially handled by #87217, rest will be handled by #87410) +21 (+18.10%) : 90439.dasm - Microsoft.EntityFrameworkCore.Infrastructure.Uniquifier:GetLength(System.Nullable`1[int]):int@@ -7,79 +7,88 @@
; 0 inlinees with PGO data; 4 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
-; V00 arg0 [V00,T00] ( 9, 27 ) struct ( 8) [rsp+30H] do-not-enreg[SF] ld-addr-op single-def
-; V01 loc0 [V01,T02] ( 4, 9 ) int -> rcx
-;* V02 loc1 [V02 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op
+; V00 arg0 [V00,T01] ( 5, 11 ) struct ( 8) [rsp+40H] do-not-enreg[SF] ld-addr-op single-def
+; V01 loc0 [V01,T06] ( 4, 9 ) int -> r8
+; V02 loc1 [V02 ] ( 4, 14 ) struct ( 8) [rsp+30H] do-not-enreg[SF] must-init ld-addr-op
;* V03 loc2 [V03 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op
; V04 OutArgs [V04 ] ( 1, 1 ) struct (32) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V05 tmp1 [V05 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op "NewObj constructor temp"
-;* V06 tmp2 [V06 ] ( 0, 0 ) struct ( 8) zero-ref
+; V06 tmp2 [V06 ] ( 5, 14 ) struct ( 8) [rsp+28H] do-not-enreg[SF] must-init
;* V07 tmp3 [V07 ] ( 0, 0 ) int -> zero-ref "Inlining Arg"
-; V08 tmp4 [V08,T05] ( 2, 8 ) bool -> rax V02.hasValue(offs=0x00) P-INDEP "field V02.hasValue (fldOffset=0x0)"
-; V09 tmp5 [V09,T06] ( 2, 6 ) int -> rdx V02.value(offs=0x04) P-INDEP "field V02.value (fldOffset=0x4)"
+; V08 tmp4 [V08,T03] ( 3, 12 ) bool -> [rsp+30H] do-not-enreg[] V02.hasValue(offs=0x00) P-DEP "field V02.hasValue (fldOffset=0x0)"
+; V09 tmp5 [V09,T08] ( 2, 6 ) int -> [rsp+34H] do-not-enreg[] V02.value(offs=0x04) P-DEP "field V02.value (fldOffset=0x4)"
;* V10 tmp6 [V10 ] ( 0, 0 ) bool -> zero-ref V03.hasValue(offs=0x00) P-INDEP "field V03.hasValue (fldOffset=0x0)"
;* V11 tmp7 [V11 ] ( 0, 0 ) int -> zero-ref V03.value(offs=0x04) P-INDEP "field V03.value (fldOffset=0x4)"
-;* V12 tmp8 [V12,T08] ( 0, 0 ) bool -> zero-ref V05.hasValue(offs=0x00) P-INDEP "field V05.hasValue (fldOffset=0x0)"
-; V13 tmp9 [V13,T07] ( 2, 4 ) int -> r9 V05.value(offs=0x04) P-INDEP "field V05.value (fldOffset=0x4)"
-; V14 tmp10 [V14,T03] ( 3, 8 ) bool -> r8 V06.hasValue(offs=0x00) P-INDEP "field V06.hasValue (fldOffset=0x0)"
-; V15 tmp11 [V15,T04] ( 3, 8 ) int -> r9 V06.value(offs=0x04) P-INDEP "field V06.value (fldOffset=0x4)"
-; V16 rat0 [V16,T01] ( 3, 12 ) int -> rdx "ReplaceWithLclVar is creating a new local variable"
+;* V12 tmp8 [V12,T10] ( 0, 0 ) bool -> zero-ref V05.hasValue(offs=0x00) P-INDEP "field V05.hasValue (fldOffset=0x0)"
+; V13 tmp9 [V13,T09] ( 2, 4 ) int -> rdx V05.value(offs=0x04) P-INDEP "field V05.value (fldOffset=0x4)"
+; V14 tmp10 [V14,T02] ( 4, 12 ) bool -> [rsp+28H] do-not-enreg[] V06.hasValue(offs=0x00) P-DEP "field V06.hasValue (fldOffset=0x0)"
+; V15 tmp11 [V15,T07] ( 3, 8 ) int -> [rsp+2CH] do-not-enreg[] V06.value(offs=0x04) P-DEP "field V06.value (fldOffset=0x4)"
+; V16 tmp12 [V16,T00] ( 5, 14 ) bool -> rcx "V00.[000..001)"
+; V17 cse0 [V17,T04] ( 3, 12 ) int -> rax "CSE - aggressive"
+; V18 rat0 [V18,T05] ( 3, 12 ) int -> rdx "ReplaceWithLclVar is creating a new local variable"
;
-; Lcl frame size = 40
+; Lcl frame size = 56
G_M24602_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
- sub rsp, 40
- mov qword ptr [rsp+30H], rcx
- ;; size=9 bbWeight=1 PerfScore 1.25
+ sub rsp, 56
+ xor eax, eax
+ mov qword ptr [rsp+30H], rax
+ mov qword ptr [rsp+28H], rax
+ mov qword ptr [rsp+40H], rcx
+ ;; size=21 bbWeight=1 PerfScore 3.50
G_M24602_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
- cmp byte ptr [rsp+30H], 0
+ movzx rcx, byte ptr [rsp+40H]
+ test ecx, ecx
jne SHORT G_M24602_IG05
- ;; size=7 bbWeight=1 PerfScore 3.00
+ ;; size=9 bbWeight=1 PerfScore 2.25
G_M24602_IG03: ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
xor eax, eax
;; size=2 bbWeight=0.50 PerfScore 0.12
G_M24602_IG04: ; bbWeight=0.50, epilog, nogc, extend
- add rsp, 40
+ add rsp, 56
ret
;; size=5 bbWeight=0.50 PerfScore 0.62
G_M24602_IG05: ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
- xor ecx, ecx
- ;; size=2 bbWeight=0.50 PerfScore 0.12
-G_M24602_IG06: ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
- movzx rax, byte ptr [rsp+30H]
- mov edx, dword ptr [rsp+34H]
- test al, al
- jne SHORT G_M24602_IG08
- ;; size=13 bbWeight=4 PerfScore 13.00
-G_M24602_IG07: ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
xor r8d, r8d
- xor r9d, r9d
- jmp SHORT G_M24602_IG09
- ;; size=8 bbWeight=2 PerfScore 5.00
-G_M24602_IG08: ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
- mov r8d, 0xD1FFAB1E
- mov eax, r8d
- imul edx:eax, edx
- mov r9d, edx
- shr r9d, 31
- sar edx, 2
- add r9d, edx
- mov r8d, 1
- ;; size=30 bbWeight=2 PerfScore 10.50
-G_M24602_IG09: ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
- mov byte ptr [rsp+30H], r8b
- mov dword ptr [rsp+34H], r9d
- inc ecx
+ ;; size=3 bbWeight=0.50 PerfScore 0.12
+G_M24602_IG06: ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
+ mov byte ptr [rsp+30H], cl
+ mov eax, dword ptr [rsp+44H]
+ mov dword ptr [rsp+34H], eax
cmp byte ptr [rsp+30H], 0
+ jne SHORT G_M24602_IG08
+ ;; size=19 bbWeight=4 PerfScore 24.00
+G_M24602_IG07: ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
+ xor eax, eax
+ mov qword ptr [rsp+28H], rax
+ jmp SHORT G_M24602_IG09
+ ;; size=9 bbWeight=2 PerfScore 6.50
+G_M24602_IG08: ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+ mov edx, 0xD1FFAB1E
+ mov eax, edx
+ imul edx:eax, dword ptr [rsp+34H]
+ mov ecx, edx
+ shr ecx, 31
+ sar edx, 2
+ add edx, ecx
+ mov byte ptr [rsp+28H], 1
+ mov dword ptr [rsp+2CH], edx
+ ;; size=30 bbWeight=2 PerfScore 18.00
+G_M24602_IG09: ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
+ movzx rcx, byte ptr [rsp+28H]
+ mov eax, dword ptr [rsp+2CH]
+ mov dword ptr [rsp+44H], eax
+ inc r8d
+ test ecx, ecx
je SHORT G_M24602_IG12
- cmp dword ptr [rsp+34H], 0
+ test eax, eax
jg SHORT G_M24602_IG06
- ;; size=26 bbWeight=4 PerfScore 33.00
+ ;; size=24 bbWeight=4 PerfScore 23.00
G_M24602_IG10: ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
- mov eax, ecx
- ;; size=2 bbWeight=0.50 PerfScore 0.12
+ mov eax, r8d
+ ;; size=3 bbWeight=0.50 PerfScore 0.12
G_M24602_IG11: ; bbWeight=0.50, epilog, nogc, extend
- add rsp, 40
+ add rsp, 56
ret
;; size=5 bbWeight=0.50 PerfScore 0.62
G_M24602_IG12: ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
@@ -88,7 +97,7 @@ G_M24602_IG12: ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 {
int3
;; size=7 bbWeight=0 PerfScore 0.00
-; Total bytes of code 116, prolog size 9, PerfScore 78.98, instruction count 35, allocated bytes for code 116 (MethodHash=90769fe5) for method Microsoft.EntityFrameworkCore.Infrastructure.Uniquifier:GetLength(System.Nullable`1[int]):int
+; Total bytes of code 137, prolog size 21, PerfScore 92.58, instruction count 42, allocated bytes for code 137 (MethodHash=90769fe5) for method Microsoft.EntityFrameworkCore.Infrastructure.Uniquifier:GetLength(System.Nullable`1[int]):int
; ============================================================ Promotions: Accesses for V00
bool @ 000
#: (2, 200)
# assigned from: (0, 0)
# assigned to: (0, 0)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
[000..008) as System.Nullable`1[int]
#: (2, 200)
# assigned from: (1, 100)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
int @ 004
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (0, 0)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
Picking promotions for V00
Evaluating access bool @ 000
Single write-back cost: 3
Write backs: 0
Read backs: 100
Cost with: 400
Cost without: 600
Promoting replacement
lvaGrabTemp returning 16 (V16 tmp12) (a long lifetime temp) called for V00.[000..001).
Evaluating access int @ 004
Single write-back cost: 3
Write backs: 0
Read backs: 100
Cost with: 350
Cost without: 300
Disqualifying replacement
V00 promoted with 1 replacements
[000..001) promoted as bool V16
Computing unpromoted remainder for V00
Remainder: [004..008) Two problems and a comment here:
STMT00003 ( 0x00D[E-] ... 0x00E )
[000009] DA--------- ▌ STORE_LCL_VAR struct<System.Nullable`1, 8>(P) V02 loc1
▌ bool V02.<unknown class>:hasValue (offs=0x00) -> V08 tmp4
▌ int V02.<unknown class>:value (offs=0x04) -> V09 tmp5
[000008] ----------- └──▌ LCL_VAR struct<System.Nullable`1, 8> V00 arg0 (last use) The extra promotion would allow much cleaner decomposition. We could do something simple and assume that overlapping struct assignments would have their cost decreased a bit by promoting fields; we could also do something smarter and track all assigned locals in a union-find data structure, which will allow us to query the sets of structs for which it would be smart to promote fields together. Processing block operation [000009] that involves replacements
V08 (field V02.hasValue (fldOffset=0x0)) <- V16 (V00.[000..001)) (last use)
Remainder: [004..008)
=> Remainder strategy: int at +004
Local V00 should not be enregistered because: was accessed as a local field
Local V02 should not be enregistered because: was accessed as a local field
New statement:
STMT00003 ( 0x00D[E-] ... 0x00E )
[000090] -A--------- ▌ COMMA void
[000087] DA--------- ├──▌ STORE_LCL_VAR bool V08 tmp4
[000086] ----------- │ └──▌ LCL_VAR bool V16 tmp12 (last use)
[000089] UA--------- └──▌ STORE_LCL_FLD int (P) V02 loc1 [+4]
▌ bool V02.<unknown class>:hasValue (offs=0x00) -> V08 tmp4
▌ int V02.<unknown class>:value (offs=0x04) -> V09 tmp5
[000088] ----------- └──▌ LCL_FLD int V00 arg0 [+4] The expected decomposition should be to If we do force physical promotion to promote @@ -7,98 +7,93 @@
; 0 inlinees with PGO data; 4 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
-; V00 arg0 [V00,T00] ( 9, 27 ) struct ( 8) [rsp+30H] do-not-enreg[SF] ld-addr-op single-def
-; V01 loc0 [V01,T02] ( 4, 9 ) int -> rcx
+; V00 arg0 [V00,T06] ( 4, 4 ) struct ( 8) [rsp+30H] do-not-enreg[SF] ld-addr-op single-def
+; V01 loc0 [V01,T03] ( 4, 9 ) int -> r8
;* V02 loc1 [V02 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op
;* V03 loc2 [V03 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op
; V04 OutArgs [V04 ] ( 1, 1 ) struct (32) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V05 tmp1 [V05 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op "NewObj constructor temp"
;* V06 tmp2 [V06 ] ( 0, 0 ) struct ( 8) zero-ref
;* V07 tmp3 [V07 ] ( 0, 0 ) int -> zero-ref "Inlining Arg"
-; V08 tmp4 [V08,T05] ( 2, 8 ) bool -> rax V02.hasValue(offs=0x00) P-INDEP "field V02.hasValue (fldOffset=0x0)"
-; V09 tmp5 [V09,T06] ( 2, 6 ) int -> rdx V02.value(offs=0x04) P-INDEP "field V02.value (fldOffset=0x4)"
+;* V08 tmp4 [V08 ] ( 0, 0 ) bool -> zero-ref V02.hasValue(offs=0x00) P-INDEP "field V02.hasValue (fldOffset=0x0)"
+; V09 tmp5 [V09,T07] ( 2, 6 ) int -> rdx V02.value(offs=0x04) P-INDEP "field V02.value (fldOffset=0x4)"
;* V10 tmp6 [V10 ] ( 0, 0 ) bool -> zero-ref V03.hasValue(offs=0x00) P-INDEP "field V03.hasValue (fldOffset=0x0)"
;* V11 tmp7 [V11 ] ( 0, 0 ) int -> zero-ref V03.value(offs=0x04) P-INDEP "field V03.value (fldOffset=0x4)"
-;* V12 tmp8 [V12,T08] ( 0, 0 ) bool -> zero-ref V05.hasValue(offs=0x00) P-INDEP "field V05.hasValue (fldOffset=0x0)"
-; V13 tmp9 [V13,T07] ( 2, 4 ) int -> r9 V05.value(offs=0x04) P-INDEP "field V05.value (fldOffset=0x4)"
-; V14 tmp10 [V14,T03] ( 3, 8 ) bool -> r8 V06.hasValue(offs=0x00) P-INDEP "field V06.hasValue (fldOffset=0x0)"
-; V15 tmp11 [V15,T04] ( 3, 8 ) int -> r9 V06.value(offs=0x04) P-INDEP "field V06.value (fldOffset=0x4)"
-; V16 rat0 [V16,T01] ( 3, 12 ) int -> rdx "ReplaceWithLclVar is creating a new local variable"
+;* V12 tmp8 [V12,T09] ( 0, 0 ) bool -> zero-ref V05.hasValue(offs=0x00) P-INDEP "field V05.hasValue (fldOffset=0x0)"
+; V13 tmp9 [V13,T08] ( 2, 4 ) int -> rdx V05.value(offs=0x04) P-INDEP "field V05.value (fldOffset=0x4)"
+; V14 tmp10 [V14,T04] ( 3, 8 ) bool -> rcx V06.hasValue(offs=0x00) P-INDEP "field V06.hasValue (fldOffset=0x0)"
+; V15 tmp11 [V15,T05] ( 3, 8 ) int -> rdx V06.value(offs=0x04) P-INDEP "field V06.value (fldOffset=0x4)"
+; V16 tmp12 [V16,T00] ( 5, 14 ) bool -> rcx "V00.[000..001)"
+; V17 tmp13 [V17,T01] ( 4, 13 ) int -> rdx "V00.[004..008)"
+; V18 rat0 [V18,T02] ( 3, 12 ) int -> rdx "ReplaceWithLclVar is creating a new local variable"
;
; Lcl frame size = 40
-G_M24602_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
+G_M24602_IG01: ;; offset=0000H
sub rsp, 40
mov qword ptr [rsp+30H], rcx
;; size=9 bbWeight=1 PerfScore 1.25
-G_M24602_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
- cmp byte ptr [rsp+30H], 0
+G_M24602_IG02: ;; offset=0009H
+ movzx rcx, byte ptr [rsp+30H]
+ mov edx, dword ptr [rsp+34H]
+ test ecx, ecx
jne SHORT G_M24602_IG05
- ;; size=7 bbWeight=1 PerfScore 3.00
-G_M24602_IG03: ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+ ;; size=13 bbWeight=1 PerfScore 3.25
+G_M24602_IG03: ;; offset=0016H
xor eax, eax
;; size=2 bbWeight=0.50 PerfScore 0.12
-G_M24602_IG04: ; bbWeight=0.50, epilog, nogc, extend
+G_M24602_IG04: ;; offset=0018H
add rsp, 40
ret
;; size=5 bbWeight=0.50 PerfScore 0.62
-G_M24602_IG05: ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
- xor ecx, ecx
- ;; size=2 bbWeight=0.50 PerfScore 0.12
-G_M24602_IG06: ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
- movzx rax, byte ptr [rsp+30H]
- mov edx, dword ptr [rsp+34H]
- test al, al
- jne SHORT G_M24602_IG08
- ;; size=13 bbWeight=4 PerfScore 13.00
-G_M24602_IG07: ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
+G_M24602_IG05: ;; offset=001DH
xor r8d, r8d
- xor r9d, r9d
+ align [0 bytes for IG06]
+ ;; size=3 bbWeight=0.50 PerfScore 0.12
+G_M24602_IG06: ;; offset=0020H
+ test ecx, ecx
+ jne SHORT G_M24602_IG08
+ ;; size=4 bbWeight=4 PerfScore 5.00
+G_M24602_IG07: ;; offset=0024H
+ xor ecx, ecx
+ xor edx, edx
jmp SHORT G_M24602_IG09
- ;; size=8 bbWeight=2 PerfScore 5.00
-G_M24602_IG08: ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
- mov r8d, 0xD1FFAB1E
- mov eax, r8d
- imul edx:eax, edx
- mov r9d, edx
- shr r9d, 31
- sar edx, 2
- add r9d, edx
- mov r8d, 1
- ;; size=30 bbWeight=2 PerfScore 10.50
-G_M24602_IG09: ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
- mov byte ptr [rsp+30H], r8b
- mov dword ptr [rsp+34H], r9d
- inc ecx
- cmp byte ptr [rsp+30H], 0
- je SHORT G_M24602_IG12
- cmp dword ptr [rsp+34H], 0
- jg SHORT G_M24602_IG06
- ;; size=26 bbWeight=4 PerfScore 33.00
-G_M24602_IG10: ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+ ;; size=6 bbWeight=2 PerfScore 5.00
+G_M24602_IG08: ;; offset=002AH
+ mov ecx, 0x66666667
mov eax, ecx
- ;; size=2 bbWeight=0.50 PerfScore 0.12
-G_M24602_IG11: ; bbWeight=0.50, epilog, nogc, extend
+ imul edx:eax, edx
+ mov eax, edx
+ shr eax, 31
+ sar edx, 2
+ add edx, eax
+ mov ecx, 1
+ ;; size=24 bbWeight=2 PerfScore 10.50
+G_M24602_IG09: ;; offset=0042H
+ movzx rcx, cl
+ inc r8d
+ test ecx, ecx
+ je SHORT G_M24602_IG12
+ test edx, edx
+ jg SHORT G_M24602_IG06
+ ;; size=14 bbWeight=4 PerfScore 12.00
+G_M24602_IG10: ;; offset=0050H
+ mov eax, r8d
+ ;; size=3 bbWeight=0.50 PerfScore 0.12
+G_M24602_IG11: ;; offset=0053H
add rsp, 40
ret
;; size=5 bbWeight=0.50 PerfScore 0.62
-G_M24602_IG12: ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
+G_M24602_IG12: ;; offset=0058H
call [System.ThrowHelper:ThrowInvalidOperationException_InvalidOperation_NoValue()]
- ; gcr arg pop 0
int3
;; size=7 bbWeight=0 PerfScore 0.00
-; Total bytes of code 116, prolog size 9, PerfScore 78.98, instruction count 35, allocated bytes for code 116 (MethodHash=90769fe5) for method Microsoft.EntityFrameworkCore.Infrastructure.Uniquifier:GetLength(System.Nullable`1[int]):int
+; Total bytes of code 95, prolog size 9, PerfScore 48.13, instruction count 35, allocated bytes for code 95 (MethodHash=90769fe5) for method Microsoft.EntityFrameworkCore.Infrastructure.Uniquifier:GetLength(System.Nullable`1[int]):int which is smaller code and much better perf score. |
(edit: not expected to be handled) +16 (+21.33%) : 21081.dasm - System.Numerics.Tests.Perf_Matrix3x2:IsIdentityBenchmark():bool:this@@ -8,7 +8,7 @@
; Final local variable assignments
;
;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd single-def
-;* V01 loc0 [V01,T01] ( 0, 0 ) struct (24) zero-ref do-not-enreg[SF] ld-addr-op
+;* V01 loc0 [V01 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[SF] ld-addr-op
;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1 [V03 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp"
;* V04 tmp2 [V04 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp"
@@ -16,11 +16,14 @@
;* V06 tmp4 [V06 ] ( 0, 0 ) simd8 -> zero-ref V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
;* V07 tmp5 [V07 ] ( 0, 0 ) simd8 -> zero-ref V03.Y(offs=0x08) P-INDEP "field V03.Y (fldOffset=0x8)"
;* V08 tmp6 [V08 ] ( 0, 0 ) simd8 -> zero-ref V03.Z(offs=0x10) P-INDEP "field V03.Z (fldOffset=0x10)"
-;* V09 tmp7 [V09,T04] ( 0, 0 ) simd8 -> zero-ref single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
-;* V10 tmp8 [V10,T05] ( 0, 0 ) simd8 -> zero-ref single-def V04.Y(offs=0x08) P-INDEP "field V04.Y (fldOffset=0x8)"
-;* V11 tmp9 [V11,T06] ( 0, 0 ) simd8 -> zero-ref single-def V04.Z(offs=0x10) P-INDEP "field V04.Z (fldOffset=0x10)"
-; V12 cse0 [V12,T02] ( 3, 3 ) simd8 -> mm0 "CSE - aggressive"
-; V13 cse1 [V13,T03] ( 3, 2 ) simd8 -> mm1 "CSE - aggressive"
+;* V09 tmp7 [V09,T03] ( 0, 0 ) simd8 -> zero-ref single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
+;* V10 tmp8 [V10,T04] ( 0, 0 ) simd8 -> zero-ref single-def V04.Y(offs=0x08) P-INDEP "field V04.Y (fldOffset=0x8)"
+;* V11 tmp9 [V11,T05] ( 0, 0 ) simd8 -> zero-ref single-def V04.Z(offs=0x10) P-INDEP "field V04.Z (fldOffset=0x10)"
+;* V12 tmp10 [V12 ] ( 0, 0 ) simd8 -> zero-ref single-def "V01.[000..008)"
+;* V13 tmp11 [V13,T06] ( 0, 0 ) simd8 -> zero-ref single-def "V01.[008..016)"
+;* V14 tmp12 [V14,T07] ( 0, 0 ) simd8 -> zero-ref single-def "V01.[016..024)"
+; V15 cse0 [V15,T01] ( 2, 2 ) simd8 -> mm0 "CSE - aggressive"
+; V16 cse1 [V16,T02] ( 2, 1.50) simd8 -> mm1 "CSE - aggressive"
;
; Lcl frame size = 0
@@ -30,12 +33,14 @@ G_M64376_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
G_M64376_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
vmovsd xmm0, qword ptr [reloc @RWD00]
vmovsd xmm1, qword ptr [reloc @RWD08]
- vcmpps k1, xmm0, xmm0, 4
+ vmovsd xmm2, qword ptr [reloc @RWD00]
+ vcmpps k1, xmm0, xmm2, 4
kortestb k1, k1
jne SHORT G_M64376_IG04
- ;; size=29 bbWeight=1 PerfScore 11.00
+ ;; size=37 bbWeight=1 PerfScore 14.00
G_M64376_IG03: ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
- vcmpps k1, xmm1, xmm1, 4
+ vmovsd xmm0, qword ptr [reloc @RWD08]
+ vcmpps k1, xmm1, xmm0, 4
kortestb k1, k1
jne SHORT G_M64376_IG04
vxorps xmm0, xmm0, xmm0
@@ -45,7 +50,7 @@ G_M64376_IG03: ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byr
sete al
movzx rax, al
jmp SHORT G_M64376_IG05
- ;; size=40 bbWeight=0.50 PerfScore 6.46
+ ;; size=48 bbWeight=0.50 PerfScore 7.96
G_M64376_IG04: ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
xor eax, eax
;; size=2 bbWeight=0.50 PerfScore 0.12
@@ -56,7 +61,7 @@ RWD00 dq 000000003F800000h
RWD08 dq 3F80000000000000h
-; Total bytes of code 75, prolog size 3, PerfScore 27.68, instruction count 18, allocated bytes for code 81 (MethodHash=42890487) for method System.Numerics.Tests.Perf_Matrix3x2:IsIdentityBenchmark():bool:this
+; Total bytes of code 91, prolog size 3, PerfScore 33.78, instruction count 20, allocated bytes for code 97 (MethodHash=42890487) for method System.Numerics.Tests.Perf_Matrix3x2:IsIdentityBenchmark():bool:this Replacements: Accesses for V01
[000..024) as System.Numerics.Matrix3x2
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
simd8 @ 000
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (0, 0)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
simd8 @ 008
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (0, 0)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
simd8 @ 016
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (0, 0)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
Picking promotions for V01
Evaluating access simd8 @ 000
Single write-back cost: 3
Write backs: 0
Read backs: 0
Cost with: 50
Cost without: 300
Promoting replacement
lvaGrabTemp returning 12 (V12 tmp10) (a long lifetime temp) called for V01.[000..008).
Evaluating access simd8 @ 008
Single write-back cost: 3
Write backs: 0
Read backs: 0
Cost with: 50
Cost without: 300
Promoting replacement
lvaGrabTemp returning 13 (V13 tmp11) (a long lifetime temp) called for V01.[008..016).
Evaluating access simd8 @ 016
Single write-back cost: 3
Write backs: 0
Read backs: 0
Cost with: 50
Cost without: 300
Promoting replacement
lvaGrabTemp returning 14 (V14 tmp12) (a long lifetime temp) called for V01.[016..024).
V01 promoted with 3 replacements
[000..008) promoted as simd8 V12
[008..016) promoted as simd8 V13
[016..024) promoted as simd8 V14
Computing unpromoted remainder for V01
Remainder: <empty> Physical promotion means we replace a |
(edit: tracked by #87554) +18 (+21.95%) : 17866.dasm - System.Numerics.Tests.Perf_Matrix3x2:GetDeterminantBenchmark():float:this@@ -8,16 +8,20 @@
; Final local variable assignments
;
;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd single-def
-; V01 loc0 [V01,T00] ( 6, 6 ) struct (24) [rsp+00H] do-not-enreg[SF] must-init ld-addr-op
+;* V01 loc0 [V01 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[SF] ld-addr-op
;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1 [V03 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp"
-;* V04 tmp2 [V04 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp"
+; V04 tmp2 [V04 ] ( 7, 7 ) struct (24) [rsp+00H] do-not-enreg[SF] must-init ld-addr-op "Inline ldloca(s) first use temp"
;* V05 tmp3 [V05 ] ( 0, 0 ) simd8 -> zero-ref V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
;* V06 tmp4 [V06 ] ( 0, 0 ) simd8 -> zero-ref V03.Y(offs=0x08) P-INDEP "field V03.Y (fldOffset=0x8)"
;* V07 tmp5 [V07 ] ( 0, 0 ) simd8 -> zero-ref V03.Z(offs=0x10) P-INDEP "field V03.Z (fldOffset=0x10)"
-;* V08 tmp6 [V08,T01] ( 0, 0 ) simd8 -> zero-ref single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
-;* V09 tmp7 [V09,T02] ( 0, 0 ) simd8 -> zero-ref single-def V04.Y(offs=0x08) P-INDEP "field V04.Y (fldOffset=0x8)"
-;* V10 tmp8 [V10,T03] ( 0, 0 ) simd8 -> zero-ref single-def V04.Z(offs=0x10) P-INDEP "field V04.Z (fldOffset=0x10)"
+; V08 tmp6 [V08,T00] ( 5, 5 ) simd8 -> [rsp+00H] do-not-enreg[S] single-def V04.X(offs=0x00) P-DEP "field V04.X (fldOffset=0x0)"
+; V09 tmp7 [V09,T01] ( 5, 5 ) simd8 -> [rsp+08H] do-not-enreg[S] single-def V04.Y(offs=0x08) P-DEP "field V04.Y (fldOffset=0x8)"
+; V10 tmp8 [V10,T02] ( 5, 5 ) simd8 -> [rsp+10H] do-not-enreg[S] single-def V04.Z(offs=0x10) P-DEP "field V04.Z (fldOffset=0x10)"
+; V11 tmp9 [V11,T03] ( 2, 2 ) float -> mm0 single-def "V01.[000..004)"
+; V12 tmp10 [V12,T04] ( 2, 2 ) float -> mm1 single-def "V01.[004..008)"
+; V13 tmp11 [V13,T05] ( 2, 2 ) float -> mm2 single-def "V01.[008..012)"
+; V14 tmp12 [V14,T06] ( 2, 2 ) float -> mm3 single-def "V01.[012..016)"
;
; Lcl frame size = 24
@@ -34,12 +38,16 @@ G_M33935_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
vmovsd qword ptr [rsp], xmm0
vmovsd xmm0, qword ptr [reloc @RWD08]
vmovsd qword ptr [rsp+08H], xmm0
+ vxorps xmm0, xmm0, xmm0
+ vmovsd qword ptr [rsp+10H], xmm0
vmovss xmm0, dword ptr [rsp]
- vmulss xmm0, xmm0, dword ptr [rsp+0CH]
- vmovss xmm1, dword ptr [rsp+08H]
- vmulss xmm1, xmm1, dword ptr [rsp+04H]
+ vmovss xmm1, dword ptr [rsp+04H]
+ vmovss xmm2, dword ptr [rsp+08H]
+ vmovss xmm3, dword ptr [rsp+0CH]
+ vmulss xmm0, xmm0, xmm3
+ vmulss xmm1, xmm2, xmm1
vsubss xmm0, xmm0, xmm1
- ;; size=54 bbWeight=1 PerfScore 27.00
+ ;; size=72 bbWeight=1 PerfScore 30.33
G_M33935_IG03: ; bbWeight=1, epilog, nogc, extend
add rsp, 24
ret
@@ -48,7 +56,7 @@ RWD00 dq 000000003F800000h
RWD08 dq 3F80000000000000h
-; Total bytes of code 82, prolog size 23, PerfScore 41.28, instruction count 17, allocated bytes for code 82 (MethodHash=96117b70) for method System.Numerics.Tests.Perf_Matrix3x2:GetDeterminantBenchmark():float:this
+; Total bytes of code 100, prolog size 23, PerfScore 46.42, instruction count 21, allocated bytes for code 100 (MethodHash=96117b70) for method System.Numerics.Tests.Perf_Matrix3x2:GetDeterminantBenchmark():float:this Replacements: Accesses for V01
[000..024) as System.Numerics.Matrix3x2
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
float @ 000
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (0, 0)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
float @ 004
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (0, 0)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
float @ 008
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (0, 0)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
float @ 012
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (0, 0)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
Picking promotions for V01
Evaluating access float @ 000
Single write-back cost: 3
Write backs: 0
Read backs: 0
Cost with: 50
Cost without: 300
Promoting replacement
lvaGrabTemp returning 11 (V11 tmp9) (a long lifetime temp) called for V01.[000..004).
Evaluating access float @ 004
Single write-back cost: 3
Write backs: 0
Read backs: 0
Cost with: 50
Cost without: 300
Promoting replacement
lvaGrabTemp returning 12 (V12 tmp10) (a long lifetime temp) called for V01.[004..008).
Evaluating access float @ 008
Single write-back cost: 3
Write backs: 0
Read backs: 0
Cost with: 50
Cost without: 300
Promoting replacement
lvaGrabTemp returning 13 (V13 tmp11) (a long lifetime temp) called for V01.[008..012).
Evaluating access float @ 012
Single write-back cost: 3
Write backs: 0
Read backs: 0
Cost with: 50
Cost without: 300
Promoting replacement
lvaGrabTemp returning 14 (V14 tmp12) (a long lifetime temp) called for V01.[012..016).
V01 promoted with 4 replacements
[000..004) promoted as float V11
[004..008) promoted as float V12
[008..012) promoted as float V13
[012..016) promoted as float V14
Computing unpromoted remainder for V01
Remainder: [016..024) We end up creating IR that DNERs V04: STMT00001 ( 0x000[E-] ... ??? )
[000017] DA--G------ ▌ STORE_LCL_VAR struct<System.Numerics.Matrix3x2, 24> V01 loc0
[000031] ----------- └──▌ LCL_VAR struct<System.Numerics.Matrix3x2+Impl, 24>(P) V04 tmp2
▌ simd8 V04.<unknown class>:X (offs=0x00) -> V08 tmp6 (last use)
▌ simd8 V04.<unknown class>:Y (offs=0x08) -> V09 tmp7 (last use)
▌ simd8 V04.<unknown class>:Z (offs=0x10) -> V10 tmp8 (last use)
Processing block operation [000017] that involves replacements
V11 (V01.[000..004)) <- src+000
V12 (V01.[004..008)) <- src+004
V13 (V01.[008..012)) <- src+008
V14 (V01.[012..016)) <- src+012
=> Remainder strategy: do nothing (remainder dying)
Local V04 should not be enregistered because: was accessed as a local field
Local V04 should not be enregistered because: was accessed as a local field
Local V04 should not be enregistered because: was accessed as a local field
Local V04 should not be enregistered because: was accessed as a local field
New statement:
STMT00001 ( 0x000[E-] ... ??? )
[000075] -A--------- ▌ COMMA void
[000066] DA--------- ├──▌ STORE_LCL_VAR float V11 tmp9
[000065] ----------- │ └──▌ LCL_FLD float (P) V04 tmp2 [+0]
│ ▌ simd8 V04.<unknown class>:X (offs=0x00) -> V08 tmp6
│ ▌ simd8 V04.<unknown class>:Y (offs=0x08) -> V09 tmp7
│ ▌ simd8 V04.<unknown class>:Z (offs=0x10) -> V10 tmp8
[000074] -A--------- └──▌ COMMA void
[000068] DA--------- ├──▌ STORE_LCL_VAR float V12 tmp10
[000067] ----------- │ └──▌ LCL_FLD float (P) V04 tmp2 [+4]
│ ▌ simd8 V04.<unknown class>:X (offs=0x00) -> V08 tmp6
│ ▌ simd8 V04.<unknown class>:Y (offs=0x08) -> V09 tmp7
│ ▌ simd8 V04.<unknown class>:Z (offs=0x10) -> V10 tmp8
[000073] -A--------- └──▌ COMMA void
[000070] DA--------- ├──▌ STORE_LCL_VAR float V13 tmp11
[000069] ----------- │ └──▌ LCL_FLD float (P) V04 tmp2 [+8]
│ ▌ simd8 V04.<unknown class>:X (offs=0x00) -> V08 tmp6
│ ▌ simd8 V04.<unknown class>:Y (offs=0x08) -> V09 tmp7
│ ▌ simd8 V04.<unknown class>:Z (offs=0x10) -> V10 tmp8
[000072] DA--------- └──▌ STORE_LCL_VAR float V14 tmp12
[000071] ----------- └──▌ LCL_FLD float (P) V04 tmp2 [+12]
▌ simd8 V04.<unknown class>:X (offs=0x00) -> V08 tmp6
▌ simd8 V04.<unknown class>:Y (offs=0x08) -> V09 tmp7
▌ simd8 V04.<unknown class>:Z (offs=0x10) -> V10 tmp8 This is missing [000083] -A--------- ▌ COMMA void
[000068] DA--------- ├──▌ STORE_LCL_VAR float V11 tmp9
[000067] ----------- │ └──▌ HWINTRINSIC float float ToScalar
[000066] ----------- │ └──▌ LCL_VAR simd8 <System.Numerics.Vector2> V08 tmp6
[000082] -A--------- └──▌ COMMA void
[000072] DA--------- ├──▌ STORE_LCL_VAR float V12 tmp10
[000071] ----------- │ └──▌ HWINTRINSIC float float GetElement
[000070] ----------- │ ├──▌ LCL_VAR simd8 <System.Numerics.Vector2> V08 tmp6
[000069] ----------- │ └──▌ CNS_INT int 1
[000081] -A--------- └──▌ COMMA void
[000076] DA--------- ├──▌ STORE_LCL_VAR float V13 tmp11
[000075] ----------- │ └──▌ HWINTRINSIC float float ToScalar
[000074] ----------- │ └──▌ LCL_VAR simd8 <System.Numerics.Vector2> V09 tmp7
[000080] DA--------- └──▌ STORE_LCL_VAR float V14 tmp12
[000079] ----------- └──▌ HWINTRINSIC float float GetElement
[000078] ----------- ├──▌ LCL_VAR simd8 <System.Numerics.Vector2> V09 tmp7
[000077] ----------- └──▌ CNS_INT int 1 Hacking this in we end up folding the entire benchmark to a constant: ; Assembly listing for method System.Numerics.Tests.Perf_Matrix3x2:GetDeterminantBenchmark():float:this
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; optimized code
; rsp based frame
; partially interruptible
; No matching PGO data
; 0 inlinees with PGO data; 6 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd single-def
;* V01 loc0 [V01 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[SF] ld-addr-op
;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1 [V03 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp"
;* V04 tmp2 [V04 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp"
;* V05 tmp3 [V05 ] ( 0, 0 ) simd8 -> zero-ref V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
;* V06 tmp4 [V06 ] ( 0, 0 ) simd8 -> zero-ref V03.Y(offs=0x08) P-INDEP "field V03.Y (fldOffset=0x8)"
;* V07 tmp5 [V07 ] ( 0, 0 ) simd8 -> zero-ref V03.Z(offs=0x10) P-INDEP "field V03.Z (fldOffset=0x10)"
;* V08 tmp6 [V08,T00] ( 0, 0 ) simd8 -> zero-ref single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
;* V09 tmp7 [V09,T01] ( 0, 0 ) simd8 -> zero-ref single-def V04.Y(offs=0x08) P-INDEP "field V04.Y (fldOffset=0x8)"
;* V10 tmp8 [V10 ] ( 0, 0 ) simd8 -> zero-ref single-def V04.Z(offs=0x10) P-INDEP "field V04.Z (fldOffset=0x10)"
;* V11 tmp9 [V11,T02] ( 0, 0 ) float -> zero-ref single-def "V01.[000..004)"
;* V12 tmp10 [V12,T03] ( 0, 0 ) float -> zero-ref single-def "V01.[004..008)"
;* V13 tmp11 [V13,T04] ( 0, 0 ) float -> zero-ref single-def "V01.[008..012)"
;* V14 tmp12 [V14,T05] ( 0, 0 ) float -> zero-ref single-def "V01.[012..016)"
;
; Lcl frame size = 0
G_M33935_IG01: ;; offset=0000H
vzeroupper
;; size=3 bbWeight=1 PerfScore 1.00
G_M33935_IG02: ;; offset=0003H
vmovss xmm0, dword ptr [reloc @RWD00]
;; size=8 bbWeight=1 PerfScore 3.00
G_M33935_IG03: ;; offset=000BH
ret
;; size=1 bbWeight=1 PerfScore 1.00
RWD00 dd 3F800000h ; 1
; Total bytes of code 12, prolog size 3, PerfScore 6.20, instruction count 3, allocated bytes for code 12 (MethodHash=96117b70) for method System.Numerics.Tests.Perf_Matrix3x2:GetDeterminantBenchmark():float:this |
(edit: not expected to be handled) +44 (+27.67%) : 1550.dasm - System.Text.Json.Utf8JsonReader:get_CurrentState():System.Text.Json.JsonReaderState:this@@ -8,55 +8,63 @@
; Final local variable assignments
;
; V00 this [V00,T00] ( 12, 12 ) byref -> rcx this single-def
-; V01 RetBuf [V01,T02] ( 4, 4 ) byref -> rbx single-def
-; V02 loc0 [V02,T01] ( 12, 12 ) struct (56) [rsp+08H] do-not-enreg[SF] ld-addr-op
+; V01 RetBuf [V01,T01] ( 12, 12 ) byref -> rbx single-def
+; V02 loc0 [V02,T02] ( 4, 4 ) struct (56) [rsp+10H] do-not-enreg[SF] ld-addr-op
;# V03 OutArgs [V03 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
+; V04 tmp1 [V04,T03] ( 2, 2 ) long -> rbp "V02.[000..008)"
+; V05 tmp2 [V05,T04] ( 2, 2 ) long -> r14 "V02.[008..016)"
+; V06 tmp3 [V06,T05] ( 2, 2 ) bool -> r15 "V02.[016..017)"
+; V07 tmp4 [V07,T06] ( 2, 2 ) bool -> r12 "V02.[017..018)"
+; V08 tmp5 [V08,T07] ( 2, 2 ) bool -> r13 "V02.[018..019)"
+; V09 tmp6 [V09,T08] ( 2, 2 ) bool -> [rsp+0CH] spill-single-def "V02.[019..020)"
+; V10 tmp7 [V10,T09] ( 2, 2 ) ubyte -> [rsp+08H] spill-single-def "V02.[020..021)"
+; V11 tmp8 [V11,T10] ( 2, 2 ) ubyte -> [rsp+04H] spill-single-def "V02.[021..022)"
;
-; Lcl frame size = 64
+; Lcl frame size = 72
G_M2776_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
+ push r15
+ push r14
+ push r13
+ push r12
push rdi
push rsi
+ push rbp
push rbx
- sub rsp, 64
+ sub rsp, 72
vzeroupper
mov rbx, rdx
; byrRegs +[rbx]
- ;; size=13 bbWeight=1 PerfScore 4.50
+ ;; size=22 bbWeight=1 PerfScore 9.50
G_M2776_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=000A {rcx rbx}, byref
; byrRegs +[rcx]
vxorps ymm0, ymm0, ymm0
- vmovdqu ymmword ptr [rsp+08H], ymm0
- vmovdqu ymmword ptr [rsp+20H], ymm0
- mov rax, qword ptr [rcx]
- mov qword ptr [rsp+08H], rax
- mov rax, qword ptr [rcx+08H]
- mov qword ptr [rsp+10H], rax
- movzx rax, byte ptr [rcx+26H]
- mov byte ptr [rsp+18H], al
- movzx rax, byte ptr [rcx+27H]
- mov byte ptr [rsp+19H], al
- movzx rax, byte ptr [rcx+2EH]
- mov byte ptr [rsp+1AH], al
+ vmovdqu ymmword ptr [rsp+10H], ymm0
+ vmovdqu ymmword ptr [rsp+28H], ymm0
+ mov rbp, qword ptr [rcx]
+ mov r14, qword ptr [rcx+08H]
+ movzx r15, byte ptr [rcx+26H]
+ movzx r12, byte ptr [rcx+27H]
+ movzx r13, byte ptr [rcx+2EH]
movzx rax, byte ptr [rcx+2CH]
- mov byte ptr [rsp+1BH], al
- movzx rax, byte ptr [rcx+28H]
- mov byte ptr [rsp+1CH], al
- movzx rax, byte ptr [rcx+29H]
- mov byte ptr [rsp+1DH], al
- mov rax, qword ptr [rcx+40H]
- mov qword ptr [rsp+20H], rax
- ;; size=90 bbWeight=1 PerfScore 29.33
+ mov dword ptr [rsp+0CH], eax
+ movzx rdx, byte ptr [rcx+28H]
+ mov dword ptr [rsp+08H], edx
+ movzx r8, byte ptr [rcx+29H]
+ mov dword ptr [rsp+04H], r8d
+ mov r9, qword ptr [rcx+40H]
+ mov qword ptr [rsp+28H], r9
+ ;; size=73 bbWeight=1 PerfScore 24.33
G_M2776_IG03: ; bbWeight=1, nogc, extend
vmovdqu xmm0, xmmword ptr [rcx+48H]
- vmovdqu xmmword ptr [rsp+28H], xmm0
- mov rax, qword ptr [rcx+58H]
- mov qword ptr [rsp+38H], rax
+ vmovdqu xmmword ptr [rsp+30H], xmm0
+ mov r9, qword ptr [rcx+58H]
+ mov qword ptr [rsp+40H], r9
;; size=20 bbWeight=1 PerfScore 8.00
G_M2776_IG04: ; bbWeight=1, extend
mov rdi, rbx
; byrRegs +[rdi]
- lea rsi, bword ptr [rsp+08H]
+ lea rsi, bword ptr [rsp+10H]
; byrRegs +[rsi]
mov ecx, 4
; byrRegs -[rcx]
@@ -64,18 +72,34 @@ G_M2776_IG04: ; bbWeight=1, extend
call CORINFO_HELP_ASSIGN_BYREF
movsq
movsq
+ mov qword ptr [rbx], rbp
+ mov qword ptr [rbx+08H], r14
+ mov byte ptr [rbx+10H], r15b
+ mov byte ptr [rbx+11H], r12b
+ mov byte ptr [rbx+12H], r13b
+ mov ebp, dword ptr [rsp+0CH]
+ mov byte ptr [rbx+13H], bpl
+ mov ebp, dword ptr [rsp+08H]
+ mov byte ptr [rbx+14H], bpl
+ mov ebp, dword ptr [rsp+04H]
+ mov byte ptr [rbx+15H], bpl
mov rax, rbx
; byrRegs +[rax]
- ;; size=28 bbWeight=1 PerfScore 29.25
+ ;; size=71 bbWeight=1 PerfScore 40.25
G_M2776_IG05: ; bbWeight=1, epilog, nogc, extend
- add rsp, 64
+ add rsp, 72
pop rbx
+ pop rbp
pop rsi
pop rdi
+ pop r12
+ pop r13
+ pop r14
+ pop r15
ret
- ;; size=8 bbWeight=1 PerfScore 2.75
+ ;; size=17 bbWeight=1 PerfScore 5.25
-; Total bytes of code 159, prolog size 10, PerfScore 89.73, instruction count 44, allocated bytes for code 159 (MethodHash=d49af527) for method System.Text.Json.Utf8JsonReader:get_CurrentState():System.Text.Json.JsonReaderState:this
+; Total bytes of code 203, prolog size 19, PerfScore 107.63, instruction count 60, allocated bytes for code 203 (MethodHash=d49af527) for method System.Text.Json.Utf8JsonReader:get_CurrentState():System.Text.Json.JsonReaderState:this Replacements: Accesses for V02
[000..056) as System.Text.Json.JsonReaderState
#: (2, 200)
# assigned from: (1, 100)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
long @ 000
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
long @ 008
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
bool @ 016
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
bool @ 017
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
bool @ 018
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
bool @ 019
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
ubyte @ 020
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
ubyte @ 021
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
[024..032) as System.Text.Json.JsonReaderOptions
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
[032..056) as System.Text.Json.BitStack
#: (1, 100)
# assigned from: (0, 0)
# assigned to: (1, 100)
# as call arg: (0, 0)
# as retbuf: (0, 0)
# as returned value: (0, 0)
Picking promotions for V02
Evaluating access long @ 000
Single write-back cost: 3
Write backs: 0
Read backs: 0
Cost with: 50
Cost without: 300
Promoting replacement
lvaGrabTemp returning 4 (V04 tmp1) (a long lifetime temp) called for V02.[000..008).
Evaluating access long @ 008
Single write-back cost: 3
Write backs: 0
Read backs: 0
Cost with: 50
Cost without: 300
Promoting replacement
lvaGrabTemp returning 5 (V05 tmp2) (a long lifetime temp) called for V02.[008..016).
Evaluating access bool @ 016
Single write-back cost: 3
Write backs: 0
Read backs: 0
Cost with: 50
Cost without: 300
Promoting replacement
lvaGrabTemp returning 6 (V06 tmp3) (a long lifetime temp) called for V02.[016..017).
Evaluating access bool @ 017
Single write-back cost: 3
Write backs: 0
Read backs: 0
Cost with: 50
Cost without: 300
Promoting replacement
lvaGrabTemp returning 7 (V07 tmp4) (a long lifetime temp) called for V02.[017..018).
Evaluating access bool @ 018
Single write-back cost: 3
Write backs: 0
Read backs: 0
Cost with: 50
Cost without: 300
Promoting replacement
lvaGrabTemp returning 8 (V08 tmp5) (a long lifetime temp) called for V02.[018..019).
Evaluating access bool @ 019
Single write-back cost: 3
Write backs: 0
Read backs: 0
Cost with: 50
Cost without: 300
Promoting replacement
lvaGrabTemp returning 9 (V09 tmp6) (a long lifetime temp) called for V02.[019..020).
Evaluating access ubyte @ 020
Single write-back cost: 3
Write backs: 0
Read backs: 0
Cost with: 50
Cost without: 300
Promoting replacement
lvaGrabTemp returning 10 (V10 tmp7) (a long lifetime temp) called for V02.[020..021).
Evaluating access ubyte @ 021
Single write-back cost: 3
Write backs: 0
Read backs: 0
Cost with: 50
Cost without: 300
Promoting replacement
lvaGrabTemp returning 11 (V11 tmp8) (a long lifetime temp) called for V02.[021..022).
V02 promoted with 8 replacements
[000..008) promoted as long V04
[008..016) promoted as long V05
[016..017) promoted as bool V06
[017..018) promoted as bool V07
[018..019) promoted as bool V08
[019..020) promoted as bool V09
[020..021) promoted as ubyte V10
[021..022) promoted as ubyte V11
Computing unpromoted remainder for V02
Remainder: [024..056) This is one of the cases where the heuristic does not take into account that decomposed assignments can be more expensive with many fields, especially considering we end spilling some of the fields. We end up with Processing block operation [000065] that involves replacements
dst+000 <- V04 (V02.[000..008)) (last use)
dst+008 <- V05 (V02.[008..016)) (last use)
dst+016 <- V06 (V02.[016..017)) (last use)
dst+017 <- V07 (V02.[017..018)) (last use)
dst+018 <- V08 (V02.[018..019)) (last use)
dst+019 <- V09 (V02.[019..020)) (last use)
dst+020 <- V10 (V02.[020..021)) (last use)
dst+021 <- V11 (V02.[021..022)) (last use)
Remainder: [024..056)
=> Remainder strategy: retain a full block op
New statement:
STMT00012 ( 0x08A[E-] ... 0x08B )
[000124] -A-XG------ ▌ COMMA void
[000065] -A-XG------ ├──▌ STORE_BLK struct<System.Text.Json.JsonReaderState, 56> (copy)
[000079] ----------- │ ├──▌ LCL_VAR byref V01 RetBuf
[000063] ----------- │ └──▌ LCL_VAR struct<System.Text.Json.JsonReaderState, 56> V02 loc0
[000123] -A-XG------ └──▌ COMMA void
[000082] -A-XG------ ├──▌ STOREIND long
[000081] ----------- │ ├──▌ LCL_VAR byref V01 RetBuf
[000080] ----------- │ └──▌ LCL_VAR long V04 tmp1 (last use)
[000122] -A-XG------ └──▌ COMMA void
[000087] -A-XG------ ├──▌ STOREIND long
[000086] ----------- │ ├──▌ ADD byref
[000084] ----------- │ │ ├──▌ LCL_VAR byref V01 RetBuf
[000085] ----------- │ │ └──▌ CNS_INT long 8
[000083] ----------- │ └──▌ LCL_VAR long V05 tmp2 (last use)
[000121] -A-XG------ └──▌ COMMA void
[000092] -A-XG------ ├──▌ STOREIND bool
[000091] ----------- │ ├──▌ ADD byref
[000089] ----------- │ │ ├──▌ LCL_VAR byref V01 RetBuf
[000090] ----------- │ │ └──▌ CNS_INT long 16
[000088] ----------- │ └──▌ LCL_VAR bool V06 tmp3 (last use)
[000120] -A-XG------ └──▌ COMMA void
[000097] -A-XG------ ├──▌ STOREIND bool
[000096] ----------- │ ├──▌ ADD byref
[000094] ----------- │ │ ├──▌ LCL_VAR byref V01 RetBuf
[000095] ----------- │ │ └──▌ CNS_INT long 17
[000093] ----------- │ └──▌ LCL_VAR bool V07 tmp4 (last use)
[000119] -A-XG------ └──▌ COMMA void
[000102] -A-XG------ ├──▌ STOREIND bool
[000101] ----------- │ ├──▌ ADD byref
[000099] ----------- │ │ ├──▌ LCL_VAR byref V01 RetBuf
[000100] ----------- │ │ └──▌ CNS_INT long 18
[000098] ----------- │ └──▌ LCL_VAR bool V08 tmp5 (last use)
[000118] -A-XG------ └──▌ COMMA void
[000107] -A-XG------ ├──▌ STOREIND bool
[000106] ----------- │ ├──▌ ADD byref
[000104] ----------- │ │ ├──▌ LCL_VAR byref V01 RetBuf
[000105] ----------- │ │ └──▌ CNS_INT long 19
[000103] ----------- │ └──▌ LCL_VAR bool V09 tmp6 (last use)
[000117] -A-XG------ └──▌ COMMA void
[000112] -A-XG------ ├──▌ STOREIND ubyte
[000111] ----------- │ ├──▌ ADD byref
[000109] ----------- │ │ ├──▌ LCL_VAR byref V01 RetBuf
[000110] ----------- │ │ └──▌ CNS_INT long 20
[000108] ----------- │ └──▌ LCL_VAR ubyte V10 tmp7 (last use)
[000116] -A-XG------ └──▌ STOREIND ubyte
[000115] ----------- ├──▌ ADD byref
[000064] ----------- │ ├──▌ LCL_VAR byref V01 RetBuf
[000114] ----------- │ └──▌ CNS_INT long 21
[000113] ----------- └──▌ LCL_VAR ubyte V11 tmp8 (last use) to handle the assignment into the ret buffer. We do see some signs of why it could be beneficial to do the promotion as we are able to keep a bunch of the fields in registers instead of on stack, but we just don't have enough registers on x64 to do that for them all. |
With the perflab runs @cincuranet set up and a query from @AndyAyersMS I can start looking at micro benchmark regressions. The following lists all benchmarks with a ratio below 0.95, indicating that they regress by more than 5%. There are 56 entries in this list (for comparison, the query for benchmarks that improve by more than 5% returns 267 results, but take it with a grain of salt as many of these are noisy). The quality columns are computed as median divided by standard deviation, so larger numbers indicate more stable benchmarks.
|
System.Numerics.Tests.Perf_Matrix4x4.IsIdentityBenchmarkSame as #76928 (comment): @@ -92,19 +92,23 @@ G_M3814_IG03: ;; offset=003BH
; Final local variable assignments
;
;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd single-def
-;* V01 loc0 [V01,T00] ( 0, 0 ) struct (64) zero-ref do-not-enreg[SF] ld-addr-op
+;* V01 loc0 [V01 ] ( 0, 0 ) struct (64) zero-ref do-not-enreg[SF] ld-addr-op
;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1 [V03 ] ( 0, 0 ) struct (64) zero-ref do-not-enreg[S] ld-addr-op "Inline stloc first use temp"
;* V04 tmp2 [V04 ] ( 0, 0 ) struct (64) zero-ref ld-addr-op "Inline ldloca(s) first use temp"
-; V05 tmp3 [V05,T01] ( 3, 2 ) bool -> rax "Inline return value spill temp"
-;* V06 tmp4 [V06,T06] ( 0, 0 ) simd16 -> zero-ref single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
-;* V07 tmp5 [V07,T07] ( 0, 0 ) simd16 -> zero-ref single-def V04.Y(offs=0x10) P-INDEP "field V04.Y (fldOffset=0x10)"
-;* V08 tmp6 [V08,T08] ( 0, 0 ) simd16 -> zero-ref single-def V04.Z(offs=0x20) P-INDEP "field V04.Z (fldOffset=0x20)"
-;* V09 tmp7 [V09,T09] ( 0, 0 ) simd16 -> zero-ref single-def V04.W(offs=0x30) P-INDEP "field V04.W (fldOffset=0x30)"
-; V10 cse0 [V10,T02] ( 3, 3 ) simd16 -> mm0 "CSE - aggressive"
-; V11 cse1 [V11,T03] ( 3, 2 ) simd16 -> mm1 "CSE - aggressive"
-; V12 cse2 [V12,T04] ( 3, 2 ) simd16 -> mm2 "CSE - aggressive"
-; V13 cse3 [V13,T05] ( 3, 2 ) simd16 -> mm3 "CSE - aggressive"
+; V05 tmp3 [V05,T00] ( 3, 2 ) bool -> rax "Inline return value spill temp"
+;* V06 tmp4 [V06,T05] ( 0, 0 ) simd16 -> zero-ref single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
+;* V07 tmp5 [V07,T06] ( 0, 0 ) simd16 -> zero-ref single-def V04.Y(offs=0x10) P-INDEP "field V04.Y (fldOffset=0x10)"
+;* V08 tmp6 [V08,T07] ( 0, 0 ) simd16 -> zero-ref single-def V04.Z(offs=0x20) P-INDEP "field V04.Z (fldOffset=0x20)"
+;* V09 tmp7 [V09,T08] ( 0, 0 ) simd16 -> zero-ref single-def V04.W(offs=0x30) P-INDEP "field V04.W (fldOffset=0x30)"
+;* V10 tmp8 [V10 ] ( 0, 0 ) simd16 -> zero-ref single-def "V01.[000..016)"
+;* V11 tmp9 [V11,T09] ( 0, 0 ) simd16 -> zero-ref single-def "V01.[016..032)"
+;* V12 tmp10 [V12,T10] ( 0, 0 ) simd16 -> zero-ref single-def "V01.[032..048)"
+;* V13 tmp11 [V13,T11] ( 0, 0 ) simd16 -> zero-ref single-def "V01.[048..064)"
+; V14 cse0 [V14,T01] ( 2, 2 ) simd16 -> mm0 "CSE - aggressive"
+; V15 cse1 [V15,T02] ( 2, 1.50) simd16 -> mm1 "CSE - aggressive"
+; V16 cse2 [V16,T03] ( 2, 1.50) simd16 -> mm2 "CSE - aggressive"
+; V17 cse3 [V17,T04] ( 2, 1.50) simd16 -> mm3 "CSE - aggressive"
;
; Lcl frame size = 0
@@ -116,31 +120,31 @@ G_M3814_IG02: ;; offset=0003H
vmovups xmm1, xmmword ptr [reloc @RWD16]
vmovups xmm2, xmmword ptr [reloc @RWD32]
vmovups xmm3, xmmword ptr [reloc @RWD48]
- vcmpps xmm0, xmm0, xmm0, 0
+ vcmpps xmm0, xmm0, xmmword ptr [reloc @RWD00], 0
vmovmskps rax, xmm0
cmp eax, 15
jne SHORT G_M3814_IG04
- ;; size=46 bbWeight=1 PerfScore 18.25
-G_M3814_IG03: ;; offset=0031H
- vcmpps xmm0, xmm1, xmm1, 0
+ ;; size=50 bbWeight=1 PerfScore 18.25
+G_M3814_IG03: ;; offset=0035H
+ vcmpps xmm0, xmm1, xmmword ptr [reloc @RWD16], 0
vmovmskps rax, xmm0
cmp eax, 15
jne SHORT G_M3814_IG04
- vcmpps xmm0, xmm2, xmm2, 0
+ vcmpps xmm0, xmm2, xmmword ptr [reloc @RWD32], 0
vmovmskps rax, xmm0
cmp eax, 15
jne SHORT G_M3814_IG04
- vcmpps xmm0, xmm3, xmm3, 0
+ vcmpps xmm0, xmm3, xmmword ptr [reloc @RWD48], 0
vmovmskps rax, xmm0
cmp eax, 15
sete al
movzx rax, al
jmp SHORT G_M3814_IG05
- ;; size=48 bbWeight=0.50 PerfScore 10.50
-G_M3814_IG04: ;; offset=0061H
+ ;; size=60 bbWeight=0.50 PerfScore 10.50
+G_M3814_IG04: ;; offset=0071H
xor eax, eax
;; size=2 bbWeight=0.50 PerfScore 0.12
-G_M3814_IG05: ;; offset=0063H
+G_M3814_IG05: ;; offset=0073H
ret
;; size=1 bbWeight=1 PerfScore 1.00
RWD00 dq 000000003F800000h, 0000000000000000h
@@ -149,7 +153,7 @@ RWD32 dq 0000000000000000h, 000000003F800000h
RWD48 dq 0000000000000000h, 3F80000000000000h
-; Total bytes of code 100, prolog size 3, PerfScore 40.88, instruction count 25, allocated bytes for code 100 (MethodHash=8a71f119) for method Program:IsIdentityBenchmark():bool:this
+; Total bytes of code 116, prolog size 3, PerfScore 42.48, instruction count 25, allocated bytes for code 116 (MethodHash=8a71f119) for method Program:IsIdentityBenchmark():bool:this
; ============================================================
-225.8 ms
+267.6 ms |
System.Numerics.Tests.Perf_Matrix3x2.SubtractBenchmark@@ -131,13 +131,13 @@ G_M5743_IG03: ;; offset=0077H
; V01 RetBuf [V01,T00] ( 6, 6 ) byref -> rdx single-def
;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1 [V03 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[S] "impAppendStmt"
-;* V04 tmp2 [V04 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[S] "spilled call-like call argument"
+;* V04 tmp2 [V04,T02] ( 0, 0 ) struct (24) zero-ref do-not-enreg[S] "spilled call-like call argument"
;* V05 tmp3 [V05 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp"
;* V06 tmp4 [V06 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp"
;* V07 tmp5 [V07 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp"
;* V08 tmp6 [V08 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp"
; V09 tmp7 [V09 ] ( 4, 8 ) struct (24) [rsp+00H] do-not-enreg[XS] addr-exposed ld-addr-op "Inlining Arg"
-;* V10 tmp8 [V10,T02] ( 0, 0 ) struct (24) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
+;* V10 tmp8 [V10 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
; V11 tmp9 [V11,T01] ( 4, 8 ) byref -> rax single-def "impAppendStmt"
;* V12 tmp10 [V12 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp"
;* V13 tmp11 [V13 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp"
@@ -159,8 +159,11 @@ G_M5743_IG03: ;; offset=0077H
; V29 tmp27 [V29,T05] ( 2, 2 ) simd8 -> mm0 V13.X(offs=0x00) P-INDEP "field V13.X (fldOffset=0x0)"
; V30 tmp28 [V30,T06] ( 2, 2 ) simd8 -> mm1 V13.Y(offs=0x08) P-INDEP "field V13.Y (fldOffset=0x8)"
; V31 tmp29 [V31,T07] ( 2, 2 ) simd8 -> mm2 V13.Z(offs=0x10) P-INDEP "field V13.Z (fldOffset=0x10)"
-; V32 cse0 [V32,T03] ( 2, 2 ) simd8 -> mm0 "CSE - aggressive"
-; V33 cse1 [V33,T04] ( 2, 2 ) simd8 -> mm1 "CSE - aggressive"
+;* V32 tmp30 [V32,T14] ( 0, 0 ) simd8 -> zero-ref "V10.[000..008)"
+;* V33 tmp31 [V33,T15] ( 0, 0 ) simd8 -> zero-ref "V10.[008..016)"
+;* V34 tmp32 [V34,T16] ( 0, 0 ) simd8 -> zero-ref "V10.[016..024)"
+;* V35 cse0 [V35,T03] ( 0, 0 ) simd8 -> zero-ref "CSE - aggressive"
+;* V36 cse1 [V36,T04] ( 0, 0 ) simd8 -> zero-ref "CSE - aggressive"
;
; Lcl frame size = 24
@@ -170,18 +173,18 @@ G_M5743_IG01: ;; offset=0000H
;; size=7 bbWeight=1 PerfScore 1.25
G_M5743_IG02: ;; offset=0007H
vmovsd xmm0, qword ptr [reloc @RWD00]
- vmovsd xmm1, qword ptr [reloc @RWD08]
- vmovsd xmm2, qword ptr [reloc @RWD00]
- vmovsd qword ptr [rsp], xmm2
- vmovsd xmm2, qword ptr [reloc @RWD08]
- vmovsd qword ptr [rsp+08H], xmm2
- vxorps xmm2, xmm2, xmm2
- vmovsd qword ptr [rsp+10H], xmm2
+ vmovsd qword ptr [rsp], xmm0
+ vmovsd xmm0, qword ptr [reloc @RWD08]
+ vmovsd qword ptr [rsp+08H], xmm0
+ vxorps xmm0, xmm0, xmm0
+ vmovsd qword ptr [rsp+10H], xmm0
lea rax, bword ptr [rsp]
- vmovsd xmm2, qword ptr [rax]
- vsubps xmm0, xmm2, xmm0
- vmovsd xmm2, qword ptr [rax+08H]
- vsubps xmm1, xmm2, xmm1
+ vmovsd xmm0, qword ptr [rax]
+ vmovsd xmm1, qword ptr [reloc @RWD00]
+ vsubps xmm0, xmm0, xmm1
+ vmovsd xmm1, qword ptr [rax+08H]
+ vmovsd xmm2, qword ptr [reloc @RWD08]
+ vsubps xmm1, xmm1, xmm2
vmovsd xmm2, qword ptr [rax+10H]
vxorps xmm3, xmm3, xmm3
vsubps xmm2, xmm2, xmm3
@@ -201,4 +204,4 @@ RWD08 dq 3F80000000000000h
; Total bytes of code 116, prolog size 7, PerfScore 57.52, instruction count 24, allocated bytes for code 116 (MethodHash=9699e990) for method Program:SubtractBenchmark():System.Numerics.Matrix3x2:this Looks like physical promotion ends up with slightly different pipelining, which seems worse in the lab (however on my laptop Intel CPU, it seems to be sometimes faster than the original). The codegen for this benchmark is terrible with and without physical promotion. The problem is around
|
System.Numerics.Tests.Perf_Matrix3x2.MultiplyByMatrixBenchmark@@ -128,126 +128,140 @@ G_M38613_IG03: ;; offset=0077H
; Final local variable assignments
;
;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd single-def
-; V01 RetBuf [V01,T02] ( 6, 6 ) byref -> rdx single-def
+; V01 RetBuf [V01,T01] ( 6, 6 ) byref -> rdx single-def
;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1 [V03 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[S] "impAppendStmt"
;* V04 tmp2 [V04 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[S] "spilled call-like call argument"
;* V05 tmp3 [V05 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp"
;* V06 tmp4 [V06 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp"
;* V07 tmp5 [V07 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp"
-;* V08 tmp6 [V08 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp"
-; V09 tmp7 [V09 ] ( 4, 8 ) struct (24) [rsp+18H] do-not-enreg[XS] addr-exposed ld-addr-op "Inlining Arg"
-; V10 tmp8 [V10,T00] ( 9, 18 ) struct (24) [rsp+00H] do-not-enreg[SF] ld-addr-op "Inlining Arg"
-; V11 tmp9 [V11,T01] ( 7, 14 ) byref -> rax single-def "impAppendStmt"
+; V08 tmp6 [V08 ] ( 9, 9 ) struct (24) [rsp+18H] do-not-enreg[SF] ld-addr-op "Inline ldloca(s) first use temp"
+; V09 tmp7 [V09 ] ( 4, 8 ) struct (24) [rsp+00H] do-not-enreg[XS] addr-exposed ld-addr-op "Inlining Arg"
+;* V10 tmp8 [V10 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
+; V11 tmp9 [V11,T00] ( 7, 14 ) byref -> rax single-def "impAppendStmt"
;* V12 tmp10 [V12 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp"
;* V13 tmp11 [V13 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp"
-; V14 tmp12 [V14,T07] ( 2, 4 ) simd8 -> mm0 ld-addr-op "NewObj constructor temp"
-; V15 tmp13 [V15,T08] ( 2, 4 ) simd8 -> mm2 ld-addr-op "NewObj constructor temp"
-; V16 tmp14 [V16,T09] ( 2, 4 ) simd8 -> mm1 ld-addr-op "NewObj constructor temp"
+; V14 tmp12 [V14,T09] ( 2, 4 ) simd8 -> mm6 ld-addr-op "NewObj constructor temp"
+; V15 tmp13 [V15,T10] ( 2, 4 ) simd8 -> mm7 ld-addr-op "NewObj constructor temp"
+; V16 tmp14 [V16,T11] ( 2, 4 ) simd8 -> mm0 ld-addr-op "NewObj constructor temp"
;* V17 tmp15 [V17 ] ( 0, 0 ) simd8 -> zero-ref V05.X(offs=0x00) P-INDEP "field V05.X (fldOffset=0x0)"
;* V18 tmp16 [V18 ] ( 0, 0 ) simd8 -> zero-ref V05.Y(offs=0x08) P-INDEP "field V05.Y (fldOffset=0x8)"
;* V19 tmp17 [V19 ] ( 0, 0 ) simd8 -> zero-ref V05.Z(offs=0x10) P-INDEP "field V05.Z (fldOffset=0x10)"
-;* V20 tmp18 [V20,T21] ( 0, 0 ) simd8 -> zero-ref V06.X(offs=0x00) P-INDEP "field V06.X (fldOffset=0x0)"
-;* V21 tmp19 [V21,T22] ( 0, 0 ) simd8 -> zero-ref V06.Y(offs=0x08) P-INDEP "field V06.Y (fldOffset=0x8)"
-;* V22 tmp20 [V22,T23] ( 0, 0 ) simd8 -> zero-ref V06.Z(offs=0x10) P-INDEP "field V06.Z (fldOffset=0x10)"
+;* V20 tmp18 [V20,T25] ( 0, 0 ) simd8 -> zero-ref V06.X(offs=0x00) P-INDEP "field V06.X (fldOffset=0x0)"
+;* V21 tmp19 [V21,T26] ( 0, 0 ) simd8 -> zero-ref V06.Y(offs=0x08) P-INDEP "field V06.Y (fldOffset=0x8)"
+;* V22 tmp20 [V22,T27] ( 0, 0 ) simd8 -> zero-ref V06.Z(offs=0x10) P-INDEP "field V06.Z (fldOffset=0x10)"
;* V23 tmp21 [V23 ] ( 0, 0 ) simd8 -> zero-ref V07.X(offs=0x00) P-INDEP "field V07.X (fldOffset=0x0)"
;* V24 tmp22 [V24 ] ( 0, 0 ) simd8 -> zero-ref V07.Y(offs=0x08) P-INDEP "field V07.Y (fldOffset=0x8)"
;* V25 tmp23 [V25 ] ( 0, 0 ) simd8 -> zero-ref V07.Z(offs=0x10) P-INDEP "field V07.Z (fldOffset=0x10)"
-;* V26 tmp24 [V26,T24] ( 0, 0 ) simd8 -> zero-ref V08.X(offs=0x00) P-INDEP "field V08.X (fldOffset=0x0)"
-;* V27 tmp25 [V27,T25] ( 0, 0 ) simd8 -> zero-ref V08.Y(offs=0x08) P-INDEP "field V08.Y (fldOffset=0x8)"
-;* V28 tmp26 [V28,T26] ( 0, 0 ) simd8 -> zero-ref V08.Z(offs=0x10) P-INDEP "field V08.Z (fldOffset=0x10)"
+; V26 tmp24 [V26,T02] ( 7, 7 ) simd8 -> [rsp+18H] do-not-enreg[S] V08.X(offs=0x00) P-DEP "field V08.X (fldOffset=0x0)"
+; V27 tmp25 [V27,T03] ( 7, 7 ) simd8 -> [rsp+20H] do-not-enreg[S] V08.Y(offs=0x08) P-DEP "field V08.Y (fldOffset=0x8)"
+; V28 tmp26 [V28,T04] ( 7, 7 ) simd8 -> [rsp+28H] do-not-enreg[S] V08.Z(offs=0x10) P-DEP "field V08.Z (fldOffset=0x10)"
;* V29 tmp27 [V29 ] ( 0, 0 ) simd8 -> zero-ref V12.X(offs=0x00) P-INDEP "field V12.X (fldOffset=0x0)"
;* V30 tmp28 [V30 ] ( 0, 0 ) simd8 -> zero-ref V12.Y(offs=0x08) P-INDEP "field V12.Y (fldOffset=0x8)"
;* V31 tmp29 [V31 ] ( 0, 0 ) simd8 -> zero-ref V12.Z(offs=0x10) P-INDEP "field V12.Z (fldOffset=0x10)"
-; V32 tmp30 [V32,T18] ( 2, 2 ) simd8 -> mm0 V13.X(offs=0x00) P-INDEP "field V13.X (fldOffset=0x0)"
-; V33 tmp31 [V33,T19] ( 2, 2 ) simd8 -> mm2 V13.Y(offs=0x08) P-INDEP "field V13.Y (fldOffset=0x8)"
-; V34 tmp32 [V34,T20] ( 2, 2 ) simd8 -> mm1 V13.Z(offs=0x10) P-INDEP "field V13.Z (fldOffset=0x10)"
-; V35 cse0 [V35,T10] ( 3, 3 ) float -> mm3 "CSE - aggressive"
-; V36 cse1 [V36,T11] ( 3, 3 ) float -> mm2 "CSE - aggressive"
-; V37 cse2 [V37,T12] ( 3, 3 ) float -> mm7 "CSE - aggressive"
-; V38 cse3 [V38,T13] ( 3, 3 ) float -> mm3 "CSE - aggressive"
-; V39 cse4 [V39,T14] ( 3, 3 ) float -> mm7 "CSE - aggressive"
-; V40 cse5 [V40,T03] ( 4, 4 ) float -> mm1 "CSE - aggressive"
-; V41 cse6 [V41,T04] ( 4, 4 ) float -> mm4 "CSE - aggressive"
-; V42 cse7 [V42,T05] ( 4, 4 ) float -> mm5 "CSE - aggressive"
-; V43 cse8 [V43,T06] ( 4, 4 ) float -> mm6 "CSE - aggressive"
-;* V44 cse9 [V44,T15] ( 0, 0 ) simd8 -> zero-ref "CSE - aggressive"
-;* V45 cse10 [V45,T16] ( 0, 0 ) simd8 -> zero-ref "CSE - aggressive"
-; V46 cse11 [V46,T17] ( 3, 3 ) float -> mm0 "CSE - aggressive"
+; V32 tmp30 [V32,T20] ( 2, 2 ) simd8 -> mm6 V13.X(offs=0x00) P-INDEP "field V13.X (fldOffset=0x0)"
+; V33 tmp31 [V33,T21] ( 2, 2 ) simd8 -> mm7 V13.Y(offs=0x08) P-INDEP "field V13.Y (fldOffset=0x8)"
+; V34 tmp32 [V34,T22] ( 2, 2 ) simd8 -> mm0 V13.Z(offs=0x10) P-INDEP "field V13.Z (fldOffset=0x10)"
+; V35 tmp33 [V35,T05] ( 4, 4 ) float -> mm0 "V04.[000..004)"
+; V36 tmp34 [V36,T06] ( 4, 4 ) float -> mm1 "V04.[004..008)"
+; V37 tmp35 [V37,T07] ( 4, 4 ) float -> mm2 "V04.[008..012)"
+; V38 tmp36 [V38,T08] ( 4, 4 ) float -> mm3 "V04.[012..016)"
+; V39 tmp37 [V39,T23] ( 2, 2 ) float -> mm4 "V04.[016..020)"
+; V40 tmp38 [V40,T24] ( 2, 2 ) float -> mm5 "V04.[020..024)"
+;* V41 tmp39 [V41 ] ( 0, 0 ) float -> zero-ref "V10.[000..004)"
+;* V42 tmp40 [V42 ] ( 0, 0 ) float -> zero-ref "V10.[004..008)"
+;* V43 tmp41 [V43 ] ( 0, 0 ) float -> zero-ref "V10.[008..012)"
+;* V44 tmp42 [V44 ] ( 0, 0 ) float -> zero-ref "V10.[012..016)"
+;* V45 tmp43 [V45 ] ( 0, 0 ) float -> zero-ref "V10.[016..020)"
+;* V46 tmp44 [V46 ] ( 0, 0 ) float -> zero-ref "V10.[020..024)"
+; V47 cse0 [V47,T12] ( 3, 3 ) float -> mm8 "CSE - aggressive"
+; V48 cse1 [V48,T13] ( 3, 3 ) float -> mm7 "CSE - aggressive"
+; V49 cse2 [V49,T14] ( 3, 3 ) float -> mm9 "CSE - aggressive"
+; V50 cse3 [V50,T15] ( 3, 3 ) float -> mm8 "CSE - aggressive"
+; V51 cse4 [V51,T16] ( 3, 3 ) float -> mm9 "CSE - aggressive"
+; V52 cse5 [V52,T17] ( 2, 2 ) simd8 -> mm0 "CSE - aggressive"
+; V53 cse6 [V53,T18] ( 2, 2 ) simd8 -> mm1 "CSE - aggressive"
+; V54 cse7 [V54,T19] ( 3, 3 ) float -> mm6 "CSE - aggressive"
;
-; Lcl frame size = 104
+; Lcl frame size = 136
G_M38613_IG01: ;; offset=0000H
- sub rsp, 104
+ sub rsp, 136
vzeroupper
- vmovaps xmmword ptr [rsp+50H], xmm6
- vmovaps xmmword ptr [rsp+40H], xmm7
- vmovaps xmmword ptr [rsp+30H], xmm8
- ;; size=25 bbWeight=1 PerfScore 7.25
-G_M38613_IG02: ;; offset=0019H
+ vmovaps xmmword ptr [rsp+70H], xmm6
+ vmovaps xmmword ptr [rsp+60H], xmm7
+ vmovaps xmmword ptr [rsp+50H], xmm8
+ vmovaps xmmword ptr [rsp+40H], xmm9
+ vmovaps xmmword ptr [rsp+30H], xmm10
+ ;; size=40 bbWeight=1 PerfScore 11.25
+G_M38613_IG02: ;; offset=0028H
vmovsd xmm0, qword ptr [reloc @RWD00]
+ vmovsd xmm1, qword ptr [reloc @RWD08]
vmovsd qword ptr [rsp+18H], xmm0
- vmovsd xmm0, qword ptr [reloc @RWD08]
- vmovsd qword ptr [rsp+20H], xmm0
+ vmovsd qword ptr [rsp+20H], xmm1
vxorps xmm0, xmm0, xmm0
vmovsd qword ptr [rsp+28H], xmm0
- vmovsd xmm0, qword ptr [reloc @RWD00]
- vmovsd qword ptr [rsp], xmm0
- vmovsd xmm0, qword ptr [reloc @RWD08]
- vmovsd qword ptr [rsp+08H], xmm0
- vxorps xmm0, xmm0, xmm0
- vmovsd qword ptr [rsp+10H], xmm0
- lea rax, bword ptr [rsp+18H]
- vmovss xmm0, dword ptr [rax]
- vmovss xmm1, dword ptr [rsp]
- vmulss xmm2, xmm0, xmm1
- vmovss xmm3, dword ptr [rax+04H]
- vmovss xmm4, dword ptr [rsp+08H]
- vmulss xmm5, xmm3, xmm4
- vaddss xmm2, xmm2, xmm5
- vmovss xmm5, dword ptr [rsp+04H]
- vmulss xmm0, xmm0, xmm5
- vmovss xmm6, dword ptr [rsp+0CH]
- vmulss xmm3, xmm3, xmm6
- vaddss xmm0, xmm0, xmm3
- vinsertps xmm0, xmm2, xmm0, 28
- vmovss xmm2, dword ptr [rax+08H]
- vmulss xmm3, xmm2, xmm1
- vmovss xmm7, dword ptr [rax+0CH]
- vmulss xmm8, xmm7, xmm4
- vaddss xmm3, xmm3, xmm8
- vmulss xmm2, xmm2, xmm5
- vmulss xmm7, xmm7, xmm6
- vaddss xmm2, xmm2, xmm7
- vinsertps xmm2, xmm3, xmm2, 28
- vmovss xmm3, dword ptr [rax+10H]
- vmulss xmm1, xmm3, xmm1
- vmovss xmm7, dword ptr [rax+14H]
- vmulss xmm4, xmm7, xmm4
- vaddss xmm1, xmm1, xmm4
- vaddss xmm1, xmm1, dword ptr [rsp+10H]
- vmulss xmm3, xmm3, xmm5
- vmulss xmm4, xmm7, xmm6
- vaddss xmm3, xmm3, xmm4
- vaddss xmm3, xmm3, dword ptr [rsp+14H]
- vinsertps xmm1, xmm1, xmm3, 28
- vmovsd qword ptr [rdx], xmm0
- vmovsd qword ptr [rdx+08H], xmm2
- vmovsd qword ptr [rdx+10H], xmm1
+ vmovss xmm0, dword ptr [rsp+18H]
+ vmovss xmm1, dword ptr [rsp+1CH]
+ vmovss xmm2, dword ptr [rsp+20H]
+ vmovss xmm3, dword ptr [rsp+24H]
+ vmovss xmm4, dword ptr [rsp+28H]
+ vmovss xmm5, dword ptr [rsp+2CH]
+ vmovsd xmm6, qword ptr [reloc @RWD00]
+ vmovsd qword ptr [rsp], xmm6
+ vmovsd xmm6, qword ptr [reloc @RWD08]
+ vmovsd qword ptr [rsp+08H], xmm6
+ vxorps xmm6, xmm6, xmm6
+ vmovsd qword ptr [rsp+10H], xmm6
+ lea rax, bword ptr [rsp]
+ vmovss xmm6, dword ptr [rax]
+ vmulss xmm7, xmm6, xmm0
+ vmovss xmm8, dword ptr [rax+04H]
+ vmulss xmm9, xmm8, xmm2
+ vaddss xmm7, xmm7, xmm9
+ vmulss xmm6, xmm6, xmm1
+ vmulss xmm8, xmm8, xmm3
+ vaddss xmm6, xmm6, xmm8
+ vinsertps xmm6, xmm7, xmm6, 28
+ vmovss xmm7, dword ptr [rax+08H]
+ vmulss xmm8, xmm7, xmm0
+ vmovss xmm9, dword ptr [rax+0CH]
+ vmulss xmm10, xmm9, xmm2
+ vaddss xmm8, xmm8, xmm10
+ vmulss xmm7, xmm7, xmm1
+ vmulss xmm9, xmm9, xmm3
+ vaddss xmm7, xmm7, xmm9
+ vinsertps xmm7, xmm8, xmm7, 28
+ vmovss xmm8, dword ptr [rax+10H]
+ vmulss xmm0, xmm8, xmm0
+ vmovss xmm9, dword ptr [rax+14H]
+ vmulss xmm2, xmm9, xmm2
+ vaddss xmm0, xmm0, xmm2
+ vaddss xmm0, xmm0, xmm4
+ vmulss xmm1, xmm8, xmm1
+ vmulss xmm2, xmm9, xmm3
+ vaddss xmm1, xmm1, xmm2
+ vaddss xmm1, xmm1, xmm5
+ vinsertps xmm0, xmm0, xmm1, 28
+ vmovsd qword ptr [rdx], xmm6
+ vmovsd qword ptr [rdx+08H], xmm7
+ vmovsd qword ptr [rdx+10H], xmm0
mov rax, rdx
- ;; size=252 bbWeight=1 PerfScore 128.42
-G_M38613_IG03: ;; offset=0115H
- vmovaps xmm6, xmmword ptr [rsp+50H]
- vmovaps xmm7, xmmword ptr [rsp+40H]
- vmovaps xmm8, xmmword ptr [rsp+30H]
- add rsp, 104
+ ;; size=263 bbWeight=1 PerfScore 130.42
+G_M38613_IG03: ;; offset=012FH
+ vmovaps xmm6, xmmword ptr [rsp+70H]
+ vmovaps xmm7, xmmword ptr [rsp+60H]
+ vmovaps xmm8, xmmword ptr [rsp+50H]
+ vmovaps xmm9, xmmword ptr [rsp+40H]
+ vmovaps xmm10, xmmword ptr [rsp+30H]
+ add rsp, 136
ret
- ;; size=23 bbWeight=1 PerfScore 13.25
+ ;; size=38 bbWeight=1 PerfScore 21.25
RWD00 dq 000000003F800000h
RWD08 dq 3F80000000000000h
-; Total bytes of code 300, prolog size 25, PerfScore 178.92, instruction count 60, allocated bytes for code 300 (MethodHash=f176692a) for method Program:MultiplyByMatrixOperatorBenchmark():Program+Matrix3x2:this
+; Total bytes of code 341, prolog size 40, PerfScore 197.02, instruction count 66, allocated bytes for code 341 (MethodHash=f176692a) for method Program:MultiplyByMatrixOperatorBenchmark():Program+Matrix3x2:this
-499.7 ms
+555.9 ms We need some more registers and also see the same kind of pipelining change as in the previous comment, but in addition we also DNER
|
Promoting The pass does not currently have the necessary information to try to take this into account, so need to think about what to do here. |
Looking at this perfscore regression: 306966.04 ( 0.29% of base) : 53160.dasm - Benchmarks.SIMD.RayTracer.RayTracer:RenderSequential(Benchmarks.SIMD.RayTracer.Scene,int[]):this@@ -19,7 +19,7 @@
;* V07 loc4 [V07 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op
; V08 OutArgs [V08 ] ( 1, 1 ) struct (40) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V09 tmp1 [V09 ] ( 0, 0 ) struct (16) zero-ref "impAppendStmt"
-; V10 tmp2 [V10,T03] ( 4,1668919.51) struct (24) [rsp+90H] do-not-enreg[S] ld-addr-op "NewObj constructor temp"
+;* V10 tmp2 [V10 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[S] ld-addr-op "NewObj constructor temp"
;* V11 tmp3 [V11 ] ( 0, 0 ) struct (16) zero-ref "spilled call-like call argument"
;* V12 tmp4 [V12 ] ( 0, 0 ) int -> zero-ref "Strict ordering of exceptions for Array store"
;* V13 tmp5 [V13 ] ( 0, 0 ) double -> zero-ref "Inlining Arg"
@@ -31,37 +31,37 @@
;* V19 tmp11 [V19 ] ( 0, 0 ) struct (16) zero-ref "spilled call-like call argument"
;* V20 tmp12 [V20 ] ( 0, 0 ) double -> zero-ref "Inlining Arg"
;* V21 tmp13 [V21 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inline ldloca(s) first use temp"
-; V22 tmp14 [V22,T36] ( 2, 834459.76) double -> mm1 "Inlining Arg"
+; V22 tmp14 [V22,T38] ( 2, 834459.76) double -> mm1 "Inlining Arg"
;* V23 tmp15 [V23 ] ( 0, 0 ) struct (16) zero-ref "Inlining Arg"
;* V24 tmp16 [V24 ] ( 0, 0 ) double -> zero-ref "Inlining Arg"
;* V25 tmp17 [V25 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inline ldloca(s) first use temp"
-; V26 tmp18 [V26,T37] ( 2, 834459.76) double -> mm3 "Inlining Arg"
+; V26 tmp18 [V26,T39] ( 2, 834459.76) double -> mm2 "Inlining Arg"
;* V27 tmp19 [V27 ] ( 0, 0 ) struct (16) zero-ref "Inlining Arg"
;* V28 tmp20 [V28 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inline ldloca(s) first use temp"
;* V29 tmp21 [V29 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inline ldloca(s) first use temp"
-; V30 tmp22 [V30,T46] ( 3, 625844.82) float -> mm1 "Inline stloc first use temp"
-; V31 tmp23 [V31,T50] ( 3, 417229.88) float -> mm9
+; V30 tmp22 [V30,T44] ( 3, 625844.82) float -> mm1 "Inline stloc first use temp"
+; V31 tmp23 [V31,T48] ( 3, 417229.88) float -> mm9
;* V32 tmp24 [V32 ] ( 0, 0 ) float -> zero-ref "Inline stloc first use temp"
;* V33 tmp25 [V33 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inline ldloca(s) first use temp"
;* V34 tmp26 [V34 ] ( 0, 0 ) double -> zero-ref "Inlining Arg"
-; V35 tmp27 [V35 ] ( 3, 417229.88) struct (16) [rsp+80H] do-not-enreg[XS] must-init addr-exposed "Inline return value spill temp"
-; V36 tmp28 [V36,T04] ( 3,1661092.31) struct (24) [rsp+68H] do-not-enreg[S] "Inlining Arg"
+; V35 tmp27 [V35 ] ( 3, 417229.88) struct (16) [rsp+88H] do-not-enreg[XS] must-init addr-exposed "Inline return value spill temp"
+; V36 tmp28 [V36,T07] ( 3,1418003.17) struct (24) [rsp+70H] do-not-enreg[S] "Inlining Arg"
; V37 tmp29 [V37,T21] ( 3, 546589.36) ref -> r8 class-hnd "Inline stloc first use temp"
;* V38 tmp30 [V38 ] ( 0, 0 ) ref -> zero-ref class-hnd "Inline return value spill temp"
; V39 tmp31 [V39,T20] ( 5, 577728.02) ref -> r13 class-hnd "Inline stloc first use temp"
-; V40 tmp32 [V40,T10] ( 3,1039161.09) ref -> [rsp+38H] class-hnd spill-single-def "Inline stloc first use temp"
-; V41 tmp33 [V41,T01] ( 5,2696339.80) int -> [rsp+64H] "Inline stloc first use temp"
-; V42 tmp34 [V42,T02] ( 4,1865793.64) ref -> r8 class-hnd "Inline stloc first use temp"
+; V40 tmp32 [V40,T10] ( 3,1039161.09) ref -> [rsp+40H] class-hnd spill-single-def "Inline stloc first use temp"
+; V41 tmp33 [V41,T01] ( 5,2696339.80) int -> [rsp+6CH] "Inline stloc first use temp"
+; V42 tmp34 [V42,T03] ( 4,1865793.64) ref -> r8 class-hnd "Inline stloc first use temp"
; V43 tmp35 [V43,T08] ( 4,1324112.02) ref -> r8 class-hnd "Inline stloc first use temp"
-; V44 tmp36 [V44,T05] ( 4,1531185.21) ref -> r8 "guarded devirt return temp"
-; V45 tmp37 [V45,T06] ( 4,1484307.13) ref -> [rsp+30H] class-hnd exact spill-single-def "guarded devirt this exact temp"
+; V44 tmp36 [V44,T04] ( 4,1531185.21) ref -> r8 "guarded devirt return temp"
+; V45 tmp37 [V45,T05] ( 4,1484307.13) ref -> [rsp+38H] class-hnd exact spill-single-def "guarded devirt this exact temp"
;* V46 tmp38 [V46 ] ( 0, 0 ) struct (16) zero-ref "Inline stloc first use temp"
-; V47 tmp39 [V47,T32] ( 4,1208697.08) float -> mm11 "Inline stloc first use temp"
+; V47 tmp39 [V47,T34] ( 4,1208697.08) float -> mm11 "Inline stloc first use temp"
;* V48 tmp40 [V48 ] ( 0, 0 ) double -> zero-ref "impAppendStmt"
-; V49 tmp41 [V49,T44] ( 3, 765017.45) double -> mm0 "Inline stloc first use temp"
-; V50 tmp42 [V50,T45] ( 3, 713112.43) float -> mm10
-; V51 tmp43 [V51,T31] ( 3,1156792.06) float -> mm10 "Inline stloc first use temp"
-; V52 tmp44 [V52,T00] ( 5,3855973.53) ref -> [rsp+28H] class-hnd exact spill-single-def "NewObj constructor temp"
+; V49 tmp41 [V49,T42] ( 3, 765017.45) double -> mm0 "Inline stloc first use temp"
+; V50 tmp42 [V50,T43] ( 3, 713112.43) float -> mm10
+; V51 tmp43 [V51,T33] ( 3,1156792.06) float -> mm10 "Inline stloc first use temp"
+; V52 tmp44 [V52,T00] ( 5,3855973.53) ref -> [rsp+30H] class-hnd exact spill-single-def "NewObj constructor temp"
;* V53 tmp45 [V53 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inline ldloca(s) first use temp"
;* V54 tmp46 [V54 ] ( 0, 0 ) struct (16) zero-ref "Inlining Arg"
;* V55 tmp47 [V55 ] ( 0, 0 ) struct (16) zero-ref "Inlining Arg"
@@ -78,44 +78,47 @@
; V66 tmp58 [V66,T24] ( 3, 417229.88) int -> rax "Inline return value spill temp"
;* V67 tmp59 [V67 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
; V68 tmp60 [V68,T18] ( 3, 625844.82) int -> rax "Inline stloc first use temp"
-; V69 tmp61 [V69,T33] ( 4, 834459.76) simd12 -> mm0 V07._simdVector(offs=0x00) P-INDEP "field V07._simdVector (fldOffset=0x0)"
-; V70 tmp62 [V70,T51] ( 2, 417229.88) simd12 -> mm2 V09._simdVector(offs=0x00) P-INDEP "field V09._simdVector (fldOffset=0x0)"
+; V69 tmp61 [V69,T35] ( 4, 834459.76) simd12 -> mm0 V07._simdVector(offs=0x00) P-INDEP "field V07._simdVector (fldOffset=0x0)"
+; V70 tmp62 [V70,T49] ( 2, 417229.88) simd12 -> mm7 V09._simdVector(offs=0x00) P-INDEP "field V09._simdVector (fldOffset=0x0)"
;* V71 tmp63 [V71 ] ( 0, 0 ) simd12 -> zero-ref V11._simdVector(offs=0x00) P-INDEP "field V11._simdVector (fldOffset=0x0)"
-; V72 tmp64 [V72,T52] ( 2, 417229.88) simd12 -> mm0 V14._simdVector(offs=0x00) P-INDEP "field V14._simdVector (fldOffset=0x0)"
+; V72 tmp64 [V72,T50] ( 2, 417229.88) simd12 -> mm0 V14._simdVector(offs=0x00) P-INDEP "field V14._simdVector (fldOffset=0x0)"
;* V73 tmp65 [V73 ] ( 0, 0 ) simd12 -> zero-ref V16._simdVector(offs=0x00) P-INDEP "field V16._simdVector (fldOffset=0x0)"
;* V74 tmp66 [V74 ] ( 0, 0 ) simd12 -> zero-ref V17._simdVector(offs=0x00) P-INDEP "field V17._simdVector (fldOffset=0x0)"
;* V75 tmp67 [V75 ] ( 0, 0 ) simd12 -> zero-ref V18._simdVector(offs=0x00) P-INDEP "field V18._simdVector (fldOffset=0x0)"
-; V76 tmp68 [V76,T34] ( 4, 834459.76) simd12 -> mm0 V19._simdVector(offs=0x00) P-INDEP "field V19._simdVector (fldOffset=0x0)"
-; V77 tmp69 [V77,T53] ( 2, 417229.88) simd12 -> mm1 V21._simdVector(offs=0x00) P-INDEP "field V21._simdVector (fldOffset=0x0)"
-; V78 tmp70 [V78,T54] ( 2, 417229.88) simd12 -> mm4 V23._simdVector(offs=0x00) P-INDEP "field V23._simdVector (fldOffset=0x0)"
-; V79 tmp71 [V79,T55] ( 2, 417229.88) simd12 -> mm3 V25._simdVector(offs=0x00) P-INDEP "field V25._simdVector (fldOffset=0x0)"
-; V80 tmp72 [V80,T56] ( 2, 417229.88) simd12 -> mm4 V27._simdVector(offs=0x00) P-INDEP "field V27._simdVector (fldOffset=0x0)"
-; V81 tmp73 [V81,T57] ( 2, 417229.88) simd12 -> mm1 V28._simdVector(offs=0x00) P-INDEP "field V28._simdVector (fldOffset=0x0)"
-; V82 tmp74 [V82,T58] ( 2, 417229.88) simd12 -> mm0 V29._simdVector(offs=0x00) P-INDEP "field V29._simdVector (fldOffset=0x0)"
-; V83 tmp75 [V83,T59] ( 2, 417229.88) simd12 -> mm0 V33._simdVector(offs=0x00) P-INDEP "field V33._simdVector (fldOffset=0x0)"
-; V84 tmp76 [V84 ] ( 3, 417229.88) simd12 -> [rsp+80H] do-not-enreg[XS] addr-exposed V35._simdVector(offs=0x00) P-DEP "field V35._simdVector (fldOffset=0x0)"
+; V76 tmp68 [V76,T36] ( 4, 834459.76) simd12 -> mm0 V19._simdVector(offs=0x00) P-INDEP "field V19._simdVector (fldOffset=0x0)"
+; V77 tmp69 [V77,T51] ( 2, 417229.88) simd12 -> mm1 V21._simdVector(offs=0x00) P-INDEP "field V21._simdVector (fldOffset=0x0)"
+; V78 tmp70 [V78,T52] ( 2, 417229.88) simd12 -> mm3 V23._simdVector(offs=0x00) P-INDEP "field V23._simdVector (fldOffset=0x0)"
+; V79 tmp71 [V79,T53] ( 2, 417229.88) simd12 -> mm2 V25._simdVector(offs=0x00) P-INDEP "field V25._simdVector (fldOffset=0x0)"
+; V80 tmp72 [V80,T54] ( 2, 417229.88) simd12 -> mm3 V27._simdVector(offs=0x00) P-INDEP "field V27._simdVector (fldOffset=0x0)"
+; V81 tmp73 [V81,T55] ( 2, 417229.88) simd12 -> mm1 V28._simdVector(offs=0x00) P-INDEP "field V28._simdVector (fldOffset=0x0)"
+; V82 tmp74 [V82,T56] ( 2, 417229.88) simd12 -> mm0 V29._simdVector(offs=0x00) P-INDEP "field V29._simdVector (fldOffset=0x0)"
+; V83 tmp75 [V83,T57] ( 2, 417229.88) simd12 -> mm9 V33._simdVector(offs=0x00) P-INDEP "field V33._simdVector (fldOffset=0x0)"
+; V84 tmp76 [V84 ] ( 3, 417229.88) simd12 -> [rsp+88H] do-not-enreg[XS] addr-exposed V35._simdVector(offs=0x00) P-DEP "field V35._simdVector (fldOffset=0x0)"
; V85 tmp77 [V85,T29] ( 4,1426224.85) simd12 -> mm10 V46._simdVector(offs=0x00) P-INDEP "field V46._simdVector (fldOffset=0x0)"
; V86 tmp78 [V86,T40] ( 2, 771194.71) simd12 -> mm10 V53._simdVector(offs=0x00) P-INDEP "field V53._simdVector (fldOffset=0x0)"
; V87 tmp79 [V87,T41] ( 2, 771194.71) simd12 -> mm0 V54._simdVector(offs=0x00) P-INDEP "field V54._simdVector (fldOffset=0x0)"
-; V88 tmp80 [V88,T42] ( 2, 771194.71) simd12 -> mm1 V55._simdVector(offs=0x00) P-INDEP "field V55._simdVector (fldOffset=0x0)"
-; V89 tmp81 [V89,T43] ( 2, 771194.71) simd12 -> mm0 V56._simdVector(offs=0x00) P-INDEP "field V56._simdVector (fldOffset=0x0)"
-;* V90 tmp82 [V90,T61] ( 0, 0 ) simd12 -> zero-ref V58._simdVector(offs=0x00) P-INDEP "field V58._simdVector (fldOffset=0x0)"
-; V91 tmp83 [V91 ] ( 2, 945335.45) struct (24) [rsp+48H] do-not-enreg[XS] addr-exposed "by-value struct argument"
-; V92 tmp84 [V92,T07] ( 3,1418003.17) ref -> r8 "argument with side effect"
-; V93 cse0 [V93,T38] ( 3, 802827.23) simd12 -> mm9 "CSE - aggressive"
-; V94 cse1 [V94,T47] ( 3, 625844.82) double -> mm1 "CSE - moderate"
-; V95 cse2 [V95,T48] ( 3, 625844.82) double -> mm4 "CSE - moderate"
-; V96 cse3 [V96,T60] ( 2, 209609.18) double -> mm6 "CSE - conservative"
-; V97 cse4 [V97,T39] ( 3, 802827.23) simd12 -> mm7 "CSE - aggressive"
-; V98 cse5 [V98,T28] ( 3, 2997.00) int -> rax "CSE - conservative"
-; V99 cse6 [V99,T30] ( 5,1280874.96) double -> mm8 "CSE - aggressive"
-; V100 cse7 [V100,T11] ( 3,1039161.09) int -> [rsp+44H] spill-single-def "CSE - aggressive"
-; V101 cse8 [V101,T35] ( 4, 834459.76) float -> mm2 "CSE - aggressive"
-; V102 cse9 [V102,T49] ( 3, 625844.82) double -> mm3 "CSE - moderate"
-; V103 cse10 [V103,T19] ( 3, 625844.82) int -> rcx "CSE - moderate"
-; TEMP_01 double -> [rsp+0xA8]
+;* V88 tmp80 [V88 ] ( 0, 0 ) simd12 -> zero-ref V55._simdVector(offs=0x00) P-INDEP "field V55._simdVector (fldOffset=0x0)"
+;* V89 tmp81 [V89 ] ( 0, 0 ) simd12 -> zero-ref V56._simdVector(offs=0x00) P-INDEP "field V56._simdVector (fldOffset=0x0)"
+;* V90 tmp82 [V90,T59] ( 0, 0 ) simd12 -> zero-ref V58._simdVector(offs=0x00) P-INDEP "field V58._simdVector (fldOffset=0x0)"
+;* V91 tmp83 [V91 ] ( 0, 0 ) simd12 -> zero-ref "V10.[000..012)"
+;* V92 tmp84 [V92 ] ( 0, 0 ) simd12 -> zero-ref "V10.[012..024)"
+; V93 tmp85 [V93,T31] ( 4,1216143.51) simd12 -> mm7 "V36.[000..012)"
+; V94 tmp86 [V94,T32] ( 4,1216143.51) simd12 -> mm9 "V36.[012..024)"
+; V95 tmp87 [V95,T02] ( 3,2313584.12) byref -> rcx "Spilling address for field-by-field copy"
+; V96 tmp88 [V96 ] ( 2, 945335.45) struct (24) [rsp+50H] do-not-enreg[XS] addr-exposed "by-value struct argument"
+; V97 tmp89 [V97,T06] ( 3,1418003.17) ref -> r8 "argument with side effect"
+; V98 cse0 [V98,T45] ( 3, 625844.82) double -> mm1 "CSE - moderate"
+; V99 cse1 [V99,T46] ( 3, 625844.82) double -> mm3 "CSE - moderate"
+; V100 cse2 [V100,T58] ( 2, 209609.18) double -> mm6 "CSE - conservative"
+; V101 cse3 [V101,T28] ( 3, 2997.00) int -> rax "CSE - conservative"
+; V102 cse4 [V102,T30] ( 5,1280874.96) double -> mm8 "CSE - aggressive"
+; V103 cse5 [V103,T11] ( 3,1039161.09) int -> [rsp+4CH] spill-single-def "CSE - aggressive"
+; V104 cse6 [V104,T37] ( 4, 834459.76) float -> mm2 "CSE - moderate"
+; V105 cse7 [V105,T47] ( 3, 625844.82) double -> mm2 "CSE - moderate"
+; V106 cse8 [V106,T19] ( 3, 625844.82) int -> rcx "CSE - moderate"
+; TEMP_01 double -> [rsp+0x98]
;
-; Lcl frame size = 280
+; Lcl frame size = 264
G_M31648_IG01: ;; offset=0000H
push r15
@@ -126,17 +129,17 @@ G_M31648_IG01: ;; offset=0000H
push rsi
push rbp
push rbx
- sub rsp, 280
+ sub rsp, 264
vzeroupper
- vmovaps xmmword ptr [rsp+100H], xmm6
- vmovaps xmmword ptr [rsp+F0H], xmm7
- vmovaps xmmword ptr [rsp+E0H], xmm8
- vmovaps xmmword ptr [rsp+D0H], xmm9
- vmovaps xmmword ptr [rsp+C0H], xmm10
- vmovaps xmmword ptr [rsp+B0H], xmm11
+ vmovaps xmmword ptr [rsp+F0H], xmm6
+ vmovaps xmmword ptr [rsp+E0H], xmm7
+ vmovaps xmmword ptr [rsp+D0H], xmm8
+ vmovaps xmmword ptr [rsp+C0H], xmm9
+ vmovaps xmmword ptr [rsp+B0H], xmm10
+ vmovaps xmmword ptr [rsp+A0H], xmm11
xor eax, eax
- mov qword ptr [rsp+80H], rax
mov qword ptr [rsp+88H], rax
+ mov qword ptr [rsp+90H], rax
mov rsi, rcx
mov rbx, rdx
mov rdi, r8
@@ -162,213 +165,203 @@ G_M31648_IG04: ;; offset=008CH
G_M31648_IG05: ;; offset=0094H
vmovsd xmm7, qword ptr [r15+08H]
vinsertps xmm7, xmm7, dword ptr [r15+10H], 40
- vmovaps xmm2, xmm7
vmovsd xmm0, qword ptr [r15+14H]
vinsertps xmm0, xmm0, dword ptr [r15+1CH], 40
vxorps xmm1, xmm1, xmm1
vcvtsi2sd xmm1, xmm1, dword ptr [rsi+20H]
- vmovsd xmm3, qword ptr [reloc @RWD00]
- vmulsd xmm4, xmm1, xmm3
- vxorps xmm5, xmm5, xmm5
- vcvtsi2sd xmm5, xmm5, r12d
- vsubsd xmm4, xmm5, xmm4
+ vmovsd xmm2, qword ptr [reloc @RWD00]
+ vmulsd xmm3, xmm1, xmm2
+ vxorps xmm4, xmm4, xmm4
+ vcvtsi2sd xmm4, xmm4, r12d
+ vsubsd xmm3, xmm4, xmm3
vmovsd xmm8, qword ptr [reloc @RWD08]
vmulsd xmm1, xmm1, xmm8
- vdivsd xmm1, xmm4, xmm1
- vmovsd xmm4, qword ptr [r15+2CH]
- vinsertps xmm4, xmm4, dword ptr [r15+34H], 40
+ vdivsd xmm1, xmm3, xmm1
+ vmovsd xmm3, qword ptr [r15+2CH]
+ vinsertps xmm3, xmm3, dword ptr [r15+34H], 40
vcvtsd2ss xmm1, xmm1, xmm1
vbroadcastss xmm1, xmm1
- vmulps xmm1, xmm1, xmm4
- vxorps xmm4, xmm4, xmm4
- vcvtsi2sd xmm4, xmm4, dword ptr [rsi+24H]
- vmulsd xmm3, xmm4, xmm3
- vsubsd xmm3, xmm6, xmm3
- vxorps xmm3, xmm3, xmmword ptr [reloc @RWD16]
- vmulsd xmm4, xmm4, xmm8
- vdivsd xmm3, xmm3, xmm4
- vmovsd xmm4, qword ptr [r15+20H]
- vinsertps xmm4, xmm4, dword ptr [r15+28H], 40
- vcvtsd2ss xmm3, xmm3, xmm3
- vbroadcastss xmm3, xmm3
- vmulps xmm3, xmm3, xmm4
- vaddps xmm1, xmm1, xmm3
+ vmulps xmm1, xmm1, xmm3
+ vxorps xmm3, xmm3, xmm3
+ vcvtsi2sd xmm3, xmm3, dword ptr [rsi+24H]
+ vmulsd xmm2, xmm3, xmm2
+ vsubsd xmm2, xmm6, xmm2
+ vxorps xmm2, xmm2, xmmword ptr [reloc @RWD16]
+ vmulsd xmm3, xmm3, xmm8
+ vdivsd xmm2, xmm2, xmm3
+ vmovsd xmm3, qword ptr [r15+20H]
+ vinsertps xmm3, xmm3, dword ptr [r15+28H], 40
+ vcvtsd2ss xmm2, xmm2, xmm2
+ vbroadcastss xmm2, xmm2
+ vmulps xmm2, xmm2, xmm3
+ vaddps xmm1, xmm1, xmm2
vaddps xmm0, xmm0, xmm1
vdpps xmm1, xmm0, xmm0, 127
vcvtss2sd xmm1, xmm1, xmm1
vsqrtsd xmm1, xmm1, xmm1
vcvtsd2ss xmm1, xmm1, xmm1
- vxorps xmm3, xmm3, xmm3
- vucomiss xmm1, xmm3
+ vxorps xmm2, xmm2, xmm2
+ vucomiss xmm1, xmm2
jp SHORT G_M31648_IG06
je G_M31648_IG33
- ;; size=209 bbWeight=208614.94 PerfScore 33708697.29
-G_M31648_IG06: ;; offset=0165H
- vmovss xmm3, dword ptr [reloc @RWD32]
- vdivss xmm9, xmm3, xmm1
+ ;; size=205 bbWeight=208614.94 PerfScore 33656543.55
+G_M31648_IG06: ;; offset=0161H
+ vmovss xmm2, dword ptr [reloc @RWD32]
+ vdivss xmm9, xmm2, xmm1
;; size=12 bbWeight=208614.94 PerfScore 2711994.21
-G_M31648_IG07: ;; offset=0171H
+G_M31648_IG07: ;; offset=016DH
vcvtss2sd xmm1, xmm1, xmm9
vcvtsd2ss xmm1, xmm1, xmm1
vbroadcastss xmm1, xmm1
vmulps xmm9, xmm1, xmm0
- vmovaps xmm0, xmm9
- vxorps xmm1, xmm1, xmm1
- vmovdqu xmmword ptr [rsp+90H], xmm1
- vmovdqu xmmword ptr [rsp+98H], xmm1
- vmovsd qword ptr [rsp+90H], xmm2
- vextractps dword ptr [rsp+98H], xmm2, 2
- vmovsd qword ptr [rsp+9CH], xmm0
- vextractps dword ptr [rsp+A4H], xmm0, 2
- vmovdqu xmm0, xmmword ptr [rsp+90H]
- vmovdqu xmmword ptr [rsp+68H], xmm0
- mov rax, qword ptr [rsp+A0H]
- mov qword ptr [rsp+78H], rax
xor r13, r13
mov rax, gword ptr [rbx+08H]
- mov gword ptr [rsp+38H], rax
+ mov gword ptr [rsp+40H], rax
xor edx, edx
mov ecx, dword ptr [rax+08H]
- mov dword ptr [rsp+44H], ecx
+ mov dword ptr [rsp+4CH], ecx
test ecx, ecx
jle G_M31648_IG22
- ;; size=142 bbWeight=208614.94 PerfScore 7579676.13
-G_M31648_IG08: ;; offset=01FFH
- mov dword ptr [rsp+64H], edx
+ ;; size=47 bbWeight=208614.94 PerfScore 4120145.05
+G_M31648_IG08: ;; offset=019CH
+ mov dword ptr [rsp+6CH], edx
mov r8d, edx
mov r8, gword ptr [rax+8*r8+10H]
mov r9, 0x7FF8687322D8 ; Benchmarks.SIMD.RayTracer.Sphere
cmp qword ptr [r8], r9
jne G_M31648_IG15
;; size=31 bbWeight=621931.21 PerfScore 4664484.11
-G_M31648_IG09: ;; offset=021EH
- mov gword ptr [rsp+30H], r8
+G_M31648_IG09: ;; offset=01BBH
+ mov gword ptr [rsp+38H], r8
vmovsd xmm0, qword ptr [r8+14H]
vinsertps xmm0, xmm0, dword ptr [r8+1CH], 40
- vmovaps xmm1, xmm7
- vsubps xmm10, xmm0, xmm1
- vmovaps xmm0, xmm9
- vdpps xmm11, xmm10, xmm0, 127
+ vsubps xmm10, xmm0, xmm7
+ vdpps xmm11, xmm10, xmm9, 127
vxorps xmm0, xmm0, xmm0
vucomiss xmm0, xmm11
ja SHORT G_M31648_IG14
- ;; size=48 bbWeight=385597.35 PerfScore 10346862.32
-G_M31648_IG10: ;; offset=024EH
+ ;; size=39 bbWeight=385597.35 PerfScore 10154063.64
+G_M31648_IG10: ;; offset=01E2H
vcvtss2sd xmm0, xmm0, dword ptr [r8+10H]
vmovaps xmm1, xmm8
call <unknown method>
- vmovsd qword ptr [rsp+A8H], xmm0
+ vmovsd qword ptr [rsp+98H], xmm0
vcvtss2sd xmm0, xmm0, xmm11
vmovaps xmm1, xmm8
call <unknown method>
vdpps xmm1, xmm10, xmm10, 127
vcvtss2sd xmm1, xmm1, xmm1
vsubsd xmm0, xmm1, xmm0
- vmovsd xmm1, qword ptr [rsp+A8H]
+ vmovsd xmm1, qword ptr [rsp+98H]
vsubsd xmm0, xmm1, xmm0
vxorps xmm1, xmm1, xmm1
vucomisd xmm1, xmm0
ja SHORT G_M31648_IG12
;; size=77 bbWeight=327515.07 PerfScore 14028562.26
-G_M31648_IG11: ;; offset=029BH
+G_M31648_IG11: ;; offset=022FH
vsqrtsd xmm0, xmm0, xmm0
vcvtsd2ss xmm0, xmm0, xmm0
vsubss xmm10, xmm11, xmm0
jmp SHORT G_M31648_IG13
;; size=14 bbWeight=109987.30 PerfScore 2309733.36
-G_M31648_IG12: ;; offset=02A9H
+G_M31648_IG12: ;; offset=023DH
vxorps xmm10, xmm10, xmm10
;; size=5 bbWeight=217527.77 PerfScore 72509.26
-G_M31648_IG13: ;; offset=02AEH
+G_M31648_IG13: ;; offset=0242H
vxorps xmm0, xmm0, xmm0
vucomiss xmm10, xmm0
jp SHORT G_M31648_IG18
jne SHORT G_M31648_IG18
;; size=12 bbWeight=385597.35 PerfScore 1670921.87
-G_M31648_IG14: ;; offset=02BAH
+G_M31648_IG14: ;; offset=024EH
xor r8, r8
jmp SHORT G_M31648_IG16
;; size=5 bbWeight=287322.78 PerfScore 646476.25
-G_M31648_IG15: ;; offset=02BFH
- vmovdqu xmm0, xmmword ptr [rsp+68H]
- vmovdqu xmmword ptr [rsp+48H], xmm0
- mov r9, qword ptr [rsp+78H]
- mov qword ptr [rsp+58H], r9
+G_M31648_IG15: ;; offset=0253H
+ vmovsd qword ptr [rsp+70H], xmm7
+ vextractps dword ptr [rsp+78H], xmm7, 2
+ vmovsd qword ptr [rsp+7CH], xmm9
+ vextractps dword ptr [rsp+84H], xmm9, 2
+ vmovdqu xmm0, xmmword ptr [rsp+70H]
+ vmovdqu xmmword ptr [rsp+50H], xmm0
+ mov r9, qword ptr [rsp+80H]
+ mov qword ptr [rsp+60H], r9
mov rcx, r8
- lea rdx, [rsp+48H]
+ lea rdx, [rsp+50H]
mov r8, qword ptr [r8]
mov r8, qword ptr [r8+48H]
call [r8+20H]<unknown method>
mov r8, rax
- ;; size=44 bbWeight=236333.86 PerfScore 3308674.06
-G_M31648_IG16: ;; offset=02EBH
+ ;; size=78 bbWeight=236333.86 PerfScore 5199344.96
+G_M31648_IG16: ;; offset=02A1H
test r8, r8
je SHORT G_M31648_IG21
;; size=5 bbWeight=621931.21 PerfScore 777414.02
-G_M31648_IG17: ;; offset=02F0H
+G_M31648_IG17: ;; offset=02A6H
jmp SHORT G_M31648_IG19
;; size=2 bbWeight=80248.55 PerfScore 160497.11
-G_M31648_IG18: ;; offset=02F2H
+G_M31648_IG18: ;; offset=02A8H
mov rcx, 0x7FF86873C618 ; Benchmarks.SIMD.RayTracer.ISect
call CORINFO_HELP_NEWSFAST
mov r8, rax
- mov gword ptr [rsp+28H], r8
+ mov gword ptr [rsp+30H], r8
lea rcx, bword ptr [r8+08H]
- mov rdx, gword ptr [rsp+30H]
+ mov rdx, gword ptr [rsp+38H]
call CORINFO_HELP_ASSIGN_REF
- mov r8, gword ptr [rsp+28H]
- vmovdqu xmm0, xmmword ptr [rsp+68H]
- vmovdqu xmmword ptr [r8+18H], xmm0
- mov rcx, qword ptr [rsp+78H]
- mov qword ptr [r8+28H], rcx
+ mov r8, gword ptr [rsp+30H]
+ lea rcx, bword ptr [r8+18H]
+ vmovsd qword ptr [rcx], xmm7
+ vextractps dword ptr [rcx+08H], xmm7, 2
+ vmovsd qword ptr [rcx+0CH], xmm9
+ vextractps dword ptr [rcx+14H], xmm9, 2
vcvtss2sd xmm0, xmm0, xmm10
vmovsd qword ptr [r8+10H], xmm0
jmp SHORT G_M31648_IG16
- ;; size=76 bbWeight=385597.35 PerfScore 8097544.42
-G_M31648_IG19: ;; offset=033EH
+ ;; size=82 bbWeight=385597.35 PerfScore 10218329.86
+G_M31648_IG19: ;; offset=02FAH
test r13, r13
jne SHORT G_M31648_IG25
;; size=5 bbWeight=80248.55 PerfScore 100310.69
-G_M31648_IG20: ;; offset=0343H
+G_M31648_IG20: ;; offset=02FFH
mov r13, r8
;; size=3 bbWeight=78703.22 PerfScore 19675.80
-G_M31648_IG21: ;; offset=0346H
- mov edx, dword ptr [rsp+64H]
+G_M31648_IG21: ;; offset=0302H
+ mov edx, dword ptr [rsp+6CH]
inc edx
- mov eax, dword ptr [rsp+44H]
+ mov eax, dword ptr [rsp+4CH]
cmp eax, edx
jg SHORT G_M31648_IG24
;; size=14 bbWeight=621931.21 PerfScore 2176759.25
-G_M31648_IG22: ;; offset=0354H
+G_M31648_IG22: ;; offset=0310H
mov r8, r13
test r8, r8
jne SHORT G_M31648_IG26
;; size=8 bbWeight=208614.94 PerfScore 312922.41
-G_M31648_IG23: ;; offset=035CH
+G_M31648_IG23: ;; offset=0318H
vxorps xmm0, xmm0, xmm0
- vmovaps xmmword ptr [rsp+80H], xmm0
+ vmovups xmmword ptr [rsp+88H], xmm0
jmp SHORT G_M31648_IG27
;; size=15 bbWeight=79255.46 PerfScore 264184.86
-G_M31648_IG24: ;; offset=036BH
- mov rax, gword ptr [rsp+38H]
+G_M31648_IG24: ;; offset=0327H
+ mov rax, gword ptr [rsp+40H]
jmp G_M31648_IG08
;; size=10 bbWeight=310965.61 PerfScore 932896.82
-G_M31648_IG25: ;; offset=0375H
+G_M31648_IG25: ;; offset=0331H
vmovsd xmm0, qword ptr [r13+10H]
vucomisd xmm0, qword ptr [r8+10H]
jbe SHORT G_M31648_IG21
jmp SHORT G_M31648_IG20
;; size=16 bbWeight=1546.37 PerfScore 18556.45
-G_M31648_IG26: ;; offset=0385H
+G_M31648_IG26: ;; offset=0341H
xor edx, edx
mov dword ptr [rsp+20H], edx
- lea rdx, [rsp+80H]
+ lea rdx, [rsp+88H]
mov rcx, rsi
mov r9, rbx
call [<unknown method>]
;; size=26 bbWeight=129359.48 PerfScore 679137.28
-G_M31648_IG27: ;; offset=039FH
- vmovaps xmm0, xmmword ptr [rsp+80H]
+G_M31648_IG27: ;; offset=035BH
+ vmovups xmm0, xmmword ptr [rsp+88H]
vunpckhps xmm1, xmm0, xmm0
vmovss xmm2, dword ptr [reloc @RWD36]
vmulss xmm1, xmm1, xmm2
@@ -376,14 +369,14 @@ G_M31648_IG27: ;; offset=039FH
cmp eax, 255
jg G_M31648_IG34
;; size=40 bbWeight=208614.94 PerfScore 3598607.70
-G_M31648_IG28: ;; offset=03C7H
+G_M31648_IG28: ;; offset=0383H
vmovshdup xmm1, xmm0
vmulss xmm1, xmm1, xmm2
vcvttss2si edx, xmm1
cmp edx, 255
jg G_M31648_IG35
;; size=24 bbWeight=208614.94 PerfScore 2346918.07
-G_M31648_IG29: ;; offset=03DFH
+G_M31648_IG29: ;; offset=039BH
shl edx, 8
or edx, eax
vmulss xmm2, xmm0, xmm2
@@ -391,7 +384,7 @@ G_M31648_IG29: ;; offset=03DFH
cmp eax, 255
jg G_M31648_IG36
;; size=24 bbWeight=208614.94 PerfScore 2294764.33
-G_M31648_IG30: ;; offset=03F7H
+G_M31648_IG30: ;; offset=03B3H
lea ecx, [r12+r14]
cmp ecx, dword ptr [rdi+08H]
jae G_M31648_IG37
@@ -403,19 +396,19 @@ G_M31648_IG30: ;; offset=03F7H
cmp r12d, dword ptr [rsi+20H]
jl G_M31648_IG05
;; size=41 bbWeight=208614.94 PerfScore 2242610.60
-G_M31648_IG31: ;; offset=0420H
+G_M31648_IG31: ;; offset=03DCH
inc ebp
cmp ebp, dword ptr [rsi+24H]
jl G_M31648_IG03
;; size=11 bbWeight=999.00 PerfScore 4245.75
-G_M31648_IG32: ;; offset=042BH
- vmovaps xmm6, xmmword ptr [rsp+100H]
- vmovaps xmm7, xmmword ptr [rsp+F0H]
- vmovaps xmm8, xmmword ptr [rsp+E0H]
- vmovaps xmm9, xmmword ptr [rsp+D0H]
- vmovaps xmm10, xmmword ptr [rsp+C0H]
- vmovaps xmm11, xmmword ptr [rsp+B0H]
- add rsp, 280
+G_M31648_IG32: ;; offset=03E7H
+ vmovaps xmm6, xmmword ptr [rsp+F0H]
+ vmovaps xmm7, xmmword ptr [rsp+E0H]
+ vmovaps xmm8, xmmword ptr [rsp+D0H]
+ vmovaps xmm9, xmmword ptr [rsp+C0H]
+ vmovaps xmm10, xmmword ptr [rsp+B0H]
+ vmovaps xmm11, xmmword ptr [rsp+A0H]
+ add rsp, 264
pop rbx
pop rbp
pop rsi
@@ -426,23 +419,23 @@ G_M31648_IG32: ;; offset=042BH
pop r15
ret
;; size=74 bbWeight=1.00 PerfScore 29.25
-G_M31648_IG33: ;; offset=0475H
+G_M31648_IG33: ;; offset=0431H
vmovss xmm9, dword ptr [reloc @RWD40]
jmp G_M31648_IG07
;; size=13 bbWeight=0 PerfScore 0.00
-G_M31648_IG34: ;; offset=0482H
+G_M31648_IG34: ;; offset=043EH
mov eax, 255
jmp G_M31648_IG28
;; size=10 bbWeight=0 PerfScore 0.00
-G_M31648_IG35: ;; offset=048CH
+G_M31648_IG35: ;; offset=0448H
mov edx, 255
jmp G_M31648_IG29
;; size=10 bbWeight=0 PerfScore 0.00
-G_M31648_IG36: ;; offset=0496H
+G_M31648_IG36: ;; offset=0452H
mov eax, 255
jmp G_M31648_IG30
;; size=10 bbWeight=0 PerfScore 0.00
-G_M31648_IG37: ;; offset=04A0H
+G_M31648_IG37: ;; offset=045CH
call CORINFO_HELP_RNGCHKFAIL
int3
;; size=6 bbWeight=0 PerfScore 0.00
@@ -454,11 +447,11 @@ RWD36 dd 437F0000h ; 255
RWD40 dd 7F800000h ; inf
-; Total bytes of code 1190, prolog size 103, PerfScore 105089852.53, instruction count 255, allocated bytes for code 1190 (MethodHash=adae845f) for method Benchmarks.SIMD.RayTracer.RayTracer:RenderSequential(Benchmarks.SIMD.RayTracer.Scene,int[]):this
+; Total bytes of code 1122, prolog size 103, PerfScore 105396818.57, instruction count 245, allocated bytes for code 1122 (MethodHash=adae845f) for method Benchmarks.SIMD.RayTracer.RayTracer:RenderSequential(Benchmarks.SIMD.RayTracer.Scene,int[]):this This is a case where our lack of handling for call args shows up. We end up with an extra struct copy in One simple fix in physical promotion for the implicit byref case would be to create a new local to ensure that it is a last use; we can then handle it by our smarter decomposition. That might be a good short-term solution with large benefit. There is also a redundant |
Description
Struct promotion (a.k.a. scalar replacement of aggregates) is an optimization that replaces structs with their constituent fields, allowing those fields to be optimized as if they were normal local variables. This is a very important optimization for low-level performance oriented code that makes heavy use of structs, so it is important that it is supported well by the JIT.
Limitations
The JIT supports promotion but with the following limitations today:
This issue is about removing (some of) these limitations.
Q1 work items
Q2 work items
Future work items
CQ
GetElement
/WithElement
for SIMDsGetElement
/WithElement
(JIT: Generalized struct promotion #76928 (comment))Throughput
Related issues
The text was updated successfully, but these errors were encountered: