Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Generalized struct promotion #76928

Closed
26 of 40 tasks
jakobbotsch opened this issue Oct 12, 2022 · 22 comments · Fixed by #88090
Closed
26 of 40 tasks

JIT: Generalized struct promotion #76928

jakobbotsch opened this issue Oct 12, 2022 · 22 comments · Fixed by #88090
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release
Milestone

Comments

@jakobbotsch
Copy link
Member

jakobbotsch commented Oct 12, 2022

Description

Struct promotion (a.k.a. scalar replacement of aggregates) is an optimization that replaces structs with their constituent fields, allowing those fields to be optimized as if they were normal local variables. This is a very important optimization for low-level performance oriented code that makes heavy use of structs, so it is important that it is supported well by the JIT.

Limitations

The JIT supports promotion but with the following limitations today:

  • Only whole structs with at most 4 fields can be promoted
  • Nested structs are not supported, except when the nested struct is a wrapper around a primitive type
  • A struct must be promoted for the full duration of the function or not at all
  • Structs with overlapping fields are not supported

This issue is about removing (some of) these limitations.

Q1 work items

Q2 work items

Future work items

CQ

Throughput

Related issues

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 12, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 12, 2022
@ghost
Copy link

ghost commented Oct 12, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Struct promotion is an optimization that replaces structs with their constituent fields, allowing those fields to be optimized as if they were normal local variables. This is a very important optimization for low-level performance oriented code that makes heavy use of structs, so it is important that it is supported well by the JIT.

Limitations

The JIT supports promotion but with the following limitations today:

  • Only whole structs with at most 4 fields can be promoted
  • Nested structs are not supported, except when the nested struct is a wrapper around a primitive type
  • A struct must be promoted for the full duration of the function or not at all
  • Structs with overlapping fields are not supported

This issue is about removing (some of) these limitations.

Plan

The preliminary idea is to introduce a new pass that replaces struct fields by new local variables and the "whole struct value" by the reassembling of the promoted fields and the residual fields. The pass will need the proper heuristics to figure out which fields to promote (depending on in which contexts they are used), and potentially in which parts of the function (e.g. due to being address exposed on some paths).

It is likely that some form of struct liveness will be needed by this pass and the hope is that the liveness pass from #76069 will be beneficial here as well.

One difficulty is in the representation of multi-reg args and returns at the ABI boundaries. Today they more or less "fall out" from the whole-promotion representation by using the parent struct local as the use/def. A new representation will likely be needed if structs no longer need to be entirely promoted. Initially I expect we can piggyback on the existing mechanism to get to a working prototype though, however as a long term goal it would be nice to replace the existing mechanism entirely.

Author: jakobbotsch
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib

@jakobbotsch
Copy link
Member Author

The kinds of struct locals we can promote appear only as operands of GT_ASG, GT_CALL and GT_RETURN, For GT_ASG we should be able to decompose directly as part of the generalized promotion pass. That leaves GT_CALL and GT_RETURN.
My thinking for a while was to introduce a new node that represented the assembling of a local and its constituent fields, similar to GT_FIELD_LIST. However, it would need an equivalent to appear on the LHS of assignments from calls, and of course downstream passes would need to be taught to handle this. Also, they would only be necessary on platforms with multi-reg args/returns.

After thinking some more I've come back to another idea which I think I will investigate. Instead of introducing a new node in HIR that persists until lowering, we introduce the equivalent of GT_PUTARG_REG for GT_CALL results and GT_RETURN. That is, we would add nodes GT_GETRES_REG that represents using one of the register results of a GT_CALL node, and a GT_PUTRET_REG that represents placing a value into one of the return registers. LSRA would get similar handling for these nodes as it has for GT_PUTARG_REG today (essentially that these registers are busy until they are used (for call)/or until a GT_RETURN is encountered).

For the generalized struct promotion pass we would then only introduce new assignments: to pass one of these struct locals to a call, or to return one of them, it will just write all the constituent fields back into the struct local and leave the local in the call argument/GT_RETURN.

A priori this would just amount to a lot of unnecessary copying in the generated code. To get to an acceptable state, we would introduce an optimization in lowering to handle these patterns without copying. For example, after generalized struct promotion + rationalization, we would have LIR like:

STORE_LCL_FLD V00 [0..8), <some operand, usually a GT_LCL_VAR>
STORE_LCL_FLD V00 [8..16), <some operand 2>
t0 = LCL_VAR V00
CALL t0

we would do some analysis to figure out whether V00 is dead after this call (potentially even more precisely whether V00 [0..8)/V00 [8..16) are dead). If yes, then we would transform this into

t0 = PUTARG_REG <some operand>
t1 = PUTARG_REG <some operand 2>
CALL t0, t1 // with FIELD_LIST or however the representation is today

Similarly for call results, e.g. we would get LIR like the following for generalized promotion after rationalization:

STORE_LCL_VAR V00 (CALL abc)
STORE_LCL_VAR V01 (LCL_FLD V00 [0..8))
STORE_LCL_VAR V02 (LCL_FLD V00 [8..16))

and, as an optimization, lower it into:

CALL abc
STORE_LCL_VAR V01 (GETRES_REG rax)`
STORE_LCL_VAR V02 (GETRES_REG rdx)`

Some questions to investigate:

  • How bad would these IR patterns be for throughput? Certainly introducing these assignments is bloating the IR somewhat (although this seems no different than the standard FIELD_LIST transformation we do for our normal promotion)
  • Can the middle-end optimization passes cope with these IR patterns? They would see tracked locals being stored into some arbitrary struct local, and then that struct local being passed to a call/returned. I'm not sure if this would lose us opportunities we have with whole struct promotion today.
  • How do we do the analysis in lowering? Presumably it would need to be some general kind of struct liveness. For prototyping we can instead create new struct locals for every call site and utilize ref counts to do this.

@tannergooding
Copy link
Member

tannergooding commented Feb 22, 2023

The kinds of struct locals we can promote appear only as operands of GT_ASG, GT_CALL and GT_RETURN

What about for GT_HWINTRINSIC, particularly in the case of things like struct M2x4 { Vector4 X; Vector4 Y; } or similar?

Same for cases like GT_INTRINSIC which were originally calls and which may become calls again in some cases (e.g. Math.Pow if constant folding can't happen).

@jakobbotsch
Copy link
Member Author

What about for GT_HWINTRINSIC, particularly in the case of things like struct M2x4 { Vector4 X; Vector4 Y; } or similar?

What particular intrinsics take arbitrary struct arguments? Can they be decomposed early like ASG would be?

@SingleAccretion
Copy link
Contributor

There is #80297, where we are handling an intrinsic that essentially has a "multi-reg arg" via early decomposition into a FIELD_LIST.

@tannergooding
Copy link
Member

Single beat me to it, that was the example I was going to give 😄

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Feb 22, 2023

I don't see why the existing approach there wouldn't continue to work. The representation for call args here could also be GT_FIELD_LIST but it would require very early ABI handling for the struct args that I am not a fan of.

@jakobbotsch
Copy link
Member Author

A similar node would be needed for parameters. Generalized promotion would create IR in the start of functions to load (parts of) parameters into the promoted field locals and lowering would optimize these into some GT_GETPARAM_REG node when possible. Then likely we would use the same source of liveness when homing parameters to figure out if we can avoid homing some of the struct parameter.

jakobbotsch added a commit that referenced this issue Apr 11, 2023
)

Introduce a "physical" promotion pass that generalizes the existing promotion.
More specifically, it does not have restrictions on field count and it can
handle arbitrary recursive promotion.

The pass is physical in the sense that it does not rely on any field metadata
for structs. Instead, it works in two separate passes over the IR:

1. In the first pass we find and analyze how unpromoted struct locals are
accessed. For example, for a simple program like:

```
public static void Main()
{
    S s = default;
    Call(s, s.C);
    Console.WriteLine(s.B + s.C);
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static void Call(S s, byte b)
{
}

private struct S
{
    public byte A, B, C, D, E;
}
```

we see IR like:

```
***** BB01
STMT00000 ( 0x000[E-] ... 0x003 )
               [000003] IA---------                         ▌  ASG       struct (init)
               [000001] D------N---                         ├──▌  LCL_VAR   struct<Program+S, 5> V00 loc0         
               [000002] -----------                         └──▌  CNS_INT   int    0

***** BB01
STMT00001 ( 0x008[E-] ... 0x026 )
               [000008] --C-G------                         ▌  CALL      void   Program:Call(Program+S,ubyte)
               [000004] ----------- arg0                    ├──▌  LCL_VAR   struct<Program+S, 5> V00 loc0         
               [000007] ----------- arg1                    └──▌  LCL_FLD   ubyte  V00 loc0         [+2]

***** BB01
STMT00002 ( 0x014[E-] ... ??? )
               [000016] --C-G------                         ▌  CALL      void   System.Console:WriteLine(int)
               [000015] ----------- arg0                    └──▌  ADD       int   
               [000011] -----------                            ├──▌  LCL_FLD   ubyte  V00 loc0         [+1]
               [000014] -----------                            └──▌  LCL_FLD   ubyte  V00 loc0         [+2]
```

and the analysis produces

```
Accesses for V00
  [000..005)
    #:                             (2, 200)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (1, 100)
    # as implicit by-ref call arg: (1, 100)
    # as on-stack call arg:        (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  ubyte @ 001
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (0, 0)
    # as implicit by-ref call arg: (0, 0)
    # as on-stack call arg:        (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  ubyte @ 002
    #:                             (2, 200)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (1, 100)
    # as implicit by-ref call arg: (0, 0)
    # as on-stack call arg:        (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)
```

Here the pairs are (#ref counts, wtd ref counts).

Based on this accounting, the analysis estimates the profitability of replacing
some of the accessed parts of the struct with a local. This may be costly
because overlapping struct accesses (e.g. passing the whole struct as an
argument) may require more expensive codegen after promotion. And of course,
creating new locals introduces more register pressure. Currently the
profitability analysis is very crude.

In this case the logic decides that promotion is not worth it:

```
Evaluating access ubyte @ 001
  Single write-back cost: 5
  Write backs: 100
  Read backs: 100
  Cost with: 1350
  Cost without: 650
  Disqualifying replacement
Evaluating access ubyte @ 002
  Single write-back cost: 5
  Write backs: 100
  Read backs: 100
  Cost with: 1700
  Cost without: 1300
  Disqualifying replacement
```

2. In the second pass the field accesses are replaced with new locals for the
profitable cases. For overlapping accesses that currently involves writing back
replacements to the struct local first. For arguments/OSR locals, it involves
reading them back from the struct first.

In the above case we can override the profitability analysis with stress mode
STRESS_PHYSICAL_PROMOTION_COST and we get:

```
Evaluating access ubyte @ 001
  Single write-back cost: 5
  Write backs: 100
  Read backs: 100
  Cost with: 1350
  Cost without: 650
  Promoting replacement due to stress

lvaGrabTemp returning 2 (V02 tmp1) (a long lifetime temp) called for V00.[001..002).
Evaluating access ubyte @ 002
  Single write-back cost: 5
  Write backs: 100
  Read backs: 100
  Cost with: 1700
  Cost without: 1300
  Promoting replacement due to stress

lvaGrabTemp returning 3 (V03 tmp2) (a long lifetime temp) called for V00.[002..003).
V00 promoted with 2 replacements
  [001..002) promoted as ubyte V02
  [002..003) promoted as ubyte V03

...

***** BB01
STMT00000 ( 0x000[E-] ... 0x003 )
               [000003] IA---------                         ▌  ASG       struct (init)
               [000001] D------N---                         ├──▌  LCL_VAR   struct<Program+S, 5> V00 loc0         
               [000002] -----------                         └──▌  CNS_INT   int    0

***** BB01
STMT00001 ( 0x008[E-] ... 0x026 )
               [000008] -ACXG------                         ▌  CALL      void   Program:Call(Program+S,ubyte)
               [000004] ----------- arg0                    ├──▌  LCL_VAR   struct<Program+S, 5> V00 loc0         
               [000022] -A--------- arg1                    └──▌  COMMA     ubyte 
               [000021] -A---------                            ├──▌  ASG       ubyte 
               [000019] D------N---                            │  ├──▌  LCL_VAR   ubyte  V03 tmp2         
               [000020] -----------                            │  └──▌  LCL_FLD   ubyte  V00 loc0         [+2]
               [000018] -----------                            └──▌  LCL_VAR   ubyte  V03 tmp2         

***** BB01
STMT00002 ( 0x014[E-] ... ??? )
               [000016] -ACXG------                         ▌  CALL      void   System.Console:WriteLine(int)
               [000015] -A--------- arg0                    └──▌  ADD       int   
               [000027] -A---------                            ├──▌  COMMA     ubyte 
               [000026] -A---------                            │  ├──▌  ASG       ubyte 
               [000024] D------N---                            │  │  ├──▌  LCL_VAR   ubyte  V02 tmp1         
               [000025] -----------                            │  │  └──▌  LCL_FLD   ubyte  V00 loc0         [+1]
               [000023] -----------                            │  └──▌  LCL_VAR   ubyte  V02 tmp1         
               [000028] -----------                            └──▌  LCL_VAR   ubyte  V03 tmp2         
```

The pass still only has rudimentary support and is missing many basic CQ
optimization optimizations. For example, it does not make use of any liveness
yet and it does not have any decomposition support for assignments. Yet, it
already shows good potential in user benchmarks. I have listed some follow-up
improvements in #76928.

This PR is adding the pass but it is disabled by default. It can be enabled by
setting DOTNET_JitStressModeNames=STRESS_PHYSICAL_PROMOTION. There are two new
scenarios added to jit-experimental that enables it, to be used for testing
purposes.
@jakobbotsch
Copy link
Member Author

jakobbotsch commented Apr 14, 2023

Some measurements over asp.net for block copies/inits and whether they involve promoted structs:

Copies physical -> physical: 3
Copies physical -> old:      283
Copies old      -> physical: 250
Copies physical ->         : 65
Copies          -> physical: 268
Inits           -> physical: 37

("old" means structs that are promoted by the normal mechanism)

It would be great to reuse block morphing to do the decomposition, but I'm not sure how simple that would be -- the decomposition for copies involving physically promoted structs is quite a bit more complicated.

Same measurements with old promotion disabled:

Copies physical -> physical: 162
Copies physical -> old:      0
Copies old      -> physical: 0
Copies physical ->         : 1332
Copies          -> physical: 6034
Inits           -> physical: 99

@jakobbotsch
Copy link
Member Author

We frequently see promotion opportunities for standard C# code iterating lists via List<T>.Enumerator, e.g.:
https://github.com/dotnet/aspnetcore/blob/8968058c9e5fdfdd1242426a03dc80609997edab/src/Servers/Kestrel/Core/src/Internal/Infrastructure/KestrelConnection.cs#L51-L54

Where the codegen for the loop ends up with the following diff:

@@ -1,35 +1,33 @@
 G_M45156_IG09:
-       mov      rax, gword ptr [rbp-28H]
-       mov      rdx, gword ptr [rbp-20H]
+       mov      rax, gword ptr [rbp-38H]
+       mov      rdx, gword ptr [rbp-30H]
        mov      rcx, gword ptr [rax+08H]
        call     [rax+18H]System.Action`1[System.__Canon]:Invoke(System.__Canon):this
 						;; size=15 bbWeight=1 PerfScore 7.00
 G_M45156_IG10:
-       mov      rcx, gword ptr [rbp-38H]
-       mov      esi, dword ptr [rbp-2CH]
-       mov      edi, dword ptr [rcx+14H]
-       cmp      esi, edi
+       mov      rcx, rsi
+       mov      r14d, dword ptr [rcx+14H]
+       cmp      edi, r14d
        jne      SHORT G_M45156_IG14
-       mov      edx, dword ptr [rbp-30H]
-       cmp      edx, dword ptr [rcx+10H]
+       cmp      ebx, dword ptr [rsi+10H]
        jae      SHORT G_M45156_IG15
-						;; size=22 bbWeight=2 PerfScore 20.50
+						;; size=17 bbWeight=2 PerfScore 15.00
 G_M45156_IG11:
-       mov      rcx, gword ptr [rcx+08H]
-       mov      eax, edx
-       cmp      eax, dword ptr [rcx+08H]
+       mov      rcx, gword ptr [rsi+08H]
+       cmp      ebx, dword ptr [rcx+08H]
        jae      SHORT G_M45156_IG08
-       shl      rax, 4
+       mov      edx, ebx
+       shl      rdx, 4
 						;; size=15 bbWeight=1 PerfScore 6.75
 G_M45156_IG12:
-       vmovdqu  xmm0, xmmword ptr [rcx+rax+10H]
-       vmovdqu  xmmword ptr [rbp-28H], xmm0
+       vmovdqu  xmm0, xmmword ptr [rcx+rdx+10H]
+       vmovdqu  xmmword ptr [rbp-38H], xmm0
 						;; size=11 bbWeight=1 PerfScore 5.00
 G_M45156_IG13:
-       inc      edx
-       mov      dword ptr [rbp-30H], edx
+       inc      ebx
        jmp      SHORT G_M45156_IG09
-						;; size=7 bbWeight=1 PerfScore 3.25
+						;; size=4 bbWeight=1 PerfScore 2.25
 G_M45156_IG14:
-       cmp      esi, edi
+       cmp      edi, r14d
        jne      SHORT G_M45156_IG07

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Jun 7, 2023

Investigating some current causes of regressions when enabling physical promotion by default.

(edit: handled by #87265)

aspnet.run.windows.x64.checked.mch:

+37 (+14.57%) : 18820.dasm - System.Collections.Concurrent.ConcurrentDictionary`2+Enumerator[System.__Canon,Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo]:MoveNext():bool:this
@@ -14,40 +14,44 @@
 ;* V04 loc3         [V04    ] (  0,  0   )     int  ->  zero-ref   
 ;  V05 loc4         [V05,T04] (  3,  6   )     int  ->  rcx        
 ;  V06 OutArgs      [V06    ] (  1,  1   )  struct (32) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-;  V07 tmp1         [V07,T06] (  5,  5   )  struct (32) [rsp+28H]   do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp"
+;  V07 tmp1         [V07,T06] (  4,  4   )  struct (32) [rsp+48H]   do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp"
 ;* V08 tmp2         [V08    ] (  0,  0   )    long  ->  zero-ref    "spilling helperCall"
-;  V09 tmp3         [V09,T10] (  2,  2   )     ref  ->  rax         class-hnd single-def "Inlining Arg"
+;  V09 tmp3         [V09,T11] (  2,  2   )     ref  ->  rax         class-hnd single-def "Inlining Arg"
 ;* V10 tmp4         [V10    ] (  0,  0   )  struct (24) zero-ref    "Inlining Arg"
-;* V11 tmp5         [V11    ] (  0,  0   )  struct (32) zero-ref    do-not-enreg[S] "Inlining Arg"
-;  V12 tmp6         [V12,T11] (  2,  1   )     ref  ->  rdx         single-def V10.<TokenSource>k__BackingField(offs=0x00) P-INDEP "field V10.<TokenSource>k__BackingField (fldOffset=0x0)"
-;  V13 tmp7         [V13,T12] (  2,  1   )     ref  ->  rcx         single-def V10.<ChangeToken>k__BackingField(offs=0x08) P-INDEP "field V10.<ChangeToken>k__BackingField (fldOffset=0x8)"
-;  V14 tmp8         [V14,T13] (  2,  1   )     ref  ->   r8         single-def V10.<Matcher>k__BackingField(offs=0x10) P-INDEP "field V10.<Matcher>k__BackingField (fldOffset=0x10)"
-;  V15 cse0         [V15,T07] (  2,  4   )     int  ->  rax         "CSE - aggressive"
-;* V16 rat0         [V16,T09] (  0,  0   )    long  ->  zero-ref    "Spilling to split statement for tree"
-;* V17 rat1         [V17,T14] (  0,  0   )    long  ->  zero-ref    "runtime lookup"
-;* V18 rat2         [V18,T08] (  0,  0   )    long  ->  zero-ref    "fgMakeTemp is creating a new local variable"
-;  V19 rat3         [V19,T05] (  3,  6   )     int  ->  rdx         "ReplaceWithLclVar is creating a new local variable"
+;  V11 tmp5         [V11,T08] (  3,  3   )  struct (32) [rsp+28H]   do-not-enreg[S] must-init "Inlining Arg"
+;  V12 tmp6         [V12,T12] (  2,  1   )     ref  ->  rdx         single-def V10.<TokenSource>k__BackingField(offs=0x00) P-INDEP "field V10.<TokenSource>k__BackingField (fldOffset=0x0)"
+;  V13 tmp7         [V13,T13] (  2,  1   )     ref  ->  rcx         single-def V10.<ChangeToken>k__BackingField(offs=0x08) P-INDEP "field V10.<ChangeToken>k__BackingField (fldOffset=0x8)"
+;  V14 tmp8         [V14,T14] (  2,  1   )     ref  ->   r8         single-def V10.<Matcher>k__BackingField(offs=0x10) P-INDEP "field V10.<Matcher>k__BackingField (fldOffset=0x10)"
+;* V15 tmp9         [V15    ] (  0,  0   )     ref  ->  zero-ref    single-def "V07.[000..008)"
+;  V16 cse0         [V16,T07] (  2,  4   )     int  ->  rax         "CSE - aggressive"
+;* V17 rat0         [V17,T10] (  0,  0   )    long  ->  zero-ref    "Spilling to split statement for tree"
+;* V18 rat1         [V18,T15] (  0,  0   )    long  ->  zero-ref    "runtime lookup"
+;* V19 rat2         [V19,T09] (  0,  0   )    long  ->  zero-ref    "fgMakeTemp is creating a new local variable"
+;  V20 rat3         [V20,T05] (  3,  6   )     int  ->  rdx         "ReplaceWithLclVar is creating a new local variable"
 ;
-; Lcl frame size = 72
+; Lcl frame size = 104
 
 G_M47209_IG01:        ; bbWeight=1, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref, nogc <-- Prolog IG
        push     rdi
        push     rsi
        push     rbp
        push     rbx
-       sub      rsp, 72
+       sub      rsp, 104
+       vzeroupper 
        xor      eax, eax
        mov      qword ptr [rsp+28H], rax
        vxorps   xmm4, xmm4, xmm4
        vmovdqa  xmmword ptr [rsp+30H], xmm4
-       mov      qword ptr [rsp+40H], rax
+       vmovdqa  xmmword ptr [rsp+40H], xmm4
+       vmovdqa  xmmword ptr [rsp+50H], xmm4
+       mov      qword ptr [rsp+60H], rax
        mov      rbx, rcx
        ; gcrRegs +[rbx]
-						;; size=33 bbWeight=1 PerfScore 9.08
+						;; size=48 bbWeight=1 PerfScore 14.08
 G_M47209_IG02:        ; bbWeight=1, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref
        mov      edx, dword ptr [rbx+24H]
        cmp      edx, 2
-       ja       G_M47209_IG08
+       ja       G_M47209_IG10
        lea      rcx, [reloc @RWD00]
        mov      ecx, dword ptr [rcx+4*rdx]
        lea      rax, G_M47209_IG02
@@ -66,7 +70,7 @@ G_M47209_IG03:        ; bbWeight=0.50, gcrefRegs=0008 {rbx}, byrefRegs=0000 {},
        ; byrRegs -[rcx]
        mov      dword ptr [rbx+20H], -1
 						;; size=28 bbWeight=0.50 PerfScore 4.25
-G_M47209_IG04:        ; bbWeight=2, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref, isz
+G_M47209_IG04:        ; bbWeight=2, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref
        mov      rdx, gword ptr [rbx+10H]
        ; gcrRegs +[rdx]
        mov      ecx, dword ptr [rbx+20H]
@@ -74,7 +78,7 @@ G_M47209_IG04:        ; bbWeight=2, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byr
        mov      dword ptr [rbx+20H], ecx
        mov      eax, dword ptr [rdx+08H]
        cmp      eax, ecx
-       jbe      SHORT G_M47209_IG08
+       jbe      G_M47209_IG10
        mov      rdx, gword ptr [rdx+8*rcx+10H]
        lea      rcx, bword ptr [rbx+18H]
        ; byrRegs +[rcx]
@@ -82,7 +86,7 @@ G_M47209_IG04:        ; bbWeight=2, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byr
        ; gcrRegs -[rdx]
        ; byrRegs -[rcx]
        mov      dword ptr [rbx+24H], 2
-						;; size=40 bbWeight=2 PerfScore 26.00
+						;; size=44 bbWeight=2 PerfScore 26.00
 G_M47209_IG05:        ; bbWeight=4, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref, isz
        mov      rbp, gword ptr [rbx+18H]
        ; gcrRegs +[rbp]
@@ -98,10 +102,16 @@ G_M47209_IG06:        ; bbWeight=0.50, gcrefRegs=0028 {rbx rbp}, byrefRegs=0000
        ; gcrRegs +[rcx]
        mov      r8, gword ptr [rbp+30H]
        ; gcrRegs +[r8]
+       mov      gword ptr [rsp+50H], rdx
+       mov      gword ptr [rsp+58H], rcx
+       mov      gword ptr [rsp+60H], r8
+						;; size=31 bbWeight=0.50 PerfScore 5.50
+G_M47209_IG07:        ; bbWeight=0.50, nogc, extend
+       vmovdqu  ymm0, ymmword ptr [rsp+48H]
+       vmovdqu  ymmword ptr [rsp+28H], ymm0
+						;; size=12 bbWeight=0.50 PerfScore 2.50
+G_M47209_IG08:        ; bbWeight=0.50, extend
        mov      gword ptr [rsp+28H], rax
-       mov      gword ptr [rsp+30H], rdx
-       mov      gword ptr [rsp+38H], rcx
-       mov      gword ptr [rsp+40H], r8
        lea      rdi, bword ptr [rbx+28H]
        ; byrRegs +[rdi]
        lea      rsi, bword ptr [rsp+28H]
@@ -119,33 +129,35 @@ G_M47209_IG06:        ; bbWeight=0.50, gcrefRegs=0028 {rbx rbp}, byrefRegs=0000
        ; gcrRegs -[rdx rbp]
        ; byrRegs -[rcx rsi rdi]
        mov      eax, 1
-						;; size=83 bbWeight=0.50 PerfScore 10.38
-G_M47209_IG07:        ; bbWeight=0.50, epilog, nogc, extend
-       add      rsp, 72
+						;; size=52 bbWeight=0.50 PerfScore 4.88
+G_M47209_IG09:        ; bbWeight=0.50, epilog, nogc, extend
+       vzeroupper 
+       add      rsp, 104
        pop      rbx
        pop      rbp
        pop      rsi
        pop      rdi
        ret      
-						;; size=9 bbWeight=0.50 PerfScore 1.62
-G_M47209_IG08:        ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, gcvars, byref
+						;; size=12 bbWeight=0.50 PerfScore 2.12
+G_M47209_IG10:        ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, gcvars, byref
        mov      dword ptr [rbx+24H], 3
        xor      eax, eax
 						;; size=9 bbWeight=0.50 PerfScore 0.62
-G_M47209_IG09:        ; bbWeight=0.50, epilog, nogc, extend
-       add      rsp, 72
+G_M47209_IG11:        ; bbWeight=0.50, epilog, nogc, extend
+       vzeroupper 
+       add      rsp, 104
        pop      rbx
        pop      rbp
        pop      rsi
        pop      rdi
        ret      
-						;; size=9 bbWeight=0.50 PerfScore 1.62
+						;; size=12 bbWeight=0.50 PerfScore 2.12
 RWD00  	dd	G_M47209_IG03 - G_M47209_IG02
        	dd	G_M47209_IG04 - G_M47209_IG02
        	dd	G_M47209_IG05 - G_M47209_IG02
 
 
-; Total bytes of code 254, prolog size 33, PerfScore 100.98, instruction count 71, allocated bytes for code 254 (MethodHash=bed44796) for method System.Collections.Concurrent.ConcurrentDictionary`2+Enumerator[System.__Canon,Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo]:MoveNext():bool:this
+; Total bytes of code 291, prolog size 48, PerfScore 113.18, instruction count 78, allocated bytes for code 291 (MethodHash=bed44796) for method System.Collections.Concurrent.ConcurrentDictionary`2+Enumerator[System.__Canon,Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo]:MoveNext():bool:this
 ; ============================================================

Promotions:

Accesses for V07
  ref @ 000
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  [000..032) as System.Collections.Generic.KeyValuePair`2[System.__Canon,Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo]
    #:                             (1, 100)
    # assigned from:               (1, 100)
    # assigned to:                 (0, 0)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  [008..032) as Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

Accesses for V11
  [000..032) as System.Collections.Generic.KeyValuePair`2[System.__Canon,Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo]
    #:                             (2, 200)
    # assigned from:               (1, 100)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

Picking promotions for V07
  Evaluating access ref @ 000
    Single write-back cost: 3
    Write backs: 0
    Read backs: 0
    Cost with: 50
    Cost without: 300
  Promoting replacement

lvaGrabTemp returning 15 (V15 tmp9) (a long lifetime temp) called for V07.[000..008).

V07 promoted with 1 replacements
  [000..008) promoted as ref V15
Computing unpromoted remainder for V07
  Remainder: [008..032)

We end up with the following decomposition:

STMT00025 ( 0x096[--] ... ??? )
               [000107] DA---------                           STORE_LCL_VAR struct<System.Collections.Generic.KeyValuePair`2, 32> V11 tmp5         
               [000036] -----------                         └──▌  LCL_VAR   struct<System.Collections.Generic.KeyValuePair`2, 32> V07 tmp1          (last use)
Processing block operation [000107] that involves replacements
  dst+000 <- V15 (V07.[000..008)) (last use)
  Remainder: [008..032)
  => Remainder strategy: retain a full block op

Local V11 should not be enregistered because: was accessed as a local field
New statement:
STMT00025 ( 0x096[--] ... ??? )
               [000112] -A---------                           COMMA     void  
               [000107] DA---------                         ├──▌  STORE_LCL_VAR struct<System.Collections.Generic.KeyValuePair`2, 32> V11 tmp5         
               [000036] -----------                           └──▌  LCL_VAR   struct<System.Collections.Generic.KeyValuePair`2, 32> V07 tmp1         
               [000111] UA---------                         └──▌  STORE_LCL_FLD ref    V11 tmp5         [+0]
               [000110] -----------                            └──▌  LCL_VAR   ref    V15 tmp9          (last use)

However, after STMT00025 there was a last use of V11 which we then are no longer able to forward sub:

-    [000107]:  [000104] is last use of [000107] (V11)  -- fwd subbing [000036]; new next stmt is
-STMT00024 ( INL02 @ 0x000[E-] ... ??? ) <- INLRT @ 0x096[--]
-               [000106] nA-XG------                         ▌  STORE_BLK struct<System.Collections.Generic.KeyValuePair`2, 32> (copy)
-               [000105] ---X-------                         ├──▌  FIELD_ADDR byref  <unknown class>:<unknown field>
-               [000020] -----------                         │  └──▌  LCL_VAR   ref    V00 this         
-               [000036] -----------                         └──▌  LCL_VAR   struct<System.Collections.Generic.KeyValuePair`2, 32> V07 tmp1          (last use)
-
-removing useless STMT00025 ( 0x096[--] ... ??? )
-               [000107] DA---------                         ▌  STORE_LCL_VAR struct<System.Collections.Generic.KeyValuePair`2, 32> V11 tmp5         
-               [000036] -----------                         └──▌  LCL_VAR   struct<System.Collections.Generic.KeyValuePair`2, 32> V07 tmp1          (last use)
- from BB07

It would be possible to look ahead to try to predict this situation and then handle the store by writing back to V07 ahead of it instead. Alternatively we could also run forward sub before physical promotion.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Jun 7, 2023

(edit: partially handled by #87217, rest will be handled by #87410)

+21 (+18.10%) : 90439.dasm - Microsoft.EntityFrameworkCore.Infrastructure.Uniquifier:GetLength(System.Nullable`1[int]):int
@@ -7,79 +7,88 @@
 ; 0 inlinees with PGO data; 4 single block inlinees; 1 inlinees without PGO data
 ; Final local variable assignments
 ;
-;  V00 arg0         [V00,T00] (  9, 27   )  struct ( 8) [rsp+30H]   do-not-enreg[SF] ld-addr-op single-def
-;  V01 loc0         [V01,T02] (  4,  9   )     int  ->  rcx        
-;* V02 loc1         [V02    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op
+;  V00 arg0         [V00,T01] (  5, 11   )  struct ( 8) [rsp+40H]   do-not-enreg[SF] ld-addr-op single-def
+;  V01 loc0         [V01,T06] (  4,  9   )     int  ->   r8        
+;  V02 loc1         [V02    ] (  4, 14   )  struct ( 8) [rsp+30H]   do-not-enreg[SF] must-init ld-addr-op
 ;* V03 loc2         [V03    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op
 ;  V04 OutArgs      [V04    ] (  1,  1   )  struct (32) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;* V05 tmp1         [V05    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op "NewObj constructor temp"
-;* V06 tmp2         [V06    ] (  0,  0   )  struct ( 8) zero-ref   
+;  V06 tmp2         [V06    ] (  5, 14   )  struct ( 8) [rsp+28H]   do-not-enreg[SF] must-init
 ;* V07 tmp3         [V07    ] (  0,  0   )     int  ->  zero-ref    "Inlining Arg"
-;  V08 tmp4         [V08,T05] (  2,  8   )    bool  ->  rax         V02.hasValue(offs=0x00) P-INDEP "field V02.hasValue (fldOffset=0x0)"
-;  V09 tmp5         [V09,T06] (  2,  6   )     int  ->  rdx         V02.value(offs=0x04) P-INDEP "field V02.value (fldOffset=0x4)"
+;  V08 tmp4         [V08,T03] (  3, 12   )    bool  ->  [rsp+30H]   do-not-enreg[] V02.hasValue(offs=0x00) P-DEP "field V02.hasValue (fldOffset=0x0)"
+;  V09 tmp5         [V09,T08] (  2,  6   )     int  ->  [rsp+34H]   do-not-enreg[] V02.value(offs=0x04) P-DEP "field V02.value (fldOffset=0x4)"
 ;* V10 tmp6         [V10    ] (  0,  0   )    bool  ->  zero-ref    V03.hasValue(offs=0x00) P-INDEP "field V03.hasValue (fldOffset=0x0)"
 ;* V11 tmp7         [V11    ] (  0,  0   )     int  ->  zero-ref    V03.value(offs=0x04) P-INDEP "field V03.value (fldOffset=0x4)"
-;* V12 tmp8         [V12,T08] (  0,  0   )    bool  ->  zero-ref    V05.hasValue(offs=0x00) P-INDEP "field V05.hasValue (fldOffset=0x0)"
-;  V13 tmp9         [V13,T07] (  2,  4   )     int  ->   r9         V05.value(offs=0x04) P-INDEP "field V05.value (fldOffset=0x4)"
-;  V14 tmp10        [V14,T03] (  3,  8   )    bool  ->   r8         V06.hasValue(offs=0x00) P-INDEP "field V06.hasValue (fldOffset=0x0)"
-;  V15 tmp11        [V15,T04] (  3,  8   )     int  ->   r9         V06.value(offs=0x04) P-INDEP "field V06.value (fldOffset=0x4)"
-;  V16 rat0         [V16,T01] (  3, 12   )     int  ->  rdx         "ReplaceWithLclVar is creating a new local variable"
+;* V12 tmp8         [V12,T10] (  0,  0   )    bool  ->  zero-ref    V05.hasValue(offs=0x00) P-INDEP "field V05.hasValue (fldOffset=0x0)"
+;  V13 tmp9         [V13,T09] (  2,  4   )     int  ->  rdx         V05.value(offs=0x04) P-INDEP "field V05.value (fldOffset=0x4)"
+;  V14 tmp10        [V14,T02] (  4, 12   )    bool  ->  [rsp+28H]   do-not-enreg[] V06.hasValue(offs=0x00) P-DEP "field V06.hasValue (fldOffset=0x0)"
+;  V15 tmp11        [V15,T07] (  3,  8   )     int  ->  [rsp+2CH]   do-not-enreg[] V06.value(offs=0x04) P-DEP "field V06.value (fldOffset=0x4)"
+;  V16 tmp12        [V16,T00] (  5, 14   )    bool  ->  rcx         "V00.[000..001)"
+;  V17 cse0         [V17,T04] (  3, 12   )     int  ->  rax         "CSE - aggressive"
+;  V18 rat0         [V18,T05] (  3, 12   )     int  ->  rdx         "ReplaceWithLclVar is creating a new local variable"
 ;
-; Lcl frame size = 40
+; Lcl frame size = 56
 
 G_M24602_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
-       sub      rsp, 40
-       mov      qword ptr [rsp+30H], rcx
-						;; size=9 bbWeight=1 PerfScore 1.25
+       sub      rsp, 56
+       xor      eax, eax
+       mov      qword ptr [rsp+30H], rax
+       mov      qword ptr [rsp+28H], rax
+       mov      qword ptr [rsp+40H], rcx
+						;; size=21 bbWeight=1 PerfScore 3.50
 G_M24602_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
-       cmp      byte  ptr [rsp+30H], 0
+       movzx    rcx, byte  ptr [rsp+40H]
+       test     ecx, ecx
        jne      SHORT G_M24602_IG05
-						;; size=7 bbWeight=1 PerfScore 3.00
+						;; size=9 bbWeight=1 PerfScore 2.25
 G_M24602_IG03:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        xor      eax, eax
 						;; size=2 bbWeight=0.50 PerfScore 0.12
 G_M24602_IG04:        ; bbWeight=0.50, epilog, nogc, extend
-       add      rsp, 40
+       add      rsp, 56
        ret      
 						;; size=5 bbWeight=0.50 PerfScore 0.62
 G_M24602_IG05:        ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
-       xor      ecx, ecx
-						;; size=2 bbWeight=0.50 PerfScore 0.12
-G_M24602_IG06:        ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
-       movzx    rax, byte  ptr [rsp+30H]
-       mov      edx, dword ptr [rsp+34H]
-       test     al, al
-       jne      SHORT G_M24602_IG08
-						;; size=13 bbWeight=4 PerfScore 13.00
-G_M24602_IG07:        ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
        xor      r8d, r8d
-       xor      r9d, r9d
-       jmp      SHORT G_M24602_IG09
-						;; size=8 bbWeight=2 PerfScore 5.00
-G_M24602_IG08:        ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
-       mov      r8d, 0xD1FFAB1E
-       mov      eax, r8d
-       imul     edx:eax, edx
-       mov      r9d, edx
-       shr      r9d, 31
-       sar      edx, 2
-       add      r9d, edx
-       mov      r8d, 1
-						;; size=30 bbWeight=2 PerfScore 10.50
-G_M24602_IG09:        ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
-       mov      byte  ptr [rsp+30H], r8b
-       mov      dword ptr [rsp+34H], r9d
-       inc      ecx
+						;; size=3 bbWeight=0.50 PerfScore 0.12
+G_M24602_IG06:        ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
+       mov      byte  ptr [rsp+30H], cl
+       mov      eax, dword ptr [rsp+44H]
+       mov      dword ptr [rsp+34H], eax
        cmp      byte  ptr [rsp+30H], 0
+       jne      SHORT G_M24602_IG08
+						;; size=19 bbWeight=4 PerfScore 24.00
+G_M24602_IG07:        ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
+       xor      eax, eax
+       mov      qword ptr [rsp+28H], rax
+       jmp      SHORT G_M24602_IG09
+						;; size=9 bbWeight=2 PerfScore 6.50
+G_M24602_IG08:        ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+       mov      edx, 0xD1FFAB1E
+       mov      eax, edx
+       imul     edx:eax, dword ptr [rsp+34H]
+       mov      ecx, edx
+       shr      ecx, 31
+       sar      edx, 2
+       add      edx, ecx
+       mov      byte  ptr [rsp+28H], 1
+       mov      dword ptr [rsp+2CH], edx
+						;; size=30 bbWeight=2 PerfScore 18.00
+G_M24602_IG09:        ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
+       movzx    rcx, byte  ptr [rsp+28H]
+       mov      eax, dword ptr [rsp+2CH]
+       mov      dword ptr [rsp+44H], eax
+       inc      r8d
+       test     ecx, ecx
        je       SHORT G_M24602_IG12
-       cmp      dword ptr [rsp+34H], 0
+       test     eax, eax
        jg       SHORT G_M24602_IG06
-						;; size=26 bbWeight=4 PerfScore 33.00
+						;; size=24 bbWeight=4 PerfScore 23.00
 G_M24602_IG10:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
-       mov      eax, ecx
-						;; size=2 bbWeight=0.50 PerfScore 0.12
+       mov      eax, r8d
+						;; size=3 bbWeight=0.50 PerfScore 0.12
 G_M24602_IG11:        ; bbWeight=0.50, epilog, nogc, extend
-       add      rsp, 40
+       add      rsp, 56
        ret      
 						;; size=5 bbWeight=0.50 PerfScore 0.62
 G_M24602_IG12:        ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
@@ -88,7 +97,7 @@ G_M24602_IG12:        ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 {
        int3     
 						;; size=7 bbWeight=0 PerfScore 0.00
 
-; Total bytes of code 116, prolog size 9, PerfScore 78.98, instruction count 35, allocated bytes for code 116 (MethodHash=90769fe5) for method Microsoft.EntityFrameworkCore.Infrastructure.Uniquifier:GetLength(System.Nullable`1[int]):int
+; Total bytes of code 137, prolog size 21, PerfScore 92.58, instruction count 42, allocated bytes for code 137 (MethodHash=90769fe5) for method Microsoft.EntityFrameworkCore.Infrastructure.Uniquifier:GetLength(System.Nullable`1[int]):int
 ; ============================================================

Promotions:

Accesses for V00
  bool @ 000
    #:                             (2, 200)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  [000..008) as System.Nullable`1[int]
    #:                             (2, 200)
    # assigned from:               (1, 100)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  int @ 004
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

Picking promotions for V00
  Evaluating access bool @ 000
    Single write-back cost: 3
    Write backs: 0
    Read backs: 100
    Cost with: 400
    Cost without: 600
  Promoting replacement


lvaGrabTemp returning 16 (V16 tmp12) (a long lifetime temp) called for V00.[000..001).
  Evaluating access int @ 004
    Single write-back cost: 3
    Write backs: 0
    Read backs: 100
    Cost with: 350
    Cost without: 300
  Disqualifying replacement


V00 promoted with 1 replacements
  [000..001) promoted as bool V16
Computing unpromoted remainder for V00
  Remainder: [004..008)

Two problems and a comment here:

  1. Promoting V00.[004..008] would be very beneficial because the assignments that V00 are used in are to another struct with a correspondingly promoted field, e.g.:
STMT00003 ( 0x00D[E-] ... 0x00E )
               [000009] DA---------                           STORE_LCL_VAR struct<System.Nullable`1, 8>(P) V02 loc1         
                                                                bool   V02.<unknown class>:hasValue (offs=0x00) -> V08 tmp4         
                                                                int    V02.<unknown class>:value (offs=0x04) -> V09 tmp5         
               [000008] -----------                         └──▌  LCL_VAR   struct<System.Nullable`1, 8> V00 arg0          (last use)

The extra promotion would allow much cleaner decomposition. We could do something simple and assume that overlapping struct assignments would have their cost decreased a bit by promoting fields; we could also do something smarter and track all assigned locals in a union-find data structure, which will allow us to query the sets of structs for which it would be smart to promote fields together.
2. We are missing handling in decomposition when copying between a physically promoted remainder and a field of a regularly promoted struct:

Processing block operation [000009] that involves replacements
  V08 (field V02.hasValue (fldOffset=0x0)) <- V16 (V00.[000..001)) (last use)
  Remainder: [004..008)
  => Remainder strategy: int at +004

Local V00 should not be enregistered because: was accessed as a local field

Local V02 should not be enregistered because: was accessed as a local field
New statement:
STMT00003 ( 0x00D[E-] ... 0x00E )
               [000090] -A---------                           COMMA     void  
               [000087] DA---------                         ├──▌  STORE_LCL_VAR bool   V08 tmp4         
               [000086] -----------                           └──▌  LCL_VAR   bool   V16 tmp12         (last use)
               [000089] UA---------                         └──▌  STORE_LCL_FLD int   (P) V02 loc1         [+4]
                                                                   bool   V02.<unknown class>:hasValue (offs=0x00) -> V08 tmp4         
                                                                   int    V02.<unknown class>:value (offs=0x04) -> V09 tmp5         
               [000088] -----------                            └──▌  LCL_FLD   int    V00 arg0         [+4]

The expected decomposition should be to V09 directly. Should be an easy fix. (edit: handled by #87217)
3. We do not regularly promote V00 because it is a parameter whose field does not fit cleanly into the register it is passed in.

If we do force physical promotion to promote V00.[004..008) then we end up with:

@@ -7,98 +7,93 @@
 ; 0 inlinees with PGO data; 4 single block inlinees; 1 inlinees without PGO data
 ; Final local variable assignments
 ;
-;  V00 arg0         [V00,T00] (  9, 27   )  struct ( 8) [rsp+30H]   do-not-enreg[SF] ld-addr-op single-def
-;  V01 loc0         [V01,T02] (  4,  9   )     int  ->  rcx        
+;  V00 arg0         [V00,T06] (  4,  4   )  struct ( 8) [rsp+30H]   do-not-enreg[SF] ld-addr-op single-def
+;  V01 loc0         [V01,T03] (  4,  9   )     int  ->   r8        
 ;* V02 loc1         [V02    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op
 ;* V03 loc2         [V03    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op
 ;  V04 OutArgs      [V04    ] (  1,  1   )  struct (32) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;* V05 tmp1         [V05    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op "NewObj constructor temp"
 ;* V06 tmp2         [V06    ] (  0,  0   )  struct ( 8) zero-ref   
 ;* V07 tmp3         [V07    ] (  0,  0   )     int  ->  zero-ref    "Inlining Arg"
-;  V08 tmp4         [V08,T05] (  2,  8   )    bool  ->  rax         V02.hasValue(offs=0x00) P-INDEP "field V02.hasValue (fldOffset=0x0)"
-;  V09 tmp5         [V09,T06] (  2,  6   )     int  ->  rdx         V02.value(offs=0x04) P-INDEP "field V02.value (fldOffset=0x4)"
+;* V08 tmp4         [V08    ] (  0,  0   )    bool  ->  zero-ref    V02.hasValue(offs=0x00) P-INDEP "field V02.hasValue (fldOffset=0x0)"
+;  V09 tmp5         [V09,T07] (  2,  6   )     int  ->  rdx         V02.value(offs=0x04) P-INDEP "field V02.value (fldOffset=0x4)"
 ;* V10 tmp6         [V10    ] (  0,  0   )    bool  ->  zero-ref    V03.hasValue(offs=0x00) P-INDEP "field V03.hasValue (fldOffset=0x0)"
 ;* V11 tmp7         [V11    ] (  0,  0   )     int  ->  zero-ref    V03.value(offs=0x04) P-INDEP "field V03.value (fldOffset=0x4)"
-;* V12 tmp8         [V12,T08] (  0,  0   )    bool  ->  zero-ref    V05.hasValue(offs=0x00) P-INDEP "field V05.hasValue (fldOffset=0x0)"
-;  V13 tmp9         [V13,T07] (  2,  4   )     int  ->   r9         V05.value(offs=0x04) P-INDEP "field V05.value (fldOffset=0x4)"
-;  V14 tmp10        [V14,T03] (  3,  8   )    bool  ->   r8         V06.hasValue(offs=0x00) P-INDEP "field V06.hasValue (fldOffset=0x0)"
-;  V15 tmp11        [V15,T04] (  3,  8   )     int  ->   r9         V06.value(offs=0x04) P-INDEP "field V06.value (fldOffset=0x4)"
-;  V16 rat0         [V16,T01] (  3, 12   )     int  ->  rdx         "ReplaceWithLclVar is creating a new local variable"
+;* V12 tmp8         [V12,T09] (  0,  0   )    bool  ->  zero-ref    V05.hasValue(offs=0x00) P-INDEP "field V05.hasValue (fldOffset=0x0)"
+;  V13 tmp9         [V13,T08] (  2,  4   )     int  ->  rdx         V05.value(offs=0x04) P-INDEP "field V05.value (fldOffset=0x4)"
+;  V14 tmp10        [V14,T04] (  3,  8   )    bool  ->  rcx         V06.hasValue(offs=0x00) P-INDEP "field V06.hasValue (fldOffset=0x0)"
+;  V15 tmp11        [V15,T05] (  3,  8   )     int  ->  rdx         V06.value(offs=0x04) P-INDEP "field V06.value (fldOffset=0x4)"
+;  V16 tmp12        [V16,T00] (  5, 14   )    bool  ->  rcx         "V00.[000..001)"
+;  V17 tmp13        [V17,T01] (  4, 13   )     int  ->  rdx         "V00.[004..008)"
+;  V18 rat0         [V18,T02] (  3, 12   )     int  ->  rdx         "ReplaceWithLclVar is creating a new local variable"
 ;
 ; Lcl frame size = 40
 
-G_M24602_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
+G_M24602_IG01:  ;; offset=0000H
        sub      rsp, 40
        mov      qword ptr [rsp+30H], rcx
 						;; size=9 bbWeight=1 PerfScore 1.25
-G_M24602_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
-       cmp      byte  ptr [rsp+30H], 0
+G_M24602_IG02:  ;; offset=0009H
+       movzx    rcx, byte  ptr [rsp+30H]
+       mov      edx, dword ptr [rsp+34H]
+       test     ecx, ecx
        jne      SHORT G_M24602_IG05
-						;; size=7 bbWeight=1 PerfScore 3.00
-G_M24602_IG03:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+						;; size=13 bbWeight=1 PerfScore 3.25
+G_M24602_IG03:  ;; offset=0016H
        xor      eax, eax
 						;; size=2 bbWeight=0.50 PerfScore 0.12
-G_M24602_IG04:        ; bbWeight=0.50, epilog, nogc, extend
+G_M24602_IG04:  ;; offset=0018H
        add      rsp, 40
        ret      
 						;; size=5 bbWeight=0.50 PerfScore 0.62
-G_M24602_IG05:        ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
-       xor      ecx, ecx
-						;; size=2 bbWeight=0.50 PerfScore 0.12
-G_M24602_IG06:        ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
-       movzx    rax, byte  ptr [rsp+30H]
-       mov      edx, dword ptr [rsp+34H]
-       test     al, al
-       jne      SHORT G_M24602_IG08
-						;; size=13 bbWeight=4 PerfScore 13.00
-G_M24602_IG07:        ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
+G_M24602_IG05:  ;; offset=001DH
        xor      r8d, r8d
-       xor      r9d, r9d
+       align    [0 bytes for IG06]
+						;; size=3 bbWeight=0.50 PerfScore 0.12
+G_M24602_IG06:  ;; offset=0020H
+       test     ecx, ecx
+       jne      SHORT G_M24602_IG08
+						;; size=4 bbWeight=4 PerfScore 5.00
+G_M24602_IG07:  ;; offset=0024H
+       xor      ecx, ecx
+       xor      edx, edx
        jmp      SHORT G_M24602_IG09
-						;; size=8 bbWeight=2 PerfScore 5.00
-G_M24602_IG08:        ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
-       mov      r8d, 0xD1FFAB1E
-       mov      eax, r8d
-       imul     edx:eax, edx
-       mov      r9d, edx
-       shr      r9d, 31
-       sar      edx, 2
-       add      r9d, edx
-       mov      r8d, 1
-						;; size=30 bbWeight=2 PerfScore 10.50
-G_M24602_IG09:        ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
-       mov      byte  ptr [rsp+30H], r8b
-       mov      dword ptr [rsp+34H], r9d
-       inc      ecx
-       cmp      byte  ptr [rsp+30H], 0
-       je       SHORT G_M24602_IG12
-       cmp      dword ptr [rsp+34H], 0
-       jg       SHORT G_M24602_IG06
-						;; size=26 bbWeight=4 PerfScore 33.00
-G_M24602_IG10:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+						;; size=6 bbWeight=2 PerfScore 5.00
+G_M24602_IG08:  ;; offset=002AH
+       mov      ecx, 0x66666667
        mov      eax, ecx
-						;; size=2 bbWeight=0.50 PerfScore 0.12
-G_M24602_IG11:        ; bbWeight=0.50, epilog, nogc, extend
+       imul     edx:eax, edx
+       mov      eax, edx
+       shr      eax, 31
+       sar      edx, 2
+       add      edx, eax
+       mov      ecx, 1
+						;; size=24 bbWeight=2 PerfScore 10.50
+G_M24602_IG09:  ;; offset=0042H
+       movzx    rcx, cl
+       inc      r8d
+       test     ecx, ecx
+       je       SHORT G_M24602_IG12
+       test     edx, edx
+       jg       SHORT G_M24602_IG06
+						;; size=14 bbWeight=4 PerfScore 12.00
+G_M24602_IG10:  ;; offset=0050H
+       mov      eax, r8d
+						;; size=3 bbWeight=0.50 PerfScore 0.12
+G_M24602_IG11:  ;; offset=0053H
        add      rsp, 40
        ret      
 						;; size=5 bbWeight=0.50 PerfScore 0.62
-G_M24602_IG12:        ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
+G_M24602_IG12:  ;; offset=0058H
        call     [System.ThrowHelper:ThrowInvalidOperationException_InvalidOperation_NoValue()]
-       ; gcr arg pop 0
        int3     
 						;; size=7 bbWeight=0 PerfScore 0.00
 
-; Total bytes of code 116, prolog size 9, PerfScore 78.98, instruction count 35, allocated bytes for code 116 (MethodHash=90769fe5) for method Microsoft.EntityFrameworkCore.Infrastructure.Uniquifier:GetLength(System.Nullable`1[int]):int
+; Total bytes of code 95, prolog size 9, PerfScore 48.13, instruction count 35, allocated bytes for code 95 (MethodHash=90769fe5) for method Microsoft.EntityFrameworkCore.Infrastructure.Uniquifier:GetLength(System.Nullable`1[int]):int

which is smaller code and much better perf score.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Jun 8, 2023

(edit: not expected to be handled)

+16 (+21.33%) : 21081.dasm - System.Numerics.Tests.Perf_Matrix3x2:IsIdentityBenchmark():bool:this
@@ -8,7 +8,7 @@
 ; Final local variable assignments
 ;
 ;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd single-def
-;* V01 loc0         [V01,T01] (  0,  0   )  struct (24) zero-ref    do-not-enreg[SF] ld-addr-op
+;* V01 loc0         [V01    ] (  0,  0   )  struct (24) zero-ref    do-not-enreg[SF] ld-addr-op
 ;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;* V03 tmp1         [V03    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline stloc first use temp"
 ;* V04 tmp2         [V04    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline ldloca(s) first use temp"
@@ -16,11 +16,14 @@
 ;* V06 tmp4         [V06    ] (  0,  0   )   simd8  ->  zero-ref    V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
 ;* V07 tmp5         [V07    ] (  0,  0   )   simd8  ->  zero-ref    V03.Y(offs=0x08) P-INDEP "field V03.Y (fldOffset=0x8)"
 ;* V08 tmp6         [V08    ] (  0,  0   )   simd8  ->  zero-ref    V03.Z(offs=0x10) P-INDEP "field V03.Z (fldOffset=0x10)"
-;* V09 tmp7         [V09,T04] (  0,  0   )   simd8  ->  zero-ref    single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
-;* V10 tmp8         [V10,T05] (  0,  0   )   simd8  ->  zero-ref    single-def V04.Y(offs=0x08) P-INDEP "field V04.Y (fldOffset=0x8)"
-;* V11 tmp9         [V11,T06] (  0,  0   )   simd8  ->  zero-ref    single-def V04.Z(offs=0x10) P-INDEP "field V04.Z (fldOffset=0x10)"
-;  V12 cse0         [V12,T02] (  3,  3   )   simd8  ->  mm0         "CSE - aggressive"
-;  V13 cse1         [V13,T03] (  3,  2   )   simd8  ->  mm1         "CSE - aggressive"
+;* V09 tmp7         [V09,T03] (  0,  0   )   simd8  ->  zero-ref    single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
+;* V10 tmp8         [V10,T04] (  0,  0   )   simd8  ->  zero-ref    single-def V04.Y(offs=0x08) P-INDEP "field V04.Y (fldOffset=0x8)"
+;* V11 tmp9         [V11,T05] (  0,  0   )   simd8  ->  zero-ref    single-def V04.Z(offs=0x10) P-INDEP "field V04.Z (fldOffset=0x10)"
+;* V12 tmp10        [V12    ] (  0,  0   )   simd8  ->  zero-ref    single-def "V01.[000..008)"
+;* V13 tmp11        [V13,T06] (  0,  0   )   simd8  ->  zero-ref    single-def "V01.[008..016)"
+;* V14 tmp12        [V14,T07] (  0,  0   )   simd8  ->  zero-ref    single-def "V01.[016..024)"
+;  V15 cse0         [V15,T01] (  2,  2   )   simd8  ->  mm0         "CSE - aggressive"
+;  V16 cse1         [V16,T02] (  2,  1.50)   simd8  ->  mm1         "CSE - aggressive"
 ;
 ; Lcl frame size = 0
 
@@ -30,12 +33,14 @@ G_M64376_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 G_M64376_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
        vmovsd   xmm0, qword ptr [reloc @RWD00]
        vmovsd   xmm1, qword ptr [reloc @RWD08]
-       vcmpps   k1, xmm0, xmm0, 4
+       vmovsd   xmm2, qword ptr [reloc @RWD00]
+       vcmpps   k1, xmm0, xmm2, 4
        kortestb k1, k1
        jne      SHORT G_M64376_IG04
-						;; size=29 bbWeight=1 PerfScore 11.00
+						;; size=37 bbWeight=1 PerfScore 14.00
 G_M64376_IG03:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
-       vcmpps   k1, xmm1, xmm1, 4
+       vmovsd   xmm0, qword ptr [reloc @RWD08]
+       vcmpps   k1, xmm1, xmm0, 4
        kortestb k1, k1
        jne      SHORT G_M64376_IG04
        vxorps   xmm0, xmm0, xmm0
@@ -45,7 +50,7 @@ G_M64376_IG03:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byr
        sete     al
        movzx    rax, al
        jmp      SHORT G_M64376_IG05
-						;; size=40 bbWeight=0.50 PerfScore 6.46
+						;; size=48 bbWeight=0.50 PerfScore 7.96
 G_M64376_IG04:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        xor      eax, eax
 						;; size=2 bbWeight=0.50 PerfScore 0.12
@@ -56,7 +61,7 @@ RWD00  	dq	000000003F800000h
 RWD08  	dq	3F80000000000000h
 
 
-; Total bytes of code 75, prolog size 3, PerfScore 27.68, instruction count 18, allocated bytes for code 81 (MethodHash=42890487) for method System.Numerics.Tests.Perf_Matrix3x2:IsIdentityBenchmark():bool:this
+; Total bytes of code 91, prolog size 3, PerfScore 33.78, instruction count 20, allocated bytes for code 97 (MethodHash=42890487) for method System.Numerics.Tests.Perf_Matrix3x2:IsIdentityBenchmark():bool:this

Replacements:

Accesses for V01
  [000..024) as System.Numerics.Matrix3x2
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  simd8 @ 000
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  simd8 @ 008
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  simd8 @ 016
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

Picking promotions for V01
  Evaluating access simd8 @ 000
    Single write-back cost: 3
    Write backs: 0
    Read backs: 0
    Cost with: 50
    Cost without: 300
  Promoting replacement


lvaGrabTemp returning 12 (V12 tmp10) (a long lifetime temp) called for V01.[000..008).
  Evaluating access simd8 @ 008
    Single write-back cost: 3
    Write backs: 0
    Read backs: 0
    Cost with: 50
    Cost without: 300
  Promoting replacement


lvaGrabTemp returning 13 (V13 tmp11) (a long lifetime temp) called for V01.[008..016).
  Evaluating access simd8 @ 016
    Single write-back cost: 3
    Write backs: 0
    Read backs: 0
    Cost with: 50
    Cost without: 300
  Promoting replacement


lvaGrabTemp returning 14 (V14 tmp12) (a long lifetime temp) called for V01.[016..024).

V01 promoted with 3 replacements
  [000..008) promoted as simd8 V12
  [008..016) promoted as simd8 V13
  [016..024) promoted as simd8 V14
Computing unpromoted remainder for V01
  Remainder: <empty>

Physical promotion means we replace a LCL_FLD with LCL_VAR. VN proves these to be a vector constant, but CSE does not kick in anymore due to the LCL_VAR, and then constant prop ends up creating some more copies of the vector constant. I think the issue is essentially #70182 as LSRA could probably realize and reuse the existing register that already contains the constant.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Jun 8, 2023

(edit: tracked by #87554)

+18 (+21.95%) : 17866.dasm - System.Numerics.Tests.Perf_Matrix3x2:GetDeterminantBenchmark():float:this
@@ -8,16 +8,20 @@
 ; Final local variable assignments
 ;
 ;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd single-def
-;  V01 loc0         [V01,T00] (  6,  6   )  struct (24) [rsp+00H]   do-not-enreg[SF] must-init ld-addr-op
+;* V01 loc0         [V01    ] (  0,  0   )  struct (24) zero-ref    do-not-enreg[SF] ld-addr-op
 ;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;* V03 tmp1         [V03    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline stloc first use temp"
-;* V04 tmp2         [V04    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline ldloca(s) first use temp"
+;  V04 tmp2         [V04    ] (  7,  7   )  struct (24) [rsp+00H]   do-not-enreg[SF] must-init ld-addr-op "Inline ldloca(s) first use temp"
 ;* V05 tmp3         [V05    ] (  0,  0   )   simd8  ->  zero-ref    V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
 ;* V06 tmp4         [V06    ] (  0,  0   )   simd8  ->  zero-ref    V03.Y(offs=0x08) P-INDEP "field V03.Y (fldOffset=0x8)"
 ;* V07 tmp5         [V07    ] (  0,  0   )   simd8  ->  zero-ref    V03.Z(offs=0x10) P-INDEP "field V03.Z (fldOffset=0x10)"
-;* V08 tmp6         [V08,T01] (  0,  0   )   simd8  ->  zero-ref    single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
-;* V09 tmp7         [V09,T02] (  0,  0   )   simd8  ->  zero-ref    single-def V04.Y(offs=0x08) P-INDEP "field V04.Y (fldOffset=0x8)"
-;* V10 tmp8         [V10,T03] (  0,  0   )   simd8  ->  zero-ref    single-def V04.Z(offs=0x10) P-INDEP "field V04.Z (fldOffset=0x10)"
+;  V08 tmp6         [V08,T00] (  5,  5   )   simd8  ->  [rsp+00H]   do-not-enreg[S] single-def V04.X(offs=0x00) P-DEP "field V04.X (fldOffset=0x0)"
+;  V09 tmp7         [V09,T01] (  5,  5   )   simd8  ->  [rsp+08H]   do-not-enreg[S] single-def V04.Y(offs=0x08) P-DEP "field V04.Y (fldOffset=0x8)"
+;  V10 tmp8         [V10,T02] (  5,  5   )   simd8  ->  [rsp+10H]   do-not-enreg[S] single-def V04.Z(offs=0x10) P-DEP "field V04.Z (fldOffset=0x10)"
+;  V11 tmp9         [V11,T03] (  2,  2   )   float  ->  mm0         single-def "V01.[000..004)"
+;  V12 tmp10        [V12,T04] (  2,  2   )   float  ->  mm1         single-def "V01.[004..008)"
+;  V13 tmp11        [V13,T05] (  2,  2   )   float  ->  mm2         single-def "V01.[008..012)"
+;  V14 tmp12        [V14,T06] (  2,  2   )   float  ->  mm3         single-def "V01.[012..016)"
 ;
 ; Lcl frame size = 24
 
@@ -34,12 +38,16 @@ G_M33935_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        vmovsd   qword ptr [rsp], xmm0
        vmovsd   xmm0, qword ptr [reloc @RWD08]
        vmovsd   qword ptr [rsp+08H], xmm0
+       vxorps   xmm0, xmm0, xmm0
+       vmovsd   qword ptr [rsp+10H], xmm0
        vmovss   xmm0, dword ptr [rsp]
-       vmulss   xmm0, xmm0, dword ptr [rsp+0CH]
-       vmovss   xmm1, dword ptr [rsp+08H]
-       vmulss   xmm1, xmm1, dword ptr [rsp+04H]
+       vmovss   xmm1, dword ptr [rsp+04H]
+       vmovss   xmm2, dword ptr [rsp+08H]
+       vmovss   xmm3, dword ptr [rsp+0CH]
+       vmulss   xmm0, xmm0, xmm3
+       vmulss   xmm1, xmm2, xmm1
        vsubss   xmm0, xmm0, xmm1
-						;; size=54 bbWeight=1 PerfScore 27.00
+						;; size=72 bbWeight=1 PerfScore 30.33
 G_M33935_IG03:        ; bbWeight=1, epilog, nogc, extend
        add      rsp, 24
        ret      
@@ -48,7 +56,7 @@ RWD00  	dq	000000003F800000h
 RWD08  	dq	3F80000000000000h
 
 
-; Total bytes of code 82, prolog size 23, PerfScore 41.28, instruction count 17, allocated bytes for code 82 (MethodHash=96117b70) for method System.Numerics.Tests.Perf_Matrix3x2:GetDeterminantBenchmark():float:this
+; Total bytes of code 100, prolog size 23, PerfScore 46.42, instruction count 21, allocated bytes for code 100 (MethodHash=96117b70) for method System.Numerics.Tests.Perf_Matrix3x2:GetDeterminantBenchmark():float:this

Replacements:

Accesses for V01
  [000..024) as System.Numerics.Matrix3x2
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  float @ 000
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  float @ 004
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  float @ 008
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  float @ 012
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

Picking promotions for V01
  Evaluating access float @ 000
    Single write-back cost: 3
    Write backs: 0
    Read backs: 0
    Cost with: 50
    Cost without: 300
  Promoting replacement


lvaGrabTemp returning 11 (V11 tmp9) (a long lifetime temp) called for V01.[000..004).
  Evaluating access float @ 004
    Single write-back cost: 3
    Write backs: 0
    Read backs: 0
    Cost with: 50
    Cost without: 300
  Promoting replacement


lvaGrabTemp returning 12 (V12 tmp10) (a long lifetime temp) called for V01.[004..008).
  Evaluating access float @ 008
    Single write-back cost: 3
    Write backs: 0
    Read backs: 0
    Cost with: 50
    Cost without: 300
  Promoting replacement


lvaGrabTemp returning 13 (V13 tmp11) (a long lifetime temp) called for V01.[008..012).
  Evaluating access float @ 012
    Single write-back cost: 3
    Write backs: 0
    Read backs: 0
    Cost with: 50
    Cost without: 300
  Promoting replacement


lvaGrabTemp returning 14 (V14 tmp12) (a long lifetime temp) called for V01.[012..016).

V01 promoted with 4 replacements
  [000..004) promoted as float V11
  [004..008) promoted as float V12
  [008..012) promoted as float V13
  [012..016) promoted as float V14
Computing unpromoted remainder for V01
  Remainder: [016..024)

We end up creating IR that DNERs V04:

STMT00001 ( 0x000[E-] ... ??? )
               [000017] DA--G------                           STORE_LCL_VAR struct<System.Numerics.Matrix3x2, 24> V01 loc0         
               [000031] -----------                         └──▌  LCL_VAR   struct<System.Numerics.Matrix3x2+Impl, 24>(P) V04 tmp2         
                                                                   simd8  V04.<unknown class>:X (offs=0x00) -> V08 tmp6          (last use)
                                                                   simd8  V04.<unknown class>:Y (offs=0x08) -> V09 tmp7          (last use)
                                                                   simd8  V04.<unknown class>:Z (offs=0x10) -> V10 tmp8          (last use)
Processing block operation [000017] that involves replacements
  V11 (V01.[000..004)) <- src+000
  V12 (V01.[004..008)) <- src+004
  V13 (V01.[008..012)) <- src+008
  V14 (V01.[012..016)) <- src+012
  => Remainder strategy: do nothing (remainder dying)

Local V04 should not be enregistered because: was accessed as a local field

Local V04 should not be enregistered because: was accessed as a local field

Local V04 should not be enregistered because: was accessed as a local field

Local V04 should not be enregistered because: was accessed as a local field
New statement:
STMT00001 ( 0x000[E-] ... ??? )
               [000075] -A---------                           COMMA     void  
               [000066] DA---------                         ├──▌  STORE_LCL_VAR float  V11 tmp9         
               [000065] -----------                           └──▌  LCL_FLD   float (P) V04 tmp2         [+0]
                                                                     simd8  V04.<unknown class>:X (offs=0x00) -> V08 tmp6         
                                                                     simd8  V04.<unknown class>:Y (offs=0x08) -> V09 tmp7         
                                                                     simd8  V04.<unknown class>:Z (offs=0x10) -> V10 tmp8         
               [000074] -A---------                         └──▌  COMMA     void  
               [000068] DA---------                            ├──▌  STORE_LCL_VAR float  V12 tmp10        
               [000067] -----------                              └──▌  LCL_FLD   float (P) V04 tmp2         [+4]
                                                                        simd8  V04.<unknown class>:X (offs=0x00) -> V08 tmp6         
                                                                        simd8  V04.<unknown class>:Y (offs=0x08) -> V09 tmp7         
                                                                        simd8  V04.<unknown class>:Z (offs=0x10) -> V10 tmp8         
               [000073] -A---------                            └──▌  COMMA     void  
               [000070] DA---------                               ├──▌  STORE_LCL_VAR float  V13 tmp11        
               [000069] -----------                                 └──▌  LCL_FLD   float (P) V04 tmp2         [+8]
                                                                           simd8  V04.<unknown class>:X (offs=0x00) -> V08 tmp6         
                                                                           simd8  V04.<unknown class>:Y (offs=0x08) -> V09 tmp7         
                                                                           simd8  V04.<unknown class>:Z (offs=0x10) -> V10 tmp8         
               [000072] DA---------                               └──▌  STORE_LCL_VAR float  V14 tmp12        
               [000071] -----------                                  └──▌  LCL_FLD   float (P) V04 tmp2         [+12]
                                                                            simd8  V04.<unknown class>:X (offs=0x00) -> V08 tmp6         
                                                                            simd8  V04.<unknown class>:Y (offs=0x08) -> V09 tmp7         
                                                                            simd8  V04.<unknown class>:Z (offs=0x10) -> V10 tmp8         

This is missing GetElement/WithElement handling that local morph has. With that handling I think we would end up with:

               [000083] -A---------                           COMMA     void  
               [000068] DA---------                         ├──▌  STORE_LCL_VAR float  V11 tmp9         
               [000067] -----------                           └──▌  HWINTRINSIC float  float ToScalar
               [000066] -----------                              └──▌  LCL_VAR   simd8 <System.Numerics.Vector2> V08 tmp6         
               [000082] -A---------                         └──▌  COMMA     void  
               [000072] DA---------                            ├──▌  STORE_LCL_VAR float  V12 tmp10        
               [000071] -----------                              └──▌  HWINTRINSIC float  float GetElement
               [000070] -----------                                 ├──▌  LCL_VAR   simd8 <System.Numerics.Vector2> V08 tmp6         
               [000069] -----------                                 └──▌  CNS_INT   int    1
               [000081] -A---------                            └──▌  COMMA     void  
               [000076] DA---------                               ├──▌  STORE_LCL_VAR float  V13 tmp11        
               [000075] -----------                                 └──▌  HWINTRINSIC float  float ToScalar
               [000074] -----------                                    └──▌  LCL_VAR   simd8 <System.Numerics.Vector2> V09 tmp7         
               [000080] DA---------                               └──▌  STORE_LCL_VAR float  V14 tmp12        
               [000079] -----------                                  └──▌  HWINTRINSIC float  float GetElement
               [000078] -----------                                     ├──▌  LCL_VAR   simd8 <System.Numerics.Vector2> V09 tmp7         
               [000077] -----------                                     └──▌  CNS_INT   int    1

Hacking this in we end up folding the entire benchmark to a constant:

; Assembly listing for method System.Numerics.Tests.Perf_Matrix3x2:GetDeterminantBenchmark():float:this
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; optimized code
; rsp based frame
; partially interruptible
; No matching PGO data
; 0 inlinees with PGO data; 6 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd single-def
;* V01 loc0         [V01    ] (  0,  0   )  struct (24) zero-ref    do-not-enreg[SF] ld-addr-op
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline stloc first use temp"
;* V04 tmp2         [V04    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline ldloca(s) first use temp"
;* V05 tmp3         [V05    ] (  0,  0   )   simd8  ->  zero-ref    V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
;* V06 tmp4         [V06    ] (  0,  0   )   simd8  ->  zero-ref    V03.Y(offs=0x08) P-INDEP "field V03.Y (fldOffset=0x8)"
;* V07 tmp5         [V07    ] (  0,  0   )   simd8  ->  zero-ref    V03.Z(offs=0x10) P-INDEP "field V03.Z (fldOffset=0x10)"
;* V08 tmp6         [V08,T00] (  0,  0   )   simd8  ->  zero-ref    single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
;* V09 tmp7         [V09,T01] (  0,  0   )   simd8  ->  zero-ref    single-def V04.Y(offs=0x08) P-INDEP "field V04.Y (fldOffset=0x8)"
;* V10 tmp8         [V10    ] (  0,  0   )   simd8  ->  zero-ref    single-def V04.Z(offs=0x10) P-INDEP "field V04.Z (fldOffset=0x10)"
;* V11 tmp9         [V11,T02] (  0,  0   )   float  ->  zero-ref    single-def "V01.[000..004)"
;* V12 tmp10        [V12,T03] (  0,  0   )   float  ->  zero-ref    single-def "V01.[004..008)"
;* V13 tmp11        [V13,T04] (  0,  0   )   float  ->  zero-ref    single-def "V01.[008..012)"
;* V14 tmp12        [V14,T05] (  0,  0   )   float  ->  zero-ref    single-def "V01.[012..016)"
;
; Lcl frame size = 0

G_M33935_IG01:  ;; offset=0000H
       vzeroupper
                                                ;; size=3 bbWeight=1 PerfScore 1.00
G_M33935_IG02:  ;; offset=0003H
       vmovss   xmm0, dword ptr [reloc @RWD00]
                                                ;; size=8 bbWeight=1 PerfScore 3.00
G_M33935_IG03:  ;; offset=000BH
       ret
                                                ;; size=1 bbWeight=1 PerfScore 1.00
RWD00   dd      3F800000h               ;         1


; Total bytes of code 12, prolog size 3, PerfScore 6.20, instruction count 3, allocated bytes for code 12 (MethodHash=96117b70) for method System.Numerics.Tests.Perf_Matrix3x2:GetDeterminantBenchmark():float:this

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Jun 8, 2023

(edit: not expected to be handled)

+44 (+27.67%) : 1550.dasm - System.Text.Json.Utf8JsonReader:get_CurrentState():System.Text.Json.JsonReaderState:this
@@ -8,55 +8,63 @@
 ; Final local variable assignments
 ;
 ;  V00 this         [V00,T00] ( 12, 12   )   byref  ->  rcx         this single-def
-;  V01 RetBuf       [V01,T02] (  4,  4   )   byref  ->  rbx         single-def
-;  V02 loc0         [V02,T01] ( 12, 12   )  struct (56) [rsp+08H]   do-not-enreg[SF] ld-addr-op
+;  V01 RetBuf       [V01,T01] ( 12, 12   )   byref  ->  rbx         single-def
+;  V02 loc0         [V02,T02] (  4,  4   )  struct (56) [rsp+10H]   do-not-enreg[SF] ld-addr-op
 ;# V03 OutArgs      [V03    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
+;  V04 tmp1         [V04,T03] (  2,  2   )    long  ->  rbp         "V02.[000..008)"
+;  V05 tmp2         [V05,T04] (  2,  2   )    long  ->  r14         "V02.[008..016)"
+;  V06 tmp3         [V06,T05] (  2,  2   )    bool  ->  r15         "V02.[016..017)"
+;  V07 tmp4         [V07,T06] (  2,  2   )    bool  ->  r12         "V02.[017..018)"
+;  V08 tmp5         [V08,T07] (  2,  2   )    bool  ->  r13         "V02.[018..019)"
+;  V09 tmp6         [V09,T08] (  2,  2   )    bool  ->  [rsp+0CH]   spill-single-def "V02.[019..020)"
+;  V10 tmp7         [V10,T09] (  2,  2   )   ubyte  ->  [rsp+08H]   spill-single-def "V02.[020..021)"
+;  V11 tmp8         [V11,T10] (  2,  2   )   ubyte  ->  [rsp+04H]   spill-single-def "V02.[021..022)"
 ;
-; Lcl frame size = 64
+; Lcl frame size = 72
 
 G_M2776_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
+       push     r15
+       push     r14
+       push     r13
+       push     r12
        push     rdi
        push     rsi
+       push     rbp
        push     rbx
-       sub      rsp, 64
+       sub      rsp, 72
        vzeroupper 
        mov      rbx, rdx
        ; byrRegs +[rbx]
-						;; size=13 bbWeight=1 PerfScore 4.50
+						;; size=22 bbWeight=1 PerfScore 9.50
 G_M2776_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=000A {rcx rbx}, byref
        ; byrRegs +[rcx]
        vxorps   ymm0, ymm0, ymm0
-       vmovdqu  ymmword ptr [rsp+08H], ymm0
-       vmovdqu  ymmword ptr [rsp+20H], ymm0
-       mov      rax, qword ptr [rcx]
-       mov      qword ptr [rsp+08H], rax
-       mov      rax, qword ptr [rcx+08H]
-       mov      qword ptr [rsp+10H], rax
-       movzx    rax, byte  ptr [rcx+26H]
-       mov      byte  ptr [rsp+18H], al
-       movzx    rax, byte  ptr [rcx+27H]
-       mov      byte  ptr [rsp+19H], al
-       movzx    rax, byte  ptr [rcx+2EH]
-       mov      byte  ptr [rsp+1AH], al
+       vmovdqu  ymmword ptr [rsp+10H], ymm0
+       vmovdqu  ymmword ptr [rsp+28H], ymm0
+       mov      rbp, qword ptr [rcx]
+       mov      r14, qword ptr [rcx+08H]
+       movzx    r15, byte  ptr [rcx+26H]
+       movzx    r12, byte  ptr [rcx+27H]
+       movzx    r13, byte  ptr [rcx+2EH]
        movzx    rax, byte  ptr [rcx+2CH]
-       mov      byte  ptr [rsp+1BH], al
-       movzx    rax, byte  ptr [rcx+28H]
-       mov      byte  ptr [rsp+1CH], al
-       movzx    rax, byte  ptr [rcx+29H]
-       mov      byte  ptr [rsp+1DH], al
-       mov      rax, qword ptr [rcx+40H]
-       mov      qword ptr [rsp+20H], rax
-						;; size=90 bbWeight=1 PerfScore 29.33
+       mov      dword ptr [rsp+0CH], eax
+       movzx    rdx, byte  ptr [rcx+28H]
+       mov      dword ptr [rsp+08H], edx
+       movzx    r8, byte  ptr [rcx+29H]
+       mov      dword ptr [rsp+04H], r8d
+       mov      r9, qword ptr [rcx+40H]
+       mov      qword ptr [rsp+28H], r9
+						;; size=73 bbWeight=1 PerfScore 24.33
 G_M2776_IG03:        ; bbWeight=1, nogc, extend
        vmovdqu  xmm0, xmmword ptr [rcx+48H]
-       vmovdqu  xmmword ptr [rsp+28H], xmm0
-       mov      rax, qword ptr [rcx+58H]
-       mov      qword ptr [rsp+38H], rax
+       vmovdqu  xmmword ptr [rsp+30H], xmm0
+       mov      r9, qword ptr [rcx+58H]
+       mov      qword ptr [rsp+40H], r9
 						;; size=20 bbWeight=1 PerfScore 8.00
 G_M2776_IG04:        ; bbWeight=1, extend
        mov      rdi, rbx
        ; byrRegs +[rdi]
-       lea      rsi, bword ptr [rsp+08H]
+       lea      rsi, bword ptr [rsp+10H]
        ; byrRegs +[rsi]
        mov      ecx, 4
        ; byrRegs -[rcx]
@@ -64,18 +72,34 @@ G_M2776_IG04:        ; bbWeight=1, extend
        call     CORINFO_HELP_ASSIGN_BYREF
        movsq    
        movsq    
+       mov      qword ptr [rbx], rbp
+       mov      qword ptr [rbx+08H], r14
+       mov      byte  ptr [rbx+10H], r15b
+       mov      byte  ptr [rbx+11H], r12b
+       mov      byte  ptr [rbx+12H], r13b
+       mov      ebp, dword ptr [rsp+0CH]
+       mov      byte  ptr [rbx+13H], bpl
+       mov      ebp, dword ptr [rsp+08H]
+       mov      byte  ptr [rbx+14H], bpl
+       mov      ebp, dword ptr [rsp+04H]
+       mov      byte  ptr [rbx+15H], bpl
        mov      rax, rbx
        ; byrRegs +[rax]
-						;; size=28 bbWeight=1 PerfScore 29.25
+						;; size=71 bbWeight=1 PerfScore 40.25
 G_M2776_IG05:        ; bbWeight=1, epilog, nogc, extend
-       add      rsp, 64
+       add      rsp, 72
        pop      rbx
+       pop      rbp
        pop      rsi
        pop      rdi
+       pop      r12
+       pop      r13
+       pop      r14
+       pop      r15
        ret      
-						;; size=8 bbWeight=1 PerfScore 2.75
+						;; size=17 bbWeight=1 PerfScore 5.25
 
-; Total bytes of code 159, prolog size 10, PerfScore 89.73, instruction count 44, allocated bytes for code 159 (MethodHash=d49af527) for method System.Text.Json.Utf8JsonReader:get_CurrentState():System.Text.Json.JsonReaderState:this
+; Total bytes of code 203, prolog size 19, PerfScore 107.63, instruction count 60, allocated bytes for code 203 (MethodHash=d49af527) for method System.Text.Json.Utf8JsonReader:get_CurrentState():System.Text.Json.JsonReaderState:this

Replacements:

Accesses for V02
  [000..056) as System.Text.Json.JsonReaderState
    #:                             (2, 200)
    # assigned from:               (1, 100)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  long @ 000
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  long @ 008
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  bool @ 016
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  bool @ 017
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  bool @ 018
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  bool @ 019
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  ubyte @ 020
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  ubyte @ 021
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  [024..032) as System.Text.Json.JsonReaderOptions
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  [032..056) as System.Text.Json.BitStack
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

Picking promotions for V02
  Evaluating access long @ 000
    Single write-back cost: 3
    Write backs: 0
    Read backs: 0
    Cost with: 50
    Cost without: 300
  Promoting replacement


lvaGrabTemp returning 4 (V04 tmp1) (a long lifetime temp) called for V02.[000..008).
  Evaluating access long @ 008
    Single write-back cost: 3
    Write backs: 0
    Read backs: 0
    Cost with: 50
    Cost without: 300
  Promoting replacement


lvaGrabTemp returning 5 (V05 tmp2) (a long lifetime temp) called for V02.[008..016).
  Evaluating access bool @ 016
    Single write-back cost: 3
    Write backs: 0
    Read backs: 0
    Cost with: 50
    Cost without: 300
  Promoting replacement


lvaGrabTemp returning 6 (V06 tmp3) (a long lifetime temp) called for V02.[016..017).
  Evaluating access bool @ 017
    Single write-back cost: 3
    Write backs: 0
    Read backs: 0
    Cost with: 50
    Cost without: 300
  Promoting replacement


lvaGrabTemp returning 7 (V07 tmp4) (a long lifetime temp) called for V02.[017..018).
  Evaluating access bool @ 018
    Single write-back cost: 3
    Write backs: 0
    Read backs: 0
    Cost with: 50
    Cost without: 300
  Promoting replacement


lvaGrabTemp returning 8 (V08 tmp5) (a long lifetime temp) called for V02.[018..019).
  Evaluating access bool @ 019
    Single write-back cost: 3
    Write backs: 0
    Read backs: 0
    Cost with: 50
    Cost without: 300
  Promoting replacement


lvaGrabTemp returning 9 (V09 tmp6) (a long lifetime temp) called for V02.[019..020).
  Evaluating access ubyte @ 020
    Single write-back cost: 3
    Write backs: 0
    Read backs: 0
    Cost with: 50
    Cost without: 300
  Promoting replacement


lvaGrabTemp returning 10 (V10 tmp7) (a long lifetime temp) called for V02.[020..021).
  Evaluating access ubyte @ 021
    Single write-back cost: 3
    Write backs: 0
    Read backs: 0
    Cost with: 50
    Cost without: 300
  Promoting replacement


lvaGrabTemp returning 11 (V11 tmp8) (a long lifetime temp) called for V02.[021..022).

V02 promoted with 8 replacements
  [000..008) promoted as long V04
  [008..016) promoted as long V05
  [016..017) promoted as bool V06
  [017..018) promoted as bool V07
  [018..019) promoted as bool V08
  [019..020) promoted as bool V09
  [020..021) promoted as ubyte V10
  [021..022) promoted as ubyte V11
Computing unpromoted remainder for V02
  Remainder: [024..056)

This is one of the cases where the heuristic does not take into account that decomposed assignments can be more expensive with many fields, especially considering we end spilling some of the fields. We end up with

Processing block operation [000065] that involves replacements
  dst+000 <- V04 (V02.[000..008)) (last use)
  dst+008 <- V05 (V02.[008..016)) (last use)
  dst+016 <- V06 (V02.[016..017)) (last use)
  dst+017 <- V07 (V02.[017..018)) (last use)
  dst+018 <- V08 (V02.[018..019)) (last use)
  dst+019 <- V09 (V02.[019..020)) (last use)
  dst+020 <- V10 (V02.[020..021)) (last use)
  dst+021 <- V11 (V02.[021..022)) (last use)
  Remainder: [024..056)
  => Remainder strategy: retain a full block op
New statement:
STMT00012 ( 0x08A[E-] ... 0x08B )
               [000124] -A-XG------                           COMMA     void  
               [000065] -A-XG------                         ├──▌  STORE_BLK struct<System.Text.Json.JsonReaderState, 56> (copy)
               [000079] -----------                           ├──▌  LCL_VAR   byref  V01 RetBuf       
               [000063] -----------                           └──▌  LCL_VAR   struct<System.Text.Json.JsonReaderState, 56> V02 loc0         
               [000123] -A-XG------                         └──▌  COMMA     void  
               [000082] -A-XG------                            ├──▌  STOREIND  long  
               [000081] -----------                              ├──▌  LCL_VAR   byref  V01 RetBuf       
               [000080] -----------                              └──▌  LCL_VAR   long   V04 tmp1          (last use)
               [000122] -A-XG------                            └──▌  COMMA     void  
               [000087] -A-XG------                               ├──▌  STOREIND  long  
               [000086] -----------                                 ├──▌  ADD       byref 
               [000084] -----------                                   ├──▌  LCL_VAR   byref  V01 RetBuf       
               [000085] -----------                                   └──▌  CNS_INT   long   8
               [000083] -----------                                 └──▌  LCL_VAR   long   V05 tmp2          (last use)
               [000121] -A-XG------                               └──▌  COMMA     void  
               [000092] -A-XG------                                  ├──▌  STOREIND  bool  
               [000091] -----------                                    ├──▌  ADD       byref 
               [000089] -----------                                      ├──▌  LCL_VAR   byref  V01 RetBuf       
               [000090] -----------                                      └──▌  CNS_INT   long   16
               [000088] -----------                                    └──▌  LCL_VAR   bool   V06 tmp3          (last use)
               [000120] -A-XG------                                  └──▌  COMMA     void  
               [000097] -A-XG------                                     ├──▌  STOREIND  bool  
               [000096] -----------                                       ├──▌  ADD       byref 
               [000094] -----------                                         ├──▌  LCL_VAR   byref  V01 RetBuf       
               [000095] -----------                                         └──▌  CNS_INT   long   17
               [000093] -----------                                       └──▌  LCL_VAR   bool   V07 tmp4          (last use)
               [000119] -A-XG------                                     └──▌  COMMA     void  
               [000102] -A-XG------                                        ├──▌  STOREIND  bool  
               [000101] -----------                                          ├──▌  ADD       byref 
               [000099] -----------                                            ├──▌  LCL_VAR   byref  V01 RetBuf       
               [000100] -----------                                            └──▌  CNS_INT   long   18
               [000098] -----------                                          └──▌  LCL_VAR   bool   V08 tmp5          (last use)
               [000118] -A-XG------                                        └──▌  COMMA     void  
               [000107] -A-XG------                                           ├──▌  STOREIND  bool  
               [000106] -----------                                             ├──▌  ADD       byref 
               [000104] -----------                                               ├──▌  LCL_VAR   byref  V01 RetBuf       
               [000105] -----------                                               └──▌  CNS_INT   long   19
               [000103] -----------                                             └──▌  LCL_VAR   bool   V09 tmp6          (last use)
               [000117] -A-XG------                                           └──▌  COMMA     void  
               [000112] -A-XG------                                              ├──▌  STOREIND  ubyte 
               [000111] -----------                                                ├──▌  ADD       byref 
               [000109] -----------                                                  ├──▌  LCL_VAR   byref  V01 RetBuf       
               [000110] -----------                                                  └──▌  CNS_INT   long   20
               [000108] -----------                                                └──▌  LCL_VAR   ubyte  V10 tmp7          (last use)
               [000116] -A-XG------                                              └──▌  STOREIND  ubyte 
               [000115] -----------                                                 ├──▌  ADD       byref 
               [000064] -----------                                                   ├──▌  LCL_VAR   byref  V01 RetBuf       
               [000114] -----------                                                   └──▌  CNS_INT   long   21
               [000113] -----------                                                 └──▌  LCL_VAR   ubyte  V11 tmp8          (last use)

to handle the assignment into the ret buffer. We do see some signs of why it could be beneficial to do the promotion as we are able to keep a bunch of the fields in registers instead of on stack, but we just don't have enough registers on x64 to do that for them all.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Jun 15, 2023

With the perflab runs @cincuranet set up and a query from @AndyAyersMS I can start looking at micro benchmark regressions. The following lists all benchmarks with a ratio below 0.95, indicating that they regress by more than 5%. There are 56 entries in this list (for comparison, the query for benchmarks that improve by more than 5% returns 267 results, but take it with a grain of salt as many of these are noisy). The quality columns are computed as median divided by standard deviation, so larger numbers indicate more stable benchmarks.

Notes Benchmark Ratio Promotion median Default median Promotion quality Default quality
Bimodal PerfLabTests.CastingPerf.CheckObjIsInterfaceNo 0.50201266706593484 62373.178950863221 31312.125918503676 4.1721551887971193 2.0068501317441818
Bimodal PerfLabTests.CastingPerf.CheckIsInstAnyIsInterfaceYes 0.5020356039475935 62373.964854866252 31313.95111651875 4.2768060071352947 2.0060349494210894
Bimodal PerfLabTests.CastingPerf.CheckObjIsInterfaceYes 0.50204105771866514 62373.932840068293 31314.275217100869 4.2853027310745686 2.0074065296253232
Bimodal PerfLabTests.CastingPerf.CheckIsInstAnyIsInterfaceNo 0.50216288733931325 62373.730079681263 31321.772390935719 4.0927570892698251 2.0097825475492144
Bimodal PerfLabTests.LowLevelPerf.EmptyStaticFunction 0.7374686274234068 2604944.2708333335 1921064.6759259258 7.1867237699720317 3.0824059321076045
Noisy MicroBenchmarks.Serializers.Xml_ToStream<MyEventsListerViewModel>.XmlSerializer_ 0.74302975696458551 660707.35294117662 490925.2238805971 4.9061924677629705 3.7168701617547995
Maybe? Need more data System.Memory.Constructors<Byte>.ArrayAsMemory 0.79356832092520146 2.3856637796570128 1.8931871999144854 6.3931791008772434 4.7934943297225
Bimodal System.Memory.Constructors<String>.MemoryMarshalCreateSpan 0.80252756077275078 1.5784818512235037 1.2667751897864545 5.5628662599349337 5.0711645321790915
Bimodal System.Memory.Constructors<String>.MemoryMarshalCreateReadOnlySpan 0.80496260867919744 1.5785205328904912 1.2706500060092067 5.7538606845191165 4.9616066679124549
Bimodal PerfLabTests.CastingPerf2.CastingPerf.IntObj 0.8285766871531669 226939.75845410625 188036.99324324325 12.390298163607353 9.8413406239725
Regression (vec cns reuse) System.Numerics.Tests.Perf_Matrix4x4.IsIdentityBenchmark 0.83234334609059291 1.2211270243560317 1.0163969534541484 10.778735229862832 7.8101018258309587
Multimodal System.Collections.ContainsFalse<Int32>.Span(Size: 512) 0.845401219708253 29126.453232893906 24623.539088863898 11.021424747595237 13.417315107098991
Maybe? Need more data System.Collections.TryGetValueTrue<Int32, Int32>.Dictionary(Size: 512) 0.84571433947519314 3795.9668312075878 3210.3035813244669 7.9431221125054137 6.7635448167816792
Bimodal System.Tests.Perf_Enum.IsDefined_Generic_NonFlags 0.85163797170392563 3.1193090478744949 2.6565220306495383 8.8673428674969017 8.5158225518187
Maybe? need more data Devirtualization.EqualityComparer.ValueTupleCompareNoOpt 0.85240213125413522 5.8012179366666068 4.9449705330843328 9.3287988994501845 7.4879186079626407
Regression (pipelining) System.Numerics.Tests.Perf_Matrix3x2.SubtractBenchmark 0.85729075729610349 1.62796924888388 1.3956429902304301 44.749404427588338 12.091485745364375
Regression (pipelining) System.Numerics.Tests.Perf_Matrix3x2.AddOperatorBenchmark 0.85737159681747388 1.6278395951425713 1.3956634330500965 63.074207379712796 10.42100793946469
Regression (pipelining) System.Numerics.Tests.Perf_Matrix3x2.SubtractOperatorBenchmark 0.85739598139557893 1.627907761982518 1.3957615732064814 38.580582580347361 13.237060244441501
Regression (pipelining) System.Numerics.Tests.Perf_Matrix3x2.AddBenchmark 0.85795657072580145 1.6266881427332425 1.3956277805797357 26.052276152907275 9.560646773106166
Bimodal System.MathBenchmarks.MathTests.DivRemInt32 0.85968833841252423 1.5068739322565785 1.2954419470188046 12.849580314598676 7.9300084772432688
Modal, but check again (only spiked with phys prom) PerfLabTests.CastingPerf.ObjObjrefValueType 0.8646586924544396 361425.11420265783 312509.36666666664 14.971126929257249 15.218101244318584
Like above PerfLabTests.CastingPerf.FooObjIsNull 0.86607371151458068 361202.81007751933 312828.25833333336 14.968123248057182 15.452934201966883
Bimodal PerfLabTests.LowLevelPerf.GenericGenericMethod 0.86612762496753892 187101.49305555556 162053.77180808882 15.107030942049 11.493496309257749
Modal, but check again (only spiked with phys prom) PerfLabTests.CastingPerf.ObjInt 0.86704330919561978 360542.60349025979 312606.05203619908 14.650520670993343 14.998381025621656
Like above PerfLabTests.CastingPerf.ObjFooIsObj 0.86731406352131957 360339.93371212116 312527.89215686271 14.967143545401138 15.27578920376382
Like above PerfLabTests.CastingPerf.ObjScalarValueType 0.86780419724880409 360123.47537878784 312516.66346153844 14.864352722727245 15.405796220386524
Maybe? Need more data Microsoft.Extensions.Primitives.Performance.StringValuesBenchmark.ForEach_Array 0.86930747689206089 5.2903566301283576 4.5989465739960682 30.026761512990653 10.35044745664271
Bimodal regression and improvement System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "Sherlock|Holmes|Watson|Irene|Adler|John|Baker", Options: NonBacktracking) 0.870338288956494 1567827.0673076923 1364539.9271402548 11.60696233951157 14.090632090500138
Regression (pipelining, #87554) System.Numerics.Tests.Perf_Matrix3x2.MultiplyByMatrixOperatorBenchmark 0.87078330292883266 4.5609575925251837 3.9716057169374164 100.14896735833135 10.364535617674489
Regression (pipelijning, #87554) System.Numerics.Tests.Perf_Matrix3x2.MultiplyByMatrixBenchmark 0.87095941903031426 4.5610356119008992 3.9724769267177811 129.83253482155473 9.5704310251011382
Noisy System.Tests.Perf_String.Trim(s: "Test") 0.87443817495914056 2.5474217375187354 2.2275628150071256 8.6368738268479444 10.77430293074069
Bimodal Benchstone.BenchI.BubbleSort.Test 0.88167317428377268 13708.670704845816 12086.567215552373 17.710410860982513 14.8836516447514
Bimodal PerfLabTests.CastingPerf.FooObjIsFoo2 0.888850848188488 434724.572368421 386405.30487804877 16.696742581172664 14.025034933825713
Multimodal System.Collections.ContainsKeyTrue<Int32, Int32>.Dictionary(Size: 512) 0.90009475716937359 3514.2079858763432 3163.1201836900404 12.024653108378379 8.0678159153909181
Maybe? Need more data PerfLabTests.CastingPerf.CheckArrayIsArrayByVariance 0.90067954767068859 2.7450413012046524 2.4724025575063648 2.7009895041042808 1.7932475048783776
Maybe? Need more data System.Memory.ReadOnlySequence.Slice_Start(Segment: Multiple) 0.90964038072783471 3.4493846916202595 3.1376996041622176 8.9030611892736591 7.7368085557620461
Maybe? Need more data System.Buffers.Tests.ReadOnlySequenceTests.FirstTenSegments 0.91450624664500813 5.0942189368202619 4.6586950394994213 13.250302455948326 5.93807832331802
Maybe? Need more data System.Numerics.Tests.Perf_Matrix3x2.EqualityOperatorBenchmark 0.91505574135403611 1.7038860172829366 1.5591506827276136 18.225174899407335 16.6845459834044
Maybe? Need more data System.Tests.Perf_Boolean.TryParse(value: "0") 0.91676396690683648 3.2874644520224661 3.0138289521013255 12.304077459929706 9.4382205739902538
Noisy System.Text.Perf_Utf8Encoding.GetBytes(Input: Chinese) 0.9205757115569656 161762.16635338342 148914.32139376219 15.828556551677019 15.141829429651327
Noisy System.Text.RegularExpressions.Tests.Perf_Regex_Cache.IsMatch(total: 40000, unique: 7, cacheSize: 0) 0.9227115889387062 59276610 54695215 10.351467756318263 8.71913930818749
Noisy BenchmarksGame.BinaryTrees_5.RunBench 0.92322071325312827 179044422.5 165297519.44444445 22.266548353098084 20.890177523956297
Noisy System.Memory.Span<Byte>.Clear(Size: 512) 0.92331343747602457 6.4115322856679935 5.9198539141686277 2.4348688680033095 2.7869962552546239
Noisy System.Text.RegularExpressions.Tests.Perf_Regex_Cache.IsMatch_Multithreading(total: 40000, unique: 7, cacheSize: 0) 0.9236300501927136 18645578.75 17221616.836734693 13.580364545045281 12.069896150198117
Maybe? Need more data Benchstone.BenchI.XposMatrix.Test 0.93003313539222943 18078.169075144509 16813.296267107489 69.731706322545591 25.949469557869556
Noisy Benchstone.BenchI.AddArray.Test 0.93120132537564571 20210.202752976191 18819.767589681953 14.050293215847208 15.155225360820852
Noisy System.Buffers.Tests.ReadOnlySequenceTests.FirstTenSegments 0.93132417841360293 4.6180075165850623 4.3008620562914261 8.1717473277435051 6.1868419993236987
Bimodal regression and improvement System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "(?i)Sherlock|Holmes|Watson", Options: NonBacktracking) 0.93138653274623273 2780731.38576779 2589935.763888889 14.820176511579046 20.266737041566497
Noisy System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateFromFile_Read(capacity: 10000000) 0.93154070667252664 43590.89447463768 40606.692643391521 17.251682799323238 14.914235278894841
Noisy System.Collections.CtorDefaultSize<String>.Stack 0.93671812574135571 17.861669480703526 16.731349558576181 16.985916186371369 14.140773457045354
Bimodal Benchstone.BenchI.Fib.Test 0.93955184884835363 159694.67229199369 150041.42460317462 20.749516161852206 22.520675763619309
Maybe? Need more data Benchstone.BenchI.IniArray.Test 0.940509770186889 67189541.25 63192420 5.0790460095421039 5.8211886020145069
Bimodal System.Memory.Span.IndexOfAnyFourValues(Size: 33) 0.94573448370538127 57.653966580812074 54.525344317871614 23.807900467166885 21.941204410312682
Maybe? Need more data System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GetHashCodeBenchmark 0.94589448437699652 12.750723643616887 12.060839166312574 18.816588477582815 17.400192778480125
Maybe? Need more data System.Collections.Tests.Perf_PriorityQueue<Int32, Int32>.HeapSort(Size: 1000) 0.94921192013119116 75208.203125 71388.5228978979 38.440681648109951 23.375784409521643
Bimodal System.Collections.Tests.Perf_Dictionary.ContainsValue(Items: 3000) 0.94928848058446713 4249593.2471264368 4034089.916666667 22.034137694330397 18.666116210466772
Noisy (no asm diffs) System.Buffers.Tests.ReadOnlySequenceTests<Byte>.FirstSingleSegment
Noisy (no asm diffs) System.Memory.ReadOnlySequence.Slice_Repeat_StartPosition_And_EndPosition(Segment: Multiple)
Bimodal (no asm diffs) System.Tests.Perf_Type.op_Equality

@jakobbotsch
Copy link
Member Author

System.Numerics.Tests.Perf_Matrix4x4.IsIdentityBenchmark

Same as #76928 (comment):

@@ -92,19 +92,23 @@ G_M3814_IG03:  ;; offset=003BH
 ; Final local variable assignments
 ;
 ;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd single-def
-;* V01 loc0         [V01,T00] (  0,  0   )  struct (64) zero-ref    do-not-enreg[SF] ld-addr-op
+;* V01 loc0         [V01    ] (  0,  0   )  struct (64) zero-ref    do-not-enreg[SF] ld-addr-op
 ;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;* V03 tmp1         [V03    ] (  0,  0   )  struct (64) zero-ref    do-not-enreg[S] ld-addr-op "Inline stloc first use temp"
 ;* V04 tmp2         [V04    ] (  0,  0   )  struct (64) zero-ref    ld-addr-op "Inline ldloca(s) first use temp"
-;  V05 tmp3         [V05,T01] (  3,  2   )    bool  ->  rax         "Inline return value spill temp"
-;* V06 tmp4         [V06,T06] (  0,  0   )  simd16  ->  zero-ref    single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
-;* V07 tmp5         [V07,T07] (  0,  0   )  simd16  ->  zero-ref    single-def V04.Y(offs=0x10) P-INDEP "field V04.Y (fldOffset=0x10)"
-;* V08 tmp6         [V08,T08] (  0,  0   )  simd16  ->  zero-ref    single-def V04.Z(offs=0x20) P-INDEP "field V04.Z (fldOffset=0x20)"
-;* V09 tmp7         [V09,T09] (  0,  0   )  simd16  ->  zero-ref    single-def V04.W(offs=0x30) P-INDEP "field V04.W (fldOffset=0x30)"
-;  V10 cse0         [V10,T02] (  3,  3   )  simd16  ->  mm0         "CSE - aggressive"
-;  V11 cse1         [V11,T03] (  3,  2   )  simd16  ->  mm1         "CSE - aggressive"
-;  V12 cse2         [V12,T04] (  3,  2   )  simd16  ->  mm2         "CSE - aggressive"
-;  V13 cse3         [V13,T05] (  3,  2   )  simd16  ->  mm3         "CSE - aggressive"
+;  V05 tmp3         [V05,T00] (  3,  2   )    bool  ->  rax         "Inline return value spill temp"
+;* V06 tmp4         [V06,T05] (  0,  0   )  simd16  ->  zero-ref    single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
+;* V07 tmp5         [V07,T06] (  0,  0   )  simd16  ->  zero-ref    single-def V04.Y(offs=0x10) P-INDEP "field V04.Y (fldOffset=0x10)"
+;* V08 tmp6         [V08,T07] (  0,  0   )  simd16  ->  zero-ref    single-def V04.Z(offs=0x20) P-INDEP "field V04.Z (fldOffset=0x20)"
+;* V09 tmp7         [V09,T08] (  0,  0   )  simd16  ->  zero-ref    single-def V04.W(offs=0x30) P-INDEP "field V04.W (fldOffset=0x30)"
+;* V10 tmp8         [V10    ] (  0,  0   )  simd16  ->  zero-ref    single-def "V01.[000..016)"
+;* V11 tmp9         [V11,T09] (  0,  0   )  simd16  ->  zero-ref    single-def "V01.[016..032)"
+;* V12 tmp10        [V12,T10] (  0,  0   )  simd16  ->  zero-ref    single-def "V01.[032..048)"
+;* V13 tmp11        [V13,T11] (  0,  0   )  simd16  ->  zero-ref    single-def "V01.[048..064)"
+;  V14 cse0         [V14,T01] (  2,  2   )  simd16  ->  mm0         "CSE - aggressive"
+;  V15 cse1         [V15,T02] (  2,  1.50)  simd16  ->  mm1         "CSE - aggressive"
+;  V16 cse2         [V16,T03] (  2,  1.50)  simd16  ->  mm2         "CSE - aggressive"
+;  V17 cse3         [V17,T04] (  2,  1.50)  simd16  ->  mm3         "CSE - aggressive"
 ;
 ; Lcl frame size = 0
 
@@ -116,31 +120,31 @@ G_M3814_IG02:  ;; offset=0003H
        vmovups  xmm1, xmmword ptr [reloc @RWD16]
        vmovups  xmm2, xmmword ptr [reloc @RWD32]
        vmovups  xmm3, xmmword ptr [reloc @RWD48]
-       vcmpps   xmm0, xmm0, xmm0, 0
+       vcmpps   xmm0, xmm0, xmmword ptr [reloc @RWD00], 0
        vmovmskps rax, xmm0
        cmp      eax, 15
        jne      SHORT G_M3814_IG04
-						;; size=46 bbWeight=1 PerfScore 18.25
-G_M3814_IG03:  ;; offset=0031H
-       vcmpps   xmm0, xmm1, xmm1, 0
+						;; size=50 bbWeight=1 PerfScore 18.25
+G_M3814_IG03:  ;; offset=0035H
+       vcmpps   xmm0, xmm1, xmmword ptr [reloc @RWD16], 0
        vmovmskps rax, xmm0
        cmp      eax, 15
        jne      SHORT G_M3814_IG04
-       vcmpps   xmm0, xmm2, xmm2, 0
+       vcmpps   xmm0, xmm2, xmmword ptr [reloc @RWD32], 0
        vmovmskps rax, xmm0
        cmp      eax, 15
        jne      SHORT G_M3814_IG04
-       vcmpps   xmm0, xmm3, xmm3, 0
+       vcmpps   xmm0, xmm3, xmmword ptr [reloc @RWD48], 0
        vmovmskps rax, xmm0
        cmp      eax, 15
        sete     al
        movzx    rax, al
        jmp      SHORT G_M3814_IG05
-						;; size=48 bbWeight=0.50 PerfScore 10.50
-G_M3814_IG04:  ;; offset=0061H
+						;; size=60 bbWeight=0.50 PerfScore 10.50
+G_M3814_IG04:  ;; offset=0071H
        xor      eax, eax
 						;; size=2 bbWeight=0.50 PerfScore 0.12
-G_M3814_IG05:  ;; offset=0063H
+G_M3814_IG05:  ;; offset=0073H
        ret      
 						;; size=1 bbWeight=1 PerfScore 1.00
 RWD00  	dq	000000003F800000h, 0000000000000000h
@@ -149,7 +153,7 @@ RWD32  	dq	0000000000000000h, 000000003F800000h
 RWD48  	dq	0000000000000000h, 3F80000000000000h
 
 
-; Total bytes of code 100, prolog size 3, PerfScore 40.88, instruction count 25, allocated bytes for code 100 (MethodHash=8a71f119) for method Program:IsIdentityBenchmark():bool:this
+; Total bytes of code 116, prolog size 3, PerfScore 42.48, instruction count 25, allocated bytes for code 116 (MethodHash=8a71f119) for method Program:IsIdentityBenchmark():bool:this
 ; ============================================================
 
-225.8 ms
+267.6 ms

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Jun 16, 2023

System.Numerics.Tests.Perf_Matrix3x2.SubtractBenchmark
@@ -131,13 +131,13 @@ G_M5743_IG03:  ;; offset=0077H
 ;  V01 RetBuf       [V01,T00] (  6,  6   )   byref  ->  rdx         single-def
 ;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;* V03 tmp1         [V03    ] (  0,  0   )  struct (24) zero-ref    do-not-enreg[S] "impAppendStmt"
-;* V04 tmp2         [V04    ] (  0,  0   )  struct (24) zero-ref    do-not-enreg[S] "spilled call-like call argument"
+;* V04 tmp2         [V04,T02] (  0,  0   )  struct (24) zero-ref    do-not-enreg[S] "spilled call-like call argument"
 ;* V05 tmp3         [V05    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline stloc first use temp"
 ;* V06 tmp4         [V06    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline ldloca(s) first use temp"
 ;* V07 tmp5         [V07    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline stloc first use temp"
 ;* V08 tmp6         [V08    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline ldloca(s) first use temp"
 ;  V09 tmp7         [V09    ] (  4,  8   )  struct (24) [rsp+00H]   do-not-enreg[XS] addr-exposed ld-addr-op "Inlining Arg"
-;* V10 tmp8         [V10,T02] (  0,  0   )  struct (24) zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
+;* V10 tmp8         [V10    ] (  0,  0   )  struct (24) zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
 ;  V11 tmp9         [V11,T01] (  4,  8   )   byref  ->  rax         single-def "impAppendStmt"
 ;* V12 tmp10        [V12    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline stloc first use temp"
 ;* V13 tmp11        [V13    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline ldloca(s) first use temp"
@@ -159,8 +159,11 @@ G_M5743_IG03:  ;; offset=0077H
 ;  V29 tmp27        [V29,T05] (  2,  2   )   simd8  ->  mm0         V13.X(offs=0x00) P-INDEP "field V13.X (fldOffset=0x0)"
 ;  V30 tmp28        [V30,T06] (  2,  2   )   simd8  ->  mm1         V13.Y(offs=0x08) P-INDEP "field V13.Y (fldOffset=0x8)"
 ;  V31 tmp29        [V31,T07] (  2,  2   )   simd8  ->  mm2         V13.Z(offs=0x10) P-INDEP "field V13.Z (fldOffset=0x10)"
-;  V32 cse0         [V32,T03] (  2,  2   )   simd8  ->  mm0         "CSE - aggressive"
-;  V33 cse1         [V33,T04] (  2,  2   )   simd8  ->  mm1         "CSE - aggressive"
+;* V32 tmp30        [V32,T14] (  0,  0   )   simd8  ->  zero-ref    "V10.[000..008)"
+;* V33 tmp31        [V33,T15] (  0,  0   )   simd8  ->  zero-ref    "V10.[008..016)"
+;* V34 tmp32        [V34,T16] (  0,  0   )   simd8  ->  zero-ref    "V10.[016..024)"
+;* V35 cse0         [V35,T03] (  0,  0   )   simd8  ->  zero-ref    "CSE - aggressive"
+;* V36 cse1         [V36,T04] (  0,  0   )   simd8  ->  zero-ref    "CSE - aggressive"
 ;
 ; Lcl frame size = 24
 
@@ -170,18 +173,18 @@ G_M5743_IG01:  ;; offset=0000H
 						;; size=7 bbWeight=1 PerfScore 1.25
 G_M5743_IG02:  ;; offset=0007H
        vmovsd   xmm0, qword ptr [reloc @RWD00]
-       vmovsd   xmm1, qword ptr [reloc @RWD08]
-       vmovsd   xmm2, qword ptr [reloc @RWD00]
-       vmovsd   qword ptr [rsp], xmm2
-       vmovsd   xmm2, qword ptr [reloc @RWD08]
-       vmovsd   qword ptr [rsp+08H], xmm2
-       vxorps   xmm2, xmm2, xmm2
-       vmovsd   qword ptr [rsp+10H], xmm2
+       vmovsd   qword ptr [rsp], xmm0
+       vmovsd   xmm0, qword ptr [reloc @RWD08]
+       vmovsd   qword ptr [rsp+08H], xmm0
+       vxorps   xmm0, xmm0, xmm0
+       vmovsd   qword ptr [rsp+10H], xmm0
        lea      rax, bword ptr [rsp]
-       vmovsd   xmm2, qword ptr [rax]
-       vsubps   xmm0, xmm2, xmm0
-       vmovsd   xmm2, qword ptr [rax+08H]
-       vsubps   xmm1, xmm2, xmm1
+       vmovsd   xmm0, qword ptr [rax]
+       vmovsd   xmm1, qword ptr [reloc @RWD00]
+       vsubps   xmm0, xmm0, xmm1
+       vmovsd   xmm1, qword ptr [rax+08H]
+       vmovsd   xmm2, qword ptr [reloc @RWD08]
+       vsubps   xmm1, xmm1, xmm2
        vmovsd   xmm2, qword ptr [rax+10H]
        vxorps   xmm3, xmm3, xmm3
        vsubps   xmm2, xmm2, xmm3
@@ -201,4 +204,4 @@ RWD08  	dq	3F80000000000000h
 ; Total bytes of code 116, prolog size 7, PerfScore 57.52, instruction count 24, allocated bytes for code 116 (MethodHash=9699e990) for method Program:SubtractBenchmark():System.Numerics.Matrix3x2:this

Looks like physical promotion ends up with slightly different pipelining, which seems worse in the lab (however on my laptop Intel CPU, it seems to be sometimes faster than the original).

The codegen for this benchmark is terrible with and without physical promotion. The problem is around V09 that we end up address exposing -- the JIT is not able to see through the AsImpl() calls with full fidelity. If we change AsImpl to return by value instead of by ref then the problem is solved and the benchmark reduces to a vector constant. At the same time we can switch to Unsafe.BitCast. Does that seem reasonable @tannergooding ?

System.Numerics.Tests.Perf_Matrix3x2.AddOperatorBenchmark, System.Numerics.Tests.Perf_Matrix3x2.SubtractOperatorBenchmark, System.Numerics.Tests.Perf_Matrix3x2.AddBenchmark and System.Numerics.Tests.Perf_Matrix3x2.SubtractBenchmark are all affected similarly.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Jun 16, 2023

System.Numerics.Tests.Perf_Matrix3x2.MultiplyByMatrixBenchmark
@@ -128,126 +128,140 @@ G_M38613_IG03:  ;; offset=0077H
 ; Final local variable assignments
 ;
 ;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd single-def
-;  V01 RetBuf       [V01,T02] (  6,  6   )   byref  ->  rdx         single-def
+;  V01 RetBuf       [V01,T01] (  6,  6   )   byref  ->  rdx         single-def
 ;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;* V03 tmp1         [V03    ] (  0,  0   )  struct (24) zero-ref    do-not-enreg[S] "impAppendStmt"
 ;* V04 tmp2         [V04    ] (  0,  0   )  struct (24) zero-ref    do-not-enreg[S] "spilled call-like call argument"
 ;* V05 tmp3         [V05    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline stloc first use temp"
 ;* V06 tmp4         [V06    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline ldloca(s) first use temp"
 ;* V07 tmp5         [V07    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline stloc first use temp"
-;* V08 tmp6         [V08    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline ldloca(s) first use temp"
-;  V09 tmp7         [V09    ] (  4,  8   )  struct (24) [rsp+18H]   do-not-enreg[XS] addr-exposed ld-addr-op "Inlining Arg"
-;  V10 tmp8         [V10,T00] (  9, 18   )  struct (24) [rsp+00H]   do-not-enreg[SF] ld-addr-op "Inlining Arg"
-;  V11 tmp9         [V11,T01] (  7, 14   )   byref  ->  rax         single-def "impAppendStmt"
+;  V08 tmp6         [V08    ] (  9,  9   )  struct (24) [rsp+18H]   do-not-enreg[SF] ld-addr-op "Inline ldloca(s) first use temp"
+;  V09 tmp7         [V09    ] (  4,  8   )  struct (24) [rsp+00H]   do-not-enreg[XS] addr-exposed ld-addr-op "Inlining Arg"
+;* V10 tmp8         [V10    ] (  0,  0   )  struct (24) zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
+;  V11 tmp9         [V11,T00] (  7, 14   )   byref  ->  rax         single-def "impAppendStmt"
 ;* V12 tmp10        [V12    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline stloc first use temp"
 ;* V13 tmp11        [V13    ] (  0,  0   )  struct (24) zero-ref    ld-addr-op "Inline ldloca(s) first use temp"
-;  V14 tmp12        [V14,T07] (  2,  4   )   simd8  ->  mm0         ld-addr-op "NewObj constructor temp"
-;  V15 tmp13        [V15,T08] (  2,  4   )   simd8  ->  mm2         ld-addr-op "NewObj constructor temp"
-;  V16 tmp14        [V16,T09] (  2,  4   )   simd8  ->  mm1         ld-addr-op "NewObj constructor temp"
+;  V14 tmp12        [V14,T09] (  2,  4   )   simd8  ->  mm6         ld-addr-op "NewObj constructor temp"
+;  V15 tmp13        [V15,T10] (  2,  4   )   simd8  ->  mm7         ld-addr-op "NewObj constructor temp"
+;  V16 tmp14        [V16,T11] (  2,  4   )   simd8  ->  mm0         ld-addr-op "NewObj constructor temp"
 ;* V17 tmp15        [V17    ] (  0,  0   )   simd8  ->  zero-ref    V05.X(offs=0x00) P-INDEP "field V05.X (fldOffset=0x0)"
 ;* V18 tmp16        [V18    ] (  0,  0   )   simd8  ->  zero-ref    V05.Y(offs=0x08) P-INDEP "field V05.Y (fldOffset=0x8)"
 ;* V19 tmp17        [V19    ] (  0,  0   )   simd8  ->  zero-ref    V05.Z(offs=0x10) P-INDEP "field V05.Z (fldOffset=0x10)"
-;* V20 tmp18        [V20,T21] (  0,  0   )   simd8  ->  zero-ref    V06.X(offs=0x00) P-INDEP "field V06.X (fldOffset=0x0)"
-;* V21 tmp19        [V21,T22] (  0,  0   )   simd8  ->  zero-ref    V06.Y(offs=0x08) P-INDEP "field V06.Y (fldOffset=0x8)"
-;* V22 tmp20        [V22,T23] (  0,  0   )   simd8  ->  zero-ref    V06.Z(offs=0x10) P-INDEP "field V06.Z (fldOffset=0x10)"
+;* V20 tmp18        [V20,T25] (  0,  0   )   simd8  ->  zero-ref    V06.X(offs=0x00) P-INDEP "field V06.X (fldOffset=0x0)"
+;* V21 tmp19        [V21,T26] (  0,  0   )   simd8  ->  zero-ref    V06.Y(offs=0x08) P-INDEP "field V06.Y (fldOffset=0x8)"
+;* V22 tmp20        [V22,T27] (  0,  0   )   simd8  ->  zero-ref    V06.Z(offs=0x10) P-INDEP "field V06.Z (fldOffset=0x10)"
 ;* V23 tmp21        [V23    ] (  0,  0   )   simd8  ->  zero-ref    V07.X(offs=0x00) P-INDEP "field V07.X (fldOffset=0x0)"
 ;* V24 tmp22        [V24    ] (  0,  0   )   simd8  ->  zero-ref    V07.Y(offs=0x08) P-INDEP "field V07.Y (fldOffset=0x8)"
 ;* V25 tmp23        [V25    ] (  0,  0   )   simd8  ->  zero-ref    V07.Z(offs=0x10) P-INDEP "field V07.Z (fldOffset=0x10)"
-;* V26 tmp24        [V26,T24] (  0,  0   )   simd8  ->  zero-ref    V08.X(offs=0x00) P-INDEP "field V08.X (fldOffset=0x0)"
-;* V27 tmp25        [V27,T25] (  0,  0   )   simd8  ->  zero-ref    V08.Y(offs=0x08) P-INDEP "field V08.Y (fldOffset=0x8)"
-;* V28 tmp26        [V28,T26] (  0,  0   )   simd8  ->  zero-ref    V08.Z(offs=0x10) P-INDEP "field V08.Z (fldOffset=0x10)"
+;  V26 tmp24        [V26,T02] (  7,  7   )   simd8  ->  [rsp+18H]   do-not-enreg[S] V08.X(offs=0x00) P-DEP "field V08.X (fldOffset=0x0)"
+;  V27 tmp25        [V27,T03] (  7,  7   )   simd8  ->  [rsp+20H]   do-not-enreg[S] V08.Y(offs=0x08) P-DEP "field V08.Y (fldOffset=0x8)"
+;  V28 tmp26        [V28,T04] (  7,  7   )   simd8  ->  [rsp+28H]   do-not-enreg[S] V08.Z(offs=0x10) P-DEP "field V08.Z (fldOffset=0x10)"
 ;* V29 tmp27        [V29    ] (  0,  0   )   simd8  ->  zero-ref    V12.X(offs=0x00) P-INDEP "field V12.X (fldOffset=0x0)"
 ;* V30 tmp28        [V30    ] (  0,  0   )   simd8  ->  zero-ref    V12.Y(offs=0x08) P-INDEP "field V12.Y (fldOffset=0x8)"
 ;* V31 tmp29        [V31    ] (  0,  0   )   simd8  ->  zero-ref    V12.Z(offs=0x10) P-INDEP "field V12.Z (fldOffset=0x10)"
-;  V32 tmp30        [V32,T18] (  2,  2   )   simd8  ->  mm0         V13.X(offs=0x00) P-INDEP "field V13.X (fldOffset=0x0)"
-;  V33 tmp31        [V33,T19] (  2,  2   )   simd8  ->  mm2         V13.Y(offs=0x08) P-INDEP "field V13.Y (fldOffset=0x8)"
-;  V34 tmp32        [V34,T20] (  2,  2   )   simd8  ->  mm1         V13.Z(offs=0x10) P-INDEP "field V13.Z (fldOffset=0x10)"
-;  V35 cse0         [V35,T10] (  3,  3   )   float  ->  mm3         "CSE - aggressive"
-;  V36 cse1         [V36,T11] (  3,  3   )   float  ->  mm2         "CSE - aggressive"
-;  V37 cse2         [V37,T12] (  3,  3   )   float  ->  mm7         "CSE - aggressive"
-;  V38 cse3         [V38,T13] (  3,  3   )   float  ->  mm3         "CSE - aggressive"
-;  V39 cse4         [V39,T14] (  3,  3   )   float  ->  mm7         "CSE - aggressive"
-;  V40 cse5         [V40,T03] (  4,  4   )   float  ->  mm1         "CSE - aggressive"
-;  V41 cse6         [V41,T04] (  4,  4   )   float  ->  mm4         "CSE - aggressive"
-;  V42 cse7         [V42,T05] (  4,  4   )   float  ->  mm5         "CSE - aggressive"
-;  V43 cse8         [V43,T06] (  4,  4   )   float  ->  mm6         "CSE - aggressive"
-;* V44 cse9         [V44,T15] (  0,  0   )   simd8  ->  zero-ref    "CSE - aggressive"
-;* V45 cse10        [V45,T16] (  0,  0   )   simd8  ->  zero-ref    "CSE - aggressive"
-;  V46 cse11        [V46,T17] (  3,  3   )   float  ->  mm0         "CSE - aggressive"
+;  V32 tmp30        [V32,T20] (  2,  2   )   simd8  ->  mm6         V13.X(offs=0x00) P-INDEP "field V13.X (fldOffset=0x0)"
+;  V33 tmp31        [V33,T21] (  2,  2   )   simd8  ->  mm7         V13.Y(offs=0x08) P-INDEP "field V13.Y (fldOffset=0x8)"
+;  V34 tmp32        [V34,T22] (  2,  2   )   simd8  ->  mm0         V13.Z(offs=0x10) P-INDEP "field V13.Z (fldOffset=0x10)"
+;  V35 tmp33        [V35,T05] (  4,  4   )   float  ->  mm0         "V04.[000..004)"
+;  V36 tmp34        [V36,T06] (  4,  4   )   float  ->  mm1         "V04.[004..008)"
+;  V37 tmp35        [V37,T07] (  4,  4   )   float  ->  mm2         "V04.[008..012)"
+;  V38 tmp36        [V38,T08] (  4,  4   )   float  ->  mm3         "V04.[012..016)"
+;  V39 tmp37        [V39,T23] (  2,  2   )   float  ->  mm4         "V04.[016..020)"
+;  V40 tmp38        [V40,T24] (  2,  2   )   float  ->  mm5         "V04.[020..024)"
+;* V41 tmp39        [V41    ] (  0,  0   )   float  ->  zero-ref    "V10.[000..004)"
+;* V42 tmp40        [V42    ] (  0,  0   )   float  ->  zero-ref    "V10.[004..008)"
+;* V43 tmp41        [V43    ] (  0,  0   )   float  ->  zero-ref    "V10.[008..012)"
+;* V44 tmp42        [V44    ] (  0,  0   )   float  ->  zero-ref    "V10.[012..016)"
+;* V45 tmp43        [V45    ] (  0,  0   )   float  ->  zero-ref    "V10.[016..020)"
+;* V46 tmp44        [V46    ] (  0,  0   )   float  ->  zero-ref    "V10.[020..024)"
+;  V47 cse0         [V47,T12] (  3,  3   )   float  ->  mm8         "CSE - aggressive"
+;  V48 cse1         [V48,T13] (  3,  3   )   float  ->  mm7         "CSE - aggressive"
+;  V49 cse2         [V49,T14] (  3,  3   )   float  ->  mm9         "CSE - aggressive"
+;  V50 cse3         [V50,T15] (  3,  3   )   float  ->  mm8         "CSE - aggressive"
+;  V51 cse4         [V51,T16] (  3,  3   )   float  ->  mm9         "CSE - aggressive"
+;  V52 cse5         [V52,T17] (  2,  2   )   simd8  ->  mm0         "CSE - aggressive"
+;  V53 cse6         [V53,T18] (  2,  2   )   simd8  ->  mm1         "CSE - aggressive"
+;  V54 cse7         [V54,T19] (  3,  3   )   float  ->  mm6         "CSE - aggressive"
 ;
-; Lcl frame size = 104
+; Lcl frame size = 136
 
 G_M38613_IG01:  ;; offset=0000H
-       sub      rsp, 104
+       sub      rsp, 136
        vzeroupper 
-       vmovaps  xmmword ptr [rsp+50H], xmm6
-       vmovaps  xmmword ptr [rsp+40H], xmm7
-       vmovaps  xmmword ptr [rsp+30H], xmm8
-						;; size=25 bbWeight=1 PerfScore 7.25
-G_M38613_IG02:  ;; offset=0019H
+       vmovaps  xmmword ptr [rsp+70H], xmm6
+       vmovaps  xmmword ptr [rsp+60H], xmm7
+       vmovaps  xmmword ptr [rsp+50H], xmm8
+       vmovaps  xmmword ptr [rsp+40H], xmm9
+       vmovaps  xmmword ptr [rsp+30H], xmm10
+						;; size=40 bbWeight=1 PerfScore 11.25
+G_M38613_IG02:  ;; offset=0028H
        vmovsd   xmm0, qword ptr [reloc @RWD00]
+       vmovsd   xmm1, qword ptr [reloc @RWD08]
        vmovsd   qword ptr [rsp+18H], xmm0
-       vmovsd   xmm0, qword ptr [reloc @RWD08]
-       vmovsd   qword ptr [rsp+20H], xmm0
+       vmovsd   qword ptr [rsp+20H], xmm1
        vxorps   xmm0, xmm0, xmm0
        vmovsd   qword ptr [rsp+28H], xmm0
-       vmovsd   xmm0, qword ptr [reloc @RWD00]
-       vmovsd   qword ptr [rsp], xmm0
-       vmovsd   xmm0, qword ptr [reloc @RWD08]
-       vmovsd   qword ptr [rsp+08H], xmm0
-       vxorps   xmm0, xmm0, xmm0
-       vmovsd   qword ptr [rsp+10H], xmm0
-       lea      rax, bword ptr [rsp+18H]
-       vmovss   xmm0, dword ptr [rax]
-       vmovss   xmm1, dword ptr [rsp]
-       vmulss   xmm2, xmm0, xmm1
-       vmovss   xmm3, dword ptr [rax+04H]
-       vmovss   xmm4, dword ptr [rsp+08H]
-       vmulss   xmm5, xmm3, xmm4
-       vaddss   xmm2, xmm2, xmm5
-       vmovss   xmm5, dword ptr [rsp+04H]
-       vmulss   xmm0, xmm0, xmm5
-       vmovss   xmm6, dword ptr [rsp+0CH]
-       vmulss   xmm3, xmm3, xmm6
-       vaddss   xmm0, xmm0, xmm3
-       vinsertps xmm0, xmm2, xmm0, 28
-       vmovss   xmm2, dword ptr [rax+08H]
-       vmulss   xmm3, xmm2, xmm1
-       vmovss   xmm7, dword ptr [rax+0CH]
-       vmulss   xmm8, xmm7, xmm4
-       vaddss   xmm3, xmm3, xmm8
-       vmulss   xmm2, xmm2, xmm5
-       vmulss   xmm7, xmm7, xmm6
-       vaddss   xmm2, xmm2, xmm7
-       vinsertps xmm2, xmm3, xmm2, 28
-       vmovss   xmm3, dword ptr [rax+10H]
-       vmulss   xmm1, xmm3, xmm1
-       vmovss   xmm7, dword ptr [rax+14H]
-       vmulss   xmm4, xmm7, xmm4
-       vaddss   xmm1, xmm1, xmm4
-       vaddss   xmm1, xmm1, dword ptr [rsp+10H]
-       vmulss   xmm3, xmm3, xmm5
-       vmulss   xmm4, xmm7, xmm6
-       vaddss   xmm3, xmm3, xmm4
-       vaddss   xmm3, xmm3, dword ptr [rsp+14H]
-       vinsertps xmm1, xmm1, xmm3, 28
-       vmovsd   qword ptr [rdx], xmm0
-       vmovsd   qword ptr [rdx+08H], xmm2
-       vmovsd   qword ptr [rdx+10H], xmm1
+       vmovss   xmm0, dword ptr [rsp+18H]
+       vmovss   xmm1, dword ptr [rsp+1CH]
+       vmovss   xmm2, dword ptr [rsp+20H]
+       vmovss   xmm3, dword ptr [rsp+24H]
+       vmovss   xmm4, dword ptr [rsp+28H]
+       vmovss   xmm5, dword ptr [rsp+2CH]
+       vmovsd   xmm6, qword ptr [reloc @RWD00]
+       vmovsd   qword ptr [rsp], xmm6
+       vmovsd   xmm6, qword ptr [reloc @RWD08]
+       vmovsd   qword ptr [rsp+08H], xmm6
+       vxorps   xmm6, xmm6, xmm6
+       vmovsd   qword ptr [rsp+10H], xmm6
+       lea      rax, bword ptr [rsp]
+       vmovss   xmm6, dword ptr [rax]
+       vmulss   xmm7, xmm6, xmm0
+       vmovss   xmm8, dword ptr [rax+04H]
+       vmulss   xmm9, xmm8, xmm2
+       vaddss   xmm7, xmm7, xmm9
+       vmulss   xmm6, xmm6, xmm1
+       vmulss   xmm8, xmm8, xmm3
+       vaddss   xmm6, xmm6, xmm8
+       vinsertps xmm6, xmm7, xmm6, 28
+       vmovss   xmm7, dword ptr [rax+08H]
+       vmulss   xmm8, xmm7, xmm0
+       vmovss   xmm9, dword ptr [rax+0CH]
+       vmulss   xmm10, xmm9, xmm2
+       vaddss   xmm8, xmm8, xmm10
+       vmulss   xmm7, xmm7, xmm1
+       vmulss   xmm9, xmm9, xmm3
+       vaddss   xmm7, xmm7, xmm9
+       vinsertps xmm7, xmm8, xmm7, 28
+       vmovss   xmm8, dword ptr [rax+10H]
+       vmulss   xmm0, xmm8, xmm0
+       vmovss   xmm9, dword ptr [rax+14H]
+       vmulss   xmm2, xmm9, xmm2
+       vaddss   xmm0, xmm0, xmm2
+       vaddss   xmm0, xmm0, xmm4
+       vmulss   xmm1, xmm8, xmm1
+       vmulss   xmm2, xmm9, xmm3
+       vaddss   xmm1, xmm1, xmm2
+       vaddss   xmm1, xmm1, xmm5
+       vinsertps xmm0, xmm0, xmm1, 28
+       vmovsd   qword ptr [rdx], xmm6
+       vmovsd   qword ptr [rdx+08H], xmm7
+       vmovsd   qword ptr [rdx+10H], xmm0
        mov      rax, rdx
-						;; size=252 bbWeight=1 PerfScore 128.42
-G_M38613_IG03:  ;; offset=0115H
-       vmovaps  xmm6, xmmword ptr [rsp+50H]
-       vmovaps  xmm7, xmmword ptr [rsp+40H]
-       vmovaps  xmm8, xmmword ptr [rsp+30H]
-       add      rsp, 104
+						;; size=263 bbWeight=1 PerfScore 130.42
+G_M38613_IG03:  ;; offset=012FH
+       vmovaps  xmm6, xmmword ptr [rsp+70H]
+       vmovaps  xmm7, xmmword ptr [rsp+60H]
+       vmovaps  xmm8, xmmword ptr [rsp+50H]
+       vmovaps  xmm9, xmmword ptr [rsp+40H]
+       vmovaps  xmm10, xmmword ptr [rsp+30H]
+       add      rsp, 136
        ret      
-						;; size=23 bbWeight=1 PerfScore 13.25
+						;; size=38 bbWeight=1 PerfScore 21.25
 RWD00  	dq	000000003F800000h
 RWD08  	dq	3F80000000000000h
 
 
-; Total bytes of code 300, prolog size 25, PerfScore 178.92, instruction count 60, allocated bytes for code 300 (MethodHash=f176692a) for method Program:MultiplyByMatrixOperatorBenchmark():Program+Matrix3x2:this
+; Total bytes of code 341, prolog size 40, PerfScore 197.02, instruction count 66, allocated bytes for code 341 (MethodHash=f176692a) for method Program:MultiplyByMatrixOperatorBenchmark():Program+Matrix3x2:this
 
-499.7 ms
+555.9 ms

We need some more registers and also see the same kind of pipelining change as in the previous comment, but in addition we also DNER V08 due to #87554.

System.Numerics.Tests.Perf_Matrix3x2.MultiplyByMatrixOperatorBenchmark is similarly affected.

@jakobbotsch
Copy link
Member Author

Promoting TYP_SIMD32 and TYP_SIMD64 fields can be very expensive if we end up creating long lifetimes that span across calls where the upper halves need to be saved/restored. For example: https://gist.github.com/jakobbotsch/e09b0e75ecfac6934ae51c8902748491

The pass does not currently have the necessary information to try to take this into account, so need to think about what to do here.

@jakobbotsch jakobbotsch added the Priority:2 Work that is important, but not critical for the release label Jun 20, 2023
@jakobbotsch
Copy link
Member Author

Looking at this perfscore regression:

306966.04 ( 0.29% of base) : 53160.dasm - Benchmarks.SIMD.RayTracer.RayTracer:RenderSequential(Benchmarks.SIMD.RayTracer.Scene,int[]):this
@@ -19,7 +19,7 @@
 ;* V07 loc4         [V07    ] (  0,      0   )  struct (16) zero-ref    ld-addr-op
 ;  V08 OutArgs      [V08    ] (  1,      1   )  struct (40) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;* V09 tmp1         [V09    ] (  0,      0   )  struct (16) zero-ref    "impAppendStmt"
-;  V10 tmp2         [V10,T03] (  4,1668919.51)  struct (24) [rsp+90H]   do-not-enreg[S] ld-addr-op "NewObj constructor temp"
+;* V10 tmp2         [V10    ] (  0,      0   )  struct (24) zero-ref    do-not-enreg[S] ld-addr-op "NewObj constructor temp"
 ;* V11 tmp3         [V11    ] (  0,      0   )  struct (16) zero-ref    "spilled call-like call argument"
 ;* V12 tmp4         [V12    ] (  0,      0   )     int  ->  zero-ref    "Strict ordering of exceptions for Array store"
 ;* V13 tmp5         [V13    ] (  0,      0   )  double  ->  zero-ref    "Inlining Arg"
@@ -31,37 +31,37 @@
 ;* V19 tmp11        [V19    ] (  0,      0   )  struct (16) zero-ref    "spilled call-like call argument"
 ;* V20 tmp12        [V20    ] (  0,      0   )  double  ->  zero-ref    "Inlining Arg"
 ;* V21 tmp13        [V21    ] (  0,      0   )  struct (16) zero-ref    ld-addr-op "Inline ldloca(s) first use temp"
-;  V22 tmp14        [V22,T36] (  2, 834459.76)  double  ->  mm1         "Inlining Arg"
+;  V22 tmp14        [V22,T38] (  2, 834459.76)  double  ->  mm1         "Inlining Arg"
 ;* V23 tmp15        [V23    ] (  0,      0   )  struct (16) zero-ref    "Inlining Arg"
 ;* V24 tmp16        [V24    ] (  0,      0   )  double  ->  zero-ref    "Inlining Arg"
 ;* V25 tmp17        [V25    ] (  0,      0   )  struct (16) zero-ref    ld-addr-op "Inline ldloca(s) first use temp"
-;  V26 tmp18        [V26,T37] (  2, 834459.76)  double  ->  mm3         "Inlining Arg"
+;  V26 tmp18        [V26,T39] (  2, 834459.76)  double  ->  mm2         "Inlining Arg"
 ;* V27 tmp19        [V27    ] (  0,      0   )  struct (16) zero-ref    "Inlining Arg"
 ;* V28 tmp20        [V28    ] (  0,      0   )  struct (16) zero-ref    ld-addr-op "Inline ldloca(s) first use temp"
 ;* V29 tmp21        [V29    ] (  0,      0   )  struct (16) zero-ref    ld-addr-op "Inline ldloca(s) first use temp"
-;  V30 tmp22        [V30,T46] (  3, 625844.82)   float  ->  mm1         "Inline stloc first use temp"
-;  V31 tmp23        [V31,T50] (  3, 417229.88)   float  ->  mm9        
+;  V30 tmp22        [V30,T44] (  3, 625844.82)   float  ->  mm1         "Inline stloc first use temp"
+;  V31 tmp23        [V31,T48] (  3, 417229.88)   float  ->  mm9        
 ;* V32 tmp24        [V32    ] (  0,      0   )   float  ->  zero-ref    "Inline stloc first use temp"
 ;* V33 tmp25        [V33    ] (  0,      0   )  struct (16) zero-ref    ld-addr-op "Inline ldloca(s) first use temp"
 ;* V34 tmp26        [V34    ] (  0,      0   )  double  ->  zero-ref    "Inlining Arg"
-;  V35 tmp27        [V35    ] (  3, 417229.88)  struct (16) [rsp+80H]   do-not-enreg[XS] must-init addr-exposed "Inline return value spill temp"
-;  V36 tmp28        [V36,T04] (  3,1661092.31)  struct (24) [rsp+68H]   do-not-enreg[S] "Inlining Arg"
+;  V35 tmp27        [V35    ] (  3, 417229.88)  struct (16) [rsp+88H]   do-not-enreg[XS] must-init addr-exposed "Inline return value spill temp"
+;  V36 tmp28        [V36,T07] (  3,1418003.17)  struct (24) [rsp+70H]   do-not-enreg[S] "Inlining Arg"
 ;  V37 tmp29        [V37,T21] (  3, 546589.36)     ref  ->   r8         class-hnd "Inline stloc first use temp"
 ;* V38 tmp30        [V38    ] (  0,      0   )     ref  ->  zero-ref    class-hnd "Inline return value spill temp"
 ;  V39 tmp31        [V39,T20] (  5, 577728.02)     ref  ->  r13         class-hnd "Inline stloc first use temp"
-;  V40 tmp32        [V40,T10] (  3,1039161.09)     ref  ->  [rsp+38H]   class-hnd spill-single-def "Inline stloc first use temp"
-;  V41 tmp33        [V41,T01] (  5,2696339.80)     int  ->  [rsp+64H]   "Inline stloc first use temp"
-;  V42 tmp34        [V42,T02] (  4,1865793.64)     ref  ->   r8         class-hnd "Inline stloc first use temp"
+;  V40 tmp32        [V40,T10] (  3,1039161.09)     ref  ->  [rsp+40H]   class-hnd spill-single-def "Inline stloc first use temp"
+;  V41 tmp33        [V41,T01] (  5,2696339.80)     int  ->  [rsp+6CH]   "Inline stloc first use temp"
+;  V42 tmp34        [V42,T03] (  4,1865793.64)     ref  ->   r8         class-hnd "Inline stloc first use temp"
 ;  V43 tmp35        [V43,T08] (  4,1324112.02)     ref  ->   r8         class-hnd "Inline stloc first use temp"
-;  V44 tmp36        [V44,T05] (  4,1531185.21)     ref  ->   r8         "guarded devirt return temp"
-;  V45 tmp37        [V45,T06] (  4,1484307.13)     ref  ->  [rsp+30H]   class-hnd exact spill-single-def "guarded devirt this exact temp"
+;  V44 tmp36        [V44,T04] (  4,1531185.21)     ref  ->   r8         "guarded devirt return temp"
+;  V45 tmp37        [V45,T05] (  4,1484307.13)     ref  ->  [rsp+38H]   class-hnd exact spill-single-def "guarded devirt this exact temp"
 ;* V46 tmp38        [V46    ] (  0,      0   )  struct (16) zero-ref    "Inline stloc first use temp"
-;  V47 tmp39        [V47,T32] (  4,1208697.08)   float  ->  mm11         "Inline stloc first use temp"
+;  V47 tmp39        [V47,T34] (  4,1208697.08)   float  ->  mm11         "Inline stloc first use temp"
 ;* V48 tmp40        [V48    ] (  0,      0   )  double  ->  zero-ref    "impAppendStmt"
-;  V49 tmp41        [V49,T44] (  3, 765017.45)  double  ->  mm0         "Inline stloc first use temp"
-;  V50 tmp42        [V50,T45] (  3, 713112.43)   float  ->  mm10        
-;  V51 tmp43        [V51,T31] (  3,1156792.06)   float  ->  mm10         "Inline stloc first use temp"
-;  V52 tmp44        [V52,T00] (  5,3855973.53)     ref  ->  [rsp+28H]   class-hnd exact spill-single-def "NewObj constructor temp"
+;  V49 tmp41        [V49,T42] (  3, 765017.45)  double  ->  mm0         "Inline stloc first use temp"
+;  V50 tmp42        [V50,T43] (  3, 713112.43)   float  ->  mm10        
+;  V51 tmp43        [V51,T33] (  3,1156792.06)   float  ->  mm10         "Inline stloc first use temp"
+;  V52 tmp44        [V52,T00] (  5,3855973.53)     ref  ->  [rsp+30H]   class-hnd exact spill-single-def "NewObj constructor temp"
 ;* V53 tmp45        [V53    ] (  0,      0   )  struct (16) zero-ref    ld-addr-op "Inline ldloca(s) first use temp"
 ;* V54 tmp46        [V54    ] (  0,      0   )  struct (16) zero-ref    "Inlining Arg"
 ;* V55 tmp47        [V55    ] (  0,      0   )  struct (16) zero-ref    "Inlining Arg"
@@ -78,44 +78,47 @@
 ;  V66 tmp58        [V66,T24] (  3, 417229.88)     int  ->  rax         "Inline return value spill temp"
 ;* V67 tmp59        [V67    ] (  0,      0   )   float  ->  zero-ref    "Inlining Arg"
 ;  V68 tmp60        [V68,T18] (  3, 625844.82)     int  ->  rax         "Inline stloc first use temp"
-;  V69 tmp61        [V69,T33] (  4, 834459.76)  simd12  ->  mm0         V07._simdVector(offs=0x00) P-INDEP "field V07._simdVector (fldOffset=0x0)"
-;  V70 tmp62        [V70,T51] (  2, 417229.88)  simd12  ->  mm2         V09._simdVector(offs=0x00) P-INDEP "field V09._simdVector (fldOffset=0x0)"
+;  V69 tmp61        [V69,T35] (  4, 834459.76)  simd12  ->  mm0         V07._simdVector(offs=0x00) P-INDEP "field V07._simdVector (fldOffset=0x0)"
+;  V70 tmp62        [V70,T49] (  2, 417229.88)  simd12  ->  mm7         V09._simdVector(offs=0x00) P-INDEP "field V09._simdVector (fldOffset=0x0)"
 ;* V71 tmp63        [V71    ] (  0,      0   )  simd12  ->  zero-ref    V11._simdVector(offs=0x00) P-INDEP "field V11._simdVector (fldOffset=0x0)"
-;  V72 tmp64        [V72,T52] (  2, 417229.88)  simd12  ->  mm0         V14._simdVector(offs=0x00) P-INDEP "field V14._simdVector (fldOffset=0x0)"
+;  V72 tmp64        [V72,T50] (  2, 417229.88)  simd12  ->  mm0         V14._simdVector(offs=0x00) P-INDEP "field V14._simdVector (fldOffset=0x0)"
 ;* V73 tmp65        [V73    ] (  0,      0   )  simd12  ->  zero-ref    V16._simdVector(offs=0x00) P-INDEP "field V16._simdVector (fldOffset=0x0)"
 ;* V74 tmp66        [V74    ] (  0,      0   )  simd12  ->  zero-ref    V17._simdVector(offs=0x00) P-INDEP "field V17._simdVector (fldOffset=0x0)"
 ;* V75 tmp67        [V75    ] (  0,      0   )  simd12  ->  zero-ref    V18._simdVector(offs=0x00) P-INDEP "field V18._simdVector (fldOffset=0x0)"
-;  V76 tmp68        [V76,T34] (  4, 834459.76)  simd12  ->  mm0         V19._simdVector(offs=0x00) P-INDEP "field V19._simdVector (fldOffset=0x0)"
-;  V77 tmp69        [V77,T53] (  2, 417229.88)  simd12  ->  mm1         V21._simdVector(offs=0x00) P-INDEP "field V21._simdVector (fldOffset=0x0)"
-;  V78 tmp70        [V78,T54] (  2, 417229.88)  simd12  ->  mm4         V23._simdVector(offs=0x00) P-INDEP "field V23._simdVector (fldOffset=0x0)"
-;  V79 tmp71        [V79,T55] (  2, 417229.88)  simd12  ->  mm3         V25._simdVector(offs=0x00) P-INDEP "field V25._simdVector (fldOffset=0x0)"
-;  V80 tmp72        [V80,T56] (  2, 417229.88)  simd12  ->  mm4         V27._simdVector(offs=0x00) P-INDEP "field V27._simdVector (fldOffset=0x0)"
-;  V81 tmp73        [V81,T57] (  2, 417229.88)  simd12  ->  mm1         V28._simdVector(offs=0x00) P-INDEP "field V28._simdVector (fldOffset=0x0)"
-;  V82 tmp74        [V82,T58] (  2, 417229.88)  simd12  ->  mm0         V29._simdVector(offs=0x00) P-INDEP "field V29._simdVector (fldOffset=0x0)"
-;  V83 tmp75        [V83,T59] (  2, 417229.88)  simd12  ->  mm0         V33._simdVector(offs=0x00) P-INDEP "field V33._simdVector (fldOffset=0x0)"
-;  V84 tmp76        [V84    ] (  3, 417229.88)  simd12  ->  [rsp+80H]   do-not-enreg[XS] addr-exposed V35._simdVector(offs=0x00) P-DEP "field V35._simdVector (fldOffset=0x0)"
+;  V76 tmp68        [V76,T36] (  4, 834459.76)  simd12  ->  mm0         V19._simdVector(offs=0x00) P-INDEP "field V19._simdVector (fldOffset=0x0)"
+;  V77 tmp69        [V77,T51] (  2, 417229.88)  simd12  ->  mm1         V21._simdVector(offs=0x00) P-INDEP "field V21._simdVector (fldOffset=0x0)"
+;  V78 tmp70        [V78,T52] (  2, 417229.88)  simd12  ->  mm3         V23._simdVector(offs=0x00) P-INDEP "field V23._simdVector (fldOffset=0x0)"
+;  V79 tmp71        [V79,T53] (  2, 417229.88)  simd12  ->  mm2         V25._simdVector(offs=0x00) P-INDEP "field V25._simdVector (fldOffset=0x0)"
+;  V80 tmp72        [V80,T54] (  2, 417229.88)  simd12  ->  mm3         V27._simdVector(offs=0x00) P-INDEP "field V27._simdVector (fldOffset=0x0)"
+;  V81 tmp73        [V81,T55] (  2, 417229.88)  simd12  ->  mm1         V28._simdVector(offs=0x00) P-INDEP "field V28._simdVector (fldOffset=0x0)"
+;  V82 tmp74        [V82,T56] (  2, 417229.88)  simd12  ->  mm0         V29._simdVector(offs=0x00) P-INDEP "field V29._simdVector (fldOffset=0x0)"
+;  V83 tmp75        [V83,T57] (  2, 417229.88)  simd12  ->  mm9         V33._simdVector(offs=0x00) P-INDEP "field V33._simdVector (fldOffset=0x0)"
+;  V84 tmp76        [V84    ] (  3, 417229.88)  simd12  ->  [rsp+88H]   do-not-enreg[XS] addr-exposed V35._simdVector(offs=0x00) P-DEP "field V35._simdVector (fldOffset=0x0)"
 ;  V85 tmp77        [V85,T29] (  4,1426224.85)  simd12  ->  mm10         V46._simdVector(offs=0x00) P-INDEP "field V46._simdVector (fldOffset=0x0)"
 ;  V86 tmp78        [V86,T40] (  2, 771194.71)  simd12  ->  mm10         V53._simdVector(offs=0x00) P-INDEP "field V53._simdVector (fldOffset=0x0)"
 ;  V87 tmp79        [V87,T41] (  2, 771194.71)  simd12  ->  mm0         V54._simdVector(offs=0x00) P-INDEP "field V54._simdVector (fldOffset=0x0)"
-;  V88 tmp80        [V88,T42] (  2, 771194.71)  simd12  ->  mm1         V55._simdVector(offs=0x00) P-INDEP "field V55._simdVector (fldOffset=0x0)"
-;  V89 tmp81        [V89,T43] (  2, 771194.71)  simd12  ->  mm0         V56._simdVector(offs=0x00) P-INDEP "field V56._simdVector (fldOffset=0x0)"
-;* V90 tmp82        [V90,T61] (  0,      0   )  simd12  ->  zero-ref    V58._simdVector(offs=0x00) P-INDEP "field V58._simdVector (fldOffset=0x0)"
-;  V91 tmp83        [V91    ] (  2, 945335.45)  struct (24) [rsp+48H]   do-not-enreg[XS] addr-exposed "by-value struct argument"
-;  V92 tmp84        [V92,T07] (  3,1418003.17)     ref  ->   r8         "argument with side effect"
-;  V93 cse0         [V93,T38] (  3, 802827.23)  simd12  ->  mm9         "CSE - aggressive"
-;  V94 cse1         [V94,T47] (  3, 625844.82)  double  ->  mm1         "CSE - moderate"
-;  V95 cse2         [V95,T48] (  3, 625844.82)  double  ->  mm4         "CSE - moderate"
-;  V96 cse3         [V96,T60] (  2, 209609.18)  double  ->  mm6         "CSE - conservative"
-;  V97 cse4         [V97,T39] (  3, 802827.23)  simd12  ->  mm7         "CSE - aggressive"
-;  V98 cse5         [V98,T28] (  3,   2997.00)     int  ->  rax         "CSE - conservative"
-;  V99 cse6         [V99,T30] (  5,1280874.96)  double  ->  mm8         "CSE - aggressive"
-;  V100 cse7        [V100,T11] (  3,1039161.09)     int  ->  [rsp+44H]   spill-single-def "CSE - aggressive"
-;  V101 cse8        [V101,T35] (  4, 834459.76)   float  ->  mm2         "CSE - aggressive"
-;  V102 cse9        [V102,T49] (  3, 625844.82)  double  ->  mm3         "CSE - moderate"
-;  V103 cse10       [V103,T19] (  3, 625844.82)     int  ->  rcx         "CSE - moderate"
-;  TEMP_01                                      double  ->  [rsp+0xA8]
+;* V88 tmp80        [V88    ] (  0,      0   )  simd12  ->  zero-ref    V55._simdVector(offs=0x00) P-INDEP "field V55._simdVector (fldOffset=0x0)"
+;* V89 tmp81        [V89    ] (  0,      0   )  simd12  ->  zero-ref    V56._simdVector(offs=0x00) P-INDEP "field V56._simdVector (fldOffset=0x0)"
+;* V90 tmp82        [V90,T59] (  0,      0   )  simd12  ->  zero-ref    V58._simdVector(offs=0x00) P-INDEP "field V58._simdVector (fldOffset=0x0)"
+;* V91 tmp83        [V91    ] (  0,      0   )  simd12  ->  zero-ref    "V10.[000..012)"
+;* V92 tmp84        [V92    ] (  0,      0   )  simd12  ->  zero-ref    "V10.[012..024)"
+;  V93 tmp85        [V93,T31] (  4,1216143.51)  simd12  ->  mm7         "V36.[000..012)"
+;  V94 tmp86        [V94,T32] (  4,1216143.51)  simd12  ->  mm9         "V36.[012..024)"
+;  V95 tmp87        [V95,T02] (  3,2313584.12)   byref  ->  rcx         "Spilling address for field-by-field copy"
+;  V96 tmp88        [V96    ] (  2, 945335.45)  struct (24) [rsp+50H]   do-not-enreg[XS] addr-exposed "by-value struct argument"
+;  V97 tmp89        [V97,T06] (  3,1418003.17)     ref  ->   r8         "argument with side effect"
+;  V98 cse0         [V98,T45] (  3, 625844.82)  double  ->  mm1         "CSE - moderate"
+;  V99 cse1         [V99,T46] (  3, 625844.82)  double  ->  mm3         "CSE - moderate"
+;  V100 cse2        [V100,T58] (  2, 209609.18)  double  ->  mm6         "CSE - conservative"
+;  V101 cse3        [V101,T28] (  3,   2997.00)     int  ->  rax         "CSE - conservative"
+;  V102 cse4        [V102,T30] (  5,1280874.96)  double  ->  mm8         "CSE - aggressive"
+;  V103 cse5        [V103,T11] (  3,1039161.09)     int  ->  [rsp+4CH]   spill-single-def "CSE - aggressive"
+;  V104 cse6        [V104,T37] (  4, 834459.76)   float  ->  mm2         "CSE - moderate"
+;  V105 cse7        [V105,T47] (  3, 625844.82)  double  ->  mm2         "CSE - moderate"
+;  V106 cse8        [V106,T19] (  3, 625844.82)     int  ->  rcx         "CSE - moderate"
+;  TEMP_01                                      double  ->  [rsp+0x98]
 ;
-; Lcl frame size = 280
+; Lcl frame size = 264
 
 G_M31648_IG01:  ;; offset=0000H
        push     r15
@@ -126,17 +129,17 @@ G_M31648_IG01:  ;; offset=0000H
        push     rsi
        push     rbp
        push     rbx
-       sub      rsp, 280
+       sub      rsp, 264
        vzeroupper 
-       vmovaps  xmmword ptr [rsp+100H], xmm6
-       vmovaps  xmmword ptr [rsp+F0H], xmm7
-       vmovaps  xmmword ptr [rsp+E0H], xmm8
-       vmovaps  xmmword ptr [rsp+D0H], xmm9
-       vmovaps  xmmword ptr [rsp+C0H], xmm10
-       vmovaps  xmmword ptr [rsp+B0H], xmm11
+       vmovaps  xmmword ptr [rsp+F0H], xmm6
+       vmovaps  xmmword ptr [rsp+E0H], xmm7
+       vmovaps  xmmword ptr [rsp+D0H], xmm8
+       vmovaps  xmmword ptr [rsp+C0H], xmm9
+       vmovaps  xmmword ptr [rsp+B0H], xmm10
+       vmovaps  xmmword ptr [rsp+A0H], xmm11
        xor      eax, eax
-       mov      qword ptr [rsp+80H], rax
        mov      qword ptr [rsp+88H], rax
+       mov      qword ptr [rsp+90H], rax
        mov      rsi, rcx
        mov      rbx, rdx
        mov      rdi, r8
@@ -162,213 +165,203 @@ G_M31648_IG04:  ;; offset=008CH
 G_M31648_IG05:  ;; offset=0094H
        vmovsd   xmm7, qword ptr [r15+08H]
        vinsertps xmm7, xmm7, dword ptr [r15+10H], 40
-       vmovaps  xmm2, xmm7
        vmovsd   xmm0, qword ptr [r15+14H]
        vinsertps xmm0, xmm0, dword ptr [r15+1CH], 40
        vxorps   xmm1, xmm1, xmm1
        vcvtsi2sd xmm1, xmm1, dword ptr [rsi+20H]
-       vmovsd   xmm3, qword ptr [reloc @RWD00]
-       vmulsd   xmm4, xmm1, xmm3
-       vxorps   xmm5, xmm5, xmm5
-       vcvtsi2sd xmm5, xmm5, r12d
-       vsubsd   xmm4, xmm5, xmm4
+       vmovsd   xmm2, qword ptr [reloc @RWD00]
+       vmulsd   xmm3, xmm1, xmm2
+       vxorps   xmm4, xmm4, xmm4
+       vcvtsi2sd xmm4, xmm4, r12d
+       vsubsd   xmm3, xmm4, xmm3
        vmovsd   xmm8, qword ptr [reloc @RWD08]
        vmulsd   xmm1, xmm1, xmm8
-       vdivsd   xmm1, xmm4, xmm1
-       vmovsd   xmm4, qword ptr [r15+2CH]
-       vinsertps xmm4, xmm4, dword ptr [r15+34H], 40
+       vdivsd   xmm1, xmm3, xmm1
+       vmovsd   xmm3, qword ptr [r15+2CH]
+       vinsertps xmm3, xmm3, dword ptr [r15+34H], 40
        vcvtsd2ss xmm1, xmm1, xmm1
        vbroadcastss xmm1, xmm1
-       vmulps   xmm1, xmm1, xmm4
-       vxorps   xmm4, xmm4, xmm4
-       vcvtsi2sd xmm4, xmm4, dword ptr [rsi+24H]
-       vmulsd   xmm3, xmm4, xmm3
-       vsubsd   xmm3, xmm6, xmm3
-       vxorps   xmm3, xmm3, xmmword ptr [reloc @RWD16]
-       vmulsd   xmm4, xmm4, xmm8
-       vdivsd   xmm3, xmm3, xmm4
-       vmovsd   xmm4, qword ptr [r15+20H]
-       vinsertps xmm4, xmm4, dword ptr [r15+28H], 40
-       vcvtsd2ss xmm3, xmm3, xmm3
-       vbroadcastss xmm3, xmm3
-       vmulps   xmm3, xmm3, xmm4
-       vaddps   xmm1, xmm1, xmm3
+       vmulps   xmm1, xmm1, xmm3
+       vxorps   xmm3, xmm3, xmm3
+       vcvtsi2sd xmm3, xmm3, dword ptr [rsi+24H]
+       vmulsd   xmm2, xmm3, xmm2
+       vsubsd   xmm2, xmm6, xmm2
+       vxorps   xmm2, xmm2, xmmword ptr [reloc @RWD16]
+       vmulsd   xmm3, xmm3, xmm8
+       vdivsd   xmm2, xmm2, xmm3
+       vmovsd   xmm3, qword ptr [r15+20H]
+       vinsertps xmm3, xmm3, dword ptr [r15+28H], 40
+       vcvtsd2ss xmm2, xmm2, xmm2
+       vbroadcastss xmm2, xmm2
+       vmulps   xmm2, xmm2, xmm3
+       vaddps   xmm1, xmm1, xmm2
        vaddps   xmm0, xmm0, xmm1
        vdpps    xmm1, xmm0, xmm0, 127
        vcvtss2sd xmm1, xmm1, xmm1
        vsqrtsd  xmm1, xmm1, xmm1
        vcvtsd2ss xmm1, xmm1, xmm1
-       vxorps   xmm3, xmm3, xmm3
-       vucomiss xmm1, xmm3
+       vxorps   xmm2, xmm2, xmm2
+       vucomiss xmm1, xmm2
        jp       SHORT G_M31648_IG06
        je       G_M31648_IG33
-						;; size=209 bbWeight=208614.94 PerfScore 33708697.29
-G_M31648_IG06:  ;; offset=0165H
-       vmovss   xmm3, dword ptr [reloc @RWD32]
-       vdivss   xmm9, xmm3, xmm1
+						;; size=205 bbWeight=208614.94 PerfScore 33656543.55
+G_M31648_IG06:  ;; offset=0161H
+       vmovss   xmm2, dword ptr [reloc @RWD32]
+       vdivss   xmm9, xmm2, xmm1
 						;; size=12 bbWeight=208614.94 PerfScore 2711994.21
-G_M31648_IG07:  ;; offset=0171H
+G_M31648_IG07:  ;; offset=016DH
        vcvtss2sd xmm1, xmm1, xmm9
        vcvtsd2ss xmm1, xmm1, xmm1
        vbroadcastss xmm1, xmm1
        vmulps   xmm9, xmm1, xmm0
-       vmovaps  xmm0, xmm9
-       vxorps   xmm1, xmm1, xmm1
-       vmovdqu  xmmword ptr [rsp+90H], xmm1
-       vmovdqu  xmmword ptr [rsp+98H], xmm1
-       vmovsd   qword ptr [rsp+90H], xmm2
-       vextractps dword ptr [rsp+98H], xmm2, 2
-       vmovsd   qword ptr [rsp+9CH], xmm0
-       vextractps dword ptr [rsp+A4H], xmm0, 2
-       vmovdqu  xmm0, xmmword ptr [rsp+90H]
-       vmovdqu  xmmword ptr [rsp+68H], xmm0
-       mov      rax, qword ptr [rsp+A0H]
-       mov      qword ptr [rsp+78H], rax
        xor      r13, r13
        mov      rax, gword ptr [rbx+08H]
-       mov      gword ptr [rsp+38H], rax
+       mov      gword ptr [rsp+40H], rax
        xor      edx, edx
        mov      ecx, dword ptr [rax+08H]
-       mov      dword ptr [rsp+44H], ecx
+       mov      dword ptr [rsp+4CH], ecx
        test     ecx, ecx
        jle      G_M31648_IG22
-						;; size=142 bbWeight=208614.94 PerfScore 7579676.13
-G_M31648_IG08:  ;; offset=01FFH
-       mov      dword ptr [rsp+64H], edx
+						;; size=47 bbWeight=208614.94 PerfScore 4120145.05
+G_M31648_IG08:  ;; offset=019CH
+       mov      dword ptr [rsp+6CH], edx
        mov      r8d, edx
        mov      r8, gword ptr [rax+8*r8+10H]
        mov      r9, 0x7FF8687322D8      ; Benchmarks.SIMD.RayTracer.Sphere
        cmp      qword ptr [r8], r9
        jne      G_M31648_IG15
 						;; size=31 bbWeight=621931.21 PerfScore 4664484.11
-G_M31648_IG09:  ;; offset=021EH
-       mov      gword ptr [rsp+30H], r8
+G_M31648_IG09:  ;; offset=01BBH
+       mov      gword ptr [rsp+38H], r8
        vmovsd   xmm0, qword ptr [r8+14H]
        vinsertps xmm0, xmm0, dword ptr [r8+1CH], 40
-       vmovaps  xmm1, xmm7
-       vsubps   xmm10, xmm0, xmm1
-       vmovaps  xmm0, xmm9
-       vdpps    xmm11, xmm10, xmm0, 127
+       vsubps   xmm10, xmm0, xmm7
+       vdpps    xmm11, xmm10, xmm9, 127
        vxorps   xmm0, xmm0, xmm0
        vucomiss xmm0, xmm11
        ja       SHORT G_M31648_IG14
-						;; size=48 bbWeight=385597.35 PerfScore 10346862.32
-G_M31648_IG10:  ;; offset=024EH
+						;; size=39 bbWeight=385597.35 PerfScore 10154063.64
+G_M31648_IG10:  ;; offset=01E2H
        vcvtss2sd xmm0, xmm0, dword ptr [r8+10H]
        vmovaps  xmm1, xmm8
        call     <unknown method>
-       vmovsd   qword ptr [rsp+A8H], xmm0
+       vmovsd   qword ptr [rsp+98H], xmm0
        vcvtss2sd xmm0, xmm0, xmm11
        vmovaps  xmm1, xmm8
        call     <unknown method>
        vdpps    xmm1, xmm10, xmm10, 127
        vcvtss2sd xmm1, xmm1, xmm1
        vsubsd   xmm0, xmm1, xmm0
-       vmovsd   xmm1, qword ptr [rsp+A8H]
+       vmovsd   xmm1, qword ptr [rsp+98H]
        vsubsd   xmm0, xmm1, xmm0
        vxorps   xmm1, xmm1, xmm1
        vucomisd xmm1, xmm0
        ja       SHORT G_M31648_IG12
 						;; size=77 bbWeight=327515.07 PerfScore 14028562.26
-G_M31648_IG11:  ;; offset=029BH
+G_M31648_IG11:  ;; offset=022FH
        vsqrtsd  xmm0, xmm0, xmm0
        vcvtsd2ss xmm0, xmm0, xmm0
        vsubss   xmm10, xmm11, xmm0
        jmp      SHORT G_M31648_IG13
 						;; size=14 bbWeight=109987.30 PerfScore 2309733.36
-G_M31648_IG12:  ;; offset=02A9H
+G_M31648_IG12:  ;; offset=023DH
        vxorps   xmm10, xmm10, xmm10
 						;; size=5 bbWeight=217527.77 PerfScore 72509.26
-G_M31648_IG13:  ;; offset=02AEH
+G_M31648_IG13:  ;; offset=0242H
        vxorps   xmm0, xmm0, xmm0
        vucomiss xmm10, xmm0
        jp       SHORT G_M31648_IG18
        jne      SHORT G_M31648_IG18
 						;; size=12 bbWeight=385597.35 PerfScore 1670921.87
-G_M31648_IG14:  ;; offset=02BAH
+G_M31648_IG14:  ;; offset=024EH
        xor      r8, r8
        jmp      SHORT G_M31648_IG16
 						;; size=5 bbWeight=287322.78 PerfScore 646476.25
-G_M31648_IG15:  ;; offset=02BFH
-       vmovdqu  xmm0, xmmword ptr [rsp+68H]
-       vmovdqu  xmmword ptr [rsp+48H], xmm0
-       mov      r9, qword ptr [rsp+78H]
-       mov      qword ptr [rsp+58H], r9
+G_M31648_IG15:  ;; offset=0253H
+       vmovsd   qword ptr [rsp+70H], xmm7
+       vextractps dword ptr [rsp+78H], xmm7, 2
+       vmovsd   qword ptr [rsp+7CH], xmm9
+       vextractps dword ptr [rsp+84H], xmm9, 2
+       vmovdqu  xmm0, xmmword ptr [rsp+70H]
+       vmovdqu  xmmword ptr [rsp+50H], xmm0
+       mov      r9, qword ptr [rsp+80H]
+       mov      qword ptr [rsp+60H], r9
        mov      rcx, r8
-       lea      rdx, [rsp+48H]
+       lea      rdx, [rsp+50H]
        mov      r8, qword ptr [r8]
        mov      r8, qword ptr [r8+48H]
        call     [r8+20H]<unknown method>
        mov      r8, rax
-						;; size=44 bbWeight=236333.86 PerfScore 3308674.06
-G_M31648_IG16:  ;; offset=02EBH
+						;; size=78 bbWeight=236333.86 PerfScore 5199344.96
+G_M31648_IG16:  ;; offset=02A1H
        test     r8, r8
        je       SHORT G_M31648_IG21
 						;; size=5 bbWeight=621931.21 PerfScore 777414.02
-G_M31648_IG17:  ;; offset=02F0H
+G_M31648_IG17:  ;; offset=02A6H
        jmp      SHORT G_M31648_IG19
 						;; size=2 bbWeight=80248.55 PerfScore 160497.11
-G_M31648_IG18:  ;; offset=02F2H
+G_M31648_IG18:  ;; offset=02A8H
        mov      rcx, 0x7FF86873C618      ; Benchmarks.SIMD.RayTracer.ISect
        call     CORINFO_HELP_NEWSFAST
        mov      r8, rax
-       mov      gword ptr [rsp+28H], r8
+       mov      gword ptr [rsp+30H], r8
        lea      rcx, bword ptr [r8+08H]
-       mov      rdx, gword ptr [rsp+30H]
+       mov      rdx, gword ptr [rsp+38H]
        call     CORINFO_HELP_ASSIGN_REF
-       mov      r8, gword ptr [rsp+28H]
-       vmovdqu  xmm0, xmmword ptr [rsp+68H]
-       vmovdqu  xmmword ptr [r8+18H], xmm0
-       mov      rcx, qword ptr [rsp+78H]
-       mov      qword ptr [r8+28H], rcx
+       mov      r8, gword ptr [rsp+30H]
+       lea      rcx, bword ptr [r8+18H]
+       vmovsd   qword ptr [rcx], xmm7
+       vextractps dword ptr [rcx+08H], xmm7, 2
+       vmovsd   qword ptr [rcx+0CH], xmm9
+       vextractps dword ptr [rcx+14H], xmm9, 2
        vcvtss2sd xmm0, xmm0, xmm10
        vmovsd   qword ptr [r8+10H], xmm0
        jmp      SHORT G_M31648_IG16
-						;; size=76 bbWeight=385597.35 PerfScore 8097544.42
-G_M31648_IG19:  ;; offset=033EH
+						;; size=82 bbWeight=385597.35 PerfScore 10218329.86
+G_M31648_IG19:  ;; offset=02FAH
        test     r13, r13
        jne      SHORT G_M31648_IG25
 						;; size=5 bbWeight=80248.55 PerfScore 100310.69
-G_M31648_IG20:  ;; offset=0343H
+G_M31648_IG20:  ;; offset=02FFH
        mov      r13, r8
 						;; size=3 bbWeight=78703.22 PerfScore 19675.80
-G_M31648_IG21:  ;; offset=0346H
-       mov      edx, dword ptr [rsp+64H]
+G_M31648_IG21:  ;; offset=0302H
+       mov      edx, dword ptr [rsp+6CH]
        inc      edx
-       mov      eax, dword ptr [rsp+44H]
+       mov      eax, dword ptr [rsp+4CH]
        cmp      eax, edx
        jg       SHORT G_M31648_IG24
 						;; size=14 bbWeight=621931.21 PerfScore 2176759.25
-G_M31648_IG22:  ;; offset=0354H
+G_M31648_IG22:  ;; offset=0310H
        mov      r8, r13
        test     r8, r8
        jne      SHORT G_M31648_IG26
 						;; size=8 bbWeight=208614.94 PerfScore 312922.41
-G_M31648_IG23:  ;; offset=035CH
+G_M31648_IG23:  ;; offset=0318H
        vxorps   xmm0, xmm0, xmm0
-       vmovaps  xmmword ptr [rsp+80H], xmm0
+       vmovups  xmmword ptr [rsp+88H], xmm0
        jmp      SHORT G_M31648_IG27
 						;; size=15 bbWeight=79255.46 PerfScore 264184.86
-G_M31648_IG24:  ;; offset=036BH
-       mov      rax, gword ptr [rsp+38H]
+G_M31648_IG24:  ;; offset=0327H
+       mov      rax, gword ptr [rsp+40H]
        jmp      G_M31648_IG08
 						;; size=10 bbWeight=310965.61 PerfScore 932896.82
-G_M31648_IG25:  ;; offset=0375H
+G_M31648_IG25:  ;; offset=0331H
        vmovsd   xmm0, qword ptr [r13+10H]
        vucomisd xmm0, qword ptr [r8+10H]
        jbe      SHORT G_M31648_IG21
        jmp      SHORT G_M31648_IG20
 						;; size=16 bbWeight=1546.37 PerfScore 18556.45
-G_M31648_IG26:  ;; offset=0385H
+G_M31648_IG26:  ;; offset=0341H
        xor      edx, edx
        mov      dword ptr [rsp+20H], edx
-       lea      rdx, [rsp+80H]
+       lea      rdx, [rsp+88H]
        mov      rcx, rsi
        mov      r9, rbx
        call     [<unknown method>]
 						;; size=26 bbWeight=129359.48 PerfScore 679137.28
-G_M31648_IG27:  ;; offset=039FH
-       vmovaps  xmm0, xmmword ptr [rsp+80H]
+G_M31648_IG27:  ;; offset=035BH
+       vmovups  xmm0, xmmword ptr [rsp+88H]
        vunpckhps xmm1, xmm0, xmm0
        vmovss   xmm2, dword ptr [reloc @RWD36]
        vmulss   xmm1, xmm1, xmm2
@@ -376,14 +369,14 @@ G_M31648_IG27:  ;; offset=039FH
        cmp      eax, 255
        jg       G_M31648_IG34
 						;; size=40 bbWeight=208614.94 PerfScore 3598607.70
-G_M31648_IG28:  ;; offset=03C7H
+G_M31648_IG28:  ;; offset=0383H
        vmovshdup xmm1, xmm0
        vmulss   xmm1, xmm1, xmm2
        vcvttss2si  edx, xmm1
        cmp      edx, 255
        jg       G_M31648_IG35
 						;; size=24 bbWeight=208614.94 PerfScore 2346918.07
-G_M31648_IG29:  ;; offset=03DFH
+G_M31648_IG29:  ;; offset=039BH
        shl      edx, 8
        or       edx, eax
        vmulss   xmm2, xmm0, xmm2
@@ -391,7 +384,7 @@ G_M31648_IG29:  ;; offset=03DFH
        cmp      eax, 255
        jg       G_M31648_IG36
 						;; size=24 bbWeight=208614.94 PerfScore 2294764.33
-G_M31648_IG30:  ;; offset=03F7H
+G_M31648_IG30:  ;; offset=03B3H
        lea      ecx, [r12+r14]
        cmp      ecx, dword ptr [rdi+08H]
        jae      G_M31648_IG37
@@ -403,19 +396,19 @@ G_M31648_IG30:  ;; offset=03F7H
        cmp      r12d, dword ptr [rsi+20H]
        jl       G_M31648_IG05
 						;; size=41 bbWeight=208614.94 PerfScore 2242610.60
-G_M31648_IG31:  ;; offset=0420H
+G_M31648_IG31:  ;; offset=03DCH
        inc      ebp
        cmp      ebp, dword ptr [rsi+24H]
        jl       G_M31648_IG03
 						;; size=11 bbWeight=999.00 PerfScore 4245.75
-G_M31648_IG32:  ;; offset=042BH
-       vmovaps  xmm6, xmmword ptr [rsp+100H]
-       vmovaps  xmm7, xmmword ptr [rsp+F0H]
-       vmovaps  xmm8, xmmword ptr [rsp+E0H]
-       vmovaps  xmm9, xmmword ptr [rsp+D0H]
-       vmovaps  xmm10, xmmword ptr [rsp+C0H]
-       vmovaps  xmm11, xmmword ptr [rsp+B0H]
-       add      rsp, 280
+G_M31648_IG32:  ;; offset=03E7H
+       vmovaps  xmm6, xmmword ptr [rsp+F0H]
+       vmovaps  xmm7, xmmword ptr [rsp+E0H]
+       vmovaps  xmm8, xmmword ptr [rsp+D0H]
+       vmovaps  xmm9, xmmword ptr [rsp+C0H]
+       vmovaps  xmm10, xmmword ptr [rsp+B0H]
+       vmovaps  xmm11, xmmword ptr [rsp+A0H]
+       add      rsp, 264
        pop      rbx
        pop      rbp
        pop      rsi
@@ -426,23 +419,23 @@ G_M31648_IG32:  ;; offset=042BH
        pop      r15
        ret      
 						;; size=74 bbWeight=1.00 PerfScore 29.25
-G_M31648_IG33:  ;; offset=0475H
+G_M31648_IG33:  ;; offset=0431H
        vmovss   xmm9, dword ptr [reloc @RWD40]
        jmp      G_M31648_IG07
 						;; size=13 bbWeight=0 PerfScore 0.00
-G_M31648_IG34:  ;; offset=0482H
+G_M31648_IG34:  ;; offset=043EH
        mov      eax, 255
        jmp      G_M31648_IG28
 						;; size=10 bbWeight=0 PerfScore 0.00
-G_M31648_IG35:  ;; offset=048CH
+G_M31648_IG35:  ;; offset=0448H
        mov      edx, 255
        jmp      G_M31648_IG29
 						;; size=10 bbWeight=0 PerfScore 0.00
-G_M31648_IG36:  ;; offset=0496H
+G_M31648_IG36:  ;; offset=0452H
        mov      eax, 255
        jmp      G_M31648_IG30
 						;; size=10 bbWeight=0 PerfScore 0.00
-G_M31648_IG37:  ;; offset=04A0H
+G_M31648_IG37:  ;; offset=045CH
        call     CORINFO_HELP_RNGCHKFAIL
        int3     
 						;; size=6 bbWeight=0 PerfScore 0.00
@@ -454,11 +447,11 @@ RWD36  	dd	437F0000h		;       255
 RWD40  	dd	7F800000h		;       inf
 
 
-; Total bytes of code 1190, prolog size 103, PerfScore 105089852.53, instruction count 255, allocated bytes for code 1190 (MethodHash=adae845f) for method Benchmarks.SIMD.RayTracer.RayTracer:RenderSequential(Benchmarks.SIMD.RayTracer.Scene,int[]):this
+; Total bytes of code 1122, prolog size 103, PerfScore 105396818.57, instruction count 245, allocated bytes for code 1122 (MethodHash=adae845f) for method Benchmarks.SIMD.RayTracer.RayTracer:RenderSequential(Benchmarks.SIMD.RayTracer.Scene,int[]):this

This is a case where our lack of handling for call args shows up. We end up with an extra struct copy in G_M31648_IG15 because physical promotion inserts a writeback into the struct local, and then call args morphing creates a copy of it since it isn't a last use.

One simple fix in physical promotion for the implicit byref case would be to create a new local to ensure that it is a last use; we can then handle it by our smarter decomposition. That might be a good short-term solution with large benefit.

There is also a redundant lea instruction in G_M31648_IG18; we are not able to peel the address because it is a FIELD_ADDR node. Also, the copy itself needs to use vextractps since it is TYP_SIMD12.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 27, 2023
jakobbotsch added a commit that referenced this issue Jun 29, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 29, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jul 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants