Vector<T> operations don't take advantage of memory operands #13798

GrabYourPitchforks · 2019-11-15T06:48:07Z

With dotnet/coreclr#22944, the raw hardware intrinsics are able to take advantage of folding the memory load operation into the SIMD instruction itself.

However, this same optimization was not applied to Vector and Vector<T> more generally, even though they're using nearly identical codegen under the covers.

public static Vector<byte> M(Vector<byte> a, ref Vector<byte> b)
{
    return Vector.Equals(a, b);
}

public static Vector256<byte> N(Vector256<byte> a, ref Vector256<byte> b)
{
    return Avx2.CompareEqual(a, b);
}

; C.M(System.Numerics.Vector`1<Byte>, System.Numerics.Vector`1<Byte> ByRef)
    L0000: vzeroupper
    L0003: vmovupd ymm0, [rdx]
    L0007: vmovupd ymm1, [r8]   ; note the allocation of register ymm1
    L000c: vpcmpeqb xmm0, xmm0, xmm1
    L0010: vmovupd [rcx], ymm0
    L0014: mov rax, rcx
    L0017: vzeroupper
    L001a: ret

; C.N(System.Runtime.Intrinsics.Vector256`1<Byte>, System.Runtime.Intrinsics.Vector256`1<Byte> ByRef)
    L0000: vzeroupper
    L0003: vmovupd ymm0, [rdx]
    L0007: vpcmpeqb xmm0, xmm0, [r8]   ; operation doesn't touch register ymm1
    L000c: vmovupd [rcx], ymm0
    L0010: mov rax, rcx
    L0013: vzeroupper
    L0016: ret

category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium

The text was updated successfully, but these errors were encountered:

GrabYourPitchforks · 2019-11-15T06:54:42Z

See dotnet/coreclr#27798 for an example of code gen that would be slightly improved by this.

BruceForstall · 2019-11-20T20:52:57Z

@CarolEidt @tannergooding

mikedn · 2019-11-20T21:24:45Z

SIMD intrinsic that have a 1:1 mapping to HW intrinsics should be easy to import as HW intrinsics:

mikedn/coreclr@384934c generates:

G_M319_IG01:
       C5F877               vzeroupper
G_M319_IG02:
       C5FD1002             vmovupd  ymm0, ymmword ptr[rdx]
       C4C17D7400           vpcmpeqb ymm0, ymm0, ymmword ptr[r8]
       C5FD1101             vmovupd  ymmword ptr[rcx], ymm0
       488BC1               mov      rax, rcx
G_M319_IG03:
       C5F877               vzeroupper
       C3                   ret

CarolEidt · 2019-11-20T21:38:08Z

I think it might make more sense to do that in simd.cpp.
At one point I believe we had some pessimization that happened when mixing vectors with different class handles, but perhaps that has been eliminated.

mikedn · 2019-11-20T22:34:56Z

I think it might make more sense to do that in simd.cpp.

Well, I would hope that in time simd.cpp can go away :). Or maybe keep it only for those SIMD intrinsics that do not have 1:1 mapping and require some sort of special handling (not sure how that would look like - either make the HW intrinsic import code more flexible so that it can produce trees, not just nodes, or introduce an abstract "SIMD" ISA that has special codegen handling).

At one point I believe we had some pessimization that happened when mixing vectors with different class handles, but perhaps that has been eliminated.

The most likely issue we'll run into is again call args - gtGetStructHandleForSIMD and gtGetStructHandleForHWSIMD return different handles so if we want to continue to pass Vector4 as a standard struct but pass Vector128<float> as __m128 we'll have problems. Looks like another case for adding class layout to GenTreeCall::Use.

mikedn · 2019-11-20T22:44:07Z

The most likely issue we'll run into is again call args

This is readily visible if you try to handle Vector2 and Vector3 as HW intrinsics. Sink(v1 + v2) produces:

[000009] --CXG-------              *  RETURN    long  
[000006] --CXG-------              \--*  CALL      simd8  Program.Sink
[000008] ---XG------- arg0            \--*  OBJ       simd8 <System.Numerics.Vector2>
[000007] ---XG-------                    \--*  ADDR      byref 
[000005] ---XG-------                       \--*  HWINTRINSIC simd8  float Add
[000004] n-----------                          +--*  OBJ       simd8 <System.Numerics.Vector2>
[000003] ------------                          |  \--*  ADDR      byref 
[000000] ------------                          |     \--*  LCL_VAR   simd8 <System.Numerics.Vector2> V00 arg0         
[000002] ---XG-------                          \--*  OBJ       simd8 <System.Numerics.Vector2>
[000001] ------------                             \--*  LCL_VAR   byref  V01 arg1

which is already dubious because the HWINTRINSIC node has type TYP_SIMD8 and then it gets morphed to

[000009] --CXG+------              *  RETURN    long  
[000006] --CXG+------              \--*  CALL      long   Program.Sink
[000005] ---XG+------ arg0 in rcx     \--*  HWINTRINSIC long   float Add
[000004] n----+------                    +--*  OBJ       simd8 <System.Numerics.Vector2>
[000003] -----+------                    |  \--*  ADDR      byref 
[000000] -----+-N----                    |     \--*  LCL_VAR   simd8 <System.Numerics.Vector2> V00 arg0         
[000002] ---XG+------                    \--*  OBJ       simd8 <System.Numerics.Vector2>
[000001] -----+------                       \--*  LCL_VAR   byref  V01 arg1

which is even more broken. Oh well, looks like one more reason to add class layout to GenTreeCall::Use and stop relying on gtGetStructHandle.

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Oct 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector<T> operations don't take advantage of memory operands #13798

Vector<T> operations don't take advantage of memory operands #13798

GrabYourPitchforks commented Nov 15, 2019

GrabYourPitchforks commented Nov 15, 2019

BruceForstall commented Nov 20, 2019

mikedn commented Nov 20, 2019

CarolEidt commented Nov 20, 2019

mikedn commented Nov 20, 2019

mikedn commented Nov 20, 2019

Vector<T> operations don't take advantage of memory operands #13798

Vector<T> operations don't take advantage of memory operands #13798

Comments

GrabYourPitchforks commented Nov 15, 2019

GrabYourPitchforks commented Nov 15, 2019

BruceForstall commented Nov 20, 2019

mikedn commented Nov 20, 2019

CarolEidt commented Nov 20, 2019

mikedn commented Nov 20, 2019

mikedn commented Nov 20, 2019