Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector<T> operations don't take advantage of memory operands #13798

Open
GrabYourPitchforks opened this issue Nov 15, 2019 · 6 comments
Open
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Milestone

Comments

@GrabYourPitchforks
Copy link
Member

With dotnet/coreclr#22944, the raw hardware intrinsics are able to take advantage of folding the memory load operation into the SIMD instruction itself.

However, this same optimization was not applied to Vector and Vector<T> more generally, even though they're using nearly identical codegen under the covers.

public static Vector<byte> M(Vector<byte> a, ref Vector<byte> b)
{
    return Vector.Equals(a, b);
}

public static Vector256<byte> N(Vector256<byte> a, ref Vector256<byte> b)
{
    return Avx2.CompareEqual(a, b);
}
; C.M(System.Numerics.Vector`1<Byte>, System.Numerics.Vector`1<Byte> ByRef)
    L0000: vzeroupper
    L0003: vmovupd ymm0, [rdx]
    L0007: vmovupd ymm1, [r8]   ; note the allocation of register ymm1
    L000c: vpcmpeqb xmm0, xmm0, xmm1
    L0010: vmovupd [rcx], ymm0
    L0014: mov rax, rcx
    L0017: vzeroupper
    L001a: ret

; C.N(System.Runtime.Intrinsics.Vector256`1<Byte>, System.Runtime.Intrinsics.Vector256`1<Byte> ByRef)
    L0000: vzeroupper
    L0003: vmovupd ymm0, [rdx]
    L0007: vpcmpeqb xmm0, xmm0, [r8]   ; operation doesn't touch register ymm1
    L000c: vmovupd [rcx], ymm0
    L0010: mov rax, rcx
    L0013: vzeroupper
    L0016: ret

category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium

@GrabYourPitchforks
Copy link
Member Author

See dotnet/coreclr#27798 for an example of code gen that would be slightly improved by this.

@BruceForstall
Copy link
Member

@CarolEidt @tannergooding

@mikedn
Copy link
Contributor

mikedn commented Nov 20, 2019

SIMD intrinsic that have a 1:1 mapping to HW intrinsics should be easy to import as HW intrinsics:

mikedn/coreclr@384934c generates:

G_M319_IG01:
       C5F877               vzeroupper
G_M319_IG02:
       C5FD1002             vmovupd  ymm0, ymmword ptr[rdx]
       C4C17D7400           vpcmpeqb ymm0, ymm0, ymmword ptr[r8]
       C5FD1101             vmovupd  ymmword ptr[rcx], ymm0
       488BC1               mov      rax, rcx
G_M319_IG03:
       C5F877               vzeroupper
       C3                   ret

@CarolEidt
Copy link
Contributor

I think it might make more sense to do that in simd.cpp.
At one point I believe we had some pessimization that happened when mixing vectors with different class handles, but perhaps that has been eliminated.

@mikedn
Copy link
Contributor

mikedn commented Nov 20, 2019

I think it might make more sense to do that in simd.cpp.

Well, I would hope that in time simd.cpp can go away :). Or maybe keep it only for those SIMD intrinsics that do not have 1:1 mapping and require some sort of special handling (not sure how that would look like - either make the HW intrinsic import code more flexible so that it can produce trees, not just nodes, or introduce an abstract "SIMD" ISA that has special codegen handling).

At one point I believe we had some pessimization that happened when mixing vectors with different class handles, but perhaps that has been eliminated.

The most likely issue we'll run into is again call args - gtGetStructHandleForSIMD and gtGetStructHandleForHWSIMD return different handles so if we want to continue to pass Vector4 as a standard struct but pass Vector128<float> as __m128 we'll have problems. Looks like another case for adding class layout to GenTreeCall::Use.

@mikedn
Copy link
Contributor

mikedn commented Nov 20, 2019

The most likely issue we'll run into is again call args

This is readily visible if you try to handle Vector2 and Vector3 as HW intrinsics. Sink(v1 + v2) produces:

[000009] --CXG-------              *  RETURN    long  
[000006] --CXG-------              \--*  CALL      simd8  Program.Sink
[000008] ---XG------- arg0            \--*  OBJ       simd8 <System.Numerics.Vector2>
[000007] ---XG-------                    \--*  ADDR      byref 
[000005] ---XG-------                       \--*  HWINTRINSIC simd8  float Add
[000004] n-----------                          +--*  OBJ       simd8 <System.Numerics.Vector2>
[000003] ------------                          |  \--*  ADDR      byref 
[000000] ------------                          |     \--*  LCL_VAR   simd8 <System.Numerics.Vector2> V00 arg0         
[000002] ---XG-------                          \--*  OBJ       simd8 <System.Numerics.Vector2>
[000001] ------------                             \--*  LCL_VAR   byref  V01 arg1         

which is already dubious because the HWINTRINSIC node has type TYP_SIMD8 and then it gets morphed to

[000009] --CXG+------              *  RETURN    long  
[000006] --CXG+------              \--*  CALL      long   Program.Sink
[000005] ---XG+------ arg0 in rcx     \--*  HWINTRINSIC long   float Add
[000004] n----+------                    +--*  OBJ       simd8 <System.Numerics.Vector2>
[000003] -----+------                    |  \--*  ADDR      byref 
[000000] -----+-N----                    |     \--*  LCL_VAR   simd8 <System.Numerics.Vector2> V00 arg0         
[000002] ---XG+------                    \--*  OBJ       simd8 <System.Numerics.Vector2>
[000001] -----+------                       \--*  LCL_VAR   byref  V01 arg1         

which is even more broken. Oh well, looks like one more reason to add class layout to GenTreeCall::Use and stop relying on gtGetStructHandle.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Oct 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet
Development

No branches or pull requests

5 participants