Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Move Vector4 to be implemented using HWIntrinsics on x86 #27483

Closed
wants to merge 4 commits into from
Closed

Move Vector4 to be implemented using HWIntrinsics on x86 #27483

wants to merge 4 commits into from

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented Oct 27, 2019

This prototypes: https://github.com/dotnet/corefx/issues/31425

There are a few benefits to switching these to being implemented using HWIntrinsics:

  • It makes it more immediately obvious to people browsing the source how these functions are compiled down
  • It makes it easier for external contributors to change the implementation and improve it
  • It will eventually allow simplifying the logic in the JIT
    • Namely, the majority of the simdcodegenxarch.cpp and simd.cpp logic could be removed.
      The only thing really needed is the better codegen for accessing the public fields and the JIT support for mapping these types to the TYP_SIMD8/12/16 types in the JIT (so conversions are done appropriately)
  • It will allow optimizations (including things like R2R/Crossgen support) to be centered around HWIntrinsics, rather than needing to handle both HWIntrinsics and SIMDIntrinsics

@tannergooding
Copy link
Member Author

It's not quite perfect yet, namely because HWIntrinsic vectors don't support method constants but SIMD vectors do (this is tracked by https://github.com/dotnet/coreclr/issues/17225 and isn't too hard to add).

jit-analyze.bat --base .\base --diff .\diff --recursive
Found 1 files with textual diffs.

Summary:
(Lower is better)

Total bytes of diff: -936 (-0.02% of base)
    diff is an improvement.

Top file improvements by size (bytes):
        -936 : System.Private.CoreLib.dasm (-0.02% of base)

1 total files with size differences (1 improved, 0 regressed), 0 unchanged.

Top method regressions by size (bytes):
          30 (157.89% of base) : System.Private.CoreLib.dasm - Vector4:get_UnitX():struct
          30 (157.89% of base) : System.Private.CoreLib.dasm - Vector4:get_UnitY():struct
          30 (157.89% of base) : System.Private.CoreLib.dasm - Vector4:get_UnitZ():struct
          30 (157.89% of base) : System.Private.CoreLib.dasm - Vector4:get_UnitW():struct
           5 (26.32% of base) : System.Private.CoreLib.dasm - Vector4:get_One():struct

Top method improvements by size (bytes):
        -150 (-88.24% of base) : System.Private.CoreLib.dasm - Vector4:Min(struct,struct):struct
        -150 (-88.24% of base) : System.Private.CoreLib.dasm - Vector4:Max(struct,struct):struct
        -128 (-6.61% of base) : System.Private.CoreLib.dasm - Vector4:Transform(struct,struct):struct (6 methods)
         -88 (-75.21% of base) : System.Private.CoreLib.dasm - Vector4:Abs(struct):struct
         -72 (-78.26% of base) : System.Private.CoreLib.dasm - Vector4:op_Addition(struct,struct):struct

Top method regressions by size (percentage):
          30 (157.89% of base) : System.Private.CoreLib.dasm - Vector4:get_UnitX():struct
          30 (157.89% of base) : System.Private.CoreLib.dasm - Vector4:get_UnitY():struct
          30 (157.89% of base) : System.Private.CoreLib.dasm - Vector4:get_UnitZ():struct
          30 (157.89% of base) : System.Private.CoreLib.dasm - Vector4:get_UnitW():struct
           5 (26.32% of base) : System.Private.CoreLib.dasm - Vector4:get_One():struct

Top method improvements by size (percentage):
        -150 (-88.24% of base) : System.Private.CoreLib.dasm - Vector4:Min(struct,struct):struct
        -150 (-88.24% of base) : System.Private.CoreLib.dasm - Vector4:Max(struct,struct):struct
         -54 (-78.26% of base) : System.Private.CoreLib.dasm - Vector4:SquareRoot(struct):struct
         -72 (-78.26% of base) : System.Private.CoreLib.dasm - Vector4:op_Addition(struct,struct):struct
         -72 (-78.26% of base) : System.Private.CoreLib.dasm - Vector4:op_Subtraction(struct,struct):struct

@tannergooding
Copy link
Member Author

The HWIntrinsics are fully VEX aware, while other parts of the JIT (SIMD Intrinsics, Scalar Ops like float + float, etc) are not.

This means that the allocated registers and the emitted disassembly is better (we should fix this regardless and make the register allocator and codegen fully vex aware for any simd instruction):

Before

       vxorps   xmm0, xmm0
       vmovupd  xmmword ptr [rcx], xmm0
       mov      rax, rcx

After:

       vxorps   xmm0, xmm0, xmm0
       vmovupd  xmmword ptr [rcx], xmm0
       mov      rax, rcx

@tannergooding
Copy link
Member Author

tannergooding commented Oct 27, 2019

HWIntrinsics currently handle duplicated inputs better:

Example 1

Before:

       vmovupd  xmm0, xmmword ptr [rcx]
       vmovupd  xmm1, xmmword ptr [rcx]
       vdpps    xmm0, xmm1, -15
       vsqrtss  xmm0, xmm0

After:

       vmovupd  xmm0, xmmword ptr [rcx]
       vmovaps  xmm1, xmm0
       vdpps    xmm0, xmm1, xmm0, -1
       vsqrtss  xmm0, xmm0

Example 2

Before:

       vmovupd  xmm0, xmmword ptr [rdx]
       vmovupd  xmm1, xmmword ptr [rdx]
       vdpps    xmm0, xmm1, -15
       vsqrtss  xmm0, xmm0
       vmovupd  xmm1, xmmword ptr [rdx]
       vbroadcastss xmm0, xmm0
       vdivps   xmm1, xmm0
       vmovupd  xmmword ptr [rcx], xmm1
       mov      rax, rcx

After:

       vmovupd  xmm0, xmmword ptr [rdx]
       vmovaps  xmm1, xmm0
       vmovaps  xmm2, xmm0
       vdpps    xmm1, xmm1, xmm2, -1
       vsqrtss  xmm1, xmm1
       vbroadcastss xmm1, xmm1
       vdivps   xmm0, xmm0, xmm1
       vmovupd  xmmword ptr [rcx], xmm0
       mov      rax, rcx

@tannergooding
Copy link
Member Author

tannergooding commented Oct 27, 2019

The built-in codegen for Vector128.Create is just better than the codegen for new Vector4() (this is the bulk of the diff)

Before:

       vxorps   xmm4, xmm4
       vmovss   xmm4, xmm4, xmm3
       vpslldq  xmm4, 4
       vmovss   xmm4, xmm4, xmm2
       vpslldq  xmm4, 4
       vmovss   xmm4, xmm4, xmm1
       vpslldq  xmm4, 4
       vmovss   xmm4, xmm4, xmm0
       vmovaps  xmm0, xmm4

After:

       vinsertps xmm0, xmm0, xmm1, 16
       vinsertps xmm0, xmm0, xmm2, 32
       vinsertps xmm0, xmm0, xmm3, 48

@tannergooding
Copy link
Member Author

For indirect invocation cases, SIMDIntrinsics were always executing the software fallback code:

G_M45532_IG02:
       vmovss   xmm0, dword ptr [rcx]
       vucomiss xmm0, dword ptr [rdx]
       jp       SHORT G_M45532_IG06
       jne      SHORT G_M45532_IG06

G_M45532_IG03:		;; bbWeight=0.50
       vmovss   xmm0, dword ptr [rcx+4]
       vucomiss xmm0, dword ptr [rdx+4]
       jp       SHORT G_M45532_IG06
       jne      SHORT G_M45532_IG06
       vmovss   xmm0, dword ptr [rcx+8]
       vucomiss xmm0, dword ptr [rdx+8]
       jp       SHORT G_M45532_IG06
       jne      SHORT G_M45532_IG06
       vmovss   xmm0, dword ptr [rcx+12]
       vucomiss xmm0, dword ptr [rdx+12]
       setnp    al
       jp       SHORT G_M45532_IG04
       sete     al

G_M45532_IG04:		;; bbWeight=0.50
       movzx    rax, al

G_M45532_IG05:		;; bbWeight=0.50
       ret      

G_M45532_IG06:		;; bbWeight=0.50
       xor      eax, eax

G_M45532_IG07:		;; bbWeight=0.50
       ret      

With HWIntrinsics, they can be faster (which should also improve Tier 0 speed):

       vmovupd  xmm0, xmmword ptr [rcx]
       vcmpps   xmm0, xmm0, xmmword ptr [rdx], 0
       vmovmskps xrax, xmm0
       cmp      eax, 15
       sete     al
       movzx    rax, al

@tannergooding
Copy link
Member Author

CC. @danmosemsft, @CarolEidt, @jkotas

I've gotten an initial (rough) prototype done and it looks very promising. I'd like to get some more thoughts/opinions first and if this is something we'd like to push forward, this can be the first step.

@tannergooding
Copy link
Member Author

Also CC. @benaadams, @saucecontrol, @EgorBo who I believe have expressed prior interest in this.

{
Vector128<float> tmp2 = Sse.Shuffle(vector2.AsVector128(), tmp, 0x40);
tmp2 = Sse.Add(tmp2, tmp);
tmp = Sse.Shuffle(tmp, tmp2, 0x30);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the absence of the MM_SHUFFLE macro, I find binary literals are almost as easy to read, and certainly easier than hex. e.g. 0b_00_11_00_00 here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really just add a helper intrinsic for this, it helps readability immensely and is going to be used in most native code people might try to port.

I don't think we have an API proposal yet, however.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been waiting for the _MM_SHUFFLE API since the S.R.I beginning 🙂 I guess it should somehow only allow constant values as an input.

Copy link

@4creators 4creators Oct 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it should somehow only allow constant values as an input.

There is currently no C# language-specific way to enforce it and the only way is via analyzers.

There is a C# proposal for runtime const unmanaged but despite preliminary OK from language team it seems that interest has faded away (see dotnet/csharplang#744)

@jkotas
Copy link
Member

jkotas commented Oct 27, 2019

This will generate a lot more complex trees. How much likely is it to hit JIT complexity limits with this implementation and regress real-world code?

@tannergooding
Copy link
Member Author

tannergooding commented Oct 27, 2019

This will generate a lot more complex trees.

The trees (at least for Tier 1) are actually not that much more complex and I think Tier 0 could be improved greatly by (if nothing else) just allowing dead code path elimination for the if (Isa.IsSupported) case.

Conversion between Vector4/Vector128 is a nop and is done entirely in importation so it doesn't generate any nodes (same as the Vector128.As<T, U> methods).

All but a single code-path are also dropped which helps keep the trees smaller. Many operations were already 1-to-1, so with the conversion being elided, they still stay as a single node with the same 1 or 2 child nodes as before.

There are some notable exceptions, such as Vector4.Equals now being:

GT_CMP
  MoveMask
    CompareEqual
      op1
      op2
  GT_CNS

Where it used to be:

SimdIntrinsicEqual
  op1
  op2

But, that should also allow other optimizations to start applying (where they couldn't easily have applied before).

@tannergooding
Copy link
Member Author

I'll get some JIT dumps to help share examples of how the trees changed.

@saucecontrol
Copy link
Member

Really nice to see how easy it is to add ARM implementations side by side. That's a good motivating factor for getting this done.

Side note: Is there an issue tracking containment not working with constant floating point values? I see extra mov's with Vector128.Create(1.0f) and the like.

@tannergooding
Copy link
Member Author

Side note: Is there an issue tracking containment not working with constant floating point values? I see extra mov's with Vector128.Create(1.0f) and the like.

Most of the issues are currently tracked here: https://github.com/dotnet/coreclr/projects/7. We have a few tracking areas where codegen is not optimal (including some around the constructors). I'm not sure if we have one specifically for broadcast.

@tannergooding
Copy link
Member Author

For reference, here are the JITDumps for probably the simplest example:

[MethodImpl(MethodImplOptions.AggressiveOptimization | MethodImplOptions.NoInlining)]
static Vector4 Test(Vector4 x, Vector4 y)
{
    return x + y;
}

JitDump_base.txt
JitDump_diff.txt

The most notable difference is that base can directly import as SIMD simd16 float +, where-as diff has to first take CALL simd16 System.Numerics.Vector4.op_Addition and inline it to get HWIntrinsic simd16 float Add. However, the latter is actually smaller after the inlining happens as it is:

               [000016] ------------              *  HWIntrinsic simd16 float Add
               [000014] ------------              +--*  LCL_VAR   simd16 V05 tmp2         
               [000015] ------------              \--*  LCL_VAR   simd16 V06 tmp3

Where-as base is the following for a good bit of time (with the ADDR node eventually being removed).

               [000006] ------------              \--*  SIMD      simd16 float +
               [000005] n-----------                 +--*  OBJ(16)   simd16
               [000004] ------------                 |  \--*  ADDR      byref 
               [000000] ------------                 |     \--*  LCL_VAR   simd16 V01 arg0         
               [000003] n-----------                 \--*  OBJ(16)   simd16
               [000002] ------------                    \--*  ADDR      byref 
               [000001] ------------                       \--*  LCL_VAR   simd16 V02 arg1

Another notable difference is that base gets emitted as SIMD simd16 float + even under Tier 0; while diff stays as CALL simd16 System.Numerics.Vector4.op_Addition. I think, however, that this would be more impactful to Vector<T> than Vector2/3/4.

@4creators
Copy link

4creators commented Oct 27, 2019

From the first glimpse, it looks like Vector3 could be handled that way quite easily providing any transfers from/to memory will be handled as SIMD12 type. Essentially it would abstract Vector3 into Vector128 when in registers while will return back to SIMD12 type once back in memory (stack for x64 target could be handled as a Vector128 due to 16-byte alignment I guess).

However, what is even more important, this approach immediately opens up the possibility to implement all double VectorX variants with Vector128/Vector256 HW intrinsics what has been a big ask from the community.

Furthermore, there is a clear path to some excellent further "vectorization" opportunities, particularly for matrix math with AVX2, AVX512 intrinsics.

@tannergooding

Very encouraging work, indeed.

@tannergooding
Copy link
Member Author

From the first glimpse, it looks like Vector3 could be handled that way quite easily providing any transfers from/to memory will be handled as SIMD12 type.

Right. There are conversion operators for Vector2/3/4/ to and from Vector128/256 now. This PR does the JIT hookup for Vector4 and showcases it working (and I'll be doing the other hookups, without rewriting things, separately to help get things in).

It should work very similarly for Vector2/3 and provide an easy way to perform specialization on existing System.Numerics code without needing to rewrite everything to HWIntrinsics (instead you only need to specialize where System.Numerics doesn't support the functionality today).

@mikedn
Copy link

mikedn commented Oct 28, 2019

JitDump_base.txt
JitDump_diff.txt

Let me see if I understand this correctly. You are saying that everything is fine and dandy and the simple fact that the JIT diff dump is more than 50% larger than the base doesn't sound any alarms.

Basically it goes from simply recognizing the op_Addition intrinsic and generating a tree from it on the spot to inlining this monster:

Invoking compiler for the inlinee method System.Numerics.Vector4:op_Addition(struct,struct):struct :
IL to import:
IL_0000  28 b6 30 00 06    call         0x60030B6
IL_0005  2c 17             brfalse.s    23 (IL_001e)
IL_0007  02                ldarg.0     
IL_0008  28 2a 2c 00 06    call         0x6002C2A
IL_000d  03                ldarg.1     
IL_000e  28 2a 2c 00 06    call         0x6002C2A
IL_0013  28 b7 30 00 06    call         0x60030B7
IL_0018  28 2e 2c 00 06    call         0x6002C2E
IL_001d  2a                ret         
IL_001e  28 22 2d 00 06    call         0x6002D22
IL_0023  2c 17             brfalse.s    23 (IL_003c)
IL_0025  02                ldarg.0     
IL_0026  28 2a 2c 00 06    call         0x6002C2A
IL_002b  03                ldarg.1     
IL_002c  28 2a 2c 00 06    call         0x6002C2A
IL_0031  28 38 2d 00 06    call         0x6002D38
IL_0036  28 2e 2c 00 06    call         0x6002C2E
IL_003b  2a                ret         
IL_003c  02                ldarg.0     
IL_003d  7b 33 06 00 04    ldfld        0x4000633
IL_0042  03                ldarg.1     
IL_0043  7b 33 06 00 04    ldfld        0x4000633
IL_0048  58                add         
IL_0049  02                ldarg.0     
IL_004a  7b 34 06 00 04    ldfld        0x4000634
IL_004f  03                ldarg.1     
IL_0050  7b 34 06 00 04    ldfld        0x4000634
IL_0055  58                add         
IL_0056  02                ldarg.0     
IL_0057  7b 35 06 00 04    ldfld        0x4000635
IL_005c  03                ldarg.1     
IL_005d  7b 35 06 00 04    ldfld        0x4000635
IL_0062  58                add         
IL_0063  02                ldarg.0     
IL_0064  7b 36 06 00 04    ldfld        0x4000636
IL_0069  03                ldarg.1     
IL_006a  7b 36 06 00 04    ldfld        0x4000636
IL_006f  58                add         
IL_0070  73 71 1c 00 06    newobj       0x6001C71
IL_0075  2a                ret         

The only reason this isn't an unmitigated disaster is that the importer manages to ignore the dead IL. Though it still does extra work, such as creating a bunch more useless basic blocks.

And this tree that you claim to be smaller:

However, the latter is actually smaller after the inlining happens as it is:

[000016] ------------              *  HWIntrinsic simd16 float Add
[000014] ------------              +--*  LCL_VAR   simd16 V05 tmp2         
[000015] ------------              \--*  LCL_VAR   simd16 V06 tmp3

Isn't that tree cute? But it's a freakin' forest in reality:

*************** In fgMarkAddressExposedLocals()
LocalAddressVisitor visiting statement:
STMT00004 (IL 0x000...  ???)
               [000026] -A----------              *  ASG       simd16 (copy)
               [000024] D-----------              +--*  LCL_VAR   simd16 V05 tmp2         
               [000006] n-----------              \--*  OBJ(16)   simd16
               [000005] ------------                 \--*  ADDR      byref 
               [000000] ------------                    \--*  LCL_VAR   simd16 V01 arg0         
LocalAddressVisitor incrementing ref count from 0 to 1 for V01

LocalAddressVisitor visiting statement:
STMT00005 (IL 0x000...  ???)
               [000029] -A----------              *  ASG       simd16 (copy)
               [000027] D-----------              +--*  LCL_VAR   simd16 V06 tmp3         
               [000004] n-----------              \--*  OBJ(16)   simd16
               [000003] ------------                 \--*  ADDR      byref 
               [000001] ------------                    \--*  LCL_VAR   simd16 V02 arg1         
LocalAddressVisitor incrementing ref count from 0 to 1 for V02

LocalAddressVisitor visiting statement:
STMT00003 (IL 0x000...  ???)
               [000019] -A----------              *  ASG       simd16 (copy)
               [000017] D-----------              +--*  LCL_VAR   simd16 V04 tmp1         
               [000016] ------------              \--*  HWIntrinsic simd16 float Add
               [000014] ------------                 +--*  LCL_VAR   simd16 V05 tmp2         
               [000015] ------------                 \--*  LCL_VAR   simd16 V06 tmp3         

LocalAddressVisitor visiting statement:
STMT00001 (IL   ???...  ???)
               [000023] -A----------              *  ASG       simd16 (copy)
               [000022] ------------              +--*  IND       simd16
               [000020] ------------              |  \--*  LCL_VAR   byref  V00 RetBuf       
               [000021] ------------              \--*  LCL_VAR   simd16 V04 tmp1         

LocalAddressVisitor visiting statement:
STMT00002 (IL   ???...  ???)
               [000010] ------------              *  RETURN    byref 
               [000009] ------------              \--*  LCL_VAR   byref  V00 RetBuf       

instead of

*************** In fgMarkAddressExposedLocals()
LocalAddressVisitor visiting statement:
STMT00000 (IL 0x000...0x007)
               [000009] -A----------              *  ASG       simd16 (copy)
               [000008] ------------              +--*  IND       simd16
               [000007] ------------              |  \--*  LCL_VAR   byref  V00 RetBuf       
               [000006] ------------              \--*  SIMD      simd16 float +
               [000005] n-----------                 +--*  OBJ(16)   simd16
               [000004] ------------                 |  \--*  ADDR      byref 
               [000000] ------------                 |     \--*  LCL_VAR   simd16 V01 arg0         
               [000003] n-----------                 \--*  OBJ(16)   simd16
               [000002] ------------                    \--*  ADDR      byref 
               [000001] ------------                       \--*  LCL_VAR   simd16 V02 arg1         
LocalAddressVisitor incrementing ref count from 0 to 1 for V01
LocalAddressVisitor incrementing ref count from 0 to 1 for V02

LocalAddressVisitor visiting statement:
STMT00001 (IL   ???...  ???)
               [000011] ------------              *  RETURN    byref 
               [000010] ------------              \--*  LCL_VAR   byref  V00 RetBuf       

And the current JIT implementation is able to maintain some more complex vector expressions as a single tree:

; return v1 + v2 * v3;
STMT00004 (IL   ???...0x02B)
N008 ( 13, 10) [000037] -A-X----R---              *  ASG       simd16 (copy) $VN.Void
N007 (  3,  2) [000036] D--X---N----              +--*  IND       simd16
N006 (  1,  1) [000035] ------------              |  \--*  LCL_VAR   byref  V00 RetBuf       u:1 $80
N005 (  9,  7) [000034] --------R---              \--*  SIMD      simd16 float + $3c1
N004 (  1,  1) [000016] ------------                 +--*  LCL_VAR   simd16 V05 tmp1         u:2 (last use) <l:$2c0, c:$303>
N003 (  7,  5) [000033] ------------                 \--*  SIMD      simd16 float * $3c0
N001 (  3,  2) [000031] ------------                    +--*  LCL_VAR   simd16 V02 loc0         u:2 (last use) <l:$2c2, c:$307>
N002 (  3,  2) [000032] ------------                    \--*  LCL_VAR   simd16 V03 loc1         u:2 (last use) <l:$2c4, c:$30b>

I don't know what this will generate with the new implementation. If the simplest example generates a forest this must be generating a jungle or something.

So yeah, it would be good to somehow implement SIMD intrinsics on top of HWIntrinsic but this approach is likely to come with a few elephants, tigers and monkeys in the baggage.

@4creators
Copy link

it would be good to somehow implement SIMD intrinsics on top of HWIntrinsic but this approach is likely to come with a few elephants, tigers and monkeys in the baggage.

It's a good point to keep checks on what JIT is generating at each phase, however, since none of this is nature creation but rather a developer creation it should be manageable. I would rather argue that the current implementation of all Vector SIMD functionality is a bit odd given that it can be abstracted much better based on HW intrinsics.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static bool operator !=(Vector4 left, Vector4 right)
{
if (Sse.IsSupported)
{
return Sse.MoveMask(Sse.CompareNotEqual(left.AsVector128(), right.AsVector128())) != 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you just call !(left == right) (as in the !Sse.IsSupported path) will the codegen be affected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be. The == operator is also aggressively inlined and the Sse.IsSupported path will be dropped from there as well, so it should be identical to the code it was previously generating.

@tannergooding
Copy link
Member Author

Let me see if I understand this correctly. You are saying that everything is fine and dandy and the simple fact that the JIT diff dump is more than 50% larger than the base doesn't sound any alarms.

Yes, and I had taken a look at the diff. A very large chunk of the size difference is the inliner spewing information about Vector4.op_Addition, which is then efficiently inlined due to all but one path being treated as dead code.
The remaining size difference is largely the JIT needing to keep some trees around until later in the pipeline as it wants to maintain the fact there was a method call so as to not break observable side-effects.

Basically it goes from simply recognizing the op_Addition intrinsic and generating a tree from it on the spot to inlining this monster:

This is a general problem with intrinsics and inlining in general. Having the IsSupported checks introduces large IL trees, which the JIT then needs to be told it can import anyways because all but one path will be eliminated.
Some of this could be mitigated with slightly different coding patterns (such as having the checks and calling inlined helper methods per ISA to help reduce the IL size of the main method), but when that is beneficial can vary.

But it's a freakin' forest in reality

I believe this is, again, a general problem with intrinsics and inlining. Due to their being an explicit method call, the JIT initially tracks the two input arguments as "address taken" and requiring a copy. The return value is also explicitly tracked separately.
Ideally, this could just be folded away in importation or have the JIT explicitly told "there aren't any side effects here, make better trees".

@tannergooding
Copy link
Member Author

but this approach is likely to come with a few elephants, tigers and monkeys in the baggage.

Right. There will likely be a few gotchas and some improvements needed in the JIT (but this was already true for code using intrinsics, whether SIMD or HW).
If we had implemented the features in reverse (HWIntrinsics and then the more general SIMD), I believe we likely would have gone this route (or maybe just handled it in importation creating the relevant HWIntrinsic trees instead).

What I'm trying to solve is that we currently have some more general TYP_SIMD logic in the JIT and then the S.Numerics and S.R.Intrinsics support. The support for the latter two is very similar but it is different.
So we have a lot of duplicated code and logic for these features that makes it harder to support more generally, as you need to have separate handling in each phase.

Now, we could go and try to share some of this infrastructure in the JIT instead, but I think there would still be some disparency there which would make this less than ideal.
I had opted to go this route as I felt it would help simplify the JIT the most, would have the end code implemented similarly to how a 3rd party would need to write it, and it would help identify some areas that might need improvement.

@CarolEidt
Copy link

When you just call !(left == right) (as in the !Sse.IsSupported path) will the codegen be affected?

It shouldn't be. The == operator is also aggressively inlined and the Sse.IsSupported path will be dropped from there as well, so it should be identical to the code it was previously generating.

This is not quite true. As seen in #24912, even though these checks are elided in the importer, the additional basic blocks aren't coalesced until later in the JIT. This means that the local assertion prop doesn't operate across the blocks, and some optimizations are missed, since global assertion prop is not identical. I attempted two different approaches to fixing this (https://github.com/CarolEidt/coreclr/tree/CopyProp1) and (https://github.com/CarolEidt/coreclr/tree/CopyProp2), but both had mixed results.

Regarding the discussion above wrt the "general problem" of inlining intrinsic methods, this problem remains to be solved (or at least sufficiently mitigated) before we accept it as a given, and make changes that incur these costs where they previously were less of an issue.

I would reiterate what I've said before - I think this is right direction long-term, but we need to be driven by specific requirements and priorities, and I think it's important that we don't go down this path prematurely.

@AndyAyersMS
Copy link
Member

Ideally, this could just be folded away in importation or have the JIT explicitly told "there aren't any side effects here, make better trees"

There might be point fixes we can make, but in general the importer doesn't have enough context to safely do this. I don't see annotations or similar as the right solution. I think a better plan involves some sort of post-inline cleanup pass to coalesce blocks and rebuild trees, and a more detailed assessment of how the optimizer scales with large numbers of temps, so we can relax limits or back off opts gradually rather than the current all or nothing approach.

In most cases where we see HW intrinsics or vectors used we can safely assume the method is important for performance. We already have a bit of this hinting in the inliner, but it's pretty simplistic.

@tannergooding
Copy link
Member Author

Will go ahead and close this for now. It was mostly just a prototype to validate that this was indeed possible and produced good codegen 😄.

There are some obvious problems that still exist and it might be good to ensure that explicit bugs exist to track those, but we can also revisit this after some more fixes go through.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants