JIT: Unblock Vector###<long> intrinsics on x86 #112728

saucecontrol · 2025-02-20T05:25:43Z

This resolves a large number of TODOs around HWIntrinsic expansion involving scalar longs on x86.

The most significant change here is in promoting CreateScalar and ToScalar to be code generating intrinsics instead of converting them to other intrinsics at lowering. This was necessary in order to handle emitting movq for scalar long loads/stores but also unlocks several other optimizations since we can now allow CreateScalar and ToScalar to be contained and can specialize codegen depending on whether they end up loading/storing from/to memory or not. Some example improvements on x64:

Vector128.CreateScalar(ref float):

-       vinsertps xmm0, xmm0, dword ptr [rbp+0x10], 14
+       vmovss   xmm0, dword ptr [rbp+0x10]

Vector128.CreateScalar(ref double):

-       vxorps   xmm0, xmm0, xmm0
-       vmovsd   xmm1, qword ptr [rbp-0x08]
-       vmovsd   xmm0, xmm0, xmm1
+       vmovsd   xmm0, qword ptr [rbp-0x08]

ref byte = Vector128<byte>.ToScalar():

-       vmovd    r9d, xmm3
-       mov      byte  ptr [r10], r9b
+       vpextrb  byte  ptr [r10], xmm3, 0

Vector<byte>.ToScalar()

-       vmovups  ymm0, ymmword ptr [esp+0x04]
-       vmovd    eax, xmm0
-       movzx    eax, al
+       movzx    eax, byte  ptr [esp+0x04]

And the less realistic, but still interesting
Sse.AddScalar(Vector128.CreateScalar(ref float), Vector128.CreateScalar(ref float)).ToScalar():

-       xorps    xmm0, xmm0
-       movss    xmm1, dword ptr [rcx]
-       movss    xmm0, xmm1
-       xorps    xmm1, xmm1
-       movss    xmm2, dword ptr [rdx]
-       movss    xmm1, xmm2
-       addss    xmm0, xmm1
+       movss    xmm0, dword ptr [rcx]
+       addss    xmm0, dword ptr [rdx]

This also removes some redundant casts for CreateScalar of small types. Previously, a zero-extending cast was inserted unconditionally and was sometimes removed by peephole opt on x64 but often wasn't.

Vector128.CreateScalar(short):

-       movsx    rax, dx
-       movzx    rax, ax
-       movd     xmm0, rax
+       movzx    rax, dx
+       movd     xmm0, eax

Vector128.CreateScalar(checked((byte)val)):

        cmp      edx, 255
        ja       SHORT G_M000_IG04
        mov      eax, edx
-       movzx    rax, al
-       vmovd    xmm0, rax
+       vmovd    xmm0, eax

Vector128.CreateScalar(ref sbyte):

-       movsx    rax, byte  ptr [rdx]
-       movzx    rax, al
-       vmovd    xmm0, rax
+       movzx    rax, byte  ptr [rdx]
+       vmovd    xmm0, eax

x86 diffs are much more significant, because of the newly-enabled intrinsic expansion:

Collection	Base size (bytes)	Diff size (bytes)	PerfScore in Diffs
benchmarks.run.windows.x86.checked.mch	7,149,204	-1,892	-2.17%
benchmarks.run_pgo.windows.x86.checked.mch	46,986,713	-738	+0.03%
benchmarks.run_tiered.windows.x86.checked.mch	9,470,045	-976	+0.11%
coreclr_tests.run.windows.x86.checked.mch	320,065,247	-205,564	-6.41%
libraries.crossgen2.windows.x86.checked.mch	31,314,339	-15,854	-4.11%
libraries.pmi.windows.x86.checked.mch	34,326,245	-14,416	-2.19%
libraries_tests.run.windows.x86.Release.mch	215,517,600	-55,366	-2.41%
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch	115,783,488	-80,576	-3.65%
realworld.run.windows.x86.checked.mch	9,587,950	-467	-0.45%

dotnet-policy-service · 2025-02-20T05:26:23Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

saucecontrol

This is ready for review.
cc @tannergooding

saucecontrol · 2025-02-22T00:07:49Z

src/coreclr/jit/lower.h

-        // Keep casts with operands usable from memory.
-        if (castOp->isContained() || castOp->IsRegOptional())
-        {
-            return op;
-        }


This condition, added in #72719, made this method effectively useless. Removing it was a zero-diff change. I can look in future at containing the casts rather than removing them.

saucecontrol · 2025-02-22T00:11:13Z

src/coreclr/jit/lowerxarch.cpp

@@ -4677,19 +4539,16 @@ GenTree* Lowering::LowerHWIntrinsicCreate(GenTreeHWIntrinsic* node)
        return LowerNode(node);
    }

-    GenTree* op2 = node->Op(2);
-
-    // TODO-XArch-AVX512 : Merge the NI_Vector512_Create and NI_Vector256_Create paths below.


The churn in this section is just taking care of this TODO

saucecontrol · 2025-02-22T00:13:48Z

src/coreclr/jit/lowerxarch.cpp


            assert(comp->compIsaSupportedDebugOnly(InstructionSet_SSE2));

            tmp2 = InsertNewSimdCreateScalarUnsafeNode(TYP_SIMD16, op2, simdBaseJitType, 16);
            LowerNode(tmp2);

-            node->ResetHWIntrinsicId(NI_SSE_MoveLowToHigh, tmp1, tmp2);


Changing this to UnpackLow shows up as a regression in a few places, because movlhps is one byte smaller, but it enables other optimizations since unpcklpd takes a memory operand plus mask and embedded broadcast.

Vector128.Create(double, 1.0):

- vmovups xmm0, xmmword ptr [reloc @RWD00] - vmovlhps xmm0, xmm1, xmm0 + vunpcklpd xmm0, xmm1, qword ptr [reloc @RWD00] {1to2}

This should probably be peepholed back to vmovlhps if both are from register.

I was thinking the same but would rather save that for a followup. llvm has a replacement list of equivalent instructions that have different sizes, and unpcklpd is on it, as are things like vpermilps, which is replaced by pshufd.

It's worth having a discussion about whether we'd also want to do replacements that switch between float and integer domains. I'll open an issue.

saucecontrol · 2025-02-22T00:17:06Z

src/coreclr/jit/decomposelongs.cpp

+    if (varDsc->lvIsParam)
+    {
+        // Promotion blocks combined read optimizations for SIMD loads of long params
+        return;
+    }


In isolation, this change produced a small number of diffs and was mostly an improvement. A few regressions show up in the SPMI reports, but the overall impact is good, especially considering the places we can load a long to vector with movq

saucecontrol · 2025-02-22T18:37:17Z

It occurred to me the optimization to emit pinsrb/w for CreateScalarUnsafe was a bad idea because it creates a false dependency on the upper bits of the target reg. Removed that.

tannergooding · 2025-02-24T15:35:43Z

src/coreclr/jit/decomposelongs.cpp

+        assert(m_compiler->compIsaSupportedDebugOnly(InstructionSet_SSE2));
+
+        GenTree* thirtyTwo = m_compiler->gtNewIconNode(32);
+        GenTree* shift     = m_compiler->gtNewSimdBinOpNode(GT_RSZ, op1->TypeGet(), simdTmpVar, thirtyTwo,


Isn't this missing ToScalar() to get the 32-bit integer out of the simd result?

ToScalar is the original intrinsicId, and it's built on the next line down. Full SSE2 codegen for Vector128<ulong>.ToScalar():

; Method Program:ToScalar(System.Runtime.Intrinsics.Vector128`1[ulong]):ulong (FullOpts) G_M34649_IG01: ;; offset=0x0000 push ebp mov ebp, esp movups xmm0, xmmword ptr [ebp+0x08] ;; size=7 bbWeight=1 PerfScore 4.25 G_M34649_IG02: ;; offset=0x0007 movd eax, xmm0 psrlq xmm0, 32 movd edx, xmm0 ;; size=13 bbWeight=1 PerfScore 4.50 G_M34649_IG03: ;; offset=0x0014 pop ebp ret 16 ;; size=4 bbWeight=1 PerfScore 2.50 ; Total bytes of code: 24

tannergooding · 2025-02-24T15:45:01Z

src/coreclr/jit/lowerxarch.cpp

                    }
-                    else
+                    else if (op1->OperIs(GT_IND))


Why IND in particular and not other types of containable memory ops?

Good point, will fix.

Added LCL_FLD as well. I believe LCL_VAR will have INT type, so even if marked DNE, it can't be handled with this case.

jakobbotsch · 2025-02-24T16:48:35Z

src/coreclr/jit/decomposelongs.cpp

+            //        *  STOREIND    long
+
+            GenTree* next = tree->gtNext;
+            if ((user != next) && !m_compiler->gtTreeHasSideEffects(next, GTF_SIDE_EFFECT))


gtTreeHasSideEffects is for HIR, not LIR. For LIR you should use OperEffects. But also, relying on the execution order in this way is an anti-pattern. It means optimizations will subtly break from unrelated changes that people may make in the future. Can the whole thing be changed to use an appropriate IsInvariantInRange check?

Ok, yeah, I wasn't happy with this but wasn't sure the best way to handle it. I started down the path of using IsInvariantInRange, but that's private to Lowering, so I'd have to move it to the public surface and then have lowering pass itself to DecomposeLongs. If that change is ok, I'll go ahead and do it.

That sounds ok to me. Alternatively this could be a static method on SideEffectSet or LIR and then each of Lower and DecomposeLongs would have a cached SideEffectSet to use for it.

Done. I used IsSafeToContainMem instead of IsInvariantInRange because the name and arg order make more sense (to me, at least) in this context.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 20, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 20, 2025

build-analysis bot mentioned this pull request Feb 20, 2025

LibraryImportGenerator.Unit.Tests crashing on linux-x64 mono interpreter #100800

Open

saucecontrol force-pushed the createscalar64 branch from 197fac5 to 628d4f8 Compare February 20, 2025 18:52

unblock long xplat intrinsics on x86

3a130c8

saucecontrol force-pushed the createscalar64 branch from 628d4f8 to 3a130c8 Compare February 21, 2025 06:31

This was referenced Feb 21, 2025

System.Numerics.Tensors.Tests.ConvertTests.ConvertChecked failing with System.OverflowException #112286

Closed

System.Numerics.Tensors.Tests.ConvertTests.ConvertChecked test failure #112755

Open

saucecontrol added 2 commits February 21, 2025 11:51

tidying

7f220c2

tidying2

78dc31d

saucecontrol commented Feb 22, 2025

View reviewed changes

saucecontrol marked this pull request as ready for review February 22, 2025 00:18

saucecontrol added 3 commits February 22, 2025 10:13

Merge remote-tracking branch 'upstream/main' into createscalar64

7330c3e

remove CreateScalarUnsafe opt for small loads

69065ee

skip more redundant casts for CreateScalar of small types

86ebdae

saucecontrol added 5 commits February 22, 2025 11:53

use temp reg for CreateScalar float SSE fallback

cdb0910

formatting patch

5d6fb3f

simplify storeind containment of ToScalar

bb03516

don't use temp reg for CreateScalar float SSE fallback

cba4ab0

Merge remote-tracking branch 'upstream/main' into createscalar64

1f97bd9

build-analysis bot mentioned this pull request Feb 24, 2025

Android emulator not booting completely on Helix queue dotnet/dnceng#1448

Open

3 tasks

tannergooding reviewed Feb 24, 2025

View reviewed changes

jakobbotsch reviewed Feb 24, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into createscalar64

71145ab

saucecontrol added 2 commits February 24, 2025 10:38

skip cast on other memory loads

42a6ab8

use proper containment check

1c98e23

saucecontrol mentioned this pull request Feb 24, 2025

Add peephole optimization to replace x86 SIMD instructions with smaller equivalents #112880

Open

build-analysis bot mentioned this pull request Feb 25, 2025

SslStreamNetworkStreamTest failures "The remote certificate is invalid because of errors in the certificate chain: PartialChain" #112856

Closed

saucecontrol added 2 commits February 25, 2025 12:39

Merge remote-tracking branch 'upstream/main' into createscalar64

c80c566

Merge remote-tracking branch 'upstream/main' into createscalar64

fb2cf30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Unblock Vector###<long> intrinsics on x86 #112728

JIT: Unblock Vector###<long> intrinsics on x86 #112728

saucecontrol commented Feb 20, 2025 •

edited

Loading

dotnet-policy-service bot commented Feb 20, 2025

saucecontrol left a comment •

edited

Loading

saucecontrol Feb 22, 2025

saucecontrol Feb 22, 2025

saucecontrol Feb 22, 2025 •

edited

Loading

tannergooding Feb 24, 2025

saucecontrol Feb 24, 2025

saucecontrol Feb 22, 2025

saucecontrol commented Feb 22, 2025

tannergooding Feb 24, 2025

saucecontrol Feb 24, 2025

tannergooding Feb 24, 2025

saucecontrol Feb 24, 2025

saucecontrol Feb 24, 2025 •

edited

Loading

jakobbotsch Feb 24, 2025 •

edited

Loading

saucecontrol Feb 24, 2025

jakobbotsch Feb 24, 2025

saucecontrol Feb 24, 2025

JIT: Unblock Vector###<long> intrinsics on x86 #112728

Are you sure you want to change the base?

JIT: Unblock Vector###<long> intrinsics on x86 #112728

Conversation

saucecontrol commented Feb 20, 2025 • edited Loading

dotnet-policy-service bot commented Feb 20, 2025

saucecontrol left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saucecontrol Feb 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saucecontrol commented Feb 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saucecontrol Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

jakobbotsch Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saucecontrol commented Feb 20, 2025 •

edited

Loading

saucecontrol left a comment •

edited

Loading

saucecontrol Feb 22, 2025 •

edited

Loading

saucecontrol Feb 24, 2025 •

edited

Loading

jakobbotsch Feb 24, 2025 •

edited

Loading