Enable AVX512 embedded masking for most other intrinsics #101886

tannergooding · 2024-05-05T17:47:17Z

This is a continuation of #97675 and almost finishes out #87097

In particular, it enables the embedded masking support for all intrinsics except for the various load, store, move, and broadcast intrinsics that explicitly deal with memory operations.

As part of this, the PR explicitly marks intrinsics which should never appear as the intrinsicId of a node to help ensure the relevant intrinsics are being properly handled.

…tible, or Commutative

dotnet-policy-service · 2024-05-05T17:47:54Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

…c code paths

…tSupported

tannergooding · 2024-05-06T23:22:49Z

src/coreclr/jit/compiler.h

+    bool canUseEmbeddedBroadcast() const
+    {
+        return JitConfig.EnableEmbeddedBroadcast();
+    }
+
+    bool canUseEmbeddedMasking() const
+    {
+        return JitConfig.EnableEmbeddedMasking();
+    }


The embedded broadcast/masking support for both AVX512 and SVE is pretty complex in parts, as such having a knob to allow disabling it can be beneficial to help validate perf/size wins for the feature and to allow users to workaround any issues if they happen to be found.

kunalspathak

I skimmed through the changes in hwintrinsiclistxarch.h and instrsxarch.h and they looked OK to me. If there are any specific changes, other than adding HW_Flag_InvalidNodeId or INS_Flags_EmbeddedBroadcastSupported, let me know.

Overall the changes looks good. It seems at multiple places having insOpts having default value would save us from passing INS_OPTS_NONE around.

Waiting for superpmi-diff results.

src/coreclr/jit/hwintrinsiclistarm64.h

src/coreclr/jit/emitxarch.h

src/coreclr/jit/codegen.h

src/coreclr/jit/hwintrinsic.h

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

src/coreclr/jit/hwintrinsicxarch.cpp

src/coreclr/jit/lowerxarch.cpp

src/tests/JIT/HardwareIntrinsics/X86/Shared/_BinaryOpTestTemplate.template

tannergooding · 2024-05-07T18:22:49Z

Overall the changes looks good. It seems at multiple places having insOpts having default value would save us from passing INS_OPTS_NONE around.

I opted to have it default for the emitIns_* APIs that are more generally used but require it for emitIns_SIMD_* since the latter needs to be more explicit and I wanted to ensure that all places were passing through insOpts and otherwise considering whether broadcasting/masking/rounding need to be handled for the cases where INS_OPTS_NONE is explicitly passed.

…g converted to the AVX512 form

…needing EVEX

kunalspathak · 2024-05-08T19:58:10Z

I didn't quite understand the changes made in f83162d...can you elaborate? other than that, things look good.

tannergooding · 2024-05-08T20:07:49Z

I didn't quite understand the changes made in https://github.com/dotnet/runtime/commit/f83162d11baa9b139d6497cbdf61f50779e7d5bd...can you elaborate? other than that, things look good.

For SSE-SSE41 there isn't actually an instruction to do floating-point CompareGreaterThan or CompareGreaterThanOrEqual. Inversely for SSE-AVX2 there isn't actually an instruction to do integer CompareLessThan. Instead, these were emulated by swapping the operands in an early phase (import or lowering). -- That is, for example, given float we'd have CGT x, y and in lowering we'd change it to CGT y, x and just have codegen emit it as CLT y, x. This was originally done to simplify various other bits and because we never need to make observations about these intrinsics from that point.

With AVX512 and the ability to do embedded masking, we want to emit CompareGreaterThanMask instead but only when its part of a ConditionalSelect (and we know the mask register will be used directly). Because we were swapping from CGT x, y to CGT y, x we had no way to know that it now actually meant CLT y, x and thus should become CompareLessThanMask instead.

So the change just ensured that we stopped lying about the operation being done when the operands were swapped. Thus CGT x, y becomes CLT y, x instead and latter operations can correctly introspect the operation and do the right thing.

kunalspathak

LGTM

tannergooding · 2024-05-08T22:16:25Z

linux x64

Diffs are based on 2,304,731 contexts (997,292 MinOpts, 1,307,439 FullOpts).

MISSED contexts: 7 (0.00%)

Overall (-21,300 bytes)

Collection	Base size (bytes)	Diff size (bytes)	PerfScore in Diffs
benchmarks.run.linux.x64.checked.mch	15,697,916	-360	-1.24%
benchmarks.run_pgo.linux.x64.checked.mch	70,177,184	-1,995	-1.17%
benchmarks.run_tiered.linux.x64.checked.mch	15,151,406	-491	-1.16%
coreclr_tests.run.linux.x64.checked.mch	416,332,755	-10,909	-1.87%
libraries.pmi.linux.x64.checked.mch	61,114,110	-87	+1.66%
libraries_tests.run.linux.x64.Release.mch	354,860,299	-5,100	-0.50%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	133,696,745	-1,982	-1.12%
realworld.run.linux.x64.checked.mch	13,650,157	-255	-0.37%
smoke_tests.nativeaot.linux.x64.checked.mch	4,210,219	-121	-1.45%

MinOpts (-8,946 bytes)

Collection	Base size (bytes)	Diff size (bytes)	PerfScore in Diffs
benchmarks.run_pgo.linux.x64.checked.mch	23,161,983	-522	-0.91%
benchmarks.run_tiered.linux.x64.checked.mch	11,562,115	-393	-0.81%
coreclr_tests.run.linux.x64.checked.mch	289,839,696	-6,365	-1.89%
libraries_tests.run.linux.x64.Release.mch	193,849,452	-1,666	-0.66%

FullOpts (-12,354 bytes)

Collection	Base size (bytes)	Diff size (bytes)	PerfScore in Diffs
benchmarks.run.linux.x64.checked.mch	15,323,854	-360	-1.24%
benchmarks.run_pgo.linux.x64.checked.mch	47,015,201	-1,473	-1.28%
benchmarks.run_tiered.linux.x64.checked.mch	3,589,291	-98	-2.11%
coreclr_tests.run.linux.x64.checked.mch	126,493,059	-4,544	-1.83%
libraries.pmi.linux.x64.checked.mch	61,000,849	-87	+1.66%
libraries_tests.run.linux.x64.Release.mch	161,010,847	-3,434	-0.43%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	123,001,625	-1,982	-1.12%
realworld.run.linux.x64.checked.mch	13,242,834	-255	-0.37%
smoke_tests.nativeaot.linux.x64.checked.mch	4,209,172	-121	-1.45%

windows x64

Diffs are based on 2,615,190 contexts (1,040,939 MinOpts, 1,574,251 FullOpts).

MISSED contexts: 4 (0.00%)

Overall (-23,037 bytes)

Collection	Base size (bytes)	Diff size (bytes)	PerfScore in Diffs
aspnet.run.windows.x64.checked.mch	63,103,761	-2,979	-1.08%
benchmarks.run.windows.x64.checked.mch	8,728,529	-395	-1.21%
benchmarks.run_pgo.windows.x64.checked.mch	35,373,663	-1,881	-1.13%
benchmarks.run_tiered.windows.x64.checked.mch	13,034,809	-441	-0.99%
coreclr_tests.run.windows.x64.checked.mch	404,903,405	-10,624	-1.92%
libraries.crossgen2.windows.x64.checked.mch	45,226,392	-6	-0.29%
libraries.pmi.windows.x64.checked.mch	62,293,489	-465	+0.46%
libraries_tests.run.windows.x64.Release.mch	301,873,279	-4,482	-1.55%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	138,326,612	-1,729	-1.14%
realworld.run.windows.x64.checked.mch	13,553,568	+91	-0.43%
smoke_tests.nativeaot.windows.x64.checked.mch	5,016,989	-126	-1.32%

MinOpts (-9,043 bytes)

Collection	Base size (bytes)	Diff size (bytes)	PerfScore in Diffs
aspnet.run.windows.x64.checked.mch	29,168,185	-445	-1.16%
benchmarks.run_pgo.windows.x64.checked.mch	14,311,838	-507	-0.94%
benchmarks.run_tiered.windows.x64.checked.mch	9,747,978	-393	-0.84%
coreclr_tests.run.windows.x64.checked.mch	282,314,302	-6,281	-1.90%
libraries_tests.run.windows.x64.Release.mch	186,524,539	-1,417	-0.46%

FullOpts (-13,994 bytes)

Collection	Base size (bytes)	Diff size (bytes)	PerfScore in Diffs
aspnet.run.windows.x64.checked.mch	33,935,576	-2,534	-1.06%
benchmarks.run.windows.x64.checked.mch	8,728,100	-395	-1.21%
benchmarks.run_pgo.windows.x64.checked.mch	21,061,825	-1,374	-1.23%
benchmarks.run_tiered.windows.x64.checked.mch	3,286,831	-48	-1.40%
coreclr_tests.run.windows.x64.checked.mch	122,589,103	-4,343	-1.96%
libraries.crossgen2.windows.x64.checked.mch	45,224,679	-6	-0.29%
libraries.pmi.windows.x64.checked.mch	62,179,538	-465	+0.46%
libraries_tests.run.windows.x64.Release.mch	115,348,740	-3,065	-2.12%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	127,486,318	-1,729	-1.14%
realworld.run.windows.x64.checked.mch	13,147,847	+91	-0.43%
smoke_tests.nativeaot.windows.x64.checked.mch	5,015,942	-126	-1.32%

tannergooding · 2024-05-08T22:20:33Z

Diffs generally look similar to the following:

-9 (-23.08%) : 69294.dasm - System.Buffers.IndexOfAnyAsciiSearcher+Ssse3AndWasmHandleZeroInNeedle:PackSources(System.Runtime.Intrinsics.Vector128`1[ushort],System.Runtime.Intrinsics.Vector128`1[ushort]):System.Runtime.Intrinsics.Vector128`1[ubyte] (Tier1)

@@ -15,7 +15,7 @@
 ;* V03 loc0         [V03    ] (  0,  0   )  simd16  ->  zero-ref    <System.Runtime.Intrinsics.Vector128`1[short]>
 ;* V04 loc1         [V04    ] (  0,  0   )  simd16  ->  zero-ref    <System.Runtime.Intrinsics.Vector128`1[short]>
 ;# V05 OutArgs      [V05    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-;  V06 cse0         [V06,T03] (  3,  3   )  simd16  ->  mm1         "CSE #01: aggressive"
+;  V06 cse0         [V06,T03] (  3,  3   )  simd16  ->  mm0         "CSE #01: aggressive"
 ;
 ; Lcl frame size = 0
 
@@ -23,23 +23,21 @@ G_M3343_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=0 bbWeight=1 PerfScore 0.00
 G_M3343_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0106 {rcx rdx r8}, byref
        ; byrRegs +[rcx rdx r8]
-       vmovups  xmm0, xmmword ptr [rdx]
-       vmovups  xmm1, xmmword ptr [reloc @RWD00]
-       vpminuw  xmm0, xmm0, xmm1
-       vmovups  xmm2, xmmword ptr [r8]
-       vpminuw  xmm1, xmm2, xmm1
-       vpackuswb xmm0, xmm0, xmm1
+       vmovups  xmm0, xmmword ptr [reloc @RWD00]
+       vpminuw  xmm1, xmm0, xmmword ptr [rdx]
+       vpminuw  xmm0, xmm0, xmmword ptr [r8]
+       vpackuswb xmm0, xmm1, xmm0
        vmovups  xmmword ptr [rcx], xmm0
        mov      rax, rcx
        ; byrRegs +[rax]
-						;; size=38 bbWeight=1 PerfScore 15.25
+						;; size=29 bbWeight=1 PerfScore 12.25
 G_M3343_IG03:        ; bbWeight=1, epilog, nogc, extend
        ret      
 						;; size=1 bbWeight=1 PerfScore 1.00
 RWD00  	dq	00FF00FF00FF00FFh, 00FF00FF00FF00FFh
 
 
-; Total bytes of code 39, prolog size 0, PerfScore 16.25, instruction count 9, allocated bytes for code 39 (MethodHash=940ff2f0) for method System.Buffers.IndexOfAnyAsciiSearcher+Ssse3AndWasmHandleZeroInNeedle:PackSources(System.Runtime.Intrinsics.Vector128`1[ushort],System.Runtime.Intrinsics.Vector128`1[ushort]):System.Runtime.Intrinsics.Vector128`1[ubyte] (Tier1)
+; Total bytes of code 30, prolog size 0, PerfScore 13.25, instruction count 7, allocated bytes for code 30 (MethodHash=940ff2f0) for method System.Buffers.IndexOfAnyAsciiSearcher+Ssse3AndWasmHandleZeroInNeedle:PackSources(System.Runtime.Intrinsics.Vector128`1[ushort],System.Runtime.Intrinsics.Vector128`1[ushort]):System.Runtime.Intrinsics.Vector128`1[ubyte] (Tier1)
 ; ============================================================
 
 Unwind Info:

-3 (-4.69%) : 4801.dasm - System.Diagnostics.Stopwatch:GetElapsedTime(long,long):System.TimeSpan (Tier1)

@@ -26,22 +26,21 @@ G_M44428_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        vxorps   xmm0, xmm0, xmm0
        vcvtsi2sd xmm0, xmm0, rdx
        vfixupimmsd xmm0, xmm0, xmmword ptr [reloc @RWD00], 0
-       vmovups  xmm1, xmmword ptr [reloc @RWD16]
-       vcmppd   xmm2, xmm0, xmmword ptr [reloc @RWD32], 13
-       vcvttsd2si  rax, xmm0
-       vpbroadcastq  xmm0, rax
-       vpternlogq xmm2, xmm1, xmm0, -54
-       vmovd    rax, xmm2
-						;; size=63 bbWeight=1 PerfScore 27.08
+       vcmppd   k1, xmm0, xmmword ptr [reloc @RWD16], 13
+       vcvttsd2si rax, xmm0
+       vpbroadcastq xmm0, rax
+       vpblendmq xmm0 {k1}, xmm0, xmmword ptr [reloc @RWD32]
+       vmovd    rax, xmm0
+						;; size=60 bbWeight=1 PerfScore 25.58
 G_M44428_IG03:        ; bbWeight=1, epilog, nogc, extend
        ret      
 						;; size=1 bbWeight=1 PerfScore 1.00
 RWD00  	dq	0000000000000088h, 0000000000000000h
-RWD16  	dq	7FFFFFFFFFFFFFFFh, 7FFFFFFFFFFFFFFFh
-RWD32  	dq	43E0000000000000h, 43E0000000000000h
+RWD16  	dq	43E0000000000000h, 43E0000000000000h
+RWD32  	dq	7FFFFFFFFFFFFFFFh, 7FFFFFFFFFFFFFFFh
 
 
-; Total bytes of code 64, prolog size 0, PerfScore 28.08, instruction count 11, allocated bytes for code 64 (MethodHash=d8da5273) for method System.Diagnostics.Stopwatch:GetElapsedTime(long,long):System.TimeSpan (Tier1)
+; Total bytes of code 61, prolog size 0, PerfScore 26.58, instruction count 10, allocated bytes for code 62 (MethodHash=d8da5273) for method System.Diagnostics.Stopwatch:GetElapsedTime(long,long):System.TimeSpan (Tier1)
 ; ============================================================
 
 Unwind Info:

tannergooding · 2024-05-08T22:24:56Z

In the optimal case, like seen in some of the tests, we can convert something like:

Vector512.ConditionalSelect(mask, x + Vector512.Create(cns), Vector512<T>.Zero)

into

vaddps zmm0 {k1}{z}, zmm0, dword ptr [rax] {1to16}

The few regressions that do show up tend to be from using the EVEX encoding, but that's to be expected as we're using larger instructions that have lower cost. There's some longer term improvements that could still be done around containment and commutativity (such as inversing the mask or specially handling some types of blending more), but those are longer term goals to handle.

tannergooding · 2024-05-08T22:52:41Z

Some other prominent diffs look like

-       vcmppd   xmm4, xmm0, xmm0, 0
-       vcmppd   xmm5, xmm3, xmm3, 0
-       vcmppd   xmm6, xmm1, xmm2, 0
-       vxorps   xmm7, xmm7, xmm7
-       vpcmpgtq xmm7, xmm7, xmm0
-       vpternlogq xmm7, xmm0, xmm3, -54
-       vcmppd   xmm1, xmm2, xmm1, 1
-       vpternlogq xmm1, xmm0, xmm3, -54
-       vpternlogq xmm6, xmm7, xmm1, -54
-       vpternlogq xmm5, xmm6, xmm3, -54
-       vpternlogq xmm4, xmm5, xmm0, -54
-       vmovups  xmmword ptr [rcx], xmm4
+       vcmppd   k1, xmm0, xmm0, 0
+       vcmppd   k2, xmm3, xmm3, 0
+       vcmppd   k3, xmm1, xmm2, 0
+       vxorps   xmm4, xmm4, xmm4
+       vpcmpgtq k4, xmm4, xmm0
+       vblendmpd xmm4 {k4}, xmm3, xmm0
+       vcmppd   k4, xmm2, xmm1, 1
+       vblendmpd xmm1 {k4}, xmm3, xmm0
+       vblendmpd xmm1 {k3}, xmm1, xmm4
+       vblendmpd xmm1 {k2}, xmm3, xmm1
+       vblendmpd xmm0 {k1}, xmm0, xmm1
+       vmovups  xmmword ptr [rcx], xmm0

and

-       vpcmpgtd ymm1, ymm1, ymm0
-       vxorps   ymm2, ymm2, ymm2
-       vpsubd   ymm2, ymm2, ymm0
-       vpternlogd ymm1, ymm2, ymm0, -54
-       vxorps   ymm2, ymm2, ymm2
-       vpcmpgtd ymm1, ymm2, ymm1
+       vpcmpgtd k1, ymm1, ymm0
+       vmovaps  ymm2, ymm0
+       vpsubd   ymm2 {k1}, ymm1, ymm0
+       vpcmpgtd ymm1, ymm1, ymm2

The TP regression peaks at around +0.06% in minopts. I had actually tried to avoid doing this in minopts in one of the early PRs and that actually turned out to be closer to a +1.1% regression due to the register allocator having to do overall more work in the typical scenario. So this ends up being an overall good balance across the entirety of the code.

tannergooding · 2024-05-09T01:21:41Z

CC. @fanyang-mono, seems there's a Mono LLVMAOT failure in the form of:

  /__w/1/s/artifacts/bin/mono/linux.x64.Release/opt: mono_aot_dFNFkp/temp.bc: error: Invalid record (Producer: 'LLVM16.0.5' Reader: 'LLVM 16.0.5')
  AOT of image /__w/1/s/artifacts/tests/coreclr/linux.x64.Release/JIT/HardwareIntrinsics/HardwareIntrinsics_X86_r/X86_Sse2_r.dll failed.
  Mono Ahead of Time compiler - compiling assembly /__w/1/s/artifacts/tests/coreclr/linux.x64.Release/JIT/HardwareIntrinsics/HardwareIntrinsics_X86_r/X86_Sse2_r.dll

I'm guessing this has something to do with the Mono V128 acceleration for x64, but it's not clear what in the tests would be causing it. It's only failing for Sse2_r and Sse2_ro from what I can tell while the additional tests exist more broadly and are replicated across other projects too. The actual test additions are also fairly simple, just adding 4 new methods that cover embedded broadcast and embedded masking patterns using the xplat APIs, which should already be well supported or fallback to the software implementation for Mono.

tannergooding · 2024-05-09T01:44:16Z

I've logged #102037 to track the general issue

* Remove HW_Flag_MultiIns in favor of using HW_Flag_SpecialCodeGen * Add a new flag HW_Flag_InvalidNodeId * Change HW_Flag_EmbMaskingIncompatible to be HW_Flag_EmbMaskingCompatible * Mark various compare intrinsics with HW_Flag_NoEvexSemantics * Marking various intrinsics as EmbBroadcastCompatible, EmbMaskingCompatible, or Commutative * Applying formatting patch * Ensure WithLower/WithUpper are not marked as InvalidNodeId * Ensure that instOptions are being passed down all relevant hwintrinsic code paths * Ensure the insOpts are plumbed through for EVEX instructions * Ensure EVEX instructions are properly annotated with EmbeddedBroadcastSupported * Ensure that embedded broadcast/masking is displayed in the disassembly * Applying formatting patch * Updating the hwintrinsic tests to cover embedded broadcast/masking * Fix some handling in the JIT related to embedded broadcast/masking * Fixup some tests where validating embedded masking is non-trivial * Cleanup some cases found by SPMI * Ensure that CompareLessThan has its operands swapped back if its being converted to the AVX512 form * Don't regress a scenario around op_Equality and TYP_MASK * Adjusting hardware intrinsic tests to test non-zero masks * Avoid some messiness around operand swapping * Ensure embedded masks mark TYP_SIMD16 and TYP_SIMD32 instructions as needing EVEX * Mark Sse2_r/Sse2_ro as AotIncompatible due to runtime/102037

tannergooding added 5 commits May 4, 2024 07:59

Remove HW_Flag_MultiIns in favor of using HW_Flag_SpecialCodeGen

d341230

Add a new flag HW_Flag_InvalidNodeId

134a43b

Change HW_Flag_EmbMaskingIncompatible to be HW_Flag_EmbMaskingCompatible

f0cc9dc

Mark various compare intrinsics with HW_Flag_NoEvexSemantics

ea8ab95

Marking various intrinsics as EmbBroadcastCompatible, EmbMaskingCompa…

089ee48

…tible, or Commutative

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 5, 2024

dotnet-policy-service bot assigned tannergooding May 5, 2024

tannergooding added 9 commits May 5, 2024 11:17

Applying formatting patch

2252608

Ensure WithLower/WithUpper are not marked as InvalidNodeId

067633e

Ensure that instOptions are being passed down all relevant hwintrinsi…

1cefd82

…c code paths

Ensure the insOpts are plumbed through for EVEX instructions

08039ed

Ensure EVEX instructions are properly annotated with EmbeddedBroadcas…

6014628

…tSupported

Ensure that embedded broadcast/masking is displayed in the disassembly

0c9b804

Applying formatting patch

ed3cd21

Updating the hwintrinsic tests to cover embedded broadcast/masking

57efa0a

Fix some handling in the JIT related to embedded broadcast/masking

5e6e1af

tannergooding commented May 6, 2024

View reviewed changes

tannergooding added 2 commits May 7, 2024 00:03

Fixup some tests where validating embedded masking is non-trivial

47d01d0

Cleanup some cases found by SPMI

40ffaa3

tannergooding force-pushed the avx512-embed-mask branch from 65da3bf to 40ffaa3 Compare May 7, 2024 17:54

kunalspathak reviewed May 7, 2024

View reviewed changes

src/tests/JIT/HardwareIntrinsics/X86/Shared/_BinaryOpTestTemplate.template Show resolved Hide resolved

Ensure that CompareLessThan has its operands swapped back if its bein…

efe1127

…g converted to the AVX512 form

This was referenced May 8, 2024

System.Net.Tests.HttpWebRequestTest_Async.GetResponseAsync_ParametersAreNotCachable_CreateNewClient test fails #100912

Closed

Test failure in System.Numerics.Tensors.Tests.SingleGenericTensorPrimitives.SpanDestinationFunctions_SpecialValues #101731

Closed

tannergooding added 3 commits May 7, 2024 19:53

Don't regress a scenario around op_Equality and TYP_MASK

40c0ce4

Adjusting hardware intrinsic tests to test non-zero masks

0db9b67

Avoid some messiness around operand swapping

f83162d

tannergooding marked this pull request as ready for review May 8, 2024 17:13

Ensure embedded masks mark TYP_SIMD16 and TYP_SIMD32 instructions as …

37b9fb1

…needing EVEX

build-analysis bot mentioned this pull request May 8, 2024

arm32 fails in CI with "/lib/arm-linux-gnueabihf/libc.so.6: version `GLIBC_2.34' not found" #102030

Closed

kunalspathak approved these changes May 8, 2024

View reviewed changes

tannergooding mentioned this pull request May 9, 2024

Mono LLVMAOT fails for Sse2_r and Sse2_ro with Invalid Record #102037

Open

Mark Sse2_r/Sse2_ro as AotIncompatible due to runtime/102037

e6e6272

tannergooding merged commit 5fdb133 into dotnet:main May 9, 2024
117 of 120 checks passed

tannergooding deleted the avx512-embed-mask branch May 9, 2024 17:06

tannergooding mentioned this pull request May 9, 2024

Add EVEX encoding opmask (k) register masking to xarch emitter #80821

Closed

mkhamoyan mentioned this pull request May 13, 2024

[mono] [aot] Mono mini and LLVM fullAOT CI jobs are failing due to Failed to load AOT module X86_Sse2_r/ro.dll in aot-only mode #102150

Open

github-actions bot locked and limited conversation to collaborators Jun 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable AVX512 embedded masking for most other intrinsics #101886

Enable AVX512 embedded masking for most other intrinsics #101886

tannergooding commented May 5, 2024 •

edited

Loading

dotnet-policy-service bot commented May 5, 2024

tannergooding May 6, 2024

kunalspathak left a comment •

edited

Loading

tannergooding commented May 7, 2024

kunalspathak commented May 8, 2024

tannergooding commented May 8, 2024

kunalspathak left a comment

tannergooding commented May 8, 2024

tannergooding commented May 8, 2024

tannergooding commented May 8, 2024

tannergooding commented May 8, 2024

tannergooding commented May 9, 2024

tannergooding commented May 9, 2024

Enable AVX512 embedded masking for most other intrinsics #101886

Enable AVX512 embedded masking for most other intrinsics #101886

Conversation

tannergooding commented May 5, 2024 • edited Loading

dotnet-policy-service bot commented May 5, 2024

tannergooding May 6, 2024

Choose a reason for hiding this comment

kunalspathak left a comment • edited Loading

Choose a reason for hiding this comment

tannergooding commented May 7, 2024

kunalspathak commented May 8, 2024

tannergooding commented May 8, 2024

kunalspathak left a comment

Choose a reason for hiding this comment

tannergooding commented May 8, 2024

linux x64

windows x64

tannergooding commented May 8, 2024

tannergooding commented May 8, 2024

tannergooding commented May 8, 2024

tannergooding commented May 9, 2024

tannergooding commented May 9, 2024

tannergooding commented May 5, 2024 •

edited

Loading

kunalspathak left a comment •

edited

Loading