[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

RKSimon · 2023-11-27T12:36:48Z

On AVX512VL targets we're better off keeping constant vectors at full width to ensure that they can be load folded into vector instructions, reducing register pressure. If a vector constant remains as a basic load, X86FixupVectorConstantsPass will still convert this to a broadcast instruction for us.

This is still a WIP patch - as can be seen by the changes to X86InstrFoldTables.cpp, we have very poor coverage for the BroadcastFoldTables (Issue #66360). I don't know whether just to continue manually extending these tables or to wait for #66360 to be done.

Non-VLX AVX512 targets are still seeing some regressions due to main instructions being implicitly widened to 512-bit ops in isel patterns and not in the DAG, so for now lets keep them as it is (same for AVX1/AVX2 targets). For AVX1/AVX2, broadcasting constants via lowerBuildVectorAsBroadcast helps a lot, as long as we don't cause register spills, which is major problem on larger vectorized hot loops. I'm currently thinking we should add a x86 pass, similar to MachineLICM, that unfolds broadcastable constant loads as long as we have spare registers; we could then remove the remaining lowerBuildVectorAsBroadcast constant handling entirely - any thoughts?

My goal is to improve AVX1/AVX2 vector constant handling but getting AVX512 out of the way appears to be an easier first step.

github-actions · 2023-11-27T12:39:18Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/avx512cfma-intrinsics.ll

…tries Prep work for #73509 (missed in #73654)

RKSimon · 2023-12-07T14:23:26Z

Added basic handling for non-VLX AVX512 targets when dealing with 512-bit constant vectors

llvm/test/CodeGen/X86/avx512fp16-arith.ll

goldsteinn · 2023-12-07T18:07:45Z

llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-4.ll

@@ -948,7 +948,8 @@ define void @load_i8_stride4_vf32(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, ptr
 ; AVX512F-NEXT:    vpshufb %ymm0, %ymm1, %ymm2
 ; AVX512F-NEXT:    vmovdqa 64(%rdi), %ymm3
 ; AVX512F-NEXT:    vpshufb %ymm0, %ymm3, %ymm0
-; AVX512F-NEXT:    vmovdqa {{.*#+}} ymm4 = [0,4,0,4,0,4,8,12]
+; AVX512F-NEXT:    vbroadcasti128 {{.*#+}} ymm4 = [0,4,8,12,0,4,8,12]


New broadcast?

Yes - still addressing regressions, that's why its still a draft :)

goldsteinn · 2023-12-07T18:09:59Z

Can't memory ops microfuse on all targets? Why is this avx512vl only?

We were using VPTERNLOGQ for everything but i32 types, which made broadcasts wider than necessary Noticed in #73509

RKSimon · 2023-12-08T11:45:44Z

Can't memory ops microfuse on all targets? Why is this avx512vl only?

I don't understand what you're asking - we already load-fold for all targets. This patch is about improving broadcast-load-fold. By prematurely converting to constant broadcasts in DAG we're hindering later optimizations - MachineLICM is a good example (we end up hoisting the broadcast which then often spills the full width broadcasted vector......). By keeping to full vector width until X86FixupVectorConstants we avoid a lot of this.

I will eventually be disabling constant broadcasting in lowerBuildVectorAsBroadcast for all AVX targets later but theres a lot of regressions to still deal with - AVX512VL (and AVX512F for 512-bit vectors) is the first step.

…ndi(z,w,c1)) to AVX512BW mask select Yet another yak shaving regression fix for #73509

Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from #73509

Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from #73509

Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from llvm#73509

Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from llvm#73509

… targets On AVX512 targets we're better off keeping constant vector at full width to ensure that they can be load folded into vector instructions, reducing register pressure. If a vector constant remains as a basic load, X86FixupVectorConstantsPass will still convert this to a broadcast instruction for us. Non-VLX targets are still seeing some regressions due to these being implicitly widened to 512-bit ops in isel patterns and not in the DAG, so I've limited this to just 512-bit vectors for now.

goldsteinn · 2024-06-21T06:30:35Z

llvm/lib/Target/X86/X86FixupVectorConstants.cpp

+        // TODO: Add support for RegBitWidth, but currently rebuildSplatCst
+        // doesn't require it (defaults to Constant::getPrimitiveSizeInBits).
+        if (FixupConstant(Fixups, 0, OpNoBcst64))
+          return true;


Maybe make inside of the condition a lambda to avoid 3x duplicate?

RKSimon · 2024-06-21T09:54:59Z

Long term WIP patch while I (slowly) address the various regressions it exposes - don't bother reviewing for now :)

RKSimon requested review from phoebewang, KanRobert, goldsteinn and yubingex007-a11y November 27, 2023 12:36

phoebewang reviewed Nov 27, 2023

View reviewed changes

llvm/lib/Target/X86/X86ISelLowering.cpp Outdated Show resolved Hide resolved

phoebewang reviewed Nov 27, 2023

View reviewed changes

llvm/test/CodeGen/X86/avx512cfma-intrinsics.ll Outdated Show resolved Hide resolved

RKSimon force-pushed the perf/broadcast-avx512 branch 8 times, most recently from 21fad09 to dae6506 Compare November 30, 2023 13:37

RKSimon added a commit that referenced this pull request Nov 30, 2023

[X86] X86InstrFoldTables.cpp - add Op4 Broadcast Fold/Unfold table en…

b8bbd5f

…tries Prep work for #73509 (missed in #73654)

RKSimon force-pushed the perf/broadcast-avx512 branch 5 times, most recently from fc410b2 to c5db884 Compare December 7, 2023 14:22

goldsteinn reviewed Dec 7, 2023

View reviewed changes

llvm/test/CodeGen/X86/avx512fp16-arith.ll Outdated Show resolved Hide resolved

goldsteinn reviewed Dec 7, 2023

View reviewed changes

RKSimon force-pushed the perf/broadcast-avx512 branch from c5db884 to 4fedfe0 Compare December 8, 2023 11:17

RKSimon added a commit that referenced this pull request Dec 8, 2023

[X86] canonicalizeBitSelect - always use VPTERNLOGD for sub-32bit types

5f91335

We were using VPTERNLOGQ for everything but i32 types, which made broadcasts wider than necessary Noticed in #73509

RKSimon force-pushed the perf/broadcast-avx512 branch 2 times, most recently from adc89f0 to 3af0810 Compare December 8, 2023 13:21

RKSimon force-pushed the perf/broadcast-avx512 branch 4 times, most recently from 0c80ea8 to 6b2809b Compare December 20, 2023 15:46

RKSimon force-pushed the perf/broadcast-avx512 branch from 6b2809b to 5eff513 Compare January 2, 2024 13:42

RKSimon added a commit that referenced this pull request Jan 3, 2024

[X86] combineConcatVectorOps - fold 512-bit concat(blendi(x,y,c0),ble…

1d27669

…ndi(z,w,c1)) to AVX512BW mask select Yet another yak shaving regression fix for #73509

RKSimon force-pushed the perf/broadcast-avx512 branch from 5eff513 to f738150 Compare January 3, 2024 13:03

RKSimon force-pushed the perf/broadcast-avx512 branch from f738150 to 6d1519e Compare February 5, 2024 12:58

RKSimon force-pushed the perf/broadcast-avx512 branch 2 times, most recently from 86ed907 to 925a8d0 Compare February 5, 2024 18:09

RKSimon force-pushed the perf/broadcast-avx512 branch from 925a8d0 to 77629d5 Compare February 28, 2024 10:58

RKSimon force-pushed the perf/broadcast-avx512 branch from 77629d5 to e8d60f1 Compare April 8, 2024 11:10

RKSimon mentioned this pull request Apr 18, 2024

[X86] Use GFNI for vXi8 shifts/rotates #89115

Merged

RKSimon force-pushed the perf/broadcast-avx512 branch from e8d60f1 to 27c0a8a Compare April 18, 2024 21:14

RKSimon force-pushed the perf/broadcast-avx512 branch from 27c0a8a to c6e33bf Compare June 12, 2024 10:46

RKSimon force-pushed the perf/broadcast-avx512 branch from c6e33bf to 7f72657 Compare June 19, 2024 12:20

RKSimon force-pushed the perf/broadcast-avx512 branch from 7f72657 to 0a147dc Compare June 19, 2024 13:16

goldsteinn reviewed Jun 21, 2024

View reviewed changes

RKSimon mentioned this pull request Dec 16, 2024

AVX mem broadcasts are cached on the stack #120015

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

RKSimon commented Nov 27, 2023

github-actions bot commented Nov 27, 2023 •

edited

Loading

RKSimon commented Dec 7, 2023

goldsteinn Dec 7, 2023

RKSimon Dec 8, 2023

goldsteinn commented Dec 7, 2023

RKSimon commented Dec 8, 2023

goldsteinn Jun 21, 2024

RKSimon commented Jun 21, 2024

[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

Are you sure you want to change the base?

[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

Conversation

RKSimon commented Nov 27, 2023

github-actions bot commented Nov 27, 2023 • edited Loading

RKSimon commented Dec 7, 2023

goldsteinn Dec 7, 2023

Choose a reason for hiding this comment

RKSimon Dec 8, 2023

Choose a reason for hiding this comment

goldsteinn commented Dec 7, 2023

RKSimon commented Dec 8, 2023

goldsteinn Jun 21, 2024

Choose a reason for hiding this comment

RKSimon commented Jun 21, 2024

github-actions bot commented Nov 27, 2023 •

edited

Loading