Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

RKSimon
Copy link
Collaborator

@RKSimon RKSimon commented Nov 27, 2023

On AVX512VL targets we're better off keeping constant vectors at full width to ensure that they can be load folded into vector instructions, reducing register pressure. If a vector constant remains as a basic load, X86FixupVectorConstantsPass will still convert this to a broadcast instruction for us.

This is still a WIP patch - as can be seen by the changes to X86InstrFoldTables.cpp, we have very poor coverage for the BroadcastFoldTables (Issue #66360). I don't know whether just to continue manually extending these tables or to wait for #66360 to be done.

Non-VLX AVX512 targets are still seeing some regressions due to main instructions being implicitly widened to 512-bit ops in isel patterns and not in the DAG, so for now lets keep them as it is (same for AVX1/AVX2 targets). For AVX1/AVX2, broadcasting constants via lowerBuildVectorAsBroadcast helps a lot, as long as we don't cause register spills, which is major problem on larger vectorized hot loops. I'm currently thinking we should add a x86 pass, similar to MachineLICM, that unfolds broadcastable constant loads as long as we have spare registers; we could then remove the remaining lowerBuildVectorAsBroadcast constant handling entirely - any thoughts?

My goal is to improve AVX1/AVX2 vector constant handling but getting AVX512 out of the way appears to be an easier first step.

Copy link

github-actions bot commented Nov 27, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch 8 times, most recently from 21fad09 to dae6506 Compare November 30, 2023 13:37
RKSimon added a commit that referenced this pull request Nov 30, 2023
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch 5 times, most recently from fc410b2 to c5db884 Compare December 7, 2023 14:22
@RKSimon
Copy link
Collaborator Author

RKSimon commented Dec 7, 2023

Added basic handling for non-VLX AVX512 targets when dealing with 512-bit constant vectors

@@ -948,7 +948,8 @@ define void @load_i8_stride4_vf32(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, ptr
; AVX512F-NEXT: vpshufb %ymm0, %ymm1, %ymm2
; AVX512F-NEXT: vmovdqa 64(%rdi), %ymm3
; AVX512F-NEXT: vpshufb %ymm0, %ymm3, %ymm0
; AVX512F-NEXT: vmovdqa {{.*#+}} ymm4 = [0,4,0,4,0,4,8,12]
; AVX512F-NEXT: vbroadcasti128 {{.*#+}} ymm4 = [0,4,8,12,0,4,8,12]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New broadcast?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - still addressing regressions, that's why its still a draft :)

@goldsteinn
Copy link
Contributor

Can't memory ops microfuse on all targets? Why is this avx512vl only?

@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from c5db884 to 4fedfe0 Compare December 8, 2023 11:17
RKSimon added a commit that referenced this pull request Dec 8, 2023
We were using VPTERNLOGQ for everything but i32 types, which made broadcasts wider than necessary

Noticed in #73509
@RKSimon
Copy link
Collaborator Author

RKSimon commented Dec 8, 2023

Can't memory ops microfuse on all targets? Why is this avx512vl only?

I don't understand what you're asking - we already load-fold for all targets. This patch is about improving broadcast-load-fold. By prematurely converting to constant broadcasts in DAG we're hindering later optimizations - MachineLICM is a good example (we end up hoisting the broadcast which then often spills the full width broadcasted vector......). By keeping to full vector width until X86FixupVectorConstants we avoid a lot of this.

I will eventually be disabling constant broadcasting in lowerBuildVectorAsBroadcast for all AVX targets later but theres a lot of regressions to still deal with - AVX512VL (and AVX512F for 512-bit vectors) is the first step.

@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch 2 times, most recently from adc89f0 to 3af0810 Compare December 8, 2023 13:21
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch 4 times, most recently from 0c80ea8 to 6b2809b Compare December 20, 2023 15:46
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from 6b2809b to 5eff513 Compare January 2, 2024 13:42
RKSimon added a commit that referenced this pull request Jan 3, 2024
…ndi(z,w,c1)) to AVX512BW mask select

Yet another yak shaving regression fix for #73509
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from 5eff513 to f738150 Compare January 3, 2024 13:03
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from f738150 to 6d1519e Compare February 5, 2024 12:58
RKSimon added a commit that referenced this pull request Feb 5, 2024
Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register

This will more significantly help improve 'fixup constant' comments from #73509
RKSimon added a commit that referenced this pull request Feb 5, 2024
Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register

This will more significantly help improve 'fixup constant' comments from #73509
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch 2 times, most recently from 86ed907 to 925a8d0 Compare February 5, 2024 18:09
agozillon pushed a commit to agozillon/llvm-project that referenced this pull request Feb 5, 2024
Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register

This will more significantly help improve 'fixup constant' comments from llvm#73509
agozillon pushed a commit to agozillon/llvm-project that referenced this pull request Feb 5, 2024
Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register

This will more significantly help improve 'fixup constant' comments from llvm#73509
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from 925a8d0 to 77629d5 Compare February 28, 2024 10:58
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from 77629d5 to e8d60f1 Compare April 8, 2024 11:10
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from e8d60f1 to 27c0a8a Compare April 18, 2024 21:14
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from 27c0a8a to c6e33bf Compare June 12, 2024 10:46
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from c6e33bf to 7f72657 Compare June 19, 2024 12:20
… targets

On AVX512 targets we're better off keeping constant vector at full width to ensure that they can be load folded into vector instructions, reducing register pressure.

If a vector constant remains as a basic load, X86FixupVectorConstantsPass will still convert this to a broadcast instruction for us.

Non-VLX targets are still seeing some regressions due to these being implicitly widened to 512-bit ops in isel patterns and not in the DAG, so I've limited this to just 512-bit vectors for now.
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from 7f72657 to 0a147dc Compare June 19, 2024 13:16
// TODO: Add support for RegBitWidth, but currently rebuildSplatCst
// doesn't require it (defaults to Constant::getPrimitiveSizeInBits).
if (FixupConstant(Fixups, 0, OpNoBcst64))
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make inside of the condition a lambda to avoid 3x duplicate?

@RKSimon
Copy link
Collaborator Author

RKSimon commented Jun 21, 2024

Long term WIP patch while I (slowly) address the various regressions it exposes - don't bother reviewing for now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants