-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509
base: main
Are you sure you want to change the base?
Conversation
✅ With the latest revision this PR passed the C/C++ code formatter. |
21fad09
to
dae6506
Compare
fc410b2
to
c5db884
Compare
Added basic handling for non-VLX AVX512 targets when dealing with 512-bit constant vectors |
@@ -948,7 +948,8 @@ define void @load_i8_stride4_vf32(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, ptr | |||
; AVX512F-NEXT: vpshufb %ymm0, %ymm1, %ymm2 | |||
; AVX512F-NEXT: vmovdqa 64(%rdi), %ymm3 | |||
; AVX512F-NEXT: vpshufb %ymm0, %ymm3, %ymm0 | |||
; AVX512F-NEXT: vmovdqa {{.*#+}} ymm4 = [0,4,0,4,0,4,8,12] | |||
; AVX512F-NEXT: vbroadcasti128 {{.*#+}} ymm4 = [0,4,8,12,0,4,8,12] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New broadcast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - still addressing regressions, that's why its still a draft :)
Can't memory ops microfuse on all targets? Why is this avx512vl only? |
c5db884
to
4fedfe0
Compare
We were using VPTERNLOGQ for everything but i32 types, which made broadcasts wider than necessary Noticed in #73509
I don't understand what you're asking - we already load-fold for all targets. This patch is about improving broadcast-load-fold. By prematurely converting to constant broadcasts in DAG we're hindering later optimizations - MachineLICM is a good example (we end up hoisting the broadcast which then often spills the full width broadcasted vector......). By keeping to full vector width until X86FixupVectorConstants we avoid a lot of this. I will eventually be disabling constant broadcasting in lowerBuildVectorAsBroadcast for all AVX targets later but theres a lot of regressions to still deal with - AVX512VL (and AVX512F for 512-bit vectors) is the first step. |
adc89f0
to
3af0810
Compare
0c80ea8
to
6b2809b
Compare
6b2809b
to
5eff513
Compare
…ndi(z,w,c1)) to AVX512BW mask select Yet another yak shaving regression fix for #73509
5eff513
to
f738150
Compare
f738150
to
6d1519e
Compare
Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from #73509
Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from #73509
86ed907
to
925a8d0
Compare
Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from llvm#73509
Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from llvm#73509
925a8d0
to
77629d5
Compare
77629d5
to
e8d60f1
Compare
e8d60f1
to
27c0a8a
Compare
27c0a8a
to
c6e33bf
Compare
c6e33bf
to
7f72657
Compare
… targets On AVX512 targets we're better off keeping constant vector at full width to ensure that they can be load folded into vector instructions, reducing register pressure. If a vector constant remains as a basic load, X86FixupVectorConstantsPass will still convert this to a broadcast instruction for us. Non-VLX targets are still seeing some regressions due to these being implicitly widened to 512-bit ops in isel patterns and not in the DAG, so I've limited this to just 512-bit vectors for now.
7f72657
to
0a147dc
Compare
// TODO: Add support for RegBitWidth, but currently rebuildSplatCst | ||
// doesn't require it (defaults to Constant::getPrimitiveSizeInBits). | ||
if (FixupConstant(Fixups, 0, OpNoBcst64)) | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make inside of the condition a lambda to avoid 3x duplicate?
Long term WIP patch while I (slowly) address the various regressions it exposes - don't bother reviewing for now :) |
On AVX512VL targets we're better off keeping constant vectors at full width to ensure that they can be load folded into vector instructions, reducing register pressure. If a vector constant remains as a basic load, X86FixupVectorConstantsPass will still convert this to a broadcast instruction for us.
This is still a WIP patch - as can be seen by the changes to X86InstrFoldTables.cpp, we have very poor coverage for the BroadcastFoldTables (Issue #66360). I don't know whether just to continue manually extending these tables or to wait for #66360 to be done.
Non-VLX AVX512 targets are still seeing some regressions due to main instructions being implicitly widened to 512-bit ops in isel patterns and not in the DAG, so for now lets keep them as it is (same for AVX1/AVX2 targets). For AVX1/AVX2, broadcasting constants via lowerBuildVectorAsBroadcast helps a lot, as long as we don't cause register spills, which is major problem on larger vectorized hot loops. I'm currently thinking we should add a x86 pass, similar to MachineLICM, that unfolds broadcastable constant loads as long as we have spare registers; we could then remove the remaining lowerBuildVectorAsBroadcast constant handling entirely - any thoughts?
My goal is to improve AVX1/AVX2 vector constant handling but getting AVX512 out of the way appears to be an easier first step.