-
Notifications
You must be signed in to change notification settings - Fork 14.2k
[VPlan] Support VPWidenIntOrFpInductionRecipes with EVL tail folding #144666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[VPlan] Support VPWidenIntOrFpInductionRecipes with EVL tail folding #144666
Conversation
Following on from llvm#118638, this handles widened induction variables with EVL tail folding by setting the VF operand to be EVL, calculated in the vector body. We need to do this for correctness since with EVL tail folding the number of elements processed in the penultimate iteration may not be VF, but the runtime EVL, and we need to increment induction variables as such. - Because the VF may now not be a live-in we need to move the builder to just after its definition - We also need to avoid truncating it when it's the same size as the step type, previously this wasn't a problem for live-ins. - Also because the VF may be smaller than the IV type, since the EVL is always i32, we may need to zext it. On -march=rva23u64 -O3 we get 87.1% more loops vectorized on TSVC, and 42.8% more loops vectorized on SPEC CPU 2017
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Luke Lau (lukel97) ChangesFollowing on from #118638, this handles widened induction variables with EVL tail folding by setting the VF operand to be EVL, calculated in the vector body. We need to do this for correctness since with EVL tail folding the number of elements processed in the penultimate iteration may not be VF, but the runtime EVL, and we need to increment induction variables as such.
On -march=rva23u64 -O3 we get 87.1% more loops vectorized on TSVC, and 42.8% more loops vectorized on SPEC CPU 2017 Patch is 74.00 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/144666.diff 11 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index f3306ad7cb8ec..d3c96dbb0cc60 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -2006,6 +2006,7 @@ class VPWidenIntOrFpInductionRecipe : public VPWidenInductionRecipe {
VPValue *getVFValue() { return getOperand(2); }
const VPValue *getVFValue() const { return getOperand(2); }
+ void setVFValue(VPValue *New) { return setOperand(2, New); }
VPValue *getSplatVFValue() {
// If the recipe has been unrolled return the VPValue for the induction
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 11f0f2a930329..edffbdd6ed1fd 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -2196,6 +2196,8 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) {
for (VPUser *U : to_vector(Plan.getVF().users())) {
if (auto *R = dyn_cast<VPVectorEndPointerRecipe>(U))
R->setOperand(1, &EVL);
+ if (auto *R = dyn_cast<VPWidenIntOrFpInductionRecipe>(U))
+ R->setVFValue(&EVL);
}
SmallVector<VPRecipeBase *> ToErase;
@@ -2277,11 +2279,10 @@ bool VPlanTransforms::tryAddExplicitVectorLength(
VPlan &Plan, const std::optional<unsigned> &MaxSafeElements) {
VPBasicBlock *Header = Plan.getVectorLoopRegion()->getEntryBasicBlock();
// The transform updates all users of inductions to work based on EVL, instead
- // of the VF directly. At the moment, widened inductions cannot be updated, so
- // bail out if the plan contains any.
- bool ContainsWidenInductions = any_of(
- Header->phis(),
- IsaPred<VPWidenIntOrFpInductionRecipe, VPWidenPointerInductionRecipe>);
+ // of the VF directly. At the moment, widened pointer inductions cannot be
+ // updated, so bail out if the plan contains any.
+ bool ContainsWidenInductions =
+ any_of(Header->phis(), IsaPred<VPWidenPointerInductionRecipe>);
if (ContainsWidenInductions)
return false;
@@ -2604,14 +2605,19 @@ expandVPWidenIntOrFpInduction(VPWidenIntOrFpInductionRecipe *WidenIVR,
Inc = SplatVF;
Prev = WidenIVR->getLastUnrolledPartOperand();
} else {
+ if (VPRecipeBase *R = VF->getDefiningRecipe())
+ Builder.setInsertPoint(R->getParent(), std::next(R->getIterator()));
+ Type *VFTy = TypeInfo.inferScalarType(VF);
// Multiply the vectorization factor by the step using integer or
// floating-point arithmetic as appropriate.
if (StepTy->isFloatingPointTy())
VF = Builder.createScalarCast(Instruction::CastOps::UIToFP, VF, StepTy,
DL);
- else
+ else if (VFTy->getScalarSizeInBits() > StepTy->getScalarSizeInBits())
VF =
Builder.createScalarCast(Instruction::CastOps::Trunc, VF, StepTy, DL);
+ else if (VFTy->getScalarSizeInBits() < StepTy->getScalarSizeInBits())
+ VF = Builder.createScalarCast(Instruction::CastOps::ZExt, VF, StepTy, DL);
Inc = Builder.createNaryOp(MulOp, {Step, VF}, Flags);
Inc = Builder.createNaryOp(VPInstruction::Broadcast, Inc);
diff --git a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
index fba4a68f4a27b..2342c540970e2 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
@@ -156,7 +156,8 @@ bool VPlanVerifier::verifyEVLRecipe(const VPInstruction &EVL) const {
.Case<VPWidenIntrinsicRecipe>([&](const VPWidenIntrinsicRecipe *S) {
return VerifyEVLUse(*S, S->getNumOperands() - 1);
})
- .Case<VPWidenStoreEVLRecipe, VPReductionEVLRecipe>(
+ .Case<VPWidenStoreEVLRecipe, VPReductionEVLRecipe,
+ VPWidenIntOrFpInductionRecipe>(
[&](const VPRecipeBase *S) { return VerifyEVLUse(*S, 2); })
.Case<VPWidenLoadEVLRecipe, VPVectorEndPointerRecipe>(
[&](const VPRecipeBase *R) { return VerifyEVLUse(*R, 1); })
@@ -165,18 +166,30 @@ bool VPlanVerifier::verifyEVLRecipe(const VPInstruction &EVL) const {
.Case<VPInstruction>([&](const VPInstruction *I) {
if (I->getOpcode() == Instruction::PHI)
return VerifyEVLUse(*I, 1);
- if (I->getOpcode() != Instruction::Add) {
- errs() << "EVL is used as an operand in non-VPInstruction::Add\n";
+ switch (I->getOpcode()) {
+ case Instruction::Add:
+ break;
+ case Instruction::UIToFP:
+ case Instruction::Trunc:
+ case Instruction::ZExt:
+ case Instruction::Mul:
+ case Instruction::FMul:
+ if (!VerifyLate) {
+ errs() << "EVL used by unexpected VPInstruction\n";
+ return false;
+ }
+ break;
+ default:
+ errs() << "EVL used by unexpected VPInstruction\n";
return false;
}
if (I->getNumUsers() != 1) {
- errs() << "EVL is used in VPInstruction:Add with multiple "
- "users\n";
+ errs() << "EVL is used in VPInstruction with multiple users\n";
return false;
}
if (!VerifyLate && !isa<VPEVLBasedIVPHIRecipe>(*I->users().begin())) {
- errs() << "Result of VPInstruction::Add with EVL operand is "
- "not used by VPEVLBasedIVPHIRecipe\n";
+ errs() << "Result of VPInstruction with EVL operand is not used by "
+ "VPEVLBasedIVPHIRecipe\n";
return false;
}
return true;
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/evl-compatible-loops.ll b/llvm/test/Transforms/LoopVectorize/RISCV/evl-compatible-loops.ll
index e40f51fd7bd70..3281c7f04fa7a 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/evl-compatible-loops.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/evl-compatible-loops.ll
@@ -8,14 +8,55 @@ define void @test_wide_integer_induction(ptr noalias %a, i64 %N) {
; CHECK-LABEL: define void @test_wide_integer_induction(
; CHECK-SAME: ptr noalias [[A:%.*]], i64 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
; CHECK-NEXT: entry:
+; CHECK-NEXT: [[TMP0:%.*]] = sub i64 -1, [[N]]
+; CHECK-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP2:%.*]] = mul i64 [[TMP1]], 2
+; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP0]], [[TMP2]]
+; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.*]], label [[ENTRY:%.*]]
+; CHECK: vector.ph:
+; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 2
+; CHECK-NEXT: [[TMP6:%.*]] = sub i64 [[TMP5]], 1
+; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP6]]
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP8:%.*]] = mul i64 [[TMP7]], 2
+; CHECK-NEXT: [[TMP9:%.*]] = call <vscale x 2 x i64> @llvm.stepvector.nxv2i64()
+; CHECK-NEXT: [[TMP10:%.*]] = mul <vscale x 2 x i64> [[TMP9]], splat (i64 1)
+; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 2 x i64> zeroinitializer, [[TMP10]]
; CHECK-NEXT: br label [[FOR_BODY:%.*]]
+; CHECK: vector.body:
+; CHECK-NEXT: [[IV:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[IV_NEXT:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT: [[EVL_BASED_IV:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[INDEX_EVL_NEXT:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT: [[VEC_IND:%.*]] = phi <vscale x 2 x i64> [ [[INDUCTION]], [[ENTRY]] ], [ [[VEC_IND_NEXT:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT: [[AVL:%.*]] = sub i64 [[N]], [[EVL_BASED_IV]]
+; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 2, i1 true)
+; CHECK-NEXT: [[TMP12:%.*]] = zext i32 [[TMP11]] to i64
+; CHECK-NEXT: [[TMP13:%.*]] = mul i64 1, [[TMP12]]
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[TMP13]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[EVL_BASED_IV]]
+; CHECK-NEXT: [[TMP15:%.*]] = getelementptr inbounds i64, ptr [[TMP14]], i32 0
+; CHECK-NEXT: call void @llvm.vp.store.nxv2i64.p0(<vscale x 2 x i64> [[VEC_IND]], ptr align 8 [[TMP15]], <vscale x 2 x i1> splat (i1 true), i32 [[TMP11]])
+; CHECK-NEXT: [[TMP16:%.*]] = zext i32 [[TMP11]] to i64
+; CHECK-NEXT: [[INDEX_EVL_NEXT]] = add i64 [[TMP16]], [[EVL_BASED_IV]]
+; CHECK-NEXT: [[IV_NEXT]] = add i64 [[IV]], [[TMP8]]
+; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 2 x i64> [[VEC_IND]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT: [[TMP17:%.*]] = icmp eq i64 [[IV_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[FOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK: middle.block:
+; CHECK-NEXT: br label [[FOR_COND_CLEANUP:%.*]]
+; CHECK: scalar.ph:
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 0, [[ENTRY1:%.*]] ]
+; CHECK-NEXT: br label [[FOR_BODY1:%.*]]
; CHECK: for.body:
-; CHECK-NEXT: [[IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[FOR_BODY]] ]
-; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[IV]]
-; CHECK-NEXT: store i64 [[IV]], ptr [[ARRAYIDX]], align 8
-; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
-; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
+; CHECK-NEXT: [[IV1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT1:%.*]], [[FOR_BODY1]] ]
+; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[IV1]]
+; CHECK-NEXT: store i64 [[IV1]], ptr [[ARRAYIDX]], align 8
+; CHECK-NEXT: [[IV_NEXT1]] = add nuw nsw i64 [[IV1]], 1
+; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT1]], [[N]]
+; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY1]], !llvm.loop [[LOOP4:![0-9]+]]
; CHECK: for.cond.cleanup:
; CHECK-NEXT: ret void
;
@@ -68,3 +109,10 @@ for.body:
for.cond.cleanup:
ret void
}
+;.
+; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]], [[META3:![0-9]+]]}
+; CHECK: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK: [[META2]] = !{!"llvm.loop.isvectorized.tailfoldingstyle", !"evl"}
+; CHECK: [[META3]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK: [[LOOP4]] = distinct !{[[LOOP4]], [[META3]], [[META1]]}
+;.
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-masked-access.ll b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-masked-access.ll
index 79425ae3a67ec..4fd309bbfabbd 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-masked-access.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-masked-access.ll
@@ -108,36 +108,54 @@ define void @masked_strided_factor2(ptr noalias nocapture readonly %p, ptr noali
; PREDICATED_EVL-LABEL: define void @masked_strided_factor2
; PREDICATED_EVL-SAME: (ptr noalias readonly captures(none) [[P:%.*]], ptr noalias captures(none) [[Q:%.*]], i8 zeroext [[GUARD:%.*]]) #[[ATTR0:[0-9]+]] {
; PREDICATED_EVL-NEXT: entry:
+; PREDICATED_EVL-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; PREDICATED_EVL: vector.ph:
; PREDICATED_EVL-NEXT: [[CONV:%.*]] = zext i8 [[GUARD]] to i32
-; PREDICATED_EVL-NEXT: br label [[FOR_BODY:%.*]]
-; PREDICATED_EVL: for.body:
-; PREDICATED_EVL-NEXT: [[IX_024:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INC:%.*]], [[FOR_INC:%.*]] ]
-; PREDICATED_EVL-NEXT: [[CMP1:%.*]] = icmp samesign ugt i32 [[IX_024]], [[CONV]]
-; PREDICATED_EVL-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
-; PREDICATED_EVL: if.then:
-; PREDICATED_EVL-NEXT: [[MUL:%.*]] = shl nuw nsw i32 [[IX_024]], 1
-; PREDICATED_EVL-NEXT: [[TMP0:%.*]] = zext nneg i32 [[MUL]] to i64
-; PREDICATED_EVL-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 [[TMP0]]
-; PREDICATED_EVL-NEXT: [[TMP1:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
-; PREDICATED_EVL-NEXT: [[ADD:%.*]] = or disjoint i32 [[MUL]], 1
-; PREDICATED_EVL-NEXT: [[TMP2:%.*]] = zext nneg i32 [[ADD]] to i64
-; PREDICATED_EVL-NEXT: [[ARRAYIDX4:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 [[TMP2]]
-; PREDICATED_EVL-NEXT: [[TMP3:%.*]] = load i8, ptr [[ARRAYIDX4]], align 1
-; PREDICATED_EVL-NEXT: [[SPEC_SELECT_I:%.*]] = call i8 @llvm.smax.i8(i8 [[TMP1]], i8 [[TMP3]])
-; PREDICATED_EVL-NEXT: [[TMP4:%.*]] = zext nneg i32 [[MUL]] to i64
-; PREDICATED_EVL-NEXT: [[ARRAYIDX6:%.*]] = getelementptr inbounds nuw i8, ptr [[Q]], i64 [[TMP4]]
-; PREDICATED_EVL-NEXT: store i8 [[SPEC_SELECT_I]], ptr [[ARRAYIDX6]], align 1
-; PREDICATED_EVL-NEXT: [[SUB:%.*]] = sub i8 0, [[SPEC_SELECT_I]]
-; PREDICATED_EVL-NEXT: [[TMP5:%.*]] = zext nneg i32 [[ADD]] to i64
-; PREDICATED_EVL-NEXT: [[ARRAYIDX11:%.*]] = getelementptr inbounds nuw i8, ptr [[Q]], i64 [[TMP5]]
-; PREDICATED_EVL-NEXT: store i8 [[SUB]], ptr [[ARRAYIDX11]], align 1
-; PREDICATED_EVL-NEXT: br label [[FOR_INC]]
-; PREDICATED_EVL: for.inc:
-; PREDICATED_EVL-NEXT: [[INC]] = add nuw nsw i32 [[IX_024]], 1
-; PREDICATED_EVL-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], 1024
-; PREDICATED_EVL-NEXT: br i1 [[EXITCOND]], label [[FOR_END:%.*]], label [[FOR_BODY]]
-; PREDICATED_EVL: for.end:
-; PREDICATED_EVL-NEXT: ret void
+; PREDICATED_EVL-NEXT: [[TMP0:%.*]] = call i32 @llvm.vscale.i32()
+; PREDICATED_EVL-NEXT: [[TMP1:%.*]] = shl i32 [[TMP0]], 4
+; PREDICATED_EVL-NEXT: [[N_RND_UP:%.*]] = add i32 [[TMP1]], 1023
+; PREDICATED_EVL-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[N_RND_UP]], [[TMP1]]
+; PREDICATED_EVL-NEXT: [[N_VEC:%.*]] = sub i32 [[N_RND_UP]], [[N_MOD_VF]]
+; PREDICATED_EVL-NEXT: [[TMP2:%.*]] = call i32 @llvm.vscale.i32()
+; PREDICATED_EVL-NEXT: [[TMP3:%.*]] = shl i32 [[TMP2]], 4
+; PREDICATED_EVL-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 16 x i32> poison, i32 [[CONV]], i64 0
+; PREDICATED_EVL-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 16 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 16 x i32> poison, <vscale x 16 x i32> zeroinitializer
+; PREDICATED_EVL-NEXT: [[TMP4:%.*]] = call <vscale x 16 x i32> @llvm.stepvector.nxv16i32()
+; PREDICATED_EVL-NEXT: br label [[VECTOR_BODY:%.*]]
+; PREDICATED_EVL: vector.body:
+; PREDICATED_EVL-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; PREDICATED_EVL-NEXT: [[EVL_BASED_IV:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_EVL_NEXT:%.*]], [[VECTOR_BODY]] ]
+; PREDICATED_EVL-NEXT: [[VEC_IND:%.*]] = phi <vscale x 16 x i32> [ [[TMP4]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
+; PREDICATED_EVL-NEXT: [[AVL:%.*]] = sub i32 1024, [[EVL_BASED_IV]]
+; PREDICATED_EVL-NEXT: [[TMP5:%.*]] = call i32 @llvm.experimental.get.vector.length.i32(i32 [[AVL]], i32 16, i1 true)
+; PREDICATED_EVL-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 16 x i32> poison, i32 [[TMP5]], i64 0
+; PREDICATED_EVL-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 16 x i32> [[BROADCAST_SPLATINSERT1]], <vscale x 16 x i32> poison, <vscale x 16 x i32> zeroinitializer
+; PREDICATED_EVL-NEXT: [[TMP6:%.*]] = icmp ugt <vscale x 16 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
+; PREDICATED_EVL-NEXT: [[TMP7:%.*]] = shl nuw nsw <vscale x 16 x i32> [[VEC_IND]], splat (i32 1)
+; PREDICATED_EVL-NEXT: [[TMP8:%.*]] = zext nneg <vscale x 16 x i32> [[TMP7]] to <vscale x 16 x i64>
+; PREDICATED_EVL-NEXT: [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[P]], <vscale x 16 x i64> [[TMP8]]
+; PREDICATED_EVL-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 16 x i8> @llvm.vp.gather.nxv16i8.nxv16p0(<vscale x 16 x ptr> align 1 [[TMP9]], <vscale x 16 x i1> [[TMP6]], i32 [[TMP5]])
+; PREDICATED_EVL-NEXT: [[TMP10:%.*]] = or disjoint <vscale x 16 x i32> [[TMP7]], splat (i32 1)
+; PREDICATED_EVL-NEXT: [[TMP11:%.*]] = zext nneg <vscale x 16 x i32> [[TMP10]] to <vscale x 16 x i64>
+; PREDICATED_EVL-NEXT: [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[P]], <vscale x 16 x i64> [[TMP11]]
+; PREDICATED_EVL-NEXT: [[WIDE_MASKED_GATHER3:%.*]] = call <vscale x 16 x i8> @llvm.vp.gather.nxv16i8.nxv16p0(<vscale x 16 x ptr> align 1 [[TMP12]], <vscale x 16 x i1> [[TMP6]], i32 [[TMP5]])
+; PREDICATED_EVL-NEXT: [[TMP13:%.*]] = icmp slt <vscale x 16 x i8> [[WIDE_MASKED_GATHER]], [[WIDE_MASKED_GATHER3]]
+; PREDICATED_EVL-NEXT: [[TMP14:%.*]] = call <vscale x 16 x i8> @llvm.vp.select.nxv16i8(<vscale x 16 x i1> [[TMP13]], <vscale x 16 x i8> [[WIDE_MASKED_GATHER3]], <vscale x 16 x i8> [[WIDE_MASKED_GATHER]], i32 [[TMP5]])
+; PREDICATED_EVL-NEXT: [[TMP15:%.*]] = zext nneg <vscale x 16 x i32> [[TMP7]] to <vscale x 16 x i64>
+; PREDICATED_EVL-NEXT: [[TMP16:%.*]] = getelementptr inbounds i8, ptr [[Q]], <vscale x 16 x i64> [[TMP15]]
+; PREDICATED_EVL-NEXT: call void @llvm.vp.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP14]], <vscale x 16 x ptr> align 1 [[TMP16]], <vscale x 16 x i1> [[TMP6]], i32 [[TMP5]])
+; PREDICATED_EVL-NEXT: [[TMP17:%.*]] = sub <vscale x 16 x i8> zeroinitializer, [[TMP14]]
+; PREDICATED_EVL-NEXT: [[TMP18:%.*]] = zext nneg <vscale x 16 x i32> [[TMP10]] to <vscale x 16 x i64>
+; PREDICATED_EVL-NEXT: [[TMP19:%.*]] = getelementptr inbounds i8, ptr [[Q]], <vscale x 16 x i64> [[TMP18]]
+; PREDICATED_EVL-NEXT: call void @llvm.vp.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP17]], <vscale x 16 x ptr> align 1 [[TMP19]], <vscale x 16 x i1> [[TMP6]], i32 [[TMP5]])
+; PREDICATED_EVL-NEXT: [[INDEX_EVL_NEXT]] = add nuw i32 [[TMP5]], [[EVL_BASED_IV]]
+; PREDICATED_EVL-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP3]]
+; PREDICATED_EVL-NEXT: [[VEC_IND_NEXT]] = add <vscale x 16 x i32> [[VEC_IND]], [[BROADCAST_SPLAT2]]
+; PREDICATED_EVL-NEXT: [[TMP20:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
+; PREDICATED_EVL-NEXT: br i1 [[TMP20]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; PREDICATED_EVL: middle.block:
+; PREDICATED_EVL-NEXT: br label [[FOR_END:%.*]]
+; PREDICATED_EVL: scalar.ph:
;
entry:
%conv = zext i8 %guard to i32
@@ -309,52 +327,71 @@ define void @masked_strided_factor4(ptr noalias nocapture readonly %p, ptr noali
; PREDICATED_EVL-LABEL: define void @masked_strided_factor4
; PREDICATED_EVL-SAME: (ptr noalias readonly captures(none) [[P:%.*]], ptr noalias captures(none) [[Q:%.*]], i8 zeroext [[GUARD:%.*]]) #[[ATTR0]] {
; PREDICATED_EVL-NEXT: entry:
+; PREDICATED_EVL-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; PREDICATED_EVL: vector.ph:
; PREDICATED_EVL-NEXT: [[CONV:%.*]] = zext i8 [[GUARD]] to i32
-; PREDICATED_EVL-NEXT: br label [[FOR_BODY:%.*]]
-; PREDICATED_EVL: for.body:
-; PREDICATED_EVL-NEXT: [[IX_024:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INC:%.*]], [[FOR_INC:%.*]] ]
-; PREDICATED_EVL-NEXT: [[CMP1:%.*]] = icmp samesign ugt i32 [[IX_024]], [[CONV]]
-; PREDICATED_EVL-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
-; PREDICATED_EVL: if.then:
-; PREDICATED_EVL-NEXT: [[IDX0:%.*]] = shl nuw nsw i32 [[IX_024]], 2
-; PREDICATED_EVL-NEXT: [[IDX1:%.*]] = or disjoint i32 [[IDX0]], 1
-; PREDICATED_EVL-NEXT: [[IDX2:%.*]] = or disjoint i32 [[IDX0]], 2
-; PREDICATED_EVL-NEXT: [[IDX3:%.*]] = or disjoint i32 [[IDX0]], 3
-; PREDICATED_EVL-NEXT: [[TMP0:%.*]] = zext nneg i32 [[IDX0]] to i64
-; PREDICATED_EVL-NEXT: [[ARRAY1IDX0:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 [[TMP0]]
-; PREDICATED_EVL-NEXT: [[TMP1:%.*]] = load i8, ptr [[ARRAY1IDX0]], align 1
-; PREDICATED_EVL-NEXT: [[TMP2:%.*]] = zext nneg i32 [[IDX1]] to i64
-; PREDICATED_EVL-NEXT: [[ARRAY1IDX1:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 [[TMP2]]
-; PREDICATED_EVL-NEXT: [[TMP3:%.*]] = load i8, ptr [[ARRAY1IDX1]], align 1
-; PREDICATED_EVL-NEXT: [[TMP4:%.*]] = zext nneg i32 [[IDX2]] to i64
-; PREDICATED_EVL-NEXT...
[truncated]
|
else if (VFTy->getScalarSizeInBits() > StepTy->getScalarSizeInBits()) | ||
VF = | ||
Builder.createScalarCast(Instruction::CastOps::Trunc, VF, StepTy, DL); | ||
else if (VFTy->getScalarSizeInBits() < StepTy->getScalarSizeInBits()) | ||
VF = Builder.createScalarCast(Instruction::CastOps::ZExt, VF, StepTy, DL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this code replace with https://github.com/llvm/llvm-project/pull/144946/files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, definitely. I think we may also be able to use it for truncating the Start and Step above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in b3c55e2, but we can't use it for the start + step since we only want to truncate, not extend
Sorry, I accidentally clicked 'Request Changes'. Please ignore it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change review status
Hmm, It seems that it cannot be changed:(
…ze/evl-VPWidenIntOrFpInductionRecipe
There's an option to dismiss stale reviews hidden away somewhere, I've accidentally done the same thing before :) https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/dismissing-a-pull-request-review |
Following on from #118638, this handles widened induction variables with EVL tail folding by setting the VF operand to be EVL, calculated in the vector body.
We need to do this for correctness since with EVL tail folding the number of elements processed in the penultimate iteration may not be VF, but the runtime EVL, and we need take this into account when updating the backedge value.
On -march=rva23u64 -O3 we get 87.1% more loops vectorized on TSVC, and 42.8% more loops vectorized on SPEC CPU 2017