-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LV] Vectorize Epilogues for loops with small VF but high IC #108190
Conversation
juliannagele
commented
Sep 11, 2024
- Consider MainLoopVF * IC when determining whether Epilogue Vectorization is profitable
- Allow the same VF for the Epilogue as for the main loop
- Use an upper bound for the trip count of the Epilogue when choosing the Epilogue VF
- Consider MainLoopVF * IC when determining whether Epilogue Vectorization is profitable - Allow the same VF for the Epilogue as for the main loop - Use an upper bound for the trip count of the Epilogue when choosing the Epilogue VF
@llvm/pr-subscribers-backend-powerpc @llvm/pr-subscribers-llvm-transforms Author: Julian Nagele (juliannagele) Changes
Patch is 95.18 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108190.diff 8 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index b5f87e458833d6..04a50ed3a9c594 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -471,6 +471,10 @@ class LoopVectorizationPlanner {
bool isMoreProfitable(const VectorizationFactor &A,
const VectorizationFactor &B) const;
+ bool isMoreProfitable(const VectorizationFactor &A,
+ const VectorizationFactor &B,
+ const unsigned MaxTripCount) const;
+
/// Determines if we have the infrastructure to vectorize the loop and its
/// epilogue, assuming the main loop is vectorized by \p VF.
bool isCandidateForEpilogueVectorization(const ElementCount VF) const;
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index b821da03c16e94..7450cfe59a28da 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1537,7 +1537,8 @@ class LoopVectorizationCostModel {
/// Returns true if epilogue vectorization is considered profitable, and
/// false otherwise.
/// \p VF is the vectorization factor chosen for the original loop.
- bool isEpilogueVectorizationProfitable(const ElementCount VF) const;
+ bool isEpilogueVectorizationProfitable(const ElementCount VF,
+ const unsigned Multiplier) const;
/// Returns the execution time cost of an instruction for a given vector
/// width. Vector width of one means scalar.
@@ -4282,12 +4283,11 @@ getVScaleForTuning(const Loop *L, const TargetTransformInfo &TTI) {
}
bool LoopVectorizationPlanner::isMoreProfitable(
- const VectorizationFactor &A, const VectorizationFactor &B) const {
+ const VectorizationFactor &A, const VectorizationFactor &B,
+ const unsigned MaxTripCount) const {
InstructionCost CostA = A.Cost;
InstructionCost CostB = B.Cost;
- unsigned MaxTripCount = PSE.getSE()->getSmallConstantMaxTripCount(OrigLoop);
-
// Improve estimate for the vector width if it is scalable.
unsigned EstimatedWidthA = A.Width.getKnownMinValue();
unsigned EstimatedWidthB = B.Width.getKnownMinValue();
@@ -4336,6 +4336,13 @@ bool LoopVectorizationPlanner::isMoreProfitable(
return CmpFn(RTCostA, RTCostB);
}
+bool LoopVectorizationPlanner::isMoreProfitable(
+ const VectorizationFactor &A, const VectorizationFactor &B) const {
+ const unsigned MaxTripCount =
+ PSE.getSE()->getSmallConstantMaxTripCount(OrigLoop);
+ return LoopVectorizationPlanner::isMoreProfitable(A, B, MaxTripCount);
+}
+
void LoopVectorizationPlanner::emitInvalidCostRemarks(
OptimizationRemarkEmitter *ORE) {
using RecipeVFPair = std::pair<VPRecipeBase *, ElementCount>;
@@ -4648,7 +4655,7 @@ bool LoopVectorizationPlanner::isCandidateForEpilogueVectorization(
}
bool LoopVectorizationCostModel::isEpilogueVectorizationProfitable(
- const ElementCount VF) const {
+ const ElementCount VF, const unsigned Multiplier) const {
// FIXME: We need a much better cost-model to take different parameters such
// as register pressure, code size increase and cost of extra branches into
// account. For now we apply a very crude heuristic and only consider loops
@@ -4663,9 +4670,6 @@ bool LoopVectorizationCostModel::isEpilogueVectorizationProfitable(
if (TTI.getMaxInterleaveFactor(VF) <= 1)
return false;
- unsigned Multiplier = 1;
- if (VF.isScalable())
- Multiplier = getVScaleForTuning(TheLoop, TTI).value_or(1);
if ((Multiplier * VF.getKnownMinValue()) >= EpilogueVectorizationMinVF)
return true;
return false;
@@ -4711,7 +4715,11 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
return Result;
}
- if (!CM.isEpilogueVectorizationProfitable(MainLoopVF)) {
+ unsigned Multiplier = IC;
+ if (MainLoopVF.isScalable())
+ Multiplier = getVScaleForTuning(OrigLoop, TTI).value_or(1);
+
+ if (!CM.isEpilogueVectorizationProfitable(MainLoopVF, Multiplier)) {
LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization is not profitable for "
"this loop\n");
return Result;
@@ -4730,16 +4738,20 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
ScalarEvolution &SE = *PSE.getSE();
Type *TCType = Legal->getWidestInductionType();
const SCEV *RemainingIterations = nullptr;
+ unsigned MaxTripCount = 0;
for (auto &NextVF : ProfitableVFs) {
// Skip candidate VFs without a corresponding VPlan.
if (!hasPlanWithVF(NextVF.Width))
continue;
- // Skip candidate VFs with widths >= the estimate runtime VF (scalable
- // vectors) or the VF of the main loop (fixed vectors).
+ // Skip candidate VFs with widths >= the (estimated) runtime VF (scalable
+ // vectors) or > the VF of the main loop (fixed vectors).
if ((!NextVF.Width.isScalable() && MainLoopVF.isScalable() &&
ElementCount::isKnownGE(NextVF.Width, EstimatedRuntimeVF)) ||
- ElementCount::isKnownGE(NextVF.Width, MainLoopVF))
+ (NextVF.Width.isScalable() &&
+ ElementCount::isKnownGE(NextVF.Width, MainLoopVF)) ||
+ (!NextVF.Width.isScalable() && !MainLoopVF.isScalable() &&
+ ElementCount::isKnownGT(NextVF.Width, MainLoopVF)))
continue;
// If NextVF is greater than the number of remaining iterations, the
@@ -4750,6 +4762,14 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
const SCEV *TC = createTripCountSCEV(TCType, PSE, OrigLoop);
RemainingIterations = SE.getURemExpr(
TC, SE.getConstant(TCType, MainLoopVF.getKnownMinValue() * IC));
+ const APInt MaxRemainingIterations =
+ SE.getUnsignedRangeMax(RemainingIterations);
+ // Guard against huge trip counts.
+ if (MaxRemainingIterations.getActiveBits() <= 32) {
+ MaxTripCount = MaxRemainingIterations.getZExtValue();
+ LLVM_DEBUG(dbgs() << "LEV: Maximum Trip Count for Epilogue: "
+ << MaxTripCount << "\n");
+ }
}
if (SE.isKnownPredicate(
CmpInst::ICMP_UGT,
@@ -4758,7 +4778,8 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
continue;
}
- if (Result.Width.isScalar() || isMoreProfitable(NextVF, Result))
+ if (Result.Width.isScalar() ||
+ isMoreProfitable(NextVF, Result, MaxTripCount))
Result = NextVF;
}
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll b/llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
index 6953d6c48694c2..283a928d0d4884 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
@@ -16,7 +16,7 @@ define void @test_pr25490(i32 %n, ptr noalias nocapture %a, ptr noalias nocaptur
; CHECK-NEXT: br i1 [[CMP_28]], label [[FOR_COND_CLEANUP:%.*]], label [[ITER_CHECK:%.*]]
; CHECK: iter.check:
; CHECK-NEXT: [[TMP0:%.*]] = zext i32 [[N]] to i64
-; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 8
+; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
; CHECK: vector.main.loop.iter.check:
; CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i32 [[N]], 16
@@ -50,33 +50,33 @@ define void @test_pr25490(i32 %n, ptr noalias nocapture %a, ptr noalias nocaptur
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_VEC]], [[TMP0]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[VEC_EPILOG_ITER_CHECK:%.*]]
; CHECK: vec.epilog.iter.check:
-; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = and i64 [[TMP0]], 8
-; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK_NOT_NOT:%.*]] = icmp eq i64 [[N_VEC_REMAINING]], 0
-; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK_NOT_NOT]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
+; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = and i64 [[TMP0]], 12
+; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp eq i64 [[N_VEC_REMAINING]], 0
+; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
; CHECK: vec.epilog.ph:
; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
-; CHECK-NEXT: [[N_VEC5:%.*]] = and i64 [[TMP0]], 4294967288
+; CHECK-NEXT: [[N_VEC5:%.*]] = and i64 [[TMP0]], 4294967292
; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
; CHECK: vec.epilog.vector.body:
; CHECK-NEXT: [[INDEX7:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT11:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX7]]
-; CHECK-NEXT: [[WIDE_LOAD8:%.*]] = load <8 x i8>, ptr [[TMP14]], align 1
+; CHECK-NEXT: [[WIDE_LOAD8:%.*]] = load <4 x i8>, ptr [[TMP14]], align 1
; CHECK-NEXT: [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDEX7]]
-; CHECK-NEXT: [[WIDE_LOAD9:%.*]] = load <8 x i8>, ptr [[TMP15]], align 1
-; CHECK-NEXT: [[TMP16:%.*]] = zext <8 x i8> [[WIDE_LOAD9]] to <8 x i16>
-; CHECK-NEXT: [[TMP17:%.*]] = zext <8 x i8> [[WIDE_LOAD8]] to <8 x i16>
-; CHECK-NEXT: [[TMP18:%.*]] = mul nuw <8 x i16> [[TMP16]], [[TMP17]]
-; CHECK-NEXT: [[TMP19:%.*]] = lshr <8 x i16> [[TMP18]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
-; CHECK-NEXT: [[TMP20:%.*]] = trunc nuw <8 x i16> [[TMP19]] to <8 x i8>
-; CHECK-NEXT: store <8 x i8> [[TMP20]], ptr [[TMP15]], align 1
+; CHECK-NEXT: [[WIDE_LOAD9:%.*]] = load <4 x i8>, ptr [[TMP15]], align 1
+; CHECK-NEXT: [[TMP16:%.*]] = zext <4 x i8> [[WIDE_LOAD9]] to <4 x i16>
+; CHECK-NEXT: [[TMP17:%.*]] = zext <4 x i8> [[WIDE_LOAD8]] to <4 x i16>
+; CHECK-NEXT: [[TMP18:%.*]] = mul nuw <4 x i16> [[TMP16]], [[TMP17]]
+; CHECK-NEXT: [[TMP19:%.*]] = lshr <4 x i16> [[TMP18]], <i16 8, i16 8, i16 8, i16 8>
+; CHECK-NEXT: [[TMP20:%.*]] = trunc nuw <4 x i16> [[TMP19]] to <4 x i8>
+; CHECK-NEXT: store <4 x i8> [[TMP20]], ptr [[TMP15]], align 1
; CHECK-NEXT: [[TMP21:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX7]]
-; CHECK-NEXT: [[WIDE_LOAD10:%.*]] = load <8 x i8>, ptr [[TMP21]], align 1
-; CHECK-NEXT: [[TMP22:%.*]] = zext <8 x i8> [[WIDE_LOAD10]] to <8 x i16>
-; CHECK-NEXT: [[TMP23:%.*]] = mul nuw <8 x i16> [[TMP22]], [[TMP17]]
-; CHECK-NEXT: [[TMP24:%.*]] = lshr <8 x i16> [[TMP23]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
-; CHECK-NEXT: [[TMP25:%.*]] = trunc nuw <8 x i16> [[TMP24]] to <8 x i8>
-; CHECK-NEXT: store <8 x i8> [[TMP25]], ptr [[TMP21]], align 1
-; CHECK-NEXT: [[INDEX_NEXT11]] = add nuw i64 [[INDEX7]], 8
+; CHECK-NEXT: [[WIDE_LOAD10:%.*]] = load <4 x i8>, ptr [[TMP21]], align 1
+; CHECK-NEXT: [[TMP22:%.*]] = zext <4 x i8> [[WIDE_LOAD10]] to <4 x i16>
+; CHECK-NEXT: [[TMP23:%.*]] = mul nuw <4 x i16> [[TMP22]], [[TMP17]]
+; CHECK-NEXT: [[TMP24:%.*]] = lshr <4 x i16> [[TMP23]], <i16 8, i16 8, i16 8, i16 8>
+; CHECK-NEXT: [[TMP25:%.*]] = trunc nuw <4 x i16> [[TMP24]] to <4 x i8>
+; CHECK-NEXT: store <4 x i8> [[TMP25]], ptr [[TMP21]], align 1
+; CHECK-NEXT: [[INDEX_NEXT11]] = add nuw i64 [[INDEX7]], 4
; CHECK-NEXT: [[TMP26:%.*]] = icmp eq i64 [[INDEX_NEXT11]], [[N_VEC5]]
; CHECK-NEXT: br i1 [[TMP26]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
; CHECK: vec.epilog.middle.block:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
new file mode 100644
index 00000000000000..cda2d19521437f
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
@@ -0,0 +1,413 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -S < %s -passes=loop-vectorize -force-vector-interleave=4 2>&1 | FileCheck %s
+
+target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64"
+
+define void @add_i8(ptr noalias nocapture noundef writeonly %A, ptr nocapture noundef readonly %B, ptr nocapture noundef readonly %C, i64 noundef %Iterations) {
+; CHECK-LABEL: @add_i8(
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[CMP11:%.*]] = icmp sgt i64 [[ITERATIONS:%.*]], 0
+; CHECK-NEXT: br i1 [[CMP11]], label [[ITER_CHECK:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+; CHECK: iter.check:
+; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[ITERATIONS]], 8
+; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
+; CHECK: vector.main.loop.iter.check:
+; CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[ITERATIONS]], 64
+; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK1]], label [[VEC_EPILOG_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK: vector.ph:
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[ITERATIONS]], 64
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[ITERATIONS]], [[N_MOD_VF]]
+; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
+; CHECK: vector.body:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
+; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 16
+; CHECK-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 32
+; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 48
+; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[B:%.*]], i64 [[TMP0]]
+; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP1]]
+; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP2]]
+; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP3]]
+; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 0
+; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 16
+; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 32
+; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 48
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <16 x i8>, ptr [[TMP8]], align 1
+; CHECK-NEXT: [[WIDE_LOAD2:%.*]] = load <16 x i8>, ptr [[TMP9]], align 1
+; CHECK-NEXT: [[WIDE_LOAD3:%.*]] = load <16 x i8>, ptr [[TMP10]], align 1
+; CHECK-NEXT: [[WIDE_LOAD4:%.*]] = load <16 x i8>, ptr [[TMP11]], align 1
+; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[C:%.*]], i64 [[TMP0]]
+; CHECK-NEXT: [[TMP13:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[TMP1]]
+; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[TMP2]]
+; CHECK-NEXT: [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[TMP3]]
+; CHECK-NEXT: [[TMP16:%.*]] = getelementptr inbounds i8, ptr [[TMP12]], i32 0
+; CHECK-NEXT: [[TMP17:%.*]] = getelementptr inbounds i8, ptr [[TMP12]], i32 16
+; CHECK-NEXT: [[TMP18:%.*]] = getelementptr inbounds i8, ptr [[TMP12]], i32 32
+; CHECK-NEXT: [[TMP19:%.*]] = getelementptr inbounds i8, ptr [[TMP12]], i32 48
+; CHECK-NEXT: [[WIDE_LOAD5:%.*]] = load <16 x i8>, ptr [[TMP16]], align 1
+; CHECK-NEXT: [[WIDE_LOAD6:%.*]] = load <16 x i8>, ptr [[TMP17]], align 1
+; CHECK-NEXT: [[WIDE_LOAD7:%.*]] = load <16 x i8>, ptr [[TMP18]], align 1
+; CHECK-NEXT: [[WIDE_LOAD8:%.*]] = load <16 x i8>, ptr [[TMP19]], align 1
+; CHECK-NEXT: [[TMP20:%.*]] = add <16 x i8> [[WIDE_LOAD5]], [[WIDE_LOAD]]
+; CHECK-NEXT: [[TMP21:%.*]] = add <16 x i8> [[WIDE_LOAD6]], [[WIDE_LOAD2]]
+; CHECK-NEXT: [[TMP22:%.*]] = add <16 x i8> [[WIDE_LOAD7]], [[WIDE_LOAD3]]
+; CHECK-NEXT: [[TMP23:%.*]] = add <16 x i8> [[WIDE_LOAD8]], [[WIDE_LOAD4]]
+; CHECK-NEXT: [[TMP24:%.*]] = getelementptr inbounds i8, ptr [[A:%.*]], i64 [[TMP0]]
+; CHECK-NEXT: [[TMP25:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP1]]
+; CHECK-NEXT: [[TMP26:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP2]]
+; CHECK-NEXT: [[TMP27:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP3]]
+; CHECK-NEXT: [[TMP28:%.*]] = getelementptr inbounds i8, ptr [[TMP24]], i32 0
+; CHECK-NEXT: [[TMP29:%.*]] = getelementptr inbounds i8, ptr [[TMP24]], i32 16
+; CHECK-NEXT: [[TMP30:%.*]] = getelementptr inbounds i8, ptr [[TMP24]], i32 32
+; CHECK-NEXT: [[TMP31:%.*]] = getelementptr inbounds i8, ptr [[TMP24]], i32 48
+; CHECK-NEXT: store <16 x i8> [[TMP20]], ptr [[TMP28]], align 1
+; CHECK-NEXT: store <16 x i8> [[TMP21]], ptr [[TMP29]], align 1
+; CHECK-NEXT: store <16 x i8> [[TMP22]], ptr [[TMP30]], align 1
+; CHECK-NEXT: store <16 x i8> [[TMP23]], ptr [[TMP31]], align 1
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 64
+; CHECK-NEXT: [[TMP32:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP32]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK: middle.block:
+; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[ITERATIONS]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[VEC_EPILOG_ITER_CHECK:%.*]]
+; CHECK: vec.epilog.iter.check:
+; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[ITERATIONS]], [[N_VEC]]
+; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8
+; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
+; CHECK: vec.epilog.ph:
+; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
+; CHECK-NEXT: [[N_MOD_VF9:%.*]] = urem i64 [[ITERATIONS]], 8
+; CHECK-NEXT: [[N_VEC10:%.*]] = sub i64 [[ITERATIONS]], [[N_MOD_VF9]]
+; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
+; CHECK: vec.epilog.vector.body:
+; CHECK-NEXT: [[INDEX12:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT15:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
+; CHECK-NEXT: [[TMP33:%.*]] = add i64 [[INDEX12]], 0
+; CHECK-NEXT: [[TMP34:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP33]]
+; CHECK-NEXT: [[TMP35:%.*]] = getelementptr inbounds i8, ptr [[TMP34]], i32 0
+; CHECK-NEXT: [[WIDE_LOAD13:%.*]] = load <8 x i8>, ptr [[TMP35]], align 1
+; CHECK-NEXT: [[TMP36:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[TMP33]]
+; CHECK-NEXT: [[TMP37:%.*]] = getelementptr inbounds i8, ptr [[TMP36]], i32 0
+; CHECK-NEXT: [[WIDE_LOAD14:%.*]] = load <8 x i8>, ptr [[TMP37]], align 1
+; CHECK-NEXT: [[TMP38:%.*]] = add <8 x i8> [[WIDE_LOAD14]], [[WIDE_LOAD13]]
+; CHECK-NEXT: [[TMP39:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP33]]
+; CHECK-NEXT: [[TMP40:%.*]] = getelementptr inbounds i8, ptr [[TMP39]], i32 0
+; CHECK-NEXT: store <8 x i8> [[TMP38]], ptr [[TMP40]], align 1
+; CHECK-NEXT: [[INDEX_NEXT15]] = add nuw i64 [[INDEX12]], 8
+; CHECK-NEXT: [[TMP41:%.*]] = icmp eq i64 [[INDEX_NEXT15]], [[N_VEC10]]
+; CHECK-NEXT: br i1 [[TMP41]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK: vec.epilog.middle.block:
+; CHECK-NEXT: [[CMP_N11:%.*]] = icmp eq i64 [[ITERATIONS]], [[N_VEC10]]
+; CHECK-NEXT: br i1 [[CMP_N11]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[VEC_EPILOG_SCALAR_PH]]
+; CHECK: vec.epilog.scalar.ph:
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC10]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK]] ]
+; CHECK-NEXT: br label [[FOR_BODY:%.*]]
+; CHECK: for.cond.cleanup.loopexit:
+; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]
+; CHECK: for.cond.cleanup:
+; CHECK-NEXT: ret void
+; CHECK: for.body:
+; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ]
+; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDVARS_IV]]
+; CHECK-NEXT: [[TMP42...
[truncated]
|
Thanks again for working on this. |
llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
Outdated
Show resolved
Hide resolved
llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
Outdated
Show resolved
Hide resolved
llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi - A high level comment - the MaxTripCount in isMoreProfitable is expected to be the constant trip count (as there is generally expected to be a single trip count if it is constant).
Should the epilog vectorization cost be based on the maximum epilog trip count (VF*IC - 1) or the average trip count?
Hi, thanks for having a look! Not sure I fully understand the question; From the existing code and name my reading was that it should be the maximum trip count, since for the main loop it's using |
I first did a quick performance run for 527.cam4_r from SPEC FP. I was hoping it would avoid the regression that is result of my extra interleaving patch (#100385), but unfortunately this patch didn't make a difference. I.e., with the extra interleaving for the V2, this patch didn't manage to bring back the lost perf. What I will plan to do next:
|
Sorry - I meant to reply to this but it fell off my radar. What I mean it that the value returned from getSmallConstantMaxTripCount is usually either a known constant from the trip count, or something like 2147483647 if the loop count is unknown (or it can return 0 for larger induction variables). The assumption is that is a good approximation for the trip count of the loop, or it is high enough that the vector cost dominates the total. Looking at the code, it should maybe be using getSmallBestKnownTC. If the trip count is a known constant then everything should be good, but if it is unknown but large this patch I think uses It might well be that VF-1 works better than any other choice, it sounds like it will pick lower VFs which would be a benefit to more trip counts (up to a point). I was interested if you had tried any other alternatives. (A lot of the VF above should be VF*UF). |
Ah, I see, thanks for clarifying! I tried both options (max epilogue trip count and average, i.e, / 2 ) on some micro benchmarks, llvm/llvm-test-suite#165, and it looks like avg is strictly worse, even resulting in some regressions (baseline is current without this PR, compares runtime so lower is better):
|
I finally got round to looking at the regression that would be introduced by setting MaxInterLeaving = 4 for the Neoverse V2. There are a couple of loops of this shape:
Some observations: this will result in a NEON loop (i.e. not scalable), and we can't say much about the IC. I don't think that this is criticism for this patch, it's not intended to recognize this, right? |
I'd say that's expected yes. This patch doesn't change the
I do see the epilogue getting vectorized with this patch (because now |
bool isEpilogueVectorizationProfitable(const ElementCount VF) const; | ||
/// \p Multiplier is an aditional scaling factor applied to VF before | ||
/// comparing to EpilogueVectorizationMinVF. | ||
bool isEpilogueVectorizationProfitable(const ElementCount VF, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiplier
is the UF/IC, right? Might be clearer to call it that to avoid confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only in the non-SVE case, for SVE it's getVScaleForTuning(OrigLoop, TTI).value_or(1)
. Which this wouldn't be changing just pulling it out as an argument and passing IC
for non-SVE.
const APInt MaxRemainingIterations = | ||
SE.getUnsignedRangeMax(RemainingIterations); | ||
// Guard against huge trip counts. | ||
if (MaxRemainingIterations.getActiveBits() <= 32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I would assume the max would always be MainLoopVF.getKnownMinValue() * IC - 1
or less, if the trip count is known to be constant? Curios why this guard is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it shouldn't be, I was overly defensive there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code is still a bit confusing, could w not create a constant that is set to MainLoopVF.getKnownMinValue() * IC - 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added MainLoopVF.getKnownMinValue() * IC - 1
as an upper bound, i.e., only use RemainingIterations
if it's known to be smaller
I have ran SPEC2017 INT + FP on the Neoverse V2, and I think I see a 2.4% improvement in x264 (I haven't done too many iterations but usually x264 is stable). All the other apps were neutral, I didn't see regressions. So, I think that's an okay result. |
Forgot to add one nit: I applied the patch and saw quite a few regression test failures; the precommit tests show them too. |
Thanks for checking. I don't know how much I would trust micro-benchmarks, but that probably makes sense if it was using the average factor. I imagine what we would really want would not be the "performance of the average trip count", but the "average performance of all the trip counts". The maximum might well be the best at approximating of that. |
if (!CM.isEpilogueVectorizationProfitable(MainLoopVF)) { | ||
unsigned Multiplier = IC; | ||
if (MainLoopVF.isScalable()) | ||
Multiplier = getVScaleForTuning(OrigLoop, TTI).value_or(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this could be Multiplier *=
for SVE, so that both the IC and the VScaleForTuning are accounted for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, but I didn't really look at SVE and since the remaining part of this change depends on using RemainingIterations
for checking profitability, which has an (existing) // TODO: extend to support scalable VFs.
I'd rather keep changing behavior for SVE as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change looks incorrect. Previously in isEpilogueVectorizationProfitable
we did:
unsigned Multiplier = 1;
if (VF.isScalable())
Multiplier = getVScaleForTuning(TheLoop, TTI).value_or(1);
if ((Multiplier * VF.getKnownMinValue()) >= EpilogueVectorizationMinVF)
return true;
i.e. for fixed-width VFs Multiplier = 1
, whereas after this change Multiplier = IC
. This is either biasing against or in favour of fixed-width VFs, which doesn't seem right. I think in order to match the previous behaviour the code should be:
unsigned Multiplier = 1;
if (MainLoopVF.isScalable())
Multiplier = getVScaleForTuning(OrigLoop, TTI).value_or(1);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @fhahn can you verify this as well? I think it should be one of the following:
unsigned Multiplier = IC;
if (MainLoopVF.isScalable())
Multiplier *= getVScaleForTuning(OrigLoop, TTI).value_or(1);
or
unsigned Multiplier = 1;
if (MainLoopVF.isScalable())
Multiplier = getVScaleForTuning(OrigLoop, TTI).value_or(1);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say it is incorrect, it just keeps the original behavior for scalable vectors as I think @juliannagele doesn't have access to HW with scalable vectors, which would be needed to evaluate the impact of changing this for scalable vectors.
At this point already picked the VF for the main loop, so the only change is that we consider epilogue vectorization for more cases with fixed vectors.
To avoid regression with fixed vectors, this patch relies on code that checks the number of remaining iterations, which currently doesn't support scalable vectors (look for // TODO: extend to support scalable VFs
), which probably should be fixed first. Again, this should probably be done by someone with access to HW supporting scalable vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can volunteer for that, but I am really keen that this lands first. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay, didn't realise that, thanks!
Reverse ping: do you have plans to finish this soon'ish? My patch depends on this, i.e. the bit of extra |
One of the test failures was a bit fragile (it arguably shouldn't have passed in the first place), but I should have a separate fix for that. The plan is to finish in the next few days/this week. |
Thanks, sounds good! |
…incoming values (#113915) This patch aims to strengthen collection of loop guards by processing PHI nodes with multiple incoming values as follows: collect guards for all incoming values/blocks and try to merge them into a single one for the PHI node. The goal is to determine tighter bounds on the trip counts of scalar tail loops after vectorization, helping to avoid unnecessary transforms. In particular we'd like to avoid vectorizing scalar tails of hand-vectorized loops, for example in [Transforms/PhaseOrdering/X86/pr38280.ll](https://github.com/llvm/llvm-project/blob/231e03ba7e82896847dbc27d457dbb208f04699c/llvm/test/Transforms/PhaseOrdering/X86/pr38280.ll), discovered via #108190 Compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=a55248789ed3f653740e0723d016203b9d585f26&to=500e4c46e79f60b93b11a752698c520e345948e3&stat=instructions:u PR: #113915
…incoming values (llvm#113915) This patch aims to strengthen collection of loop guards by processing PHI nodes with multiple incoming values as follows: collect guards for all incoming values/blocks and try to merge them into a single one for the PHI node. The goal is to determine tighter bounds on the trip counts of scalar tail loops after vectorization, helping to avoid unnecessary transforms. In particular we'd like to avoid vectorizing scalar tails of hand-vectorized loops, for example in [Transforms/PhaseOrdering/X86/pr38280.ll](https://github.com/llvm/llvm-project/blob/231e03ba7e82896847dbc27d457dbb208f04699c/llvm/test/Transforms/PhaseOrdering/X86/pr38280.ll), discovered via llvm#108190 Compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=a55248789ed3f653740e0723d016203b9d585f26&to=500e4c46e79f60b93b11a752698c520e345948e3&stat=instructions:u PR: llvm#113915
Whilst rebasing PR llvm#116247 I discovered an issue where PR llvm#108190 seems to have unintentionally introduced an unfairness in selecting epilogue VFs by making potentially better choices for fixed-width VFs compared to scalable VFs. When considering whether epilogue vectorisation is profitable or not the latest algorithm appears to be: bool IsProfitable = false; if (VF.isFixed()) IsProfitable = (IC * VF.getFixedValue()) >= EpilogueVectorizationMinVF; else IsProfitable = (getVScaleForTuning() * VF.getKnownMinValue()) >= EpilogueVectorizationMinVF; Instead, the estimate for the number of scalar iterations processed in the main vector loop should be (IC * estimatedRuntimeVF)
…incoming values (llvm#113915) This patch aims to strengthen collection of loop guards by processing PHI nodes with multiple incoming values as follows: collect guards for all incoming values/blocks and try to merge them into a single one for the PHI node. The goal is to determine tighter bounds on the trip counts of scalar tail loops after vectorization, helping to avoid unnecessary transforms. In particular we'd like to avoid vectorizing scalar tails of hand-vectorized loops, for example in [Transforms/PhaseOrdering/X86/pr38280.ll](https://github.com/llvm/llvm-project/blob/231e03ba7e82896847dbc27d457dbb208f04699c/llvm/test/Transforms/PhaseOrdering/X86/pr38280.ll), discovered via llvm#108190 Compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=a55248789ed3f653740e0723d016203b9d585f26&to=500e4c46e79f60b93b11a752698c520e345948e3&stat=instructions:u PR: llvm#113915
… IC (#9666) * [SCEV] Collect and merge loop guards through PHI nodes with multiple incoming values (llvm#113915) This patch aims to strengthen collection of loop guards by processing PHI nodes with multiple incoming values as follows: collect guards for all incoming values/blocks and try to merge them into a single one for the PHI node. The goal is to determine tighter bounds on the trip counts of scalar tail loops after vectorization, helping to avoid unnecessary transforms. In particular we'd like to avoid vectorizing scalar tails of hand-vectorized loops, for example in [Transforms/PhaseOrdering/X86/pr38280.ll](https://github.com/llvm/llvm-project/blob/231e03ba7e82896847dbc27d457dbb208f04699c/llvm/test/Transforms/PhaseOrdering/X86/pr38280.ll), discovered via llvm#108190 Compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=a55248789ed3f653740e0723d016203b9d585f26&to=500e4c46e79f60b93b11a752698c520e345948e3&stat=instructions:u PR: llvm#113915 (cherry picked from commit 7c8e05a) * [SCEV] Address post-commit comments for llvm#113915. Address post-commit comments for llvm#113915. (cherry picked from commit feb9b37) * [LV] Vectorize Epilogues for loops with small VF but high IC (llvm#108190) - Consider MainLoopVF * IC when determining whether Epilogue Vectorization is profitable - Allow the same VF for the Epilogue as for the main loop - Use an upper bound for the trip count of the Epilogue when choosing the Epilogue VF PR: llvm#108190 --------- Co-authored-by: Florian Hahn <flo@fhahn.com> (cherry picked from commit a8538b9) --------- Co-authored-by: Florian Hahn <flo@fhahn.com>