[LV] Vectorize Epilogues for loops with small VF but high IC #108190

juliannagele · 2024-09-11T11:04:29Z

Consider MainLoopVF * IC when determining whether Epilogue Vectorization is profitable
Allow the same VF for the Epilogue as for the main loop
Use an upper bound for the trip count of the Epilogue when choosing the Epilogue VF

- Consider MainLoopVF * IC when determining whether Epilogue Vectorization is profitable - Allow the same VF for the Epilogue as for the main loop - Use an upper bound for the trip count of the Epilogue when choosing the Epilogue VF

llvmbot · 2024-09-11T11:05:06Z

@llvm/pr-subscribers-backend-powerpc

@llvm/pr-subscribers-llvm-transforms

Author: Julian Nagele (juliannagele)

Changes

Consider MainLoopVF * IC when determining whether Epilogue Vectorization is profitable
Allow the same VF for the Epilogue as for the main loop
Use an upper bound for the trip count of the Epilogue when choosing the Epilogue VF

Patch is 95.18 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108190.diff

8 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+4)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+34-13)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll (+20-20)
(added) llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll (+413)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll (+35-9)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/interleaving-reduction.ll (+52-23)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/intrinsiccost.ll (+101-67)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll (+64-64)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index b5f87e458833d6..04a50ed3a9c594 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -471,6 +471,10 @@ class LoopVectorizationPlanner {
   bool isMoreProfitable(const VectorizationFactor &A,
                         const VectorizationFactor &B) const;
 
+  bool isMoreProfitable(const VectorizationFactor &A,
+                        const VectorizationFactor &B,
+                        const unsigned MaxTripCount) const;
+
   /// Determines if we have the infrastructure to vectorize the loop and its
   /// epilogue, assuming the main loop is vectorized by \p VF.
   bool isCandidateForEpilogueVectorization(const ElementCount VF) const;
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index b821da03c16e94..7450cfe59a28da 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1537,7 +1537,8 @@ class LoopVectorizationCostModel {
   /// Returns true if epilogue vectorization is considered profitable, and
   /// false otherwise.
   /// \p VF is the vectorization factor chosen for the original loop.
-  bool isEpilogueVectorizationProfitable(const ElementCount VF) const;
+  bool isEpilogueVectorizationProfitable(const ElementCount VF,
+                                         const unsigned Multiplier) const;
 
   /// Returns the execution time cost of an instruction for a given vector
   /// width. Vector width of one means scalar.
@@ -4282,12 +4283,11 @@ getVScaleForTuning(const Loop *L, const TargetTransformInfo &TTI) {
 }
 
 bool LoopVectorizationPlanner::isMoreProfitable(
-    const VectorizationFactor &A, const VectorizationFactor &B) const {
+    const VectorizationFactor &A, const VectorizationFactor &B,
+    const unsigned MaxTripCount) const {
   InstructionCost CostA = A.Cost;
   InstructionCost CostB = B.Cost;
 
-  unsigned MaxTripCount = PSE.getSE()->getSmallConstantMaxTripCount(OrigLoop);
-
   // Improve estimate for the vector width if it is scalable.
   unsigned EstimatedWidthA = A.Width.getKnownMinValue();
   unsigned EstimatedWidthB = B.Width.getKnownMinValue();
@@ -4336,6 +4336,13 @@ bool LoopVectorizationPlanner::isMoreProfitable(
   return CmpFn(RTCostA, RTCostB);
 }
 
+bool LoopVectorizationPlanner::isMoreProfitable(
+    const VectorizationFactor &A, const VectorizationFactor &B) const {
+  const unsigned MaxTripCount =
+      PSE.getSE()->getSmallConstantMaxTripCount(OrigLoop);
+  return LoopVectorizationPlanner::isMoreProfitable(A, B, MaxTripCount);
+}
+
 void LoopVectorizationPlanner::emitInvalidCostRemarks(
     OptimizationRemarkEmitter *ORE) {
   using RecipeVFPair = std::pair<VPRecipeBase *, ElementCount>;
@@ -4648,7 +4655,7 @@ bool LoopVectorizationPlanner::isCandidateForEpilogueVectorization(
 }
 
 bool LoopVectorizationCostModel::isEpilogueVectorizationProfitable(
-    const ElementCount VF) const {
+    const ElementCount VF, const unsigned Multiplier) const {
   // FIXME: We need a much better cost-model to take different parameters such
   // as register pressure, code size increase and cost of extra branches into
   // account. For now we apply a very crude heuristic and only consider loops
@@ -4663,9 +4670,6 @@ bool LoopVectorizationCostModel::isEpilogueVectorizationProfitable(
   if (TTI.getMaxInterleaveFactor(VF) <= 1)
     return false;
 
-  unsigned Multiplier = 1;
-  if (VF.isScalable())
-    Multiplier = getVScaleForTuning(TheLoop, TTI).value_or(1);
   if ((Multiplier * VF.getKnownMinValue()) >= EpilogueVectorizationMinVF)
     return true;
   return false;
@@ -4711,7 +4715,11 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
     return Result;
   }
 
-  if (!CM.isEpilogueVectorizationProfitable(MainLoopVF)) {
+  unsigned Multiplier = IC;
+  if (MainLoopVF.isScalable())
+    Multiplier = getVScaleForTuning(OrigLoop, TTI).value_or(1);
+
+  if (!CM.isEpilogueVectorizationProfitable(MainLoopVF, Multiplier)) {
     LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization is not profitable for "
                          "this loop\n");
     return Result;
@@ -4730,16 +4738,20 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
   ScalarEvolution &SE = *PSE.getSE();
   Type *TCType = Legal->getWidestInductionType();
   const SCEV *RemainingIterations = nullptr;
+  unsigned MaxTripCount = 0;
   for (auto &NextVF : ProfitableVFs) {
     // Skip candidate VFs without a corresponding VPlan.
     if (!hasPlanWithVF(NextVF.Width))
       continue;
 
-    // Skip candidate VFs with widths >= the estimate runtime VF (scalable
-    // vectors) or the VF of the main loop (fixed vectors).
+    // Skip candidate VFs with widths >= the (estimated) runtime VF (scalable
+    // vectors) or > the VF of the main loop (fixed vectors).
     if ((!NextVF.Width.isScalable() && MainLoopVF.isScalable() &&
          ElementCount::isKnownGE(NextVF.Width, EstimatedRuntimeVF)) ||
-        ElementCount::isKnownGE(NextVF.Width, MainLoopVF))
+        (NextVF.Width.isScalable() &&
+         ElementCount::isKnownGE(NextVF.Width, MainLoopVF)) ||
+        (!NextVF.Width.isScalable() && !MainLoopVF.isScalable() &&
+         ElementCount::isKnownGT(NextVF.Width, MainLoopVF)))
       continue;
 
     // If NextVF is greater than the number of remaining iterations, the
@@ -4750,6 +4762,14 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
         const SCEV *TC = createTripCountSCEV(TCType, PSE, OrigLoop);
         RemainingIterations = SE.getURemExpr(
             TC, SE.getConstant(TCType, MainLoopVF.getKnownMinValue() * IC));
+        const APInt MaxRemainingIterations =
+            SE.getUnsignedRangeMax(RemainingIterations);
+        // Guard against huge trip counts.
+        if (MaxRemainingIterations.getActiveBits() <= 32) {
+          MaxTripCount = MaxRemainingIterations.getZExtValue();
+          LLVM_DEBUG(dbgs() << "LEV: Maximum Trip Count for Epilogue: "
+                            << MaxTripCount << "\n");
+        }
       }
       if (SE.isKnownPredicate(
               CmpInst::ICMP_UGT,
@@ -4758,7 +4778,8 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
         continue;
     }
 
-    if (Result.Width.isScalar() || isMoreProfitable(NextVF, Result))
+    if (Result.Width.isScalar() ||
+        isMoreProfitable(NextVF, Result, MaxTripCount))
       Result = NextVF;
   }
 
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll b/llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
index 6953d6c48694c2..283a928d0d4884 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
@@ -16,7 +16,7 @@ define void @test_pr25490(i32 %n, ptr noalias nocapture %a, ptr noalias nocaptur
 ; CHECK-NEXT:    br i1 [[CMP_28]], label [[FOR_COND_CLEANUP:%.*]], label [[ITER_CHECK:%.*]]
 ; CHECK:       iter.check:
 ; CHECK-NEXT:    [[TMP0:%.*]] = zext i32 [[N]] to i64
-; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 8
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
 ; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
 ; CHECK:       vector.main.loop.iter.check:
 ; CHECK-NEXT:    [[MIN_ITERS_CHECK1:%.*]] = icmp ult i32 [[N]], 16
@@ -50,33 +50,33 @@ define void @test_pr25490(i32 %n, ptr noalias nocapture %a, ptr noalias nocaptur
 ; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N_VEC]], [[TMP0]]
 ; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[VEC_EPILOG_ITER_CHECK:%.*]]
 ; CHECK:       vec.epilog.iter.check:
-; CHECK-NEXT:    [[N_VEC_REMAINING:%.*]] = and i64 [[TMP0]], 8
-; CHECK-NEXT:    [[MIN_EPILOG_ITERS_CHECK_NOT_NOT:%.*]] = icmp eq i64 [[N_VEC_REMAINING]], 0
-; CHECK-NEXT:    br i1 [[MIN_EPILOG_ITERS_CHECK_NOT_NOT]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
+; CHECK-NEXT:    [[N_VEC_REMAINING:%.*]] = and i64 [[TMP0]], 12
+; CHECK-NEXT:    [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp eq i64 [[N_VEC_REMAINING]], 0
+; CHECK-NEXT:    br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
 ; CHECK:       vec.epilog.ph:
 ; CHECK-NEXT:    [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
-; CHECK-NEXT:    [[N_VEC5:%.*]] = and i64 [[TMP0]], 4294967288
+; CHECK-NEXT:    [[N_VEC5:%.*]] = and i64 [[TMP0]], 4294967292
 ; CHECK-NEXT:    br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
 ; CHECK:       vec.epilog.vector.body:
 ; CHECK-NEXT:    [[INDEX7:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT11:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX7]]
-; CHECK-NEXT:    [[WIDE_LOAD8:%.*]] = load <8 x i8>, ptr [[TMP14]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD8:%.*]] = load <4 x i8>, ptr [[TMP14]], align 1
 ; CHECK-NEXT:    [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDEX7]]
-; CHECK-NEXT:    [[WIDE_LOAD9:%.*]] = load <8 x i8>, ptr [[TMP15]], align 1
-; CHECK-NEXT:    [[TMP16:%.*]] = zext <8 x i8> [[WIDE_LOAD9]] to <8 x i16>
-; CHECK-NEXT:    [[TMP17:%.*]] = zext <8 x i8> [[WIDE_LOAD8]] to <8 x i16>
-; CHECK-NEXT:    [[TMP18:%.*]] = mul nuw <8 x i16> [[TMP16]], [[TMP17]]
-; CHECK-NEXT:    [[TMP19:%.*]] = lshr <8 x i16> [[TMP18]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
-; CHECK-NEXT:    [[TMP20:%.*]] = trunc nuw <8 x i16> [[TMP19]] to <8 x i8>
-; CHECK-NEXT:    store <8 x i8> [[TMP20]], ptr [[TMP15]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD9:%.*]] = load <4 x i8>, ptr [[TMP15]], align 1
+; CHECK-NEXT:    [[TMP16:%.*]] = zext <4 x i8> [[WIDE_LOAD9]] to <4 x i16>
+; CHECK-NEXT:    [[TMP17:%.*]] = zext <4 x i8> [[WIDE_LOAD8]] to <4 x i16>
+; CHECK-NEXT:    [[TMP18:%.*]] = mul nuw <4 x i16> [[TMP16]], [[TMP17]]
+; CHECK-NEXT:    [[TMP19:%.*]] = lshr <4 x i16> [[TMP18]], <i16 8, i16 8, i16 8, i16 8>
+; CHECK-NEXT:    [[TMP20:%.*]] = trunc nuw <4 x i16> [[TMP19]] to <4 x i8>
+; CHECK-NEXT:    store <4 x i8> [[TMP20]], ptr [[TMP15]], align 1
 ; CHECK-NEXT:    [[TMP21:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX7]]
-; CHECK-NEXT:    [[WIDE_LOAD10:%.*]] = load <8 x i8>, ptr [[TMP21]], align 1
-; CHECK-NEXT:    [[TMP22:%.*]] = zext <8 x i8> [[WIDE_LOAD10]] to <8 x i16>
-; CHECK-NEXT:    [[TMP23:%.*]] = mul nuw <8 x i16> [[TMP22]], [[TMP17]]
-; CHECK-NEXT:    [[TMP24:%.*]] = lshr <8 x i16> [[TMP23]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
-; CHECK-NEXT:    [[TMP25:%.*]] = trunc nuw <8 x i16> [[TMP24]] to <8 x i8>
-; CHECK-NEXT:    store <8 x i8> [[TMP25]], ptr [[TMP21]], align 1
-; CHECK-NEXT:    [[INDEX_NEXT11]] = add nuw i64 [[INDEX7]], 8
+; CHECK-NEXT:    [[WIDE_LOAD10:%.*]] = load <4 x i8>, ptr [[TMP21]], align 1
+; CHECK-NEXT:    [[TMP22:%.*]] = zext <4 x i8> [[WIDE_LOAD10]] to <4 x i16>
+; CHECK-NEXT:    [[TMP23:%.*]] = mul nuw <4 x i16> [[TMP22]], [[TMP17]]
+; CHECK-NEXT:    [[TMP24:%.*]] = lshr <4 x i16> [[TMP23]], <i16 8, i16 8, i16 8, i16 8>
+; CHECK-NEXT:    [[TMP25:%.*]] = trunc nuw <4 x i16> [[TMP24]] to <4 x i8>
+; CHECK-NEXT:    store <4 x i8> [[TMP25]], ptr [[TMP21]], align 1
+; CHECK-NEXT:    [[INDEX_NEXT11]] = add nuw i64 [[INDEX7]], 4
 ; CHECK-NEXT:    [[TMP26:%.*]] = icmp eq i64 [[INDEX_NEXT11]], [[N_VEC5]]
 ; CHECK-NEXT:    br i1 [[TMP26]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
 ; CHECK:       vec.epilog.middle.block:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
new file mode 100644
index 00000000000000..cda2d19521437f
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
@@ -0,0 +1,413 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -S < %s -passes=loop-vectorize -force-vector-interleave=4 2>&1 | FileCheck %s
+
+target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64"
+
+define void @add_i8(ptr noalias nocapture noundef writeonly %A, ptr nocapture noundef readonly %B, ptr nocapture noundef readonly %C, i64 noundef %Iterations) {
+; CHECK-LABEL: @add_i8(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CMP11:%.*]] = icmp sgt i64 [[ITERATIONS:%.*]], 0
+; CHECK-NEXT:    br i1 [[CMP11]], label [[ITER_CHECK:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+; CHECK:       iter.check:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[ITERATIONS]], 8
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
+; CHECK:       vector.main.loop.iter.check:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[ITERATIONS]], 64
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK1]], label [[VEC_EPILOG_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK:       vector.ph:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[ITERATIONS]], 64
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[ITERATIONS]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[INDEX]], 0
+; CHECK-NEXT:    [[TMP1:%.*]] = add i64 [[INDEX]], 16
+; CHECK-NEXT:    [[TMP2:%.*]] = add i64 [[INDEX]], 32
+; CHECK-NEXT:    [[TMP3:%.*]] = add i64 [[INDEX]], 48
+; CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[B:%.*]], i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP1]]
+; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP2]]
+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP3]]
+; CHECK-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 0
+; CHECK-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 16
+; CHECK-NEXT:    [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 32
+; CHECK-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 48
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <16 x i8>, ptr [[TMP8]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD2:%.*]] = load <16 x i8>, ptr [[TMP9]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD3:%.*]] = load <16 x i8>, ptr [[TMP10]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD4:%.*]] = load <16 x i8>, ptr [[TMP11]], align 1
+; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[C:%.*]], i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP13:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[TMP1]]
+; CHECK-NEXT:    [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[TMP2]]
+; CHECK-NEXT:    [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[TMP3]]
+; CHECK-NEXT:    [[TMP16:%.*]] = getelementptr inbounds i8, ptr [[TMP12]], i32 0
+; CHECK-NEXT:    [[TMP17:%.*]] = getelementptr inbounds i8, ptr [[TMP12]], i32 16
+; CHECK-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i8, ptr [[TMP12]], i32 32
+; CHECK-NEXT:    [[TMP19:%.*]] = getelementptr inbounds i8, ptr [[TMP12]], i32 48
+; CHECK-NEXT:    [[WIDE_LOAD5:%.*]] = load <16 x i8>, ptr [[TMP16]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD6:%.*]] = load <16 x i8>, ptr [[TMP17]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD7:%.*]] = load <16 x i8>, ptr [[TMP18]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD8:%.*]] = load <16 x i8>, ptr [[TMP19]], align 1
+; CHECK-NEXT:    [[TMP20:%.*]] = add <16 x i8> [[WIDE_LOAD5]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP21:%.*]] = add <16 x i8> [[WIDE_LOAD6]], [[WIDE_LOAD2]]
+; CHECK-NEXT:    [[TMP22:%.*]] = add <16 x i8> [[WIDE_LOAD7]], [[WIDE_LOAD3]]
+; CHECK-NEXT:    [[TMP23:%.*]] = add <16 x i8> [[WIDE_LOAD8]], [[WIDE_LOAD4]]
+; CHECK-NEXT:    [[TMP24:%.*]] = getelementptr inbounds i8, ptr [[A:%.*]], i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP25:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP1]]
+; CHECK-NEXT:    [[TMP26:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP2]]
+; CHECK-NEXT:    [[TMP27:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP3]]
+; CHECK-NEXT:    [[TMP28:%.*]] = getelementptr inbounds i8, ptr [[TMP24]], i32 0
+; CHECK-NEXT:    [[TMP29:%.*]] = getelementptr inbounds i8, ptr [[TMP24]], i32 16
+; CHECK-NEXT:    [[TMP30:%.*]] = getelementptr inbounds i8, ptr [[TMP24]], i32 32
+; CHECK-NEXT:    [[TMP31:%.*]] = getelementptr inbounds i8, ptr [[TMP24]], i32 48
+; CHECK-NEXT:    store <16 x i8> [[TMP20]], ptr [[TMP28]], align 1
+; CHECK-NEXT:    store <16 x i8> [[TMP21]], ptr [[TMP29]], align 1
+; CHECK-NEXT:    store <16 x i8> [[TMP22]], ptr [[TMP30]], align 1
+; CHECK-NEXT:    store <16 x i8> [[TMP23]], ptr [[TMP31]], align 1
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 64
+; CHECK-NEXT:    [[TMP32:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP32]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[ITERATIONS]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[VEC_EPILOG_ITER_CHECK:%.*]]
+; CHECK:       vec.epilog.iter.check:
+; CHECK-NEXT:    [[N_VEC_REMAINING:%.*]] = sub i64 [[ITERATIONS]], [[N_VEC]]
+; CHECK-NEXT:    [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8
+; CHECK-NEXT:    br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
+; CHECK:       vec.epilog.ph:
+; CHECK-NEXT:    [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
+; CHECK-NEXT:    [[N_MOD_VF9:%.*]] = urem i64 [[ITERATIONS]], 8
+; CHECK-NEXT:    [[N_VEC10:%.*]] = sub i64 [[ITERATIONS]], [[N_MOD_VF9]]
+; CHECK-NEXT:    br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
+; CHECK:       vec.epilog.vector.body:
+; CHECK-NEXT:    [[INDEX12:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT15:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP33:%.*]] = add i64 [[INDEX12]], 0
+; CHECK-NEXT:    [[TMP34:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP33]]
+; CHECK-NEXT:    [[TMP35:%.*]] = getelementptr inbounds i8, ptr [[TMP34]], i32 0
+; CHECK-NEXT:    [[WIDE_LOAD13:%.*]] = load <8 x i8>, ptr [[TMP35]], align 1
+; CHECK-NEXT:    [[TMP36:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[TMP33]]
+; CHECK-NEXT:    [[TMP37:%.*]] = getelementptr inbounds i8, ptr [[TMP36]], i32 0
+; CHECK-NEXT:    [[WIDE_LOAD14:%.*]] = load <8 x i8>, ptr [[TMP37]], align 1
+; CHECK-NEXT:    [[TMP38:%.*]] = add <8 x i8> [[WIDE_LOAD14]], [[WIDE_LOAD13]]
+; CHECK-NEXT:    [[TMP39:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP33]]
+; CHECK-NEXT:    [[TMP40:%.*]] = getelementptr inbounds i8, ptr [[TMP39]], i32 0
+; CHECK-NEXT:    store <8 x i8> [[TMP38]], ptr [[TMP40]], align 1
+; CHECK-NEXT:    [[INDEX_NEXT15]] = add nuw i64 [[INDEX12]], 8
+; CHECK-NEXT:    [[TMP41:%.*]] = icmp eq i64 [[INDEX_NEXT15]], [[N_VEC10]]
+; CHECK-NEXT:    br i1 [[TMP41]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       vec.epilog.middle.block:
+; CHECK-NEXT:    [[CMP_N11:%.*]] = icmp eq i64 [[ITERATIONS]], [[N_VEC10]]
+; CHECK-NEXT:    br i1 [[CMP_N11]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[VEC_EPILOG_SCALAR_PH]]
+; CHECK:       vec.epilog.scalar.ph:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC10]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK]] ]
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.cond.cleanup.loopexit:
+; CHECK-NEXT:    br label [[FOR_COND_CLEANUP]]
+; CHECK:       for.cond.cleanup:
+; CHECK-NEXT:    ret void
+; CHECK:       for.body:
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ]
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDVARS_IV]]
+; CHECK-NEXT:    [[TMP42...
[truncated]

sjoerdmeijer · 2024-09-11T12:50:31Z

Thanks again for working on this.
I will kick off some performance runs with this, which I was planning to do but haven't done yet.
I am off tomorrow, but promise I will get back within a couple of days.

llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

davemgreen

Hi - A high level comment - the MaxTripCount in isMoreProfitable is expected to be the constant trip count (as there is generally expected to be a single trip count if it is constant).

Should the epilog vectorization cost be based on the maximum epilog trip count (VF*IC - 1) or the average trip count?

juliannagele · 2024-09-13T16:21:50Z

Hi - A high level comment - the MaxTripCount in isMoreProfitable is expected to be the constant trip count (as there is generally expected to be a single trip count if it is constant).

Should the epilog vectorization cost be based on the maximum epilog trip count (VF*IC - 1) or the average trip count?

Hi, thanks for having a look! Not sure I fully understand the question; From the existing code and name my reading was that it should be the maximum trip count, since for the main loop it's using getSmallConstantMaxTripCount, which describes itself as an "upper bound of the loop trip count".

sjoerdmeijer · 2024-09-16T08:45:22Z

I first did a quick performance run for 527.cam4_r from SPEC FP. I was hoping it would avoid the regression that is result of my extra interleaving patch (#100385), but unfortunately this patch didn't make a difference. I.e., with the extra interleaving for the V2, this patch didn't manage to bring back the lost perf.

What I will plan to do next:

I haven't looked why this is not helping. I guess because epilogue vectorization is not kicking in, but will debug why that isn't the case,
will kick off some more perf runs.

davemgreen · 2024-09-23T15:29:15Z

Hi, thanks for having a look! Not sure I fully understand the question; From the existing code and name my reading was that it should be the maximum trip count, since for the main loop it's using getSmallConstantMaxTripCount, which describes itself as an "upper bound of the loop trip count".

Sorry - I meant to reply to this but it fell off my radar. What I mean it that the value returned from getSmallConstantMaxTripCount is usually either a known constant from the trip count, or something like 2147483647 if the loop count is unknown (or it can return 0 for larger induction variables). The assumption is that is a good approximation for the trip count of the loop, or it is high enough that the vector cost dominates the total. Looking at the code, it should maybe be using getSmallBestKnownTC.

If the trip count is a known constant then everything should be good, but if it is unknown but large this patch I think uses VF-1 (either directly or from 2147483647 % VF via SCEVs) to cost the epilog. The real number of iterations will be something in the range 0 to VF-1, with 0 probably being relatively common. That means we will calculate the cost based on the worst-case epilog iteration count, which for small loops like this could skew the results. I would suspect that it probably makes the vectorizer choose a smaller factor for the epilog, so that it would pick 3*VF2 + 1*scalar instead of 1*VF4 if VF/2 used instead.

It might well be that VF-1 works better than any other choice, it sounds like it will pick lower VFs which would be a benefit to more trip counts (up to a point). I was interested if you had tried any other alternatives.

(A lot of the VF above should be VF*UF).

juliannagele · 2024-09-24T15:13:09Z

Ah, I see, thanks for clarifying! I tried both options (max epilogue trip count and average, i.e, / 2 ) on some micro benchmarks, llvm/llvm-test-suite#165, and it looks like avg is strictly worse, even resulting in some regressions (baseline is current without this PR, compares runtime so lower is better):

	name	MaxTCvsBase	AvgTCvsBase	AvgTCvsMaxTC
0	benchAutoVecForuint8_tForLoopTC65	-0.09%	-0.12%	-0.03%
1	benchReductionAutoVecForuint8_tForLoopTC65	0.04%	0.10%	0.06%
2	benchAutoVecForuint8_tForLoopTC80	0.02%	-0.04%	-0.07%
3	benchReductionAutoVecForuint8_tForLoopTC80	-0.07%	17.82%	17.91%
4	benchAutoVecForuint8_tForLoopTC96	0.20%	0.05%	-0.15%
5	benchReductionAutoVecForuint8_tForLoopTC96	-0.25%	-0.20%	0.05%
6	benchAutoVecForuint8_tForLoopTC104	0.04%	0.07%	0.03%
7	benchReductionAutoVecForuint8_tForLoopTC104	0.24%	5.76%	5.52%
8	benchAutoVecForuint8_tForLoopTC127	-0.02%	0.08%	0.10%
9	benchReductionAutoVecForuint8_tForLoopTC127	0.01%	16.98%	16.97%
10	benchAutoVecForuint16_tForLoopTC65	0.17%	0.16%	-0.00%
11	benchReductionAutoVecForuint16_tForLoopTC65	0.75%	0.92%	0.17%
12	benchAutoVecForuint16_tForLoopTC80	-44.01%	-44.05%	-0.06%
13	benchReductionAutoVecForuint16_tForLoopTC80	-26.48%	-26.42%	0.07%
14	benchAutoVecForuint16_tForLoopTC96	0.00%	-0.15%	-0.15%
15	benchReductionAutoVecForuint16_tForLoopTC96	0.70%	0.76%	0.06%
16	benchAutoVecForuint16_tForLoopTC104	-26.32%	-26.39%	-0.09%
17	benchReductionAutoVecForuint16_tForLoopTC104	-9.45%	-9.47%	-0.02%
18	benchAutoVecForuint16_tForLoopTC127	-40.52%	-40.49%	0.05%
19	benchReductionAutoVecForuint16_tForLoopTC127	-36.34%	-36.41%	-0.11%
20	benchAutoVecForuint32_tForLoopTC65	-0.61%	-0.51%	0.10%
21	benchReductionAutoVecForuint32_tForLoopTC65	0.45%	0.37%	-0.08%
22	benchAutoVecForuint32_tForLoopTC80	0.02%	-0.02%	-0.03%
23	benchReductionAutoVecForuint32_tForLoopTC80	0.15%	0.17%	0.02%
24	benchAutoVecForuint32_tForLoopTC96	0.01%	0.03%	0.02%
25	benchReductionAutoVecForuint32_tForLoopTC96	0.29%	0.34%	0.05%
26	benchAutoVecForuint32_tForLoopTC104	-17.54%	-12.88%	5.65%
27	benchReductionAutoVecForuint32_tForLoopTC104	-14.36%	-10.04%	5.04%
28	benchAutoVecForuint32_tForLoopTC127	-21.28%	-18.47%	3.58%
29	benchReductionAutoVecForuint32_tForLoopTC127	-18.78%	-18.76%	0.03%

sjoerdmeijer · 2024-09-30T09:23:56Z

I finally got round to looking at the regression that would be introduced by setting MaxInterLeaving = 4 for the Neoverse V2. There are a couple of loops of this shape:

int add(double * __restrict in1, double * __restrict in2, int n) {
    for (int i =0 ; i < n; i++) 
        in1[i] = in1[i] + in2[i];
}

Some observations: this will result in a NEON loop (i.e. not scalable), and we can't say much about the IC.
As a result, we don't vectorize the epilogue for these cases, also not with this patch applied.

I don't think that this is criticism for this patch, it's not intended to recognize this, right?

juliannagele · 2024-10-01T09:58:32Z

I finally got round to looking at the regression that would be introduced by setting MaxInterLeaving = 4 for the Neoverse V2. There are a couple of loops of this shape:
int add(double * __restrict in1, double * __restrict in2, int n) {
    for (int i =0 ; i < n; i++) 
        in1[i] = in1[i] + in2[i];
}
Some observations: this will result in a NEON loop (i.e. not scalable), and we can't say much about the IC. As a result, we don't vectorize the epilogue for these cases, also not with this patch applied.

I don't think that this is criticism for this patch, it's not intended to recognize this, right?

I'd say that's expected yes. This patch doesn't change the EpilogueVectorizationMinVF threshold, so with that being 16 and VF=2 and IC=4 in this loop we still end up with 8 when considering VF*IC and don't vectorize the epilogue. If I change to float instead

int add2(float  * __restrict in1, float * __restrict in2, int n) {
    for (int i =0 ; i < n; i++)
        in1[i] = in1[i] + in2[i];
    return 0;
}

I do see the epilogue getting vectorized with this patch (because now VF=4, IC=4, VF*IC>=16).

fhahn · 2024-10-04T10:08:12Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-  bool isEpilogueVectorizationProfitable(const ElementCount VF) const;
+  /// \p Multiplier is an aditional scaling factor applied to VF before
+  /// comparing to EpilogueVectorizationMinVF.
+  bool isEpilogueVectorizationProfitable(const ElementCount VF,


Multiplier is the UF/IC, right? Might be clearer to call it that to avoid confusion

Only in the non-SVE case, for SVE it's getVScaleForTuning(OrigLoop, TTI).value_or(1). Which this wouldn't be changing just pulling it out as an argument and passing IC for non-SVE.

fhahn · 2024-10-04T10:11:39Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+        const APInt MaxRemainingIterations =
+            SE.getUnsignedRangeMax(RemainingIterations);
+        // Guard against huge trip counts.
+        if (MaxRemainingIterations.getActiveBits() <= 32) {


Hm, I would assume the max would always be MainLoopVF.getKnownMinValue() * IC - 1 or less, if the trip count is known to be constant? Curios why this guard is needed?

Right, it shouldn't be, I was overly defensive there.

The current code is still a bit confusing, could w not create a constant that is set to MainLoopVF.getKnownMinValue() * IC - 1?

Added MainLoopVF.getKnownMinValue() * IC - 1 as an upper bound, i.e., only use RemainingIterations if it's known to be smaller

sjoerdmeijer · 2024-10-05T08:05:20Z

I have ran SPEC2017 INT + FP on the Neoverse V2, and I think I see a 2.4% improvement in x264 (I haven't done too many iterations but usually x264 is stable). All the other apps were neutral, I didn't see regressions. So, I think that's an okay result.

sjoerdmeijer · 2024-10-05T08:09:29Z

Forgot to add one nit: I applied the patch and saw quite a few regression test failures; the precommit tests show them too.

davemgreen · 2024-10-08T16:58:00Z

I tried both options (max epilogue trip count and average, i.e, / 2 ) on some micro benchmarks, llvm/llvm-test-suite#165, and it looks like avg is strictly worse

Thanks for checking. I don't know how much I would trust micro-benchmarks, but that probably makes sense if it was using the average factor. I imagine what we would really want would not be the "performance of the average trip count", but the "average performance of all the trip counts". The maximum might well be the best at approximating of that.

davemgreen · 2024-10-08T11:15:32Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-  if (!CM.isEpilogueVectorizationProfitable(MainLoopVF)) {
+  unsigned Multiplier = IC;
+  if (MainLoopVF.isScalable())
+    Multiplier = getVScaleForTuning(OrigLoop, TTI).value_or(1);


I believe this could be Multiplier *= for SVE, so that both the IC and the VScaleForTuning are accounted for.

Possibly, but I didn't really look at SVE and since the remaining part of this change depends on using RemainingIterations for checking profitability, which has an (existing) // TODO: extend to support scalable VFs. I'd rather keep changing behavior for SVE as a follow-up.

I think this change looks incorrect. Previously in isEpilogueVectorizationProfitable we did:

unsigned Multiplier = 1; if (VF.isScalable()) Multiplier = getVScaleForTuning(TheLoop, TTI).value_or(1); if ((Multiplier * VF.getKnownMinValue()) >= EpilogueVectorizationMinVF) return true;

i.e. for fixed-width VFs Multiplier = 1, whereas after this change Multiplier = IC. This is either biasing against or in favour of fixed-width VFs, which doesn't seem right. I think in order to match the previous behaviour the code should be:

unsigned Multiplier = 1; if (MainLoopVF.isScalable()) Multiplier = getVScaleForTuning(OrigLoop, TTI).value_or(1);

Hi @fhahn can you verify this as well? I think it should be one of the following:

unsigned Multiplier = IC; if (MainLoopVF.isScalable()) Multiplier *= getVScaleForTuning(OrigLoop, TTI).value_or(1);

or

unsigned Multiplier = 1; if (MainLoopVF.isScalable()) Multiplier = getVScaleForTuning(OrigLoop, TTI).value_or(1);

I wouldn't say it is incorrect, it just keeps the original behavior for scalable vectors as I think @juliannagele doesn't have access to HW with scalable vectors, which would be needed to evaluate the impact of changing this for scalable vectors.

At this point already picked the VF for the main loop, so the only change is that we consider epilogue vectorization for more cases with fixed vectors.

To avoid regression with fixed vectors, this patch relies on code that checks the number of remaining iterations, which currently doesn't support scalable vectors (look for // TODO: extend to support scalable VFs), which probably should be fixed first. Again, this should probably be done by someone with access to HW supporting scalable vectors.

I can volunteer for that, but I am really keen that this lands first. :)

I believe this has now landed, and #100385 is hopefully unblocked. Dave put up #116607 for changing this part too.

Ah, okay, didn't realise that, thanks!

sjoerdmeijer · 2024-10-22T16:36:41Z

Reverse ping: do you have plans to finish this soon'ish? My patch depends on this, i.e. the bit of extra Multiplier logic. If you don't have any plans then I could consider integrating that into my patch to break the dependency and unblock that work. Let me know what you think.

juliannagele · 2024-10-22T21:45:23Z

Reverse ping: do you have plans to finish this soon'ish? My patch depends on this, i.e. the bit of extra Multiplier logic. If you don't have any plans then I could consider integrating that into my patch to break the dependency and unblock that work. Let me know what you think.

One of the test failures was a bit fragile (it arguably shouldn't have passed in the first place), but I should have a separate fix for that. The plan is to finish in the next few days/this week.

sjoerdmeijer · 2024-10-23T11:50:56Z

Reverse ping: do you have plans to finish this soon'ish? My patch depends on this, i.e. the bit of extra Multiplier logic. If you don't have any plans then I could consider integrating that into my patch to break the dependency and unblock that work. Let me know what you think.

One of the test failures was a bit fragile (it arguably shouldn't have passed in the first place), but I should have a separate fix for that. The plan is to finish in the next few days/this week.

Thanks, sounds good!

…ogue-vectorization

…incoming values (#113915) This patch aims to strengthen collection of loop guards by processing PHI nodes with multiple incoming values as follows: collect guards for all incoming values/blocks and try to merge them into a single one for the PHI node. The goal is to determine tighter bounds on the trip counts of scalar tail loops after vectorization, helping to avoid unnecessary transforms. In particular we'd like to avoid vectorizing scalar tails of hand-vectorized loops, for example in [Transforms/PhaseOrdering/X86/pr38280.ll](https://github.com/llvm/llvm-project/blob/231e03ba7e82896847dbc27d457dbb208f04699c/llvm/test/Transforms/PhaseOrdering/X86/pr38280.ll), discovered via #108190 Compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=a55248789ed3f653740e0723d016203b9d585f26&to=500e4c46e79f60b93b11a752698c520e345948e3&stat=instructions:u PR: #113915

…incoming values (llvm#113915) This patch aims to strengthen collection of loop guards by processing PHI nodes with multiple incoming values as follows: collect guards for all incoming values/blocks and try to merge them into a single one for the PHI node. The goal is to determine tighter bounds on the trip counts of scalar tail loops after vectorization, helping to avoid unnecessary transforms. In particular we'd like to avoid vectorizing scalar tails of hand-vectorized loops, for example in [Transforms/PhaseOrdering/X86/pr38280.ll](https://github.com/llvm/llvm-project/blob/231e03ba7e82896847dbc27d457dbb208f04699c/llvm/test/Transforms/PhaseOrdering/X86/pr38280.ll), discovered via llvm#108190 Compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=a55248789ed3f653740e0723d016203b9d585f26&to=500e4c46e79f60b93b11a752698c520e345948e3&stat=instructions:u PR: llvm#113915

Whilst rebasing PR llvm#116247 I discovered an issue where PR llvm#108190 seems to have unintentionally introduced an unfairness in selecting epilogue VFs by making potentially better choices for fixed-width VFs compared to scalable VFs. When considering whether epilogue vectorisation is profitable or not the latest algorithm appears to be: bool IsProfitable = false; if (VF.isFixed()) IsProfitable = (IC * VF.getFixedValue()) >= EpilogueVectorizationMinVF; else IsProfitable = (getVScaleForTuning() * VF.getKnownMinValue()) >= EpilogueVectorizationMinVF; Instead, the estimate for the number of scalar iterations processed in the main vector loop should be (IC * estimatedRuntimeVF)

…incoming values (llvm#113915) This patch aims to strengthen collection of loop guards by processing PHI nodes with multiple incoming values as follows: collect guards for all incoming values/blocks and try to merge them into a single one for the PHI node. The goal is to determine tighter bounds on the trip counts of scalar tail loops after vectorization, helping to avoid unnecessary transforms. In particular we'd like to avoid vectorizing scalar tails of hand-vectorized loops, for example in [Transforms/PhaseOrdering/X86/pr38280.ll](https://github.com/llvm/llvm-project/blob/231e03ba7e82896847dbc27d457dbb208f04699c/llvm/test/Transforms/PhaseOrdering/X86/pr38280.ll), discovered via llvm#108190 Compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=a55248789ed3f653740e0723d016203b9d585f26&to=500e4c46e79f60b93b11a752698c520e345948e3&stat=instructions:u PR: llvm#113915

… IC (#9666) * [SCEV] Collect and merge loop guards through PHI nodes with multiple incoming values (llvm#113915) This patch aims to strengthen collection of loop guards by processing PHI nodes with multiple incoming values as follows: collect guards for all incoming values/blocks and try to merge them into a single one for the PHI node. The goal is to determine tighter bounds on the trip counts of scalar tail loops after vectorization, helping to avoid unnecessary transforms. In particular we'd like to avoid vectorizing scalar tails of hand-vectorized loops, for example in [Transforms/PhaseOrdering/X86/pr38280.ll](https://github.com/llvm/llvm-project/blob/231e03ba7e82896847dbc27d457dbb208f04699c/llvm/test/Transforms/PhaseOrdering/X86/pr38280.ll), discovered via llvm#108190 Compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=a55248789ed3f653740e0723d016203b9d585f26&to=500e4c46e79f60b93b11a752698c520e345948e3&stat=instructions:u PR: llvm#113915 (cherry picked from commit 7c8e05a) * [SCEV] Address post-commit comments for llvm#113915. Address post-commit comments for llvm#113915. (cherry picked from commit feb9b37) * [LV] Vectorize Epilogues for loops with small VF but high IC (llvm#108190) - Consider MainLoopVF * IC when determining whether Epilogue Vectorization is profitable - Allow the same VF for the Epilogue as for the main loop - Use an upper bound for the trip count of the Epilogue when choosing the Epilogue VF PR: llvm#108190 --------- Co-authored-by: Florian Hahn <flo@fhahn.com> (cherry picked from commit a8538b9) --------- Co-authored-by: Florian Hahn <flo@fhahn.com>

llvmbot added vectorizers llvm:transforms labels Sep 11, 2024

fhahn requested review from fhahn, davemgreen and sjoerdmeijer September 11, 2024 12:40

fhahn reviewed Sep 11, 2024

View reviewed changes

davemgreen reviewed Sep 11, 2024

View reviewed changes

fixup! [LV] Vectorize Epilogues for loops with small VF but high IC

db8a41c

Merge branch 'main' into epilogue-vectorization

3c75a59

fhahn reviewed Oct 4, 2024

View reviewed changes

sjoerdmeijer mentioned this pull request Oct 8, 2024

[AArch64] Set MaxInterleaving to 4 for Neoverse V2 and V3 #100385

Merged

davemgreen reviewed Oct 8, 2024

View reviewed changes

juliannagele mentioned this pull request Nov 8, 2024

[SCEV] Collect and merge loop guards through PHI nodes with multiple incoming values #113915

Merged

juliannagele added 2 commits November 11, 2024 18:00

Merge branch 'main' of https://github.com/llvm/llvm-project into epil…

aeed89a

…ogue-vectorization

Update tests

0978b38

Use MainLoopVF * IC - 1 as upper bound for epilogue trip count

b6f44e0

llvmbot added the backend:PowerPC label Nov 14, 2024

Merge branch 'main' into epilogue-vectorization

add1358

fhahn merged commit a8538b9 into llvm:main Nov 17, 2024
8 checks passed

david-arm mentioned this pull request Nov 18, 2024

[LoopVectorize] Ensure fairness when selecting epilogue VFs #116607

Open

fhahn mentioned this pull request Nov 21, 2024

Request Commit Access For juliannagele #117027

Closed

juliannagele mentioned this pull request Nov 22, 2024

[cherrypick][LV] Vectorize Epilogues for loops with small VF but high IC swiftlang/llvm-project#9666

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LV] Vectorize Epilogues for loops with small VF but high IC #108190

[LV] Vectorize Epilogues for loops with small VF but high IC #108190

juliannagele commented Sep 11, 2024

llvmbot commented Sep 11, 2024 •

edited

Loading

sjoerdmeijer commented Sep 11, 2024

davemgreen left a comment

juliannagele commented Sep 13, 2024

sjoerdmeijer commented Sep 16, 2024

davemgreen commented Sep 23, 2024

juliannagele commented Sep 24, 2024 •

edited

Loading

sjoerdmeijer commented Sep 30, 2024 •

edited

Loading

juliannagele commented Oct 1, 2024

fhahn Oct 4, 2024

juliannagele Nov 12, 2024

fhahn Oct 4, 2024

juliannagele Nov 12, 2024

fhahn Nov 13, 2024

juliannagele Nov 14, 2024

sjoerdmeijer commented Oct 5, 2024

sjoerdmeijer commented Oct 5, 2024

davemgreen commented Oct 8, 2024

davemgreen Oct 8, 2024

juliannagele Nov 12, 2024

david-arm Nov 18, 2024 •

edited

Loading

david-arm Nov 18, 2024

fhahn Nov 18, 2024

sjoerdmeijer Nov 19, 2024

davemgreen Nov 19, 2024

sjoerdmeijer Nov 19, 2024

sjoerdmeijer commented Oct 22, 2024

juliannagele commented Oct 22, 2024

sjoerdmeijer commented Oct 23, 2024

[LV] Vectorize Epilogues for loops with small VF but high IC #108190

[LV] Vectorize Epilogues for loops with small VF but high IC #108190

Conversation

juliannagele commented Sep 11, 2024

llvmbot commented Sep 11, 2024 • edited Loading

sjoerdmeijer commented Sep 11, 2024

davemgreen left a comment

Choose a reason for hiding this comment

juliannagele commented Sep 13, 2024

sjoerdmeijer commented Sep 16, 2024

davemgreen commented Sep 23, 2024

juliannagele commented Sep 24, 2024 • edited Loading

sjoerdmeijer commented Sep 30, 2024 • edited Loading

juliannagele commented Oct 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjoerdmeijer commented Oct 5, 2024

sjoerdmeijer commented Oct 5, 2024

davemgreen commented Oct 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-arm Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjoerdmeijer commented Oct 22, 2024

juliannagele commented Oct 22, 2024

sjoerdmeijer commented Oct 23, 2024

llvmbot commented Sep 11, 2024 •

edited

Loading

juliannagele commented Sep 24, 2024 •

edited

Loading

sjoerdmeijer commented Sep 30, 2024 •

edited

Loading

david-arm Nov 18, 2024 •

edited

Loading