Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LV] Vectorize Epilogues for loops with small VF but high IC #108190

Merged
merged 7 commits into from
Nov 17, 2024

Conversation

juliannagele
Copy link
Member

  • Consider MainLoopVF * IC when determining whether Epilogue Vectorization is profitable
  • Allow the same VF for the Epilogue as for the main loop
  • Use an upper bound for the trip count of the Epilogue when choosing the Epilogue VF

- Consider MainLoopVF * IC when determining whether Epilogue
  Vectorization is profitable
- Allow the same VF for the Epilogue as for the main loop
- Use an upper bound for the trip count of the Epilogue
  when choosing the Epilogue VF
@llvmbot
Copy link
Member

llvmbot commented Sep 11, 2024

@llvm/pr-subscribers-backend-powerpc

@llvm/pr-subscribers-llvm-transforms

Author: Julian Nagele (juliannagele)

Changes
  • Consider MainLoopVF * IC when determining whether Epilogue Vectorization is profitable
  • Allow the same VF for the Epilogue as for the main loop
  • Use an upper bound for the trip count of the Epilogue when choosing the Epilogue VF

Patch is 95.18 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108190.diff

8 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+4)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+34-13)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll (+20-20)
  • (added) llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll (+413)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll (+35-9)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/interleaving-reduction.ll (+52-23)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/intrinsiccost.ll (+101-67)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll (+64-64)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index b5f87e458833d6..04a50ed3a9c594 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -471,6 +471,10 @@ class LoopVectorizationPlanner {
   bool isMoreProfitable(const VectorizationFactor &A,
                         const VectorizationFactor &B) const;
 
+  bool isMoreProfitable(const VectorizationFactor &A,
+                        const VectorizationFactor &B,
+                        const unsigned MaxTripCount) const;
+
   /// Determines if we have the infrastructure to vectorize the loop and its
   /// epilogue, assuming the main loop is vectorized by \p VF.
   bool isCandidateForEpilogueVectorization(const ElementCount VF) const;
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index b821da03c16e94..7450cfe59a28da 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1537,7 +1537,8 @@ class LoopVectorizationCostModel {
   /// Returns true if epilogue vectorization is considered profitable, and
   /// false otherwise.
   /// \p VF is the vectorization factor chosen for the original loop.
-  bool isEpilogueVectorizationProfitable(const ElementCount VF) const;
+  bool isEpilogueVectorizationProfitable(const ElementCount VF,
+                                         const unsigned Multiplier) const;
 
   /// Returns the execution time cost of an instruction for a given vector
   /// width. Vector width of one means scalar.
@@ -4282,12 +4283,11 @@ getVScaleForTuning(const Loop *L, const TargetTransformInfo &TTI) {
 }
 
 bool LoopVectorizationPlanner::isMoreProfitable(
-    const VectorizationFactor &A, const VectorizationFactor &B) const {
+    const VectorizationFactor &A, const VectorizationFactor &B,
+    const unsigned MaxTripCount) const {
   InstructionCost CostA = A.Cost;
   InstructionCost CostB = B.Cost;
 
-  unsigned MaxTripCount = PSE.getSE()->getSmallConstantMaxTripCount(OrigLoop);
-
   // Improve estimate for the vector width if it is scalable.
   unsigned EstimatedWidthA = A.Width.getKnownMinValue();
   unsigned EstimatedWidthB = B.Width.getKnownMinValue();
@@ -4336,6 +4336,13 @@ bool LoopVectorizationPlanner::isMoreProfitable(
   return CmpFn(RTCostA, RTCostB);
 }
 
+bool LoopVectorizationPlanner::isMoreProfitable(
+    const VectorizationFactor &A, const VectorizationFactor &B) const {
+  const unsigned MaxTripCount =
+      PSE.getSE()->getSmallConstantMaxTripCount(OrigLoop);
+  return LoopVectorizationPlanner::isMoreProfitable(A, B, MaxTripCount);
+}
+
 void LoopVectorizationPlanner::emitInvalidCostRemarks(
     OptimizationRemarkEmitter *ORE) {
   using RecipeVFPair = std::pair<VPRecipeBase *, ElementCount>;
@@ -4648,7 +4655,7 @@ bool LoopVectorizationPlanner::isCandidateForEpilogueVectorization(
 }
 
 bool LoopVectorizationCostModel::isEpilogueVectorizationProfitable(
-    const ElementCount VF) const {
+    const ElementCount VF, const unsigned Multiplier) const {
   // FIXME: We need a much better cost-model to take different parameters such
   // as register pressure, code size increase and cost of extra branches into
   // account. For now we apply a very crude heuristic and only consider loops
@@ -4663,9 +4670,6 @@ bool LoopVectorizationCostModel::isEpilogueVectorizationProfitable(
   if (TTI.getMaxInterleaveFactor(VF) <= 1)
     return false;
 
-  unsigned Multiplier = 1;
-  if (VF.isScalable())
-    Multiplier = getVScaleForTuning(TheLoop, TTI).value_or(1);
   if ((Multiplier * VF.getKnownMinValue()) >= EpilogueVectorizationMinVF)
     return true;
   return false;
@@ -4711,7 +4715,11 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
     return Result;
   }
 
-  if (!CM.isEpilogueVectorizationProfitable(MainLoopVF)) {
+  unsigned Multiplier = IC;
+  if (MainLoopVF.isScalable())
+    Multiplier = getVScaleForTuning(OrigLoop, TTI).value_or(1);
+
+  if (!CM.isEpilogueVectorizationProfitable(MainLoopVF, Multiplier)) {
     LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization is not profitable for "
                          "this loop\n");
     return Result;
@@ -4730,16 +4738,20 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
   ScalarEvolution &SE = *PSE.getSE();
   Type *TCType = Legal->getWidestInductionType();
   const SCEV *RemainingIterations = nullptr;
+  unsigned MaxTripCount = 0;
   for (auto &NextVF : ProfitableVFs) {
     // Skip candidate VFs without a corresponding VPlan.
     if (!hasPlanWithVF(NextVF.Width))
       continue;
 
-    // Skip candidate VFs with widths >= the estimate runtime VF (scalable
-    // vectors) or the VF of the main loop (fixed vectors).
+    // Skip candidate VFs with widths >= the (estimated) runtime VF (scalable
+    // vectors) or > the VF of the main loop (fixed vectors).
     if ((!NextVF.Width.isScalable() && MainLoopVF.isScalable() &&
          ElementCount::isKnownGE(NextVF.Width, EstimatedRuntimeVF)) ||
-        ElementCount::isKnownGE(NextVF.Width, MainLoopVF))
+        (NextVF.Width.isScalable() &&
+         ElementCount::isKnownGE(NextVF.Width, MainLoopVF)) ||
+        (!NextVF.Width.isScalable() && !MainLoopVF.isScalable() &&
+         ElementCount::isKnownGT(NextVF.Width, MainLoopVF)))
       continue;
 
     // If NextVF is greater than the number of remaining iterations, the
@@ -4750,6 +4762,14 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
         const SCEV *TC = createTripCountSCEV(TCType, PSE, OrigLoop);
         RemainingIterations = SE.getURemExpr(
             TC, SE.getConstant(TCType, MainLoopVF.getKnownMinValue() * IC));
+        const APInt MaxRemainingIterations =
+            SE.getUnsignedRangeMax(RemainingIterations);
+        // Guard against huge trip counts.
+        if (MaxRemainingIterations.getActiveBits() <= 32) {
+          MaxTripCount = MaxRemainingIterations.getZExtValue();
+          LLVM_DEBUG(dbgs() << "LEV: Maximum Trip Count for Epilogue: "
+                            << MaxTripCount << "\n");
+        }
       }
       if (SE.isKnownPredicate(
               CmpInst::ICMP_UGT,
@@ -4758,7 +4778,8 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
         continue;
     }
 
-    if (Result.Width.isScalar() || isMoreProfitable(NextVF, Result))
+    if (Result.Width.isScalar() ||
+        isMoreProfitable(NextVF, Result, MaxTripCount))
       Result = NextVF;
   }
 
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll b/llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
index 6953d6c48694c2..283a928d0d4884 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
@@ -16,7 +16,7 @@ define void @test_pr25490(i32 %n, ptr noalias nocapture %a, ptr noalias nocaptur
 ; CHECK-NEXT:    br i1 [[CMP_28]], label [[FOR_COND_CLEANUP:%.*]], label [[ITER_CHECK:%.*]]
 ; CHECK:       iter.check:
 ; CHECK-NEXT:    [[TMP0:%.*]] = zext i32 [[N]] to i64
-; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 8
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
 ; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
 ; CHECK:       vector.main.loop.iter.check:
 ; CHECK-NEXT:    [[MIN_ITERS_CHECK1:%.*]] = icmp ult i32 [[N]], 16
@@ -50,33 +50,33 @@ define void @test_pr25490(i32 %n, ptr noalias nocapture %a, ptr noalias nocaptur
 ; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N_VEC]], [[TMP0]]
 ; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[VEC_EPILOG_ITER_CHECK:%.*]]
 ; CHECK:       vec.epilog.iter.check:
-; CHECK-NEXT:    [[N_VEC_REMAINING:%.*]] = and i64 [[TMP0]], 8
-; CHECK-NEXT:    [[MIN_EPILOG_ITERS_CHECK_NOT_NOT:%.*]] = icmp eq i64 [[N_VEC_REMAINING]], 0
-; CHECK-NEXT:    br i1 [[MIN_EPILOG_ITERS_CHECK_NOT_NOT]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
+; CHECK-NEXT:    [[N_VEC_REMAINING:%.*]] = and i64 [[TMP0]], 12
+; CHECK-NEXT:    [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp eq i64 [[N_VEC_REMAINING]], 0
+; CHECK-NEXT:    br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
 ; CHECK:       vec.epilog.ph:
 ; CHECK-NEXT:    [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
-; CHECK-NEXT:    [[N_VEC5:%.*]] = and i64 [[TMP0]], 4294967288
+; CHECK-NEXT:    [[N_VEC5:%.*]] = and i64 [[TMP0]], 4294967292
 ; CHECK-NEXT:    br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
 ; CHECK:       vec.epilog.vector.body:
 ; CHECK-NEXT:    [[INDEX7:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT11:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX7]]
-; CHECK-NEXT:    [[WIDE_LOAD8:%.*]] = load <8 x i8>, ptr [[TMP14]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD8:%.*]] = load <4 x i8>, ptr [[TMP14]], align 1
 ; CHECK-NEXT:    [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDEX7]]
-; CHECK-NEXT:    [[WIDE_LOAD9:%.*]] = load <8 x i8>, ptr [[TMP15]], align 1
-; CHECK-NEXT:    [[TMP16:%.*]] = zext <8 x i8> [[WIDE_LOAD9]] to <8 x i16>
-; CHECK-NEXT:    [[TMP17:%.*]] = zext <8 x i8> [[WIDE_LOAD8]] to <8 x i16>
-; CHECK-NEXT:    [[TMP18:%.*]] = mul nuw <8 x i16> [[TMP16]], [[TMP17]]
-; CHECK-NEXT:    [[TMP19:%.*]] = lshr <8 x i16> [[TMP18]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
-; CHECK-NEXT:    [[TMP20:%.*]] = trunc nuw <8 x i16> [[TMP19]] to <8 x i8>
-; CHECK-NEXT:    store <8 x i8> [[TMP20]], ptr [[TMP15]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD9:%.*]] = load <4 x i8>, ptr [[TMP15]], align 1
+; CHECK-NEXT:    [[TMP16:%.*]] = zext <4 x i8> [[WIDE_LOAD9]] to <4 x i16>
+; CHECK-NEXT:    [[TMP17:%.*]] = zext <4 x i8> [[WIDE_LOAD8]] to <4 x i16>
+; CHECK-NEXT:    [[TMP18:%.*]] = mul nuw <4 x i16> [[TMP16]], [[TMP17]]
+; CHECK-NEXT:    [[TMP19:%.*]] = lshr <4 x i16> [[TMP18]], <i16 8, i16 8, i16 8, i16 8>
+; CHECK-NEXT:    [[TMP20:%.*]] = trunc nuw <4 x i16> [[TMP19]] to <4 x i8>
+; CHECK-NEXT:    store <4 x i8> [[TMP20]], ptr [[TMP15]], align 1
 ; CHECK-NEXT:    [[TMP21:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX7]]
-; CHECK-NEXT:    [[WIDE_LOAD10:%.*]] = load <8 x i8>, ptr [[TMP21]], align 1
-; CHECK-NEXT:    [[TMP22:%.*]] = zext <8 x i8> [[WIDE_LOAD10]] to <8 x i16>
-; CHECK-NEXT:    [[TMP23:%.*]] = mul nuw <8 x i16> [[TMP22]], [[TMP17]]
-; CHECK-NEXT:    [[TMP24:%.*]] = lshr <8 x i16> [[TMP23]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
-; CHECK-NEXT:    [[TMP25:%.*]] = trunc nuw <8 x i16> [[TMP24]] to <8 x i8>
-; CHECK-NEXT:    store <8 x i8> [[TMP25]], ptr [[TMP21]], align 1
-; CHECK-NEXT:    [[INDEX_NEXT11]] = add nuw i64 [[INDEX7]], 8
+; CHECK-NEXT:    [[WIDE_LOAD10:%.*]] = load <4 x i8>, ptr [[TMP21]], align 1
+; CHECK-NEXT:    [[TMP22:%.*]] = zext <4 x i8> [[WIDE_LOAD10]] to <4 x i16>
+; CHECK-NEXT:    [[TMP23:%.*]] = mul nuw <4 x i16> [[TMP22]], [[TMP17]]
+; CHECK-NEXT:    [[TMP24:%.*]] = lshr <4 x i16> [[TMP23]], <i16 8, i16 8, i16 8, i16 8>
+; CHECK-NEXT:    [[TMP25:%.*]] = trunc nuw <4 x i16> [[TMP24]] to <4 x i8>
+; CHECK-NEXT:    store <4 x i8> [[TMP25]], ptr [[TMP21]], align 1
+; CHECK-NEXT:    [[INDEX_NEXT11]] = add nuw i64 [[INDEX7]], 4
 ; CHECK-NEXT:    [[TMP26:%.*]] = icmp eq i64 [[INDEX_NEXT11]], [[N_VEC5]]
 ; CHECK-NEXT:    br i1 [[TMP26]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
 ; CHECK:       vec.epilog.middle.block:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
new file mode 100644
index 00000000000000..cda2d19521437f
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
@@ -0,0 +1,413 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -S < %s -passes=loop-vectorize -force-vector-interleave=4 2>&1 | FileCheck %s
+
+target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64"
+
+define void @add_i8(ptr noalias nocapture noundef writeonly %A, ptr nocapture noundef readonly %B, ptr nocapture noundef readonly %C, i64 noundef %Iterations) {
+; CHECK-LABEL: @add_i8(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CMP11:%.*]] = icmp sgt i64 [[ITERATIONS:%.*]], 0
+; CHECK-NEXT:    br i1 [[CMP11]], label [[ITER_CHECK:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+; CHECK:       iter.check:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[ITERATIONS]], 8
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
+; CHECK:       vector.main.loop.iter.check:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[ITERATIONS]], 64
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK1]], label [[VEC_EPILOG_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK:       vector.ph:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[ITERATIONS]], 64
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[ITERATIONS]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[INDEX]], 0
+; CHECK-NEXT:    [[TMP1:%.*]] = add i64 [[INDEX]], 16
+; CHECK-NEXT:    [[TMP2:%.*]] = add i64 [[INDEX]], 32
+; CHECK-NEXT:    [[TMP3:%.*]] = add i64 [[INDEX]], 48
+; CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[B:%.*]], i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP1]]
+; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP2]]
+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP3]]
+; CHECK-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 0
+; CHECK-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 16
+; CHECK-NEXT:    [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 32
+; CHECK-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 48
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <16 x i8>, ptr [[TMP8]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD2:%.*]] = load <16 x i8>, ptr [[TMP9]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD3:%.*]] = load <16 x i8>, ptr [[TMP10]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD4:%.*]] = load <16 x i8>, ptr [[TMP11]], align 1
+; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[C:%.*]], i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP13:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[TMP1]]
+; CHECK-NEXT:    [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[TMP2]]
+; CHECK-NEXT:    [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[TMP3]]
+; CHECK-NEXT:    [[TMP16:%.*]] = getelementptr inbounds i8, ptr [[TMP12]], i32 0
+; CHECK-NEXT:    [[TMP17:%.*]] = getelementptr inbounds i8, ptr [[TMP12]], i32 16
+; CHECK-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i8, ptr [[TMP12]], i32 32
+; CHECK-NEXT:    [[TMP19:%.*]] = getelementptr inbounds i8, ptr [[TMP12]], i32 48
+; CHECK-NEXT:    [[WIDE_LOAD5:%.*]] = load <16 x i8>, ptr [[TMP16]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD6:%.*]] = load <16 x i8>, ptr [[TMP17]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD7:%.*]] = load <16 x i8>, ptr [[TMP18]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD8:%.*]] = load <16 x i8>, ptr [[TMP19]], align 1
+; CHECK-NEXT:    [[TMP20:%.*]] = add <16 x i8> [[WIDE_LOAD5]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP21:%.*]] = add <16 x i8> [[WIDE_LOAD6]], [[WIDE_LOAD2]]
+; CHECK-NEXT:    [[TMP22:%.*]] = add <16 x i8> [[WIDE_LOAD7]], [[WIDE_LOAD3]]
+; CHECK-NEXT:    [[TMP23:%.*]] = add <16 x i8> [[WIDE_LOAD8]], [[WIDE_LOAD4]]
+; CHECK-NEXT:    [[TMP24:%.*]] = getelementptr inbounds i8, ptr [[A:%.*]], i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP25:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP1]]
+; CHECK-NEXT:    [[TMP26:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP2]]
+; CHECK-NEXT:    [[TMP27:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP3]]
+; CHECK-NEXT:    [[TMP28:%.*]] = getelementptr inbounds i8, ptr [[TMP24]], i32 0
+; CHECK-NEXT:    [[TMP29:%.*]] = getelementptr inbounds i8, ptr [[TMP24]], i32 16
+; CHECK-NEXT:    [[TMP30:%.*]] = getelementptr inbounds i8, ptr [[TMP24]], i32 32
+; CHECK-NEXT:    [[TMP31:%.*]] = getelementptr inbounds i8, ptr [[TMP24]], i32 48
+; CHECK-NEXT:    store <16 x i8> [[TMP20]], ptr [[TMP28]], align 1
+; CHECK-NEXT:    store <16 x i8> [[TMP21]], ptr [[TMP29]], align 1
+; CHECK-NEXT:    store <16 x i8> [[TMP22]], ptr [[TMP30]], align 1
+; CHECK-NEXT:    store <16 x i8> [[TMP23]], ptr [[TMP31]], align 1
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 64
+; CHECK-NEXT:    [[TMP32:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP32]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[ITERATIONS]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[VEC_EPILOG_ITER_CHECK:%.*]]
+; CHECK:       vec.epilog.iter.check:
+; CHECK-NEXT:    [[N_VEC_REMAINING:%.*]] = sub i64 [[ITERATIONS]], [[N_VEC]]
+; CHECK-NEXT:    [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8
+; CHECK-NEXT:    br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
+; CHECK:       vec.epilog.ph:
+; CHECK-NEXT:    [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
+; CHECK-NEXT:    [[N_MOD_VF9:%.*]] = urem i64 [[ITERATIONS]], 8
+; CHECK-NEXT:    [[N_VEC10:%.*]] = sub i64 [[ITERATIONS]], [[N_MOD_VF9]]
+; CHECK-NEXT:    br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
+; CHECK:       vec.epilog.vector.body:
+; CHECK-NEXT:    [[INDEX12:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT15:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP33:%.*]] = add i64 [[INDEX12]], 0
+; CHECK-NEXT:    [[TMP34:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP33]]
+; CHECK-NEXT:    [[TMP35:%.*]] = getelementptr inbounds i8, ptr [[TMP34]], i32 0
+; CHECK-NEXT:    [[WIDE_LOAD13:%.*]] = load <8 x i8>, ptr [[TMP35]], align 1
+; CHECK-NEXT:    [[TMP36:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[TMP33]]
+; CHECK-NEXT:    [[TMP37:%.*]] = getelementptr inbounds i8, ptr [[TMP36]], i32 0
+; CHECK-NEXT:    [[WIDE_LOAD14:%.*]] = load <8 x i8>, ptr [[TMP37]], align 1
+; CHECK-NEXT:    [[TMP38:%.*]] = add <8 x i8> [[WIDE_LOAD14]], [[WIDE_LOAD13]]
+; CHECK-NEXT:    [[TMP39:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP33]]
+; CHECK-NEXT:    [[TMP40:%.*]] = getelementptr inbounds i8, ptr [[TMP39]], i32 0
+; CHECK-NEXT:    store <8 x i8> [[TMP38]], ptr [[TMP40]], align 1
+; CHECK-NEXT:    [[INDEX_NEXT15]] = add nuw i64 [[INDEX12]], 8
+; CHECK-NEXT:    [[TMP41:%.*]] = icmp eq i64 [[INDEX_NEXT15]], [[N_VEC10]]
+; CHECK-NEXT:    br i1 [[TMP41]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       vec.epilog.middle.block:
+; CHECK-NEXT:    [[CMP_N11:%.*]] = icmp eq i64 [[ITERATIONS]], [[N_VEC10]]
+; CHECK-NEXT:    br i1 [[CMP_N11]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[VEC_EPILOG_SCALAR_PH]]
+; CHECK:       vec.epilog.scalar.ph:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC10]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK]] ]
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.cond.cleanup.loopexit:
+; CHECK-NEXT:    br label [[FOR_COND_CLEANUP]]
+; CHECK:       for.cond.cleanup:
+; CHECK-NEXT:    ret void
+; CHECK:       for.body:
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ]
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDVARS_IV]]
+; CHECK-NEXT:    [[TMP42...
[truncated]

@sjoerdmeijer
Copy link
Collaborator

Thanks again for working on this.
I will kick off some performance runs with this, which I was planning to do but haven't done yet.
I am off tomorrow, but promise I will get back within a couple of days.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi - A high level comment - the MaxTripCount in isMoreProfitable is expected to be the constant trip count (as there is generally expected to be a single trip count if it is constant).

Should the epilog vectorization cost be based on the maximum epilog trip count (VF*IC - 1) or the average trip count?

@juliannagele
Copy link
Member Author

Hi - A high level comment - the MaxTripCount in isMoreProfitable is expected to be the constant trip count (as there is generally expected to be a single trip count if it is constant).

Should the epilog vectorization cost be based on the maximum epilog trip count (VF*IC - 1) or the average trip count?

Hi, thanks for having a look! Not sure I fully understand the question; From the existing code and name my reading was that it should be the maximum trip count, since for the main loop it's using getSmallConstantMaxTripCount, which describes itself as an "upper bound of the loop trip count".

@sjoerdmeijer
Copy link
Collaborator

I first did a quick performance run for 527.cam4_r from SPEC FP. I was hoping it would avoid the regression that is result of my extra interleaving patch (#100385), but unfortunately this patch didn't make a difference. I.e., with the extra interleaving for the V2, this patch didn't manage to bring back the lost perf.

What I will plan to do next:

  • I haven't looked why this is not helping. I guess because epilogue vectorization is not kicking in, but will debug why that isn't the case,
  • will kick off some more perf runs.

@davemgreen
Copy link
Collaborator

Hi, thanks for having a look! Not sure I fully understand the question; From the existing code and name my reading was that it should be the maximum trip count, since for the main loop it's using getSmallConstantMaxTripCount, which describes itself as an "upper bound of the loop trip count".

Sorry - I meant to reply to this but it fell off my radar. What I mean it that the value returned from getSmallConstantMaxTripCount is usually either a known constant from the trip count, or something like 2147483647 if the loop count is unknown (or it can return 0 for larger induction variables). The assumption is that is a good approximation for the trip count of the loop, or it is high enough that the vector cost dominates the total. Looking at the code, it should maybe be using getSmallBestKnownTC.

If the trip count is a known constant then everything should be good, but if it is unknown but large this patch I think uses VF-1 (either directly or from 2147483647 % VF via SCEVs) to cost the epilog. The real number of iterations will be something in the range 0 to VF-1, with 0 probably being relatively common. That means we will calculate the cost based on the worst-case epilog iteration count, which for small loops like this could skew the results. I would suspect that it probably makes the vectorizer choose a smaller factor for the epilog, so that it would pick 3*VF2 + 1*scalar instead of 1*VF4 if VF/2 used instead.

It might well be that VF-1 works better than any other choice, it sounds like it will pick lower VFs which would be a benefit to more trip counts (up to a point). I was interested if you had tried any other alternatives.

(A lot of the VF above should be VF*UF).

@juliannagele
Copy link
Member Author

juliannagele commented Sep 24, 2024

Ah, I see, thanks for clarifying! I tried both options (max epilogue trip count and average, i.e, / 2 ) on some micro benchmarks, llvm/llvm-test-suite#165, and it looks like avg is strictly worse, even resulting in some regressions (baseline is current without this PR, compares runtime so lower is better):

name MaxTCvsBase AvgTCvsBase AvgTCvsMaxTC
0 benchAutoVecForuint8_tForLoopTC65 -0.09% -0.12% -0.03%
1 benchReductionAutoVecForuint8_tForLoopTC65 0.04% 0.10% 0.06%
2 benchAutoVecForuint8_tForLoopTC80 0.02% -0.04% -0.07%
3 benchReductionAutoVecForuint8_tForLoopTC80 -0.07% 17.82% 17.91%
4 benchAutoVecForuint8_tForLoopTC96 0.20% 0.05% -0.15%
5 benchReductionAutoVecForuint8_tForLoopTC96 -0.25% -0.20% 0.05%
6 benchAutoVecForuint8_tForLoopTC104 0.04% 0.07% 0.03%
7 benchReductionAutoVecForuint8_tForLoopTC104 0.24% 5.76% 5.52%
8 benchAutoVecForuint8_tForLoopTC127 -0.02% 0.08% 0.10%
9 benchReductionAutoVecForuint8_tForLoopTC127 0.01% 16.98% 16.97%
10 benchAutoVecForuint16_tForLoopTC65 0.17% 0.16% -0.00%
11 benchReductionAutoVecForuint16_tForLoopTC65 0.75% 0.92% 0.17%
12 benchAutoVecForuint16_tForLoopTC80 -44.01% -44.05% -0.06%
13 benchReductionAutoVecForuint16_tForLoopTC80 -26.48% -26.42% 0.07%
14 benchAutoVecForuint16_tForLoopTC96 0.00% -0.15% -0.15%
15 benchReductionAutoVecForuint16_tForLoopTC96 0.70% 0.76% 0.06%
16 benchAutoVecForuint16_tForLoopTC104 -26.32% -26.39% -0.09%
17 benchReductionAutoVecForuint16_tForLoopTC104 -9.45% -9.47% -0.02%
18 benchAutoVecForuint16_tForLoopTC127 -40.52% -40.49% 0.05%
19 benchReductionAutoVecForuint16_tForLoopTC127 -36.34% -36.41% -0.11%
20 benchAutoVecForuint32_tForLoopTC65 -0.61% -0.51% 0.10%
21 benchReductionAutoVecForuint32_tForLoopTC65 0.45% 0.37% -0.08%
22 benchAutoVecForuint32_tForLoopTC80 0.02% -0.02% -0.03%
23 benchReductionAutoVecForuint32_tForLoopTC80 0.15% 0.17% 0.02%
24 benchAutoVecForuint32_tForLoopTC96 0.01% 0.03% 0.02%
25 benchReductionAutoVecForuint32_tForLoopTC96 0.29% 0.34% 0.05%
26 benchAutoVecForuint32_tForLoopTC104 -17.54% -12.88% 5.65%
27 benchReductionAutoVecForuint32_tForLoopTC104 -14.36% -10.04% 5.04%
28 benchAutoVecForuint32_tForLoopTC127 -21.28% -18.47% 3.58%
29 benchReductionAutoVecForuint32_tForLoopTC127 -18.78% -18.76% 0.03%

@sjoerdmeijer
Copy link
Collaborator

sjoerdmeijer commented Sep 30, 2024

I finally got round to looking at the regression that would be introduced by setting MaxInterLeaving = 4 for the Neoverse V2. There are a couple of loops of this shape:

int add(double * __restrict in1, double * __restrict in2, int n) {
    for (int i =0 ; i < n; i++) 
        in1[i] = in1[i] + in2[i];
}

Some observations: this will result in a NEON loop (i.e. not scalable), and we can't say much about the IC.
As a result, we don't vectorize the epilogue for these cases, also not with this patch applied.

I don't think that this is criticism for this patch, it's not intended to recognize this, right?

@juliannagele
Copy link
Member Author

I finally got round to looking at the regression that would be introduced by setting MaxInterLeaving = 4 for the Neoverse V2. There are a couple of loops of this shape:

int add(double * __restrict in1, double * __restrict in2, int n) {
    for (int i =0 ; i < n; i++) 
        in1[i] = in1[i] + in2[i];
}

Some observations: this will result in a NEON loop (i.e. not scalable), and we can't say much about the IC. As a result, we don't vectorize the epilogue for these cases, also not with this patch applied.

I don't think that this is criticism for this patch, it's not intended to recognize this, right?

I'd say that's expected yes. This patch doesn't change the EpilogueVectorizationMinVF threshold, so with that being 16 and VF=2 and IC=4 in this loop we still end up with 8 when considering VF*IC and don't vectorize the epilogue. If I change to float instead

int add2(float  * __restrict in1, float * __restrict in2, int n) {
    for (int i =0 ; i < n; i++)
        in1[i] = in1[i] + in2[i];
    return 0;
}

I do see the epilogue getting vectorized with this patch (because now VF=4, IC=4, VF*IC>=16).

bool isEpilogueVectorizationProfitable(const ElementCount VF) const;
/// \p Multiplier is an aditional scaling factor applied to VF before
/// comparing to EpilogueVectorizationMinVF.
bool isEpilogueVectorizationProfitable(const ElementCount VF,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiplier is the UF/IC, right? Might be clearer to call it that to avoid confusion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only in the non-SVE case, for SVE it's getVScaleForTuning(OrigLoop, TTI).value_or(1). Which this wouldn't be changing just pulling it out as an argument and passing IC for non-SVE.

const APInt MaxRemainingIterations =
SE.getUnsignedRangeMax(RemainingIterations);
// Guard against huge trip counts.
if (MaxRemainingIterations.getActiveBits() <= 32) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I would assume the max would always be MainLoopVF.getKnownMinValue() * IC - 1 or less, if the trip count is known to be constant? Curios why this guard is needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it shouldn't be, I was overly defensive there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code is still a bit confusing, could w not create a constant that is set to MainLoopVF.getKnownMinValue() * IC - 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added MainLoopVF.getKnownMinValue() * IC - 1 as an upper bound, i.e., only use RemainingIterations if it's known to be smaller

@sjoerdmeijer
Copy link
Collaborator

I have ran SPEC2017 INT + FP on the Neoverse V2, and I think I see a 2.4% improvement in x264 (I haven't done too many iterations but usually x264 is stable). All the other apps were neutral, I didn't see regressions. So, I think that's an okay result.

@sjoerdmeijer
Copy link
Collaborator

Forgot to add one nit: I applied the patch and saw quite a few regression test failures; the precommit tests show them too.

@davemgreen
Copy link
Collaborator

I tried both options (max epilogue trip count and average, i.e, / 2 ) on some micro benchmarks, llvm/llvm-test-suite#165, and it looks like avg is strictly worse

Thanks for checking. I don't know how much I would trust micro-benchmarks, but that probably makes sense if it was using the average factor. I imagine what we would really want would not be the "performance of the average trip count", but the "average performance of all the trip counts". The maximum might well be the best at approximating of that.

if (!CM.isEpilogueVectorizationProfitable(MainLoopVF)) {
unsigned Multiplier = IC;
if (MainLoopVF.isScalable())
Multiplier = getVScaleForTuning(OrigLoop, TTI).value_or(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this could be Multiplier *= for SVE, so that both the IC and the VScaleForTuning are accounted for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but I didn't really look at SVE and since the remaining part of this change depends on using RemainingIterations for checking profitability, which has an (existing) // TODO: extend to support scalable VFs. I'd rather keep changing behavior for SVE as a follow-up.

Copy link
Contributor

@david-arm david-arm Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change looks incorrect. Previously in isEpilogueVectorizationProfitable we did:

  unsigned Multiplier = 1;
  if (VF.isScalable())
    Multiplier = getVScaleForTuning(TheLoop, TTI).value_or(1);
  if ((Multiplier * VF.getKnownMinValue()) >= EpilogueVectorizationMinVF)
    return true;

i.e. for fixed-width VFs Multiplier = 1, whereas after this change Multiplier = IC. This is either biasing against or in favour of fixed-width VFs, which doesn't seem right. I think in order to match the previous behaviour the code should be:

  unsigned Multiplier = 1;
  if (MainLoopVF.isScalable())
    Multiplier = getVScaleForTuning(OrigLoop, TTI).value_or(1);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fhahn can you verify this as well? I think it should be one of the following:

  unsigned Multiplier = IC;
  if (MainLoopVF.isScalable())
    Multiplier *= getVScaleForTuning(OrigLoop, TTI).value_or(1);

or

  unsigned Multiplier = 1;
  if (MainLoopVF.isScalable())
    Multiplier = getVScaleForTuning(OrigLoop, TTI).value_or(1);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say it is incorrect, it just keeps the original behavior for scalable vectors as I think @juliannagele doesn't have access to HW with scalable vectors, which would be needed to evaluate the impact of changing this for scalable vectors.

At this point already picked the VF for the main loop, so the only change is that we consider epilogue vectorization for more cases with fixed vectors.

To avoid regression with fixed vectors, this patch relies on code that checks the number of remaining iterations, which currently doesn't support scalable vectors (look for // TODO: extend to support scalable VFs), which probably should be fixed first. Again, this should probably be done by someone with access to HW supporting scalable vectors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can volunteer for that, but I am really keen that this lands first. :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this has now landed, and #100385 is hopefully unblocked. Dave put up #116607 for changing this part too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay, didn't realise that, thanks!

@sjoerdmeijer
Copy link
Collaborator

Reverse ping: do you have plans to finish this soon'ish? My patch depends on this, i.e. the bit of extra Multiplier logic. If you don't have any plans then I could consider integrating that into my patch to break the dependency and unblock that work. Let me know what you think.

@juliannagele
Copy link
Member Author

Reverse ping: do you have plans to finish this soon'ish? My patch depends on this, i.e. the bit of extra Multiplier logic. If you don't have any plans then I could consider integrating that into my patch to break the dependency and unblock that work. Let me know what you think.

One of the test failures was a bit fragile (it arguably shouldn't have passed in the first place), but I should have a separate fix for that. The plan is to finish in the next few days/this week.

@sjoerdmeijer
Copy link
Collaborator

Reverse ping: do you have plans to finish this soon'ish? My patch depends on this, i.e. the bit of extra Multiplier logic. If you don't have any plans then I could consider integrating that into my patch to break the dependency and unblock that work. Let me know what you think.

One of the test failures was a bit fragile (it arguably shouldn't have passed in the first place), but I should have a separate fix for that. The plan is to finish in the next few days/this week.

Thanks, sounds good!

fhahn pushed a commit that referenced this pull request Nov 15, 2024
…incoming values (#113915)

This patch aims to strengthen collection of loop guards by processing
PHI nodes with multiple incoming values as follows: collect guards for
all incoming values/blocks and try to merge them into a single one for
the PHI node.

The goal is to determine tighter bounds on the trip counts of scalar
tail loops after vectorization, helping to avoid unnecessary transforms.
In particular we'd like to avoid vectorizing scalar tails of
hand-vectorized loops, for example in
[Transforms/PhaseOrdering/X86/pr38280.ll](https://github.com/llvm/llvm-project/blob/231e03ba7e82896847dbc27d457dbb208f04699c/llvm/test/Transforms/PhaseOrdering/X86/pr38280.ll),
discovered via #108190

Compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=a55248789ed3f653740e0723d016203b9d585f26&to=500e4c46e79f60b93b11a752698c520e345948e3&stat=instructions:u

PR: #113915
fhahn pushed a commit to fhahn/llvm-project that referenced this pull request Nov 15, 2024
…incoming values (llvm#113915)

This patch aims to strengthen collection of loop guards by processing
PHI nodes with multiple incoming values as follows: collect guards for
all incoming values/blocks and try to merge them into a single one for
the PHI node.

The goal is to determine tighter bounds on the trip counts of scalar
tail loops after vectorization, helping to avoid unnecessary transforms.
In particular we'd like to avoid vectorizing scalar tails of
hand-vectorized loops, for example in
[Transforms/PhaseOrdering/X86/pr38280.ll](https://github.com/llvm/llvm-project/blob/231e03ba7e82896847dbc27d457dbb208f04699c/llvm/test/Transforms/PhaseOrdering/X86/pr38280.ll),
discovered via llvm#108190

Compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=a55248789ed3f653740e0723d016203b9d585f26&to=500e4c46e79f60b93b11a752698c520e345948e3&stat=instructions:u

PR: llvm#113915
@fhahn fhahn merged commit a8538b9 into llvm:main Nov 17, 2024
8 checks passed
david-arm added a commit to david-arm/llvm-project that referenced this pull request Nov 18, 2024
Whilst rebasing PR llvm#116247 I discovered an issue where
PR llvm#108190 seems to have unintentionally introduced an
unfairness in selecting epilogue VFs by making potentially
better choices for fixed-width VFs compared to scalable VFs.
When considering whether epilogue vectorisation is profitable
or not the latest algorithm appears to be:

bool IsProfitable = false;
if (VF.isFixed())
  IsProfitable = (IC * VF.getFixedValue())
     >= EpilogueVectorizationMinVF;
else
  IsProfitable = (getVScaleForTuning() * VF.getKnownMinValue())
     >= EpilogueVectorizationMinVF;

Instead, the estimate for the number of scalar iterations
processed in the main vector loop should be

  (IC * estimatedRuntimeVF)
akshayrdeodhar pushed a commit to akshayrdeodhar/llvm-project that referenced this pull request Nov 18, 2024
…incoming values (llvm#113915)

This patch aims to strengthen collection of loop guards by processing
PHI nodes with multiple incoming values as follows: collect guards for
all incoming values/blocks and try to merge them into a single one for
the PHI node.

The goal is to determine tighter bounds on the trip counts of scalar
tail loops after vectorization, helping to avoid unnecessary transforms.
In particular we'd like to avoid vectorizing scalar tails of
hand-vectorized loops, for example in
[Transforms/PhaseOrdering/X86/pr38280.ll](https://github.com/llvm/llvm-project/blob/231e03ba7e82896847dbc27d457dbb208f04699c/llvm/test/Transforms/PhaseOrdering/X86/pr38280.ll),
discovered via llvm#108190

Compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=a55248789ed3f653740e0723d016203b9d585f26&to=500e4c46e79f60b93b11a752698c520e345948e3&stat=instructions:u

PR: llvm#113915
juliannagele added a commit to swiftlang/llvm-project that referenced this pull request Dec 2, 2024
… IC (#9666)

* [SCEV] Collect and merge loop guards through PHI nodes with multiple incoming values (llvm#113915)

This patch aims to strengthen collection of loop guards by processing
PHI nodes with multiple incoming values as follows: collect guards for
all incoming values/blocks and try to merge them into a single one for
the PHI node.

The goal is to determine tighter bounds on the trip counts of scalar
tail loops after vectorization, helping to avoid unnecessary transforms.
In particular we'd like to avoid vectorizing scalar tails of
hand-vectorized loops, for example in
[Transforms/PhaseOrdering/X86/pr38280.ll](https://github.com/llvm/llvm-project/blob/231e03ba7e82896847dbc27d457dbb208f04699c/llvm/test/Transforms/PhaseOrdering/X86/pr38280.ll),
discovered via llvm#108190

Compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=a55248789ed3f653740e0723d016203b9d585f26&to=500e4c46e79f60b93b11a752698c520e345948e3&stat=instructions:u

PR: llvm#113915
(cherry picked from commit 7c8e05a)

* [SCEV] Address post-commit comments for llvm#113915.

Address post-commit comments for
llvm#113915.

(cherry picked from commit feb9b37)

* [LV] Vectorize Epilogues for loops with small VF but high IC (llvm#108190)

- Consider MainLoopVF * IC when determining whether Epilogue
Vectorization is profitable
- Allow the same VF for the Epilogue as for the main loop
- Use an upper bound for the trip count of the Epilogue when choosing
the Epilogue VF

PR: llvm#108190
---------

Co-authored-by: Florian Hahn <flo@fhahn.com>
(cherry picked from commit a8538b9)

---------

Co-authored-by: Florian Hahn <flo@fhahn.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants