[AArch64][LoopVectorize] Use upper bound trip count instead of the constant TC when choosing max VF #67697

Rin18 · 2023-09-28T16:05:48Z

This patch is based off of #67543. It should not be merged before the previous PR.

We are currently using the exact trip count to make decisions regarding the maximum VF. We can instead use the upper bound TC, which will be the same as the constant trip count when that is known.

llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll

commit 9c2faf15231ac5ebc168161d1731feed55eb177c Merge: 0a0ac8da5df6 baecc9e Author: Rin <irina.dobrescu@arm.com> Date: Thu Oct 5 11:19:13 2023 +0100 Merge branch 'main' into maxTC_tailBase commit 0a0ac8da5df684b865d0fb16f7a806832f37e05b Author: Rin Dobrescu <rin.dobrescu@arm.com> Date: Thu Sep 28 15:48:49 2023 +0000 [AArch64][LoopVectorize] Use upper bound trip count instead of the constant TC when choosing max VF commit 26e009c Author: Rin Dobrescu <rin.dobrescu@arm.com> Date: Thu Sep 28 10:30:39 2023 +0000 Remove 'assertions automatically generated' line from test commit e056129 Author: Rin Dobrescu <rin.dobrescu@arm.com> Date: Wed Sep 27 14:47:42 2023 +0000 Address comments and fix tests commit 1bf78c8 Author: Rin Dobrescu <rin.dobrescu@arm.com> Date: Mon Sep 25 11:34:15 2023 +0000 [AArch64][LoopVectorize] Use either fixed-width or scalable VF when tail-folding

llvmbot · 2023-10-06T09:41:18Z

@llvm/pr-subscribers-llvm-transforms

Changes

This patch is based off of #67543. It should not be merged before the previous PR.

We are currently using the exact trip count to make decisions regarding the maximum VF. We can instead use the upper bound TC, which will be the same as the constant trip count when that is known.

Full diff: https://github.com/llvm/llvm-project/pull/67697.diff

1 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+34-31)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 53ad37bf3599b5c..26bf92d7d7c02be 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1663,17 +1663,17 @@ class LoopVectorizationCostModel {
   /// disabled or unsupported, then the scalable part will be equal to
   /// ElementCount::getScalable(0).
   FixedScalableVFPair computeFeasibleMaxVF(unsigned ConstTripCount,
+                                           unsigned MaxTripCount,
                                            ElementCount UserVF,
                                            bool FoldTailByMasking);
 
   /// \return the maximized element count based on the targets vector
   /// registers and the loop trip-count, but limited to a maximum safe VF.
   /// This is a helper function of computeFeasibleMaxVF.
-  ElementCount getMaximizedVFForTarget(unsigned ConstTripCount,
-                                       unsigned SmallestType,
-                                       unsigned WidestType,
-                                       ElementCount MaxSafeVF,
-                                       bool FoldTailByMasking);
+  ElementCount
+  getMaximizedVFForTarget(unsigned ConstTripCount, unsigned MaxTripCount,
+                          unsigned SmallestType, unsigned WidestType,
+                          ElementCount MaxSafeVF, bool FoldTailByMasking);
 
   /// \return the maximum legal scalable VF, based on the safe max number
   /// of elements.
@@ -4811,7 +4811,8 @@ LoopVectorizationCostModel::getMaxLegalScalableVF(unsigned MaxSafeElements) {
 }
 
 FixedScalableVFPair LoopVectorizationCostModel::computeFeasibleMaxVF(
-    unsigned ConstTripCount, ElementCount UserVF, bool FoldTailByMasking) {
+    unsigned ConstTripCount, unsigned MaxTripCount, ElementCount UserVF,
+    bool FoldTailByMasking) {
   MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI);
   unsigned SmallestType, WidestType;
   std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes();
@@ -4898,14 +4899,14 @@ FixedScalableVFPair LoopVectorizationCostModel::computeFeasibleMaxVF(
 
   FixedScalableVFPair Result(ElementCount::getFixed(1),
                              ElementCount::getScalable(0));
-  if (auto MaxVF =
-          getMaximizedVFForTarget(ConstTripCount, SmallestType, WidestType,
-                                  MaxSafeFixedVF, FoldTailByMasking))
+  if (auto MaxVF = getMaximizedVFForTarget(ConstTripCount, MaxTripCount,
+                                           SmallestType, WidestType,
+                                           MaxSafeFixedVF, FoldTailByMasking))
     Result.FixedVF = MaxVF;
 
-  if (auto MaxVF =
-          getMaximizedVFForTarget(ConstTripCount, SmallestType, WidestType,
-                                  MaxSafeScalableVF, FoldTailByMasking))
+  if (auto MaxVF = getMaximizedVFForTarget(
+          ConstTripCount, MaxTripCount, SmallestType, WidestType,
+          MaxSafeScalableVF, FoldTailByMasking))
     if (MaxVF.isScalable()) {
       Result.ScalableVF = MaxVF;
       LLVM_DEBUG(dbgs() << "LV: Found feasible scalable VF = " << MaxVF
@@ -4928,6 +4929,7 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
   }
 
   unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
+  unsigned MaxTC = PSE.getSE()->getSmallConstantMaxTripCount(TheLoop);
   LLVM_DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');
   if (TC == 1) {
     reportVectorizationFailure("Single iteration (non) loop",
@@ -4938,7 +4940,7 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
 
   switch (ScalarEpilogueStatus) {
   case CM_ScalarEpilogueAllowed:
-    return computeFeasibleMaxVF(TC, UserVF, false);
+    return computeFeasibleMaxVF(TC, MaxTC, UserVF, false);
   case CM_ScalarEpilogueNotAllowedUsePredicate:
     [[fallthrough]];
   case CM_ScalarEpilogueNotNeededUsePredicate:
@@ -4976,7 +4978,7 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
       LLVM_DEBUG(dbgs() << "LV: Cannot fold tail by masking: vectorize with a "
                            "scalar epilogue instead.\n");
       ScalarEpilogueStatus = CM_ScalarEpilogueAllowed;
-      return computeFeasibleMaxVF(TC, UserVF, false);
+      return computeFeasibleMaxVF(TC, MaxTC, UserVF, false);
     }
     return FixedScalableVFPair::getNone();
   }
@@ -4993,7 +4995,8 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
     InterleaveInfo.invalidateGroupsRequiringScalarEpilogue();
   }
 
-  FixedScalableVFPair MaxFactors = computeFeasibleMaxVF(TC, UserVF, true);
+  FixedScalableVFPair MaxFactors =
+      computeFeasibleMaxVF(TC, MaxTC, UserVF, true);
 
   // Avoid tail folding if the trip count is known to be a multiple of any VF
   // we choose.
@@ -5069,8 +5072,8 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
 }
 
 ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
-    unsigned ConstTripCount, unsigned SmallestType, unsigned WidestType,
-    ElementCount MaxSafeVF, bool FoldTailByMasking) {
+    unsigned ConstTripCount, unsigned MaxTripCount, unsigned SmallestType,
+    unsigned WidestType, ElementCount MaxSafeVF, bool FoldTailByMasking) {
   bool ComputeScalableMaxVF = MaxSafeVF.isScalable();
   const TypeSize WidestRegister = TTI.getRegisterBitWidth(
       ComputeScalableMaxVF ? TargetTransformInfo::RGK_ScalableVector
@@ -5108,24 +5111,24 @@ ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
   }
 
   // When a scalar epilogue is required, at least one iteration of the scalar
-  // loop has to execute. Adjust ConstTripCount accordingly to avoid picking a
+  // loop has to execute. Adjust MaxTripCount accordingly to avoid picking a
   // max VF that results in a dead vector loop.
-  if (ConstTripCount > 0 && requiresScalarEpilogue(true))
-    ConstTripCount -= 1;
-
-  if (ConstTripCount && ConstTripCount <= WidestRegisterMinEC &&
-      (!FoldTailByMasking || isPowerOf2_32(ConstTripCount))) {
-    // If loop trip count (TC) is known at compile time there is no point in
-    // choosing VF greater than TC (as done in the loop below). Select maximum
-    // power of two which doesn't exceed TC.
-    // If MaxVectorElementCount is scalable, we only fall back on a fixed VF
-    // when the TC is less than or equal to the known number of lanes.
-    auto ClampedConstTripCount = llvm::bit_floor(ConstTripCount);
+  if (MaxTripCount > 0 && requiresScalarEpilogue(true))
+    MaxTripCount -= 1;
+
+  if (MaxTripCount && MaxTripCount <= WidestRegisterMinEC &&
+      (!FoldTailByMasking || isPowerOf2_32(MaxTripCount))) {
+    // If upper bound loop trip count (TC) is known at compile time there is no
+    // point in choosing VF greater than TC (as done in the loop below). Select
+    // maximum power of two which doesn't exceed TC. If MaxVectorElementCount is
+    // scalable, we only fall back on a fixed VF when the TC is less than or
+    // equal to the known number of lanes.
+    auto ClampedUpperTripCount = llvm::bit_floor(MaxTripCount);
     LLVM_DEBUG(dbgs() << "LV: Clamping the MaxVF to maximum power of two not "
                          "exceeding the constant trip count: "
-                      << ClampedConstTripCount << "\n");
+                      << ClampedUpperTripCount << "\n");
     return ElementCount::get(
-        ClampedConstTripCount,
+        ClampedUpperTripCount,
         FoldTailByMasking ? MaxVectorElementCount.isScalable() : false);
   }

…xVF functions and add test

david-arm

Thanks for this @rin-arm! It looks like a nice improvement that makes greater use of information that the compiler already has. I just had a couple of minor comments regarding the test.

llvm/test/Transforms/LoopVectorize/AArch64/wide-trip-count.ll

david-arm

LGTM! Could you rename the new test before merging the patch? Thanks!

llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll

Rin18 requested review from davemgreen, sdesmalen-arm, david-arm and MDevereau September 28, 2023 16:06

david-arm reviewed Sep 29, 2023

View reviewed changes

llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll Outdated Show resolved Hide resolved

Rin18 force-pushed the maxTC_tailBase branch from 9c2faf1 to 876266a Compare October 5, 2023 15:48

Rin18 marked this pull request as ready for review October 6, 2023 09:40

llvmbot added vectorizers llvm:transforms labels Oct 6, 2023

Remove constant TC from getMaximizedVFForTarget and computeFeasibleMa…

b170876

…xVF functions and add test

david-arm reviewed Oct 6, 2023

View reviewed changes

llvm/test/Transforms/LoopVectorize/AArch64/wide-trip-count.ll Outdated Show resolved Hide resolved

llvm/test/Transforms/LoopVectorize/AArch64/wide-trip-count.ll Outdated Show resolved Hide resolved

Add test in clamped-trip-count.ll

afe149c

david-arm approved these changes Oct 9, 2023

View reviewed changes

llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll Outdated Show resolved Hide resolved

Rename test

8737b76

david-arm approved these changes Oct 9, 2023

View reviewed changes

Rin18 merged commit df8e0d0 into llvm:main Oct 9, 2023
2 checks passed

stepthomas mentioned this pull request Oct 10, 2023

AMDGPU stepthomas atomic csub no rtn forms ver2 stepthomas/llvm-project#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AArch64][LoopVectorize] Use upper bound trip count instead of the constant TC when choosing max VF #67697

[AArch64][LoopVectorize] Use upper bound trip count instead of the constant TC when choosing max VF #67697

Rin18 commented Sep 28, 2023

llvmbot commented Oct 6, 2023

david-arm left a comment

david-arm left a comment

[AArch64][LoopVectorize] Use upper bound trip count instead of the constant TC when choosing max VF #67697

[AArch64][LoopVectorize] Use upper bound trip count instead of the constant TC when choosing max VF #67697

Conversation

Rin18 commented Sep 28, 2023

llvmbot commented Oct 6, 2023

david-arm left a comment

Choose a reason for hiding this comment

david-arm left a comment

Choose a reason for hiding this comment