Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LoopVectorize] Add support for vectorisation of more early exit loops #88385

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

david-arm
Copy link
Contributor

@david-arm david-arm commented Apr 11, 2024

This patch follows on from PR #107004 by adding support for vectorisation of a simple class of loops that typically involves searching for something, i.e.

  for (int i = 0; i < n; i++) {
    if (p[i] == val)
      return i;
  }
  return n;

or

  for (int i = 0; i < n; i++) {
    if (p1[i] != p2[i])
      return i;
  }
  return n;

In this initial commit we will only vectorise early exit loops legal if they
follow these criteria:

  1. There are no stores in the loop.
  2. The loop must have only one early uncountable exit like those shown in the
    above example.
  3. The early exit block dominates the latch block.
  4. The latch block must have an exact exit count.
  5. The loop must not contain reductions or recurrences.
  6. We must be able to prove at compile-time that loops will not contain
    faulting loads.

For point 7 once this patch lands I intend to follow up by supporting
some limited cases of faulting loops where we can version the loop based
on pointer alignment. For example, it turns out in the SPEC2017 benchmark
(xalancbmk) there is a std::find loop that we can vectorise provided we
add SCEV checks for the initial pointer being aligned to a multiple of
the VF. In practice, the pointer is regularly aligned to at least 32/64
bytes and since the VF is a power of 2, any vector loads <= 32/64 bytes
in size will always fault on the first lane, following the same behaviour
as the scalar loop. Given we already do such speculative versioning for
loops with unknown strides, alignment-based versioning doesn't seem to be
any worse at least for loops with only one load.

This patch makes use of the existing experimental_cttz_elems intrinsic
that's required in the vectorised early exit block to determine the first
lane that triggered the exit. This intrinsic has generic lowering support
so it's guaranteed to work for all targets.

Tests have been updated here:

Transforms/LoopVectorize/simple_early_exit.ll

@llvmbot
Copy link
Member

llvmbot commented Apr 11, 2024

@llvm/pr-subscribers-backend-systemz
@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-llvm-support

@llvm/pr-subscribers-llvm-ir

Author: David Sherwood (david-arm)

Changes

This patch adds support for vectorisation of a simple class of loops that typically involves searching for something, i.e.

  for (int i = 0; i &lt; n; i++) {
    if (p[i] == val)
      return i;
  }
  return n;

or

  for (int i = 0; i &lt; n; i++) {
    if (p1[i] != p2[i])
      return i;
  }
  return n;

In this initial commit we only vectorise loops with the following criteria:

  1. There are no stores in the loop.
  2. The loop must have only one early exit like those shown in the above example. I have referred to such exits as speculative early exits, to distinguish from existing support for early exits where the exit-not-taken count is known exactly at compile time.
  3. The early exit block dominates the latch block.
  4. There are no loads after the early exit block.
  5. The loop must not contain reductions or recurrences. I don't see anything fundamental blocking vectorisation of such loops, but I just haven't done the work to support them yet.
  6. We must be able to prove at compile-time that loops will not contain faulting loads.

For point 5 once this patch lands I intend to follow up by supporting some limited cases of faulting loops where we can version the loop based on pointer alignment. For example, it turns out in the SPEC2017 benchmark there is a std::find loop that we can vectorise provided we add SCEV checks for the initial pointer being aligned to a multiple of the VF. In practice, the pointer is regularly aligned to at least 32/64 bytes and since the VF is a power of 2, any vector loads <= 32/64 bytes in size will always fault on the first lane, following the same behaviour as the scalar loop. Given we already do such speculative versioning for loops with unknown strides, alignment-based versioning doesn't seem to be any worse at least for loops with only one load.

This patch makes use of the existing experimental_cttz_elems intrinsic that's required in the vectorised early exit block to determine the first lane that triggered the exit. This intrinsic has generic lowering support so it's guaranteed to work for all targets.

Tests have been added here:

Transforms/LoopVectorize/AArch64/simple_early_exit.ll


Patch is 226.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/88385.diff

18 Files Affected:

  • (modified) llvm/include/llvm/Analysis/LoopAccessAnalysis.h (+36)
  • (modified) llvm/include/llvm/Analysis/ScalarEvolution.h (+33-3)
  • (modified) llvm/include/llvm/IR/IRBuilder.h (+7)
  • (modified) llvm/include/llvm/Support/GenericLoopInfo.h (+4)
  • (modified) llvm/include/llvm/Support/GenericLoopInfoImpl.h (+10)
  • (modified) llvm/include/llvm/Transforms/Utils/ScalarEvolutionExpander.h (+8-1)
  • (modified) llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h (+18)
  • (modified) llvm/lib/Analysis/LoopAccessAnalysis.cpp (+180-9)
  • (modified) llvm/lib/Analysis/ScalarEvolution.cpp (+88-6)
  • (modified) llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp (+2-2)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp (+10)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+348-42)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+63-5)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+71-7)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+38-11)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+3-1)
  • (added) llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll (+2544)
  • (modified) llvm/test/Transforms/LoopVectorize/control-flow.ll (+1-1)
diff --git a/llvm/include/llvm/Analysis/LoopAccessAnalysis.h b/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
index e39c371b41ec5c..d79c53f490c927 100644
--- a/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
+++ b/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
@@ -587,6 +587,9 @@ class LoopAccessInfo {
   /// not legal to insert them.
   bool hasConvergentOp() const { return HasConvergentOp; }
 
+  /// Return true if the loop may fault due to memory accesses.
+  bool mayFault() const { return LoopMayFault; }
+
   const RuntimePointerChecking *getRuntimePointerChecking() const {
     return PtrRtChecking.get();
   }
@@ -608,6 +611,24 @@ class LoopAccessInfo {
   unsigned getNumStores() const { return NumStores; }
   unsigned getNumLoads() const { return NumLoads;}
 
+  /// Returns the block that exits early from the loop, if there is one.
+  /// Otherwise returns nullptr.
+  BasicBlock *getSpeculativeEarlyExitingBlock() const {
+    return SpeculativeEarlyExitingBB;
+  }
+
+  /// Returns the successor of the block that exits early from the loop, if
+  /// there is one. Otherwise returns nullptr.
+  BasicBlock *getSpeculativeEarlyExitBlock() const {
+    return SpeculativeEarlyExitBB;
+  }
+
+  /// Returns all blocks with a countable exit, i.e. the exit-not-taken count
+  /// is known exactly at compile time.
+  const SmallVector<BasicBlock *, 4> &getCountableEarlyExitingBlocks() const {
+    return CountableEarlyExitBlocks;
+  }
+
   /// The diagnostics report generated for the analysis.  E.g. why we
   /// couldn't analyze the loop.
   const OptimizationRemarkAnalysis *getReport() const { return Report.get(); }
@@ -659,6 +680,10 @@ class LoopAccessInfo {
   /// pass.
   bool canAnalyzeLoop();
 
+  /// Returns true if this is a supported early exit loop that we can analyze
+  /// in this pass.
+  bool isAnalyzableEarlyExitLoop();
+
   /// Save the analysis remark.
   ///
   /// LAA does not directly emits the remarks.  Instead it stores it which the
@@ -696,6 +721,17 @@ class LoopAccessInfo {
   /// Cache the result of analyzeLoop.
   bool CanVecMem = false;
   bool HasConvergentOp = false;
+  bool LoopMayFault = false;
+
+  /// Keeps track of the early-exiting block, if present.
+  BasicBlock *SpeculativeEarlyExitingBB = nullptr;
+
+  /// Keeps track of the successor of the early-exiting block, if present.
+  BasicBlock *SpeculativeEarlyExitBB = nullptr;
+
+  /// Keeps track of all the early exits with known or countable exit-not-taken
+  /// counts.
+  SmallVector<BasicBlock *, 4> CountableEarlyExitBlocks;
 
   /// Indicator that there are non vectorizable stores to a uniform address.
   bool HasDependenceInvolvingLoopInvariantAddress = false;
diff --git a/llvm/include/llvm/Analysis/ScalarEvolution.h b/llvm/include/llvm/Analysis/ScalarEvolution.h
index 5828cc156cc785..562deab8b4159e 100644
--- a/llvm/include/llvm/Analysis/ScalarEvolution.h
+++ b/llvm/include/llvm/Analysis/ScalarEvolution.h
@@ -892,9 +892,13 @@ class ScalarEvolution {
   /// Similar to getBackedgeTakenCount, except it will add a set of
   /// SCEV predicates to Predicates that are required to be true in order for
   /// the answer to be correct. Predicates can be checked with run-time
-  /// checks and can be used to perform loop versioning.
-  const SCEV *getPredicatedBackedgeTakenCount(const Loop *L,
-                                              SmallVector<const SCEVPredicate *, 4> &Predicates);
+  /// checks and can be used to perform loop versioning. If \p Speculative is
+  /// true, this will attempt to return the speculative backedge count for loops
+  /// with early exits. However, this is only possible if we can formulate an
+  /// exact expression for the backedge count from the latch block.
+  const SCEV *getPredicatedBackedgeTakenCount(
+      const Loop *L, SmallVector<const SCEVPredicate *, 4> &Predicates,
+      bool Speculative = false);
 
   /// When successful, this returns a SCEVConstant that is greater than or equal
   /// to (i.e. a "conservative over-approximation") of the value returend by
@@ -912,6 +916,12 @@ class ScalarEvolution {
     return getBackedgeTakenCount(L, SymbolicMaximum);
   }
 
+  /// Return all the exiting blocks in with exact exit counts.
+  void getExactExitingBlocks(const Loop *L,
+                             SmallVector<BasicBlock *, 4> *Blocks) {
+    getBackedgeTakenInfo(L).getExactExitingBlocks(L, this, Blocks);
+  }
+
   /// Return true if the backedge taken count is either the value returned by
   /// getConstantMaxBackedgeTakenCount or zero.
   bool isBackedgeTakenCountMaxOrZero(const Loop *L);
@@ -1534,6 +1544,16 @@ class ScalarEvolution {
     const SCEV *getExact(const Loop *L, ScalarEvolution *SE,
                          SmallVector<const SCEVPredicate *, 4> *Predicates = nullptr) const;
 
+    /// Similar to the above, except we permit unknown exit counts from
+    /// non-latch exit blocks. Any such early exit blocks must dominate the
+    /// latch and so the returned expression represents the speculative, or
+    /// maximum possible, *backedge-taken* count of the loop. If there is no
+    /// exact exit count for the latch this function returns
+    /// SCEVCouldNotCompute.
+    const SCEV *getSpeculative(
+        const Loop *L, ScalarEvolution *SE,
+        SmallVector<const SCEVPredicate *, 4> *Predicates = nullptr) const;
+
     /// Return the number of times this loop exit may fall through to the back
     /// edge, or SCEVCouldNotCompute. The loop is guaranteed not to exit via
     /// this block before this number of iterations, but may exit via another
@@ -1541,6 +1561,10 @@ class ScalarEvolution {
     const SCEV *getExact(const BasicBlock *ExitingBlock,
                          ScalarEvolution *SE) const;
 
+    /// Return all the exiting blocks in with exact exit counts.
+    void getExactExitingBlocks(const Loop *L, ScalarEvolution *SE,
+                               SmallVector<BasicBlock *, 4> *Blocks) const;
+
     /// Get the constant max backedge taken count for the loop.
     const SCEV *getConstantMax(ScalarEvolution *SE) const;
 
@@ -2316,6 +2340,9 @@ class PredicatedScalarEvolution {
   /// Get the (predicated) backedge count for the analyzed loop.
   const SCEV *getBackedgeTakenCount();
 
+  /// Get the (predicated) speculative backedge count for the analyzed loop.
+  const SCEV *getSpeculativeBackedgeTakenCount();
+
   /// Adds a new predicate.
   void addPredicate(const SCEVPredicate &Pred);
 
@@ -2384,6 +2411,9 @@ class PredicatedScalarEvolution {
 
   /// The backedge taken count.
   const SCEV *BackedgeCount = nullptr;
+
+  /// The speculative backedge taken count.
+  const SCEV *SpeculativeBackedgeCount = nullptr;
 };
 
 template <> struct DenseMapInfo<ScalarEvolution::FoldID> {
diff --git a/llvm/include/llvm/IR/IRBuilder.h b/llvm/include/llvm/IR/IRBuilder.h
index f381273c46cfb8..81cf8a6f5d4793 100644
--- a/llvm/include/llvm/IR/IRBuilder.h
+++ b/llvm/include/llvm/IR/IRBuilder.h
@@ -2503,6 +2503,13 @@ class IRBuilderBase {
     return CreateShuffleVector(V, PoisonValue::get(V->getType()), Mask, Name);
   }
 
+  Value *CreateCountTrailingZeroElems(Type *ResTy, Value *Mask,
+                                      const Twine &Name = "") {
+    return CreateIntrinsic(
+        Intrinsic::experimental_cttz_elts, {ResTy, Mask->getType()},
+        {Mask, getInt1(/*ZeroIsPoison=*/true)}, nullptr, Name);
+  }
+
   Value *CreateExtractValue(Value *Agg, ArrayRef<unsigned> Idxs,
                             const Twine &Name = "") {
     if (auto *V = Folder.FoldExtractValue(Agg, Idxs))
diff --git a/llvm/include/llvm/Support/GenericLoopInfo.h b/llvm/include/llvm/Support/GenericLoopInfo.h
index d560ca648132c9..83cacf864089cc 100644
--- a/llvm/include/llvm/Support/GenericLoopInfo.h
+++ b/llvm/include/llvm/Support/GenericLoopInfo.h
@@ -294,6 +294,10 @@ template <class BlockT, class LoopT> class LoopBase {
   /// Otherwise return null.
   BlockT *getUniqueExitBlock() const;
 
+  /// Return the exit block for the latch if one exists. This function assumes
+  /// the loop has a latch.
+  BlockT *getLatchExitBlock() const;
+
   /// Return true if this loop does not have any exit blocks.
   bool hasNoExitBlocks() const;
 
diff --git a/llvm/include/llvm/Support/GenericLoopInfoImpl.h b/llvm/include/llvm/Support/GenericLoopInfoImpl.h
index 1e0d0ee446fc41..3beb3e538398ef 100644
--- a/llvm/include/llvm/Support/GenericLoopInfoImpl.h
+++ b/llvm/include/llvm/Support/GenericLoopInfoImpl.h
@@ -159,6 +159,16 @@ BlockT *LoopBase<BlockT, LoopT>::getUniqueExitBlock() const {
   return getExitBlockHelper(this, true).first;
 }
 
+template <class BlockT, class LoopT>
+BlockT *LoopBase<BlockT, LoopT>::getLatchExitBlock() const {
+  BlockT *Latch = getLoopLatch();
+  assert(Latch && "Latch block must exists");
+  for (BlockT *Successor : children<BlockT *>(Latch))
+    if (!contains(Successor))
+      return Successor;
+  return nullptr;
+}
+
 /// getExitEdges - Return all pairs of (_inside_block_,_outside_block_).
 template <class BlockT, class LoopT>
 void LoopBase<BlockT, LoopT>::getExitEdges(
diff --git a/llvm/include/llvm/Transforms/Utils/ScalarEvolutionExpander.h b/llvm/include/llvm/Transforms/Utils/ScalarEvolutionExpander.h
index 62c1e15a9a60e1..05850f864d042a 100644
--- a/llvm/include/llvm/Transforms/Utils/ScalarEvolutionExpander.h
+++ b/llvm/include/llvm/Transforms/Utils/ScalarEvolutionExpander.h
@@ -124,6 +124,11 @@ class SCEVExpander : public SCEVVisitor<SCEVExpander, Value *> {
   /// "expanded" form.
   bool LSRMode;
 
+  /// If the loop has an early exit we may have to use the speculative backedge
+  /// count, since the normal backedge count function is unable to compute a
+  /// SCEV expression.
+  bool UseSpeculativeBackedgeCount;
+
   typedef IRBuilder<InstSimplifyFolder, IRBuilderCallbackInserter> BuilderType;
   BuilderType Builder;
 
@@ -176,10 +181,12 @@ class SCEVExpander : public SCEVVisitor<SCEVExpander, Value *> {
 public:
   /// Construct a SCEVExpander in "canonical" mode.
   explicit SCEVExpander(ScalarEvolution &se, const DataLayout &DL,
-                        const char *name, bool PreserveLCSSA = true)
+                        const char *name, bool PreserveLCSSA = true,
+                        bool UseSpeculativeBackedgeCount = false)
       : SE(se), DL(DL), IVName(name), PreserveLCSSA(PreserveLCSSA),
         IVIncInsertLoop(nullptr), IVIncInsertPos(nullptr), CanonicalMode(true),
         LSRMode(false),
+        UseSpeculativeBackedgeCount(UseSpeculativeBackedgeCount),
         Builder(se.getContext(), InstSimplifyFolder(DL),
                 IRBuilderCallbackInserter(
                     [this](Instruction *I) { rememberInstruction(I); })) {
diff --git a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
index a509ebf6a7e1b3..20a53abeb2e5cc 100644
--- a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
+++ b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
@@ -374,6 +374,24 @@ class LoopVectorizationLegality {
     return LAI->getDepChecker().getMaxSafeVectorWidthInBits();
   }
 
+  /// Returns true if the loop has a early exit with a exact backedge
+  /// count that is speculative.
+  bool hasSpeculativeEarlyExit() const {
+    return LAI && LAI->getSpeculativeEarlyExitingBlock();
+  }
+
+  /// Returns the early exiting block in a loop with a speculative backedge
+  /// count.
+  BasicBlock *getSpeculativeEarlyExitingBlock() const {
+    return LAI->getSpeculativeEarlyExitingBlock();
+  }
+
+  /// Returns the destination of an early exiting block in a loop with a
+  /// speculative backedge count.
+  BasicBlock *getSpeculativeEarlyExitBlock() const {
+    return LAI->getSpeculativeEarlyExitBlock();
+  }
+
   /// Returns true if vector representation of the instruction \p I
   /// requires mask.
   bool isMaskRequired(const Instruction *I) const {
diff --git a/llvm/lib/Analysis/LoopAccessAnalysis.cpp b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
index 3bfc9700a14559..32e5816644310a 100644
--- a/llvm/lib/Analysis/LoopAccessAnalysis.cpp
+++ b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
@@ -730,6 +730,9 @@ class AccessAnalysis {
     return UnderlyingObjects;
   }
 
+  /// Returns true if we cannot prove the loop will not fault.
+  bool mayFault();
+
 private:
   typedef MapVector<MemAccessInfo, SmallSetVector<Type *, 1>> PtrAccessMap;
 
@@ -1281,6 +1284,63 @@ bool AccessAnalysis::canCheckPtrAtRT(RuntimePointerChecking &RtCheck,
   return CanDoRTIfNeeded;
 }
 
+bool AccessAnalysis::mayFault() {
+  auto &DL = TheLoop->getHeader()->getModule()->getDataLayout();
+  for (auto &UO : UnderlyingObjects) {
+    // TODO: For now if we encounter more than one underlying object we just
+    // assume it could fault. However, with more analysis it's possible to look
+    // at all of them and calculate a common range of permitted GEP indices.
+    if (UO.second.size() != 1)
+      return true;
+
+    // For now only the simplest cases are permitted, but this could be
+    // extended further.
+    auto *GEP = dyn_cast<GetElementPtrInst>(UO.first);
+    if (!GEP || GEP->getPointerOperand() != UO.second[0] ||
+        GEP->getNumIndices() != 1)
+      return true;
+
+    // Verify pointer accessed within the loop always falls within the bounds
+    // of the underlying object, but first it's necessary to determine the
+    // object size.
+
+    auto GetKnownObjSize = [&](const Value *Obj) -> uint64_t {
+      // TODO: We should also be able to support global variables too.
+      if (auto *AllocaObj = dyn_cast<AllocaInst>(Obj)) {
+        if (TheLoop->isLoopInvariant(AllocaObj))
+          if (std::optional<TypeSize> AllocaSize =
+                  AllocaObj->getAllocationSize(DL))
+            return !AllocaSize->isScalable() ? AllocaSize->getFixedValue() : 0;
+      } else if (auto *ArgObj = dyn_cast<Argument>(Obj))
+        return ArgObj->getDereferenceableBytes();
+      return 0;
+    };
+
+    uint64_t ObjSize = GetKnownObjSize(UO.second[0]);
+    if (!ObjSize)
+      return true;
+
+    Value *GEPInd = GEP->getOperand(1);
+    const SCEV *IndScev = PSE.getSCEV(GEPInd);
+    if (!isa<SCEVAddRecExpr>(IndScev))
+      return true;
+
+    // Calculate the maximum number of addressable elements in the object.
+    uint64_t ElemSize = GEP->getSourceElementType()->getScalarSizeInBits() / 8;
+    uint64_t MaxNumElems = ObjSize / ElemSize;
+
+    const SCEV *MinScev = PSE.getSE()->getConstant(GEPInd->getType(), 0);
+    const SCEV *MaxScev =
+        PSE.getSE()->getConstant(GEPInd->getType(), MaxNumElems);
+    if (!PSE.getSE()->isKnownOnEveryIteration(
+            ICmpInst::ICMP_SGE, cast<SCEVAddRecExpr>(IndScev), MinScev) ||
+        !PSE.getSE()->isKnownOnEveryIteration(
+            ICmpInst::ICMP_SLT, cast<SCEVAddRecExpr>(IndScev), MaxScev))
+      return true;
+  }
+  return false;
+}
+
 void AccessAnalysis::processMemAccesses() {
   // We process the set twice: first we process read-write pointers, last we
   // process read-only pointers. This allows us to skip dependence tests for
@@ -2292,6 +2352,73 @@ void MemoryDepChecker::Dependence::print(
   OS.indent(Depth + 2) << *Instrs[Destination] << "\n";
 }
 
+bool LoopAccessInfo::isAnalyzableEarlyExitLoop() {
+  // At least one of the exiting blocks must be the latch.
+  BasicBlock *LatchBB = TheLoop->getLoopLatch();
+  if (!LatchBB)
+    return false;
+
+  SmallVector<BasicBlock *, 8> ExitingBlocks;
+  TheLoop->getExitingBlocks(ExitingBlocks);
+
+  // This is definitely not an early exit loop.
+  if (ExitingBlocks.size() < 2)
+    return false;
+
+  SmallVector<BasicBlock *, 4> ExactExitingBlocks;
+  PSE->getSE()->getExactExitingBlocks(TheLoop, &ExactExitingBlocks);
+
+  // We only support one speculative early exit.
+  if ((ExitingBlocks.size() - ExactExitingBlocks.size()) > 1)
+    return false;
+
+  // There could be multiple exiting blocks with an exact exit-not-taken
+  // count. Find the speculative early exit block, i.e. the one with an
+  // unknown count.
+  BasicBlock *TmpBB = nullptr;
+  for (BasicBlock *BB1 : ExitingBlocks) {
+    bool Found = false;
+    for (BasicBlock *BB2 : ExactExitingBlocks)
+      if (BB1 == BB2) {
+        Found = true;
+        break;
+      }
+    if (!Found) {
+      TmpBB = BB1;
+      break;
+    }
+  }
+  assert(TmpBB && "Expected to find speculative early exiting block");
+
+  // For now, let's keep things simple by ensuring the latch block only has
+  // the exiting block as a predecessor.
+  BasicBlock *LatchPredBB = LatchBB->getUniquePredecessor();
+  if (!LatchPredBB || LatchPredBB != TmpBB)
+    return false;
+
+  LLVM_DEBUG(
+      dbgs()
+      << "LAA: Found an early exit. Retrying with speculative exit count.\n");
+  const SCEV *SpecExitCount = PSE->getSpeculativeBackedgeTakenCount();
+  if (isa<SCEVCouldNotCompute>(SpecExitCount))
+    return false;
+
+  LLVM_DEBUG(dbgs() << "LAA: Found speculative backedge taken count: "
+                    << *SpecExitCount << '\n');
+  SpeculativeEarlyExitingBB = TmpBB;
+
+  for (BasicBlock *BB : successors(SpeculativeEarlyExitingBB))
+    if (BB != LatchBB) {
+      SpeculativeEarlyExitBB = BB;
+      break;
+    }
+  assert(SpeculativeEarlyExitBB &&
+         "Expected to find speculative early exit block");
+  CountableEarlyExitBlocks = std::move(ExactExitingBlocks);
+
+  return true;
+}
+
 bool LoopAccessInfo::canAnalyzeLoop() {
   // We need to have a loop header.
   LLVM_DEBUG(dbgs() << "LAA: Found a loop in "
@@ -2317,10 +2444,12 @@ bool LoopAccessInfo::canAnalyzeLoop() {
   // ScalarEvolution needs to be able to find the exit count.
   const SCEV *ExitCount = PSE->getBackedgeTakenCount();
   if (isa<SCEVCouldNotCompute>(ExitCount)) {
-    recordAnalysis("CantComputeNumberOfIterations")
-        << "could not determine number of loop iterations";
     LLVM_DEBUG(dbgs() << "LAA: SCEV could not compute the loop exit count.\n");
-    return false;
+    if (!isAnalyzableEarlyExitLoop()) {
+      recordAnalysis("CantComputeNumberOfIterations")
+          << "could not determine number of loop iterations";
+      return false;
+    }
   }
 
   return true;
@@ -2352,6 +2481,9 @@ void LoopAccessInfo::analyzeLoop(AAResults *AA, LoopInfo *LI,
       EnableMemAccessVersioning &&
       !TheLoop->getHeader()->getParent()->hasOptSize();
 
+  BasicBlock *LatchBB = TheLoop->getLoopLatch();
+  bool HasComplexWorkInEarlyExitLoop = false;
+
   // Traverse blocks in fixed RPOT order, regardless of their storage in the
   // loop info, as it may be arbitrary.
   LoopBlocksRPO RPOT(TheLoop);
@@ -2367,7 +2499,8 @@ void LoopAccessInfo::analyzeLoop(AAResults *AA, LoopInfo *LI,
 
       // With both a non-vectorizable memory instruction and a convergent
       // operation, found in this loop, no reason to continue the search.
-      if (HasComplexMemInst && HasConvergentOp) {
+      if ((HasComplexMemInst && HasConvergentOp) ||
+          HasComplexWorkInEarlyExitLoop) {
         CanVecMem = false;
         return;
       }
@@ -2385,6 +2518,14 @@ void LoopAccessInfo::analyzeLoop(AAResults *AA, LoopInfo *LI,
       // vectorize a loop if it contains known function calls that don't set
       // the flag. Therefore, it is safe to ignore this read from memory.
       auto *Call = dyn_cast<CallInst>(&I);
+      if (Call && SpeculativeEarlyExitingBB) {
+        recordAnalysis("CantVectorizeInstruction", Call)
+            << "cannot vectorize calls in early exit loop";
+        LLVM_DEBUG(dbgs() << "LAA: Found a call in early exit loop.\n");
+        HasComplexWorkInEarlyExitLoop = true;
+        continue;
+      }
+
       if (Call && getVectorIntrinsicIDForCall(Call, TLI))
         continue;
 
@@ -2412,6 +2553,13 @@ void LoopAccessInfo::analyzeLoop(AAResults *AA, LoopInfo *LI,
           HasComplexMemInst = true;
           continue;
         }
+        if (SpeculativeEarlyExitingBB && BB == LatchBB) {
+          recordAnalysis("CantVectorizeInstruction", Call)
+              << "cannot vectorize loads after early exit block";
+          LLVM_DEBUG(dbgs() << "LAA: Found a load after early exit.\n");
+          HasComplexWorkInEarlyExitLoop = true;
+          continue;
+        }
         NumLoads++;
         Loads.push_back(Ld);
         DepChecker->addAccess(Ld);
@@ -2423,6 +2571,13 @@ void LoopAccessInfo::analyzeLoop(AAResults *AA, LoopInfo *LI,
       // Save 'store' instructions. Abort if othe...
[truncated]

@@ -10260,7 +10562,11 @@ bool LoopVectorizePass::processLoop(Loop *L) {
Hints.setAlreadyVectorized();
}

assert(!verifyFunction(*L->getHeader()->getParent(), &dbgs()));
// assert(!verifyFunction(*L->getHeader()->getParent(), &dbgs()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies - I just realised this debug code is still present. I'll fix asap!

@@ -2423,6 +2571,13 @@ void LoopAccessInfo::analyzeLoop(AAResults *AA, LoopInfo *LI,
// Save 'store' instructions. Abort if other instructions write to memory.
if (I.mayWriteToMemory()) {
auto *St = dyn_cast<StoreInst>(&I);
if (SpeculativeEarlyExitingBB) {
recordAnalysis("CantVectorizeInstruction", St)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if St is null here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a similar problem in the code below too - recordAnalysis simply uses the debug information for the loop instead, but it won't crash. However, I think it makes sense to record the instruction using I instead and I'll update the message to show that it might not be a store.

// a later poison exit count should not propagate into the result. This are
// exactly the semantics provided by umin_seq.
return SE->getUMinFromMismatchedTypes(Ops, /* Sequential */ true);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this "speculative" BECount differ from the SymbolicMax BECount?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it does seem like it does the same thing, except there is no predicated version that accepts a vector of SCEVPredicate pointers, which is required for getPredicatedBackedgeTakenCount. I can try adding a predicated version of getSymbolicMax to see if that works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other major difference between getSpeculative and getSymbolicMax is the former requires the exact-not-taken count for the latch to be known, whereas the latter doesn't care. So I think in order to use something close to the existing getSymbolicMax interface I'll need to do two things:

  1. Rewrite getSymbolicMax (or add an overloaded interface) so that it's a const interface (allowing it to be called from getPredicatedBackedgeTakenCount). Also, add a SmallVector<const SCEVPredicate *, 4> *Predicates argument.
  2. Add code to getPredicatedBackedgeTakenCount to explicitly check we have a exact-not-taken count for the latch.

I'm happy to do this of course - just pointing out that getSymbolicMax isn't a drop-in replacement that's all. I'll try it out and see if I get the same behaviour as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried posting a new commit that teaches getPredicatedBackedgeTakenCount to use a version of getSymbolicMax that accepts predicates, provided we have an exact count for the latch. Hopefully this makes better reuse of the code.

bool LoopMayFault = false;

/// Keeps track of the early-exiting block, if present.
BasicBlock *SpeculativeEarlyExitingBB = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better name just EarlyExitingBB? At least to me Speculative implies that there's speculation on memory, i.e. that refers to BBs with mayfault accesses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a sense though that is not far off the truth, because when vectorising the loop we are by definition reading ahead in memory which could potentially cause a fault where the scalar loop would not. However, the main reason I added the word 'Speculative' was to distinguish between early exits with exact exit-not-taken counts (which the vectoriser does support) and early exits that cannot be counted.

I'd prefer not to call it EarlyExitingBB to avoid any possible confusion, but I'm happy to take suggestions on alternative names that are better? Perhaps UncountableEarlyExitingBB?

auto *UI = cast<Instruction>(U);
if (!L->contains(UI)) {
PHINode *PHI = dyn_cast<PHINode>(UI);
assert(PHI && "Expected LCSSA form");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: checking LCSSA form could be hoisted and checked earlier and just once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't hoist this assert out, since it's based upon the User U which varies in each loop iteration.

david-arm added a commit to david-arm/llvm-project that referenced this pull request May 1, 2024
In PR llvm#88385 I've added support for auto-vectorisation of some
early exit loops, which requires using the experimental.cttz.elts
to calculate final indices in the early exit block. We need a
more accurate cost model for this intrinsic to better reflect the
cost of work required in the early exit block. I've tried to
accurately represent the expansion code for the intrinsic when
the target does not have efficient lowering for it. It's quite
tricky to model because you need to first figure out what types
will actually be used in the expansion. The type used can have
a significant effect on the cost if you end up using illegal
vector types.

Tests added here:

  Analysis/CostModel/AArch64/cttz_elts.ll
  Analysis/CostModel/RISCV/cttz_elts.ll
@sjoerdmeijer
Copy link
Collaborator

I was wondering if it would be good to add some AArch64 codegen tests too so that we can look at some codegen?

@david-arm
Copy link
Contributor Author

david-arm commented May 1, 2024

I was wondering if it would be good to add some AArch64 codegen tests too so that we can look at some codegen?

If you're referring to the codegen coming out of clang after vectorising the loop, I don't think we typically have tests like that in test/Transform/LoopVectorize. They are normally IR/opt based tests. Are you referring specifically to the codegen from the cttz.elts intrinsic? If so, we already have tests for them - see CodeGen/AArch64/intrinsic-cttz-elts-sve.ll, for example.

@sjoerdmeijer
Copy link
Collaborator

I was wondering if it would be good to add some AArch64 codegen tests too so that we can look at some codegen?

If you're referring to the codegen coming out of clang after vectorising the loop, I don't think we typically have tests like that in test/Transform/LoopVectorize. They are normally IR/opt based tests. Are you referring specifically to the codegen from the cttz.elts intrinsic? If so, we already have tests for them - see CodeGen/AArch64/intrinsic-cttz-elts-sve.ll, for example.

Yes, I appreciate we test all things individually, but I was just thinking that it is a bit of shame we can't look at some codegen for a loop for all of this work. For example, take the resulting IR of some of the tests in test/Transform/LoopVectorize/AArch64, and create llc tests. Not sure if there's precedent for that, I guess not.

@fhahn
Copy link
Contributor

fhahn commented May 7, 2024

Yes, I appreciate we test all things individually, but I was just thinking that it is a bit of shame we can't look at some codegen for a loop for all of this work. For example, take the resulting IR of some of the tests in test/Transform/LoopVectorize/AArch64, and create llc tests. Not sure if there's precedent for that, I guess not.

It would probably make sense to have some micro-benchmarks for some loops with varying trip counts (both statically known and unknown) to cover the end-to-end flow and allow for easy evaluation. Sharing the generated assembly end-to-end for some of those might help, as @sjoerdmeijer suggested?

(I don't think we should add end-to-end tests to llvm-project/llvm/tests/ directly that run the vectorizer (and possibly other passes) all the way down to assembly)

Copy link
Collaborator

@sjoerdmeijer sjoerdmeijer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch basically contains two parts: the LAA/SCEV and the vectorisation part.

I have only looked at the vectorisation part and that looks good to me:

  • thanks for taking the cost-model remarks into account, the added logic seems like a good first step,
  • the option to vectorise loops with early breaks is off by default. This allows us to experiment more with this, possibly refine the cost-model, without creating a lot of turbulence.
  • It's a shame we can't look at final codegen for these sort of patches, but that is not a problem of this patch. I like the idea of some microbenchmarks for this, but given that this is off by default I don't think that this needs to hold up this patch.

So, LGTM, but I haven't looked at the LAA part, perhaps @nikic or @nikolaypanchenko can sign off on that part.

NoumanAmir657 pushed a commit to NoumanAmir657/llvm-project that referenced this pull request Nov 4, 2024
This work is in preparation for PRs llvm#112138 and llvm#88385 where
the middle block is not guaranteed to be the immediate successor
to the region block. I've simply add new getMiddleBlock()
interfaces to VPlan that for now just return

cast<VPBasicBlock>(VectorRegion->getSingleSuccessor())

Once PR llvm#112138 lands we'll need to do more work to discover
the middle block.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Nov 4, 2024
With PR llvm#88385 I am introducing support for vectorising more
loops with early exits that don't require a scalar epilogue.
As such, if a loop doesn't have a unique exit block it will
not automatically imply we require a scalar epilogue. Also,
in the only place in the code today where we use the variable
LoopExitBlock we actually mean the exit block from the latch.
Therefore, it seemed reasonable to add a new
getUniqueLatchExitBlock helper that allows the caller to
determine the exit block taken from the latch and use this
instead of getUniqueExitBlock. I also removed LoopExitBlock
since it was only used in one place.

While doing this I also noticed that one of the comments in
requiresScalarEpilogue is wrong when we require a scalar
epilogue, i.e. when we're not exiting from the latch block.
This doesn't always imply we have multiple exits, e.g. see
the test in

Transforms/LoopVectorize/unroll_nonlatch.ll

where the latch unconditionally branches back to the only
exiting block.
david-arm added a commit that referenced this pull request Nov 6, 2024
With PR #88385 I am introducing support for vectorising more loops with
early exits that don't require a scalar epilogue. As such, if a loop
doesn't have a unique exit block it will not automatically imply we
require a scalar epilogue. Also, in all places in the code today where
we use the variable LoopExitBlock we actually mean the exit block from
the latch. Therefore, it seemed reasonable to add a new
getUniqueLatchExitBlock that allows the caller to determine the exit
block taken from the latch and use this instead of getUniqueExitBlock. I
also renamed LoopExitBlock to be LatchExitBlock. I feel this not only
better reflects how the variable is used today, but also prepares the
code for PR #88385.

While doing this I also noticed that one of the comments in
requiresScalarEpilogue is wrong when we require a scalar epilogue, i.e.
when we're not exiting from the latch block. This doesn't always imply
we have multiple exits, e.g. see the test in

Transforms/LoopVectorize/unroll_nonlatch.ll

where the latch unconditionally branches back to the only exiting block.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Nov 13, 2024
Caching the decision returned by requiresScalarEpilogue means that
we can avoid printing out the same debug many times, and also
avoids repeating the same calculation. This function will get more
complex when we start to reason about more early exit loops, such
as in PR llvm#88385. The only problem with this is we sometimes have to
invalidate the previous result due to changes in the scalar epilogue
status or interleave groups.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Nov 20, 2024
PR llvm#112138 introduced initial support for dispatching to
multiple exit blocks via split middle blocks. This patch
fixes a few issues so that we can enable more tests to use
the new enable-early-exit-vectorization flag. Fixes are:

1. The code to bail out for any loop live-out values happens
too late. This is because collectUsersInExitBlocks ignores
induction variables, which get dealt with in fixupIVUsers.
I've moved the check much earlier in processLoop by looking
for outside users of loop-defined values.
2. We shouldn't yet be interleaving when vectorising loops
with uncountable early exits, since we've not added support
for this yet.
3. Similarly, we also shouldn't be creating vector epilogues.
4. Similarly, we shouldn't enable tail-folding.
5. The existing implementation doesn't yet support loops
that require scalar epilogues, although I plan to add that
as part of PR llvm#88385.
6. The new split middle blocks weren't being added to the
parent loop.
7. VPIRInstruction::execute assumed that the VPIRBasicBlock
predecessors correspond like-for-like with the predecessors
of the scalar exit block prior to vectorisation. For example,
collectUsersInExitBlocks adds the operands to the
VPIRInstruction in the order returned by predecessors(ExitBB),
whereas VPIRInstruction::execute processes the operands in
order of the VPIRBasicBlock predecessors. There is absolutely
no guarantee that they match up, which in some cases (such as
the yacr2 test in the LLVM test suite) they don't. I've fixed
this by maintaining the old behaviour when there is a single
operand, and when there are 2 or more operands we use the
same ordering as the BasicBlock predecessors.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Dec 11, 2024
PR llvm#112138 introduced initial support for dispatching to
multiple exit blocks via split middle blocks. This patch
fixes a few issues so that we can enable more tests to use
the new enable-early-exit-vectorization flag. Fixes are:

1. The code to bail out for any loop live-out values happens
too late. This is because collectUsersInExitBlocks ignores
induction variables, which get dealt with in fixupIVUsers.
I've moved the check much earlier in processLoop by looking
for outside users of loop-defined values.
2. We shouldn't yet be interleaving when vectorising loops
with uncountable early exits, since we've not added support
for this yet.
3. Similarly, we also shouldn't be creating vector epilogues.
4. Similarly, we shouldn't enable tail-folding.
5. The existing implementation doesn't yet support loops
that require scalar epilogues, although I plan to add that
as part of PR llvm#88385.
6. The new split middle blocks weren't being added to the
parent loop.
fhahn added a commit that referenced this pull request Dec 11, 2024
A more lightweight variant of
#109193,
which dispatches to multiple exit blocks via the middle blocks.

The patch also introduces a bit of required scaffolding to enable
early-exit vectorization, including an option. At the moment, early-exit
vectorization doesn't come with legality checks, and is only used if the
option is provided and the loop has metadata forcing vectorization. This
is only intended to be used for testing during bring-up, with @david-arm
enabling auto early-exit vectorization plugging in the changes from
#88385.

PR: #112138
david-arm added a commit to david-arm/llvm-project that referenced this pull request Dec 12, 2024
PR llvm#112138 introduced initial support for dispatching to
multiple exit blocks via split middle blocks. This patch
fixes a few issues so that we can enable more tests to use
the new enable-early-exit-vectorization flag. Fixes are:

1. The code to bail out for any loop live-out values happens
too late. This is because collectUsersInExitBlocks ignores
induction variables, which get dealt with in fixupIVUsers.
I've moved the check much earlier in processLoop by looking
for outside users of loop-defined values.
2. We shouldn't yet be interleaving when vectorising loops
with uncountable early exits, since we've not added support
for this yet.
3. Similarly, we also shouldn't be creating vector epilogues.
4. Similarly, we shouldn't enable tail-folding.
5. The existing implementation doesn't yet support loops
that require scalar epilogues, although I plan to add that
as part of PR llvm#88385.
6. The new split middle blocks weren't being added to the
parent loop.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Dec 12, 2024
PR llvm#112138 introduced initial support for dispatching to
multiple exit blocks via split middle blocks. This patch
fixes a few issues so that we can enable more tests to use
the new enable-early-exit-vectorization flag. Fixes are:

1. The code to bail out for any loop live-out values happens
too late. This is because collectUsersInExitBlocks ignores
induction variables, which get dealt with in fixupIVUsers.
I've moved the check much earlier in processLoop by looking
for outside users of loop-defined values.
2. We shouldn't yet be interleaving when vectorising loops
with uncountable early exits, since we've not added support
for this yet.
3. Similarly, we also shouldn't be creating vector epilogues.
4. Similarly, we shouldn't enable tail-folding.
5. The existing implementation doesn't yet support loops
that require scalar epilogues, although I plan to add that
as part of PR llvm#88385.
6. The new split middle blocks weren't being added to the
parent loop.
david-arm added a commit that referenced this pull request Dec 18, 2024
PR #112138 introduced initial support for dispatching to
multiple exit blocks via split middle blocks. This patch
fixes a few issues so that we can enable more tests to use
the new enable-early-exit-vectorization flag. Fixes are:

1. The code to bail out for any loop live-out values happens
too late. This is because collectUsersInExitBlocks ignores
induction variables, which get dealt with in fixupIVUsers.
I've moved the check much earlier in processLoop by looking
for outside users of loop-defined values.
2. We shouldn't yet be interleaving when vectorising loops
with uncountable early exits, since we've not added support
for this yet.
3. Similarly, we also shouldn't be creating vector epilogues.
4. Similarly, we shouldn't enable tail-folding.
5. The existing implementation doesn't yet support loops
that require scalar epilogues, although I plan to add that
as part of PR #88385.
6. The new split middle blocks weren't being added to the
parent loop.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Dec 19, 2024
This work feeds part of PR llvm#88385, and adds support for
vectorising loops with uncountable early exits and outside users of
loop-defined variables.

I've added a new fixupEarlyExitIVUsers to mirror what happens in
fixupIVUsers when patching up outside users of induction variables
in the early exit block. We have to handle these differently for two
reasons:

1. We can't work backwards from the end value in the middle block
because we didn't leave at the last iteration.
2. We need to generate different IR that calculates the vector lane
that triggered the exit, and hence can determine the induction value
at the point we exited.

I've added a new 'null' VPValue as a dummy placeholder to manage
the incoming operands of PHI nodes in the exit block. We can have
situations where one of the incoming values is an induction variable
(or its update) and the other is not. For example, both the latch
and the early exiting block can jump to the same exit block. However,
VPInstruction::generate walks through all predecessors of the PHI
assuming the value is *not* an IV. In order to ensure that we process
the right value for the right incoming block we use this new 'null'
value is a marker to indicate it should be skipped, since it will be
handled separately in fixupIVUsers or fixupEarlyExitIVUsers.

All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. I also had to fix
up the vplan verifier because it assumed that the block containing
a definition always dominated the parent of the user. That's no
longer the case because we can arrive at the exit block via one of
the latch or the early exiting block.

I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.
@david-arm
Copy link
Contributor Author

I realise this PR is very out of date, but just for reference this work is nearing completion, but has been done through many other smaller PRs, including a nice patch by @fhahn adding support for multiple exits in VPlan. Here is one of the remaining PRs to add support for loops with live-outs: #120567

I also have a WIP patch to add support for versioning early exit loops with potentially faulting pointers: #120603

david-arm added a commit to david-arm/llvm-project that referenced this pull request Jan 7, 2025
This work feeds part of PR llvm#88385, and adds support for
vectorising loops with uncountable early exits and outside users of
loop-defined variables.

I've added a new fixupEarlyExitIVUsers to mirror what happens in
fixupIVUsers when patching up outside users of induction variables
in the early exit block. We have to handle these differently for two
reasons:

1. We can't work backwards from the end value in the middle block
because we didn't leave at the last iteration.
2. We need to generate different IR that calculates the vector lane
that triggered the exit, and hence can determine the induction value
at the point we exited.

I've added a new 'null' VPValue as a dummy placeholder to manage
the incoming operands of PHI nodes in the exit block. We can have
situations where one of the incoming values is an induction variable
(or its update) and the other is not. For example, both the latch
and the early exiting block can jump to the same exit block. However,
VPInstruction::generate walks through all predecessors of the PHI
assuming the value is *not* an IV. In order to ensure that we process
the right value for the right incoming block we use this new 'null'
value is a marker to indicate it should be skipped, since it will be
handled separately in fixupIVUsers or fixupEarlyExitIVUsers.

All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. I also had to fix
up the vplan verifier because it assumed that the block containing
a definition always dominated the parent of the user. That's no
longer the case because we can arrive at the exit block via one of
the latch or the early exiting block.

I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Jan 9, 2025
This work feeds part of PR llvm#88385, and adds support for vectorising
loops with uncountable early exits and outside users of loop-defined
variables. When calculating the final value from an uncountable early
exit we need to calculate the vector lane that triggered the exit,
and hence determine the value at the point we exited.

All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. Doing this required
two fixes:

1. The vplan verifier incorrectly assumed that the block containing
a definition always dominates the block of the user. That's not true
if you can arrive at the use block from multiple incoming blocks.
This is possible for early exit loops where both the early exit and
the latch jump to the same block.

I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.

NOTE: The IR generated for dealing with live-outs from early exit
loops is unoptimised, as opposed to normal loops. This inevitably
leads to poor quality code, but this can be fixed up later.
david-arm added a commit that referenced this pull request Jan 15, 2025
…AlignedInLoop (#96752)

Currently when we encounter a negative step in the induction
variable isDereferenceableAndAlignedInLoop bails out because
the element size is signed greater than the step. This patch
adds support for negative steps in cases where we detect the
start address for the load is of the form base + offset. In
this case the address decrements in each iteration so we need
to calculate the access size differently. I have done this by
caling getStartAndEndForAccess from LoopAccessAnalysis.cpp.

The motivation for this patch comes from PR #88385 where a
reviewer requested reusing isDereferenceableAndAlignedInLoop,
but that PR itself does support reverse loops.

The changed test in LoopVectorize/X86/load-deref-pred.ll now
passes because previously we were calculating the total access
size incorrectly, whereas now it is 412 bytes and fits
perfectly into the alloca.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Jan 17, 2025
This work feeds part of PR llvm#88385, and adds support for vectorising
loops with uncountable early exits and outside users of loop-defined
variables. When calculating the final value from an uncountable early
exit we need to calculate the vector lane that triggered the exit,
and hence determine the value at the point we exited.

All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. Doing this required
two fixes:

1. The vplan verifier incorrectly assumed that the block containing
a definition always dominates the block of the user. That's not true
if you can arrive at the use block from multiple incoming blocks.
This is possible for early exit loops where both the early exit and
the latch jump to the same block.

I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.

NOTE: The IR generated for dealing with live-outs from early exit
loops is unoptimised, as opposed to normal loops. This inevitably
leads to poor quality code, but this can be fixed up later.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Jan 20, 2025
This work feeds part of PR llvm#88385, and adds support for vectorising
loops with uncountable early exits and outside users of loop-defined
variables. When calculating the final value from an uncountable early
exit we need to calculate the vector lane that triggered the exit,
and hence determine the value at the point we exited.

All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. Doing this required
two fixes:

1. The vplan verifier incorrectly assumed that the block containing
a definition always dominates the block of the user. That's not true
if you can arrive at the use block from multiple incoming blocks.
This is possible for early exit loops where both the early exit and
the latch jump to the same block.

I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.

NOTE: The IR generated for dealing with live-outs from early exit
loops is unoptimised, as opposed to normal loops. This inevitably
leads to poor quality code, but this can be fixed up later.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Jan 21, 2025
This work feeds part of PR llvm#88385, and adds support for vectorising
loops with uncountable early exits and outside users of loop-defined
variables. When calculating the final value from an uncountable early
exit we need to calculate the vector lane that triggered the exit,
and hence determine the value at the point we exited.

All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. Doing this required
two fixes:

1. The vplan verifier incorrectly assumed that the block containing
a definition always dominates the block of the user. That's not true
if you can arrive at the use block from multiple incoming blocks.
This is possible for early exit loops where both the early exit and
the latch jump to the same block.

I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.

NOTE: The IR generated for dealing with live-outs from early exit
loops is unoptimised, as opposed to normal loops. This inevitably
leads to poor quality code, but this can be fixed up later.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Jan 22, 2025
This work feeds part of PR llvm#88385, and adds support for vectorising
loops with uncountable early exits and outside users of loop-defined
variables. When calculating the final value from an uncountable early
exit we need to calculate the vector lane that triggered the exit,
and hence determine the value at the point we exited.

All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. Doing this required
two fixes:

1. The vplan verifier incorrectly assumed that the block containing
a definition always dominates the block of the user. That's not true
if you can arrive at the use block from multiple incoming blocks.
This is possible for early exit loops where both the early exit and
the latch jump to the same block.

I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.

NOTE: The IR generated for dealing with live-outs from early exit
loops is unoptimised, as opposed to normal loops. This inevitably
leads to poor quality code, but this can be fixed up later.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Jan 27, 2025
This work feeds part of PR llvm#88385, and adds support for vectorising
loops with uncountable early exits and outside users of loop-defined
variables. When calculating the final value from an uncountable early
exit we need to calculate the vector lane that triggered the exit,
and hence determine the value at the point we exited.

All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. Doing this required
the following fix:

* The vplan verifier incorrectly assumed that the block containing
a definition always dominates the block of the user. That's not true
if you can arrive at the use block from multiple incoming blocks.
This is possible for early exit loops where both the early exit and
the latch jump to the same block.

I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.

NOTE: The IR generated for dealing with live-outs from early exit
loops is unoptimised, as opposed to normal loops. This inevitably
leads to poor quality code, but this can be fixed up later.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Jan 29, 2025
This work feeds part of PR llvm#88385, and adds support for vectorising
loops with uncountable early exits and outside users of loop-defined
variables. When calculating the final value from an uncountable early
exit we need to calculate the vector lane that triggered the exit,
and hence determine the value at the point we exited.

All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. Doing this required
the following fix:

* The vplan verifier incorrectly assumed that the block containing
a definition always dominates the block of the user. That's not true
if you can arrive at the use block from multiple incoming blocks.
This is possible for early exit loops where both the early exit and
the latch jump to the same block.

I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.

NOTE: The IR generated for dealing with live-outs from early exit
loops is unoptimised, as opposed to normal loops. This inevitably
leads to poor quality code, but this can be fixed up later.
david-arm added a commit that referenced this pull request Jan 30, 2025
…ts (#120567)

This work feeds part of PR
#88385, and adds support for
vectorising
loops with uncountable early exits and outside users of loop-defined
variables. When calculating the final value from an uncountable early
exit we need to calculate the vector lane that triggered the exit,
and hence determine the value at the point we exited.

All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. Doing this required
two fixes:

1. The vplan verifier incorrectly assumed that the block containing
a definition always dominates the block of the user. That's not true
if you can arrive at the use block from multiple incoming blocks.
This is possible for early exit loops where both the early exit and
the latch jump to the same block.
2. We were adding the new vector.early.exit to the wrong parent loop.
It needs to have the same parent as the actual early exit block from
the original loop.

I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.

NOTE: The IR generated for dealing with live-outs from early exit
loops is unoptimised, as opposed to normal loops. This inevitably
leads to poor quality code, but this can be fixed up later.
github-actions bot pushed a commit to arm/arm-toolchain that referenced this pull request Jan 30, 2025
…ith live-outs (#120567)

This work feeds part of PR
llvm/llvm-project#88385, and adds support for
vectorising
loops with uncountable early exits and outside users of loop-defined
variables. When calculating the final value from an uncountable early
exit we need to calculate the vector lane that triggered the exit,
and hence determine the value at the point we exited.

All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. Doing this required
two fixes:

1. The vplan verifier incorrectly assumed that the block containing
a definition always dominates the block of the user. That's not true
if you can arrive at the use block from multiple incoming blocks.
This is possible for early exit loops where both the early exit and
the latch jump to the same block.
2. We were adding the new vector.early.exit to the wrong parent loop.
It needs to have the same parent as the actual early exit block from
the original loop.

I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.

NOTE: The IR generated for dealing with live-outs from early exit
loops is unoptimised, as opposed to normal loops. This inevitably
leads to poor quality code, but this can be fixed up later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants