-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LoopVectorize] Add support for vectorisation of more early exit loops #88385
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-backend-systemz @llvm/pr-subscribers-llvm-ir Author: David Sherwood (david-arm) ChangesThis patch adds support for vectorisation of a simple class of loops that typically involves searching for something, i.e.
or
In this initial commit we only vectorise loops with the following criteria:
For point 5 once this patch lands I intend to follow up by supporting some limited cases of faulting loops where we can version the loop based on pointer alignment. For example, it turns out in the SPEC2017 benchmark there is a std::find loop that we can vectorise provided we add SCEV checks for the initial pointer being aligned to a multiple of the VF. In practice, the pointer is regularly aligned to at least 32/64 bytes and since the VF is a power of 2, any vector loads <= 32/64 bytes in size will always fault on the first lane, following the same behaviour as the scalar loop. Given we already do such speculative versioning for loops with unknown strides, alignment-based versioning doesn't seem to be any worse at least for loops with only one load. This patch makes use of the existing experimental_cttz_elems intrinsic that's required in the vectorised early exit block to determine the first lane that triggered the exit. This intrinsic has generic lowering support so it's guaranteed to work for all targets. Tests have been added here: Transforms/LoopVectorize/AArch64/simple_early_exit.ll Patch is 226.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/88385.diff 18 Files Affected:
diff --git a/llvm/include/llvm/Analysis/LoopAccessAnalysis.h b/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
index e39c371b41ec5c..d79c53f490c927 100644
--- a/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
+++ b/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
@@ -587,6 +587,9 @@ class LoopAccessInfo {
/// not legal to insert them.
bool hasConvergentOp() const { return HasConvergentOp; }
+ /// Return true if the loop may fault due to memory accesses.
+ bool mayFault() const { return LoopMayFault; }
+
const RuntimePointerChecking *getRuntimePointerChecking() const {
return PtrRtChecking.get();
}
@@ -608,6 +611,24 @@ class LoopAccessInfo {
unsigned getNumStores() const { return NumStores; }
unsigned getNumLoads() const { return NumLoads;}
+ /// Returns the block that exits early from the loop, if there is one.
+ /// Otherwise returns nullptr.
+ BasicBlock *getSpeculativeEarlyExitingBlock() const {
+ return SpeculativeEarlyExitingBB;
+ }
+
+ /// Returns the successor of the block that exits early from the loop, if
+ /// there is one. Otherwise returns nullptr.
+ BasicBlock *getSpeculativeEarlyExitBlock() const {
+ return SpeculativeEarlyExitBB;
+ }
+
+ /// Returns all blocks with a countable exit, i.e. the exit-not-taken count
+ /// is known exactly at compile time.
+ const SmallVector<BasicBlock *, 4> &getCountableEarlyExitingBlocks() const {
+ return CountableEarlyExitBlocks;
+ }
+
/// The diagnostics report generated for the analysis. E.g. why we
/// couldn't analyze the loop.
const OptimizationRemarkAnalysis *getReport() const { return Report.get(); }
@@ -659,6 +680,10 @@ class LoopAccessInfo {
/// pass.
bool canAnalyzeLoop();
+ /// Returns true if this is a supported early exit loop that we can analyze
+ /// in this pass.
+ bool isAnalyzableEarlyExitLoop();
+
/// Save the analysis remark.
///
/// LAA does not directly emits the remarks. Instead it stores it which the
@@ -696,6 +721,17 @@ class LoopAccessInfo {
/// Cache the result of analyzeLoop.
bool CanVecMem = false;
bool HasConvergentOp = false;
+ bool LoopMayFault = false;
+
+ /// Keeps track of the early-exiting block, if present.
+ BasicBlock *SpeculativeEarlyExitingBB = nullptr;
+
+ /// Keeps track of the successor of the early-exiting block, if present.
+ BasicBlock *SpeculativeEarlyExitBB = nullptr;
+
+ /// Keeps track of all the early exits with known or countable exit-not-taken
+ /// counts.
+ SmallVector<BasicBlock *, 4> CountableEarlyExitBlocks;
/// Indicator that there are non vectorizable stores to a uniform address.
bool HasDependenceInvolvingLoopInvariantAddress = false;
diff --git a/llvm/include/llvm/Analysis/ScalarEvolution.h b/llvm/include/llvm/Analysis/ScalarEvolution.h
index 5828cc156cc785..562deab8b4159e 100644
--- a/llvm/include/llvm/Analysis/ScalarEvolution.h
+++ b/llvm/include/llvm/Analysis/ScalarEvolution.h
@@ -892,9 +892,13 @@ class ScalarEvolution {
/// Similar to getBackedgeTakenCount, except it will add a set of
/// SCEV predicates to Predicates that are required to be true in order for
/// the answer to be correct. Predicates can be checked with run-time
- /// checks and can be used to perform loop versioning.
- const SCEV *getPredicatedBackedgeTakenCount(const Loop *L,
- SmallVector<const SCEVPredicate *, 4> &Predicates);
+ /// checks and can be used to perform loop versioning. If \p Speculative is
+ /// true, this will attempt to return the speculative backedge count for loops
+ /// with early exits. However, this is only possible if we can formulate an
+ /// exact expression for the backedge count from the latch block.
+ const SCEV *getPredicatedBackedgeTakenCount(
+ const Loop *L, SmallVector<const SCEVPredicate *, 4> &Predicates,
+ bool Speculative = false);
/// When successful, this returns a SCEVConstant that is greater than or equal
/// to (i.e. a "conservative over-approximation") of the value returend by
@@ -912,6 +916,12 @@ class ScalarEvolution {
return getBackedgeTakenCount(L, SymbolicMaximum);
}
+ /// Return all the exiting blocks in with exact exit counts.
+ void getExactExitingBlocks(const Loop *L,
+ SmallVector<BasicBlock *, 4> *Blocks) {
+ getBackedgeTakenInfo(L).getExactExitingBlocks(L, this, Blocks);
+ }
+
/// Return true if the backedge taken count is either the value returned by
/// getConstantMaxBackedgeTakenCount or zero.
bool isBackedgeTakenCountMaxOrZero(const Loop *L);
@@ -1534,6 +1544,16 @@ class ScalarEvolution {
const SCEV *getExact(const Loop *L, ScalarEvolution *SE,
SmallVector<const SCEVPredicate *, 4> *Predicates = nullptr) const;
+ /// Similar to the above, except we permit unknown exit counts from
+ /// non-latch exit blocks. Any such early exit blocks must dominate the
+ /// latch and so the returned expression represents the speculative, or
+ /// maximum possible, *backedge-taken* count of the loop. If there is no
+ /// exact exit count for the latch this function returns
+ /// SCEVCouldNotCompute.
+ const SCEV *getSpeculative(
+ const Loop *L, ScalarEvolution *SE,
+ SmallVector<const SCEVPredicate *, 4> *Predicates = nullptr) const;
+
/// Return the number of times this loop exit may fall through to the back
/// edge, or SCEVCouldNotCompute. The loop is guaranteed not to exit via
/// this block before this number of iterations, but may exit via another
@@ -1541,6 +1561,10 @@ class ScalarEvolution {
const SCEV *getExact(const BasicBlock *ExitingBlock,
ScalarEvolution *SE) const;
+ /// Return all the exiting blocks in with exact exit counts.
+ void getExactExitingBlocks(const Loop *L, ScalarEvolution *SE,
+ SmallVector<BasicBlock *, 4> *Blocks) const;
+
/// Get the constant max backedge taken count for the loop.
const SCEV *getConstantMax(ScalarEvolution *SE) const;
@@ -2316,6 +2340,9 @@ class PredicatedScalarEvolution {
/// Get the (predicated) backedge count for the analyzed loop.
const SCEV *getBackedgeTakenCount();
+ /// Get the (predicated) speculative backedge count for the analyzed loop.
+ const SCEV *getSpeculativeBackedgeTakenCount();
+
/// Adds a new predicate.
void addPredicate(const SCEVPredicate &Pred);
@@ -2384,6 +2411,9 @@ class PredicatedScalarEvolution {
/// The backedge taken count.
const SCEV *BackedgeCount = nullptr;
+
+ /// The speculative backedge taken count.
+ const SCEV *SpeculativeBackedgeCount = nullptr;
};
template <> struct DenseMapInfo<ScalarEvolution::FoldID> {
diff --git a/llvm/include/llvm/IR/IRBuilder.h b/llvm/include/llvm/IR/IRBuilder.h
index f381273c46cfb8..81cf8a6f5d4793 100644
--- a/llvm/include/llvm/IR/IRBuilder.h
+++ b/llvm/include/llvm/IR/IRBuilder.h
@@ -2503,6 +2503,13 @@ class IRBuilderBase {
return CreateShuffleVector(V, PoisonValue::get(V->getType()), Mask, Name);
}
+ Value *CreateCountTrailingZeroElems(Type *ResTy, Value *Mask,
+ const Twine &Name = "") {
+ return CreateIntrinsic(
+ Intrinsic::experimental_cttz_elts, {ResTy, Mask->getType()},
+ {Mask, getInt1(/*ZeroIsPoison=*/true)}, nullptr, Name);
+ }
+
Value *CreateExtractValue(Value *Agg, ArrayRef<unsigned> Idxs,
const Twine &Name = "") {
if (auto *V = Folder.FoldExtractValue(Agg, Idxs))
diff --git a/llvm/include/llvm/Support/GenericLoopInfo.h b/llvm/include/llvm/Support/GenericLoopInfo.h
index d560ca648132c9..83cacf864089cc 100644
--- a/llvm/include/llvm/Support/GenericLoopInfo.h
+++ b/llvm/include/llvm/Support/GenericLoopInfo.h
@@ -294,6 +294,10 @@ template <class BlockT, class LoopT> class LoopBase {
/// Otherwise return null.
BlockT *getUniqueExitBlock() const;
+ /// Return the exit block for the latch if one exists. This function assumes
+ /// the loop has a latch.
+ BlockT *getLatchExitBlock() const;
+
/// Return true if this loop does not have any exit blocks.
bool hasNoExitBlocks() const;
diff --git a/llvm/include/llvm/Support/GenericLoopInfoImpl.h b/llvm/include/llvm/Support/GenericLoopInfoImpl.h
index 1e0d0ee446fc41..3beb3e538398ef 100644
--- a/llvm/include/llvm/Support/GenericLoopInfoImpl.h
+++ b/llvm/include/llvm/Support/GenericLoopInfoImpl.h
@@ -159,6 +159,16 @@ BlockT *LoopBase<BlockT, LoopT>::getUniqueExitBlock() const {
return getExitBlockHelper(this, true).first;
}
+template <class BlockT, class LoopT>
+BlockT *LoopBase<BlockT, LoopT>::getLatchExitBlock() const {
+ BlockT *Latch = getLoopLatch();
+ assert(Latch && "Latch block must exists");
+ for (BlockT *Successor : children<BlockT *>(Latch))
+ if (!contains(Successor))
+ return Successor;
+ return nullptr;
+}
+
/// getExitEdges - Return all pairs of (_inside_block_,_outside_block_).
template <class BlockT, class LoopT>
void LoopBase<BlockT, LoopT>::getExitEdges(
diff --git a/llvm/include/llvm/Transforms/Utils/ScalarEvolutionExpander.h b/llvm/include/llvm/Transforms/Utils/ScalarEvolutionExpander.h
index 62c1e15a9a60e1..05850f864d042a 100644
--- a/llvm/include/llvm/Transforms/Utils/ScalarEvolutionExpander.h
+++ b/llvm/include/llvm/Transforms/Utils/ScalarEvolutionExpander.h
@@ -124,6 +124,11 @@ class SCEVExpander : public SCEVVisitor<SCEVExpander, Value *> {
/// "expanded" form.
bool LSRMode;
+ /// If the loop has an early exit we may have to use the speculative backedge
+ /// count, since the normal backedge count function is unable to compute a
+ /// SCEV expression.
+ bool UseSpeculativeBackedgeCount;
+
typedef IRBuilder<InstSimplifyFolder, IRBuilderCallbackInserter> BuilderType;
BuilderType Builder;
@@ -176,10 +181,12 @@ class SCEVExpander : public SCEVVisitor<SCEVExpander, Value *> {
public:
/// Construct a SCEVExpander in "canonical" mode.
explicit SCEVExpander(ScalarEvolution &se, const DataLayout &DL,
- const char *name, bool PreserveLCSSA = true)
+ const char *name, bool PreserveLCSSA = true,
+ bool UseSpeculativeBackedgeCount = false)
: SE(se), DL(DL), IVName(name), PreserveLCSSA(PreserveLCSSA),
IVIncInsertLoop(nullptr), IVIncInsertPos(nullptr), CanonicalMode(true),
LSRMode(false),
+ UseSpeculativeBackedgeCount(UseSpeculativeBackedgeCount),
Builder(se.getContext(), InstSimplifyFolder(DL),
IRBuilderCallbackInserter(
[this](Instruction *I) { rememberInstruction(I); })) {
diff --git a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
index a509ebf6a7e1b3..20a53abeb2e5cc 100644
--- a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
+++ b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
@@ -374,6 +374,24 @@ class LoopVectorizationLegality {
return LAI->getDepChecker().getMaxSafeVectorWidthInBits();
}
+ /// Returns true if the loop has a early exit with a exact backedge
+ /// count that is speculative.
+ bool hasSpeculativeEarlyExit() const {
+ return LAI && LAI->getSpeculativeEarlyExitingBlock();
+ }
+
+ /// Returns the early exiting block in a loop with a speculative backedge
+ /// count.
+ BasicBlock *getSpeculativeEarlyExitingBlock() const {
+ return LAI->getSpeculativeEarlyExitingBlock();
+ }
+
+ /// Returns the destination of an early exiting block in a loop with a
+ /// speculative backedge count.
+ BasicBlock *getSpeculativeEarlyExitBlock() const {
+ return LAI->getSpeculativeEarlyExitBlock();
+ }
+
/// Returns true if vector representation of the instruction \p I
/// requires mask.
bool isMaskRequired(const Instruction *I) const {
diff --git a/llvm/lib/Analysis/LoopAccessAnalysis.cpp b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
index 3bfc9700a14559..32e5816644310a 100644
--- a/llvm/lib/Analysis/LoopAccessAnalysis.cpp
+++ b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
@@ -730,6 +730,9 @@ class AccessAnalysis {
return UnderlyingObjects;
}
+ /// Returns true if we cannot prove the loop will not fault.
+ bool mayFault();
+
private:
typedef MapVector<MemAccessInfo, SmallSetVector<Type *, 1>> PtrAccessMap;
@@ -1281,6 +1284,63 @@ bool AccessAnalysis::canCheckPtrAtRT(RuntimePointerChecking &RtCheck,
return CanDoRTIfNeeded;
}
+bool AccessAnalysis::mayFault() {
+ auto &DL = TheLoop->getHeader()->getModule()->getDataLayout();
+ for (auto &UO : UnderlyingObjects) {
+ // TODO: For now if we encounter more than one underlying object we just
+ // assume it could fault. However, with more analysis it's possible to look
+ // at all of them and calculate a common range of permitted GEP indices.
+ if (UO.second.size() != 1)
+ return true;
+
+ // For now only the simplest cases are permitted, but this could be
+ // extended further.
+ auto *GEP = dyn_cast<GetElementPtrInst>(UO.first);
+ if (!GEP || GEP->getPointerOperand() != UO.second[0] ||
+ GEP->getNumIndices() != 1)
+ return true;
+
+ // Verify pointer accessed within the loop always falls within the bounds
+ // of the underlying object, but first it's necessary to determine the
+ // object size.
+
+ auto GetKnownObjSize = [&](const Value *Obj) -> uint64_t {
+ // TODO: We should also be able to support global variables too.
+ if (auto *AllocaObj = dyn_cast<AllocaInst>(Obj)) {
+ if (TheLoop->isLoopInvariant(AllocaObj))
+ if (std::optional<TypeSize> AllocaSize =
+ AllocaObj->getAllocationSize(DL))
+ return !AllocaSize->isScalable() ? AllocaSize->getFixedValue() : 0;
+ } else if (auto *ArgObj = dyn_cast<Argument>(Obj))
+ return ArgObj->getDereferenceableBytes();
+ return 0;
+ };
+
+ uint64_t ObjSize = GetKnownObjSize(UO.second[0]);
+ if (!ObjSize)
+ return true;
+
+ Value *GEPInd = GEP->getOperand(1);
+ const SCEV *IndScev = PSE.getSCEV(GEPInd);
+ if (!isa<SCEVAddRecExpr>(IndScev))
+ return true;
+
+ // Calculate the maximum number of addressable elements in the object.
+ uint64_t ElemSize = GEP->getSourceElementType()->getScalarSizeInBits() / 8;
+ uint64_t MaxNumElems = ObjSize / ElemSize;
+
+ const SCEV *MinScev = PSE.getSE()->getConstant(GEPInd->getType(), 0);
+ const SCEV *MaxScev =
+ PSE.getSE()->getConstant(GEPInd->getType(), MaxNumElems);
+ if (!PSE.getSE()->isKnownOnEveryIteration(
+ ICmpInst::ICMP_SGE, cast<SCEVAddRecExpr>(IndScev), MinScev) ||
+ !PSE.getSE()->isKnownOnEveryIteration(
+ ICmpInst::ICMP_SLT, cast<SCEVAddRecExpr>(IndScev), MaxScev))
+ return true;
+ }
+ return false;
+}
+
void AccessAnalysis::processMemAccesses() {
// We process the set twice: first we process read-write pointers, last we
// process read-only pointers. This allows us to skip dependence tests for
@@ -2292,6 +2352,73 @@ void MemoryDepChecker::Dependence::print(
OS.indent(Depth + 2) << *Instrs[Destination] << "\n";
}
+bool LoopAccessInfo::isAnalyzableEarlyExitLoop() {
+ // At least one of the exiting blocks must be the latch.
+ BasicBlock *LatchBB = TheLoop->getLoopLatch();
+ if (!LatchBB)
+ return false;
+
+ SmallVector<BasicBlock *, 8> ExitingBlocks;
+ TheLoop->getExitingBlocks(ExitingBlocks);
+
+ // This is definitely not an early exit loop.
+ if (ExitingBlocks.size() < 2)
+ return false;
+
+ SmallVector<BasicBlock *, 4> ExactExitingBlocks;
+ PSE->getSE()->getExactExitingBlocks(TheLoop, &ExactExitingBlocks);
+
+ // We only support one speculative early exit.
+ if ((ExitingBlocks.size() - ExactExitingBlocks.size()) > 1)
+ return false;
+
+ // There could be multiple exiting blocks with an exact exit-not-taken
+ // count. Find the speculative early exit block, i.e. the one with an
+ // unknown count.
+ BasicBlock *TmpBB = nullptr;
+ for (BasicBlock *BB1 : ExitingBlocks) {
+ bool Found = false;
+ for (BasicBlock *BB2 : ExactExitingBlocks)
+ if (BB1 == BB2) {
+ Found = true;
+ break;
+ }
+ if (!Found) {
+ TmpBB = BB1;
+ break;
+ }
+ }
+ assert(TmpBB && "Expected to find speculative early exiting block");
+
+ // For now, let's keep things simple by ensuring the latch block only has
+ // the exiting block as a predecessor.
+ BasicBlock *LatchPredBB = LatchBB->getUniquePredecessor();
+ if (!LatchPredBB || LatchPredBB != TmpBB)
+ return false;
+
+ LLVM_DEBUG(
+ dbgs()
+ << "LAA: Found an early exit. Retrying with speculative exit count.\n");
+ const SCEV *SpecExitCount = PSE->getSpeculativeBackedgeTakenCount();
+ if (isa<SCEVCouldNotCompute>(SpecExitCount))
+ return false;
+
+ LLVM_DEBUG(dbgs() << "LAA: Found speculative backedge taken count: "
+ << *SpecExitCount << '\n');
+ SpeculativeEarlyExitingBB = TmpBB;
+
+ for (BasicBlock *BB : successors(SpeculativeEarlyExitingBB))
+ if (BB != LatchBB) {
+ SpeculativeEarlyExitBB = BB;
+ break;
+ }
+ assert(SpeculativeEarlyExitBB &&
+ "Expected to find speculative early exit block");
+ CountableEarlyExitBlocks = std::move(ExactExitingBlocks);
+
+ return true;
+}
+
bool LoopAccessInfo::canAnalyzeLoop() {
// We need to have a loop header.
LLVM_DEBUG(dbgs() << "LAA: Found a loop in "
@@ -2317,10 +2444,12 @@ bool LoopAccessInfo::canAnalyzeLoop() {
// ScalarEvolution needs to be able to find the exit count.
const SCEV *ExitCount = PSE->getBackedgeTakenCount();
if (isa<SCEVCouldNotCompute>(ExitCount)) {
- recordAnalysis("CantComputeNumberOfIterations")
- << "could not determine number of loop iterations";
LLVM_DEBUG(dbgs() << "LAA: SCEV could not compute the loop exit count.\n");
- return false;
+ if (!isAnalyzableEarlyExitLoop()) {
+ recordAnalysis("CantComputeNumberOfIterations")
+ << "could not determine number of loop iterations";
+ return false;
+ }
}
return true;
@@ -2352,6 +2481,9 @@ void LoopAccessInfo::analyzeLoop(AAResults *AA, LoopInfo *LI,
EnableMemAccessVersioning &&
!TheLoop->getHeader()->getParent()->hasOptSize();
+ BasicBlock *LatchBB = TheLoop->getLoopLatch();
+ bool HasComplexWorkInEarlyExitLoop = false;
+
// Traverse blocks in fixed RPOT order, regardless of their storage in the
// loop info, as it may be arbitrary.
LoopBlocksRPO RPOT(TheLoop);
@@ -2367,7 +2499,8 @@ void LoopAccessInfo::analyzeLoop(AAResults *AA, LoopInfo *LI,
// With both a non-vectorizable memory instruction and a convergent
// operation, found in this loop, no reason to continue the search.
- if (HasComplexMemInst && HasConvergentOp) {
+ if ((HasComplexMemInst && HasConvergentOp) ||
+ HasComplexWorkInEarlyExitLoop) {
CanVecMem = false;
return;
}
@@ -2385,6 +2518,14 @@ void LoopAccessInfo::analyzeLoop(AAResults *AA, LoopInfo *LI,
// vectorize a loop if it contains known function calls that don't set
// the flag. Therefore, it is safe to ignore this read from memory.
auto *Call = dyn_cast<CallInst>(&I);
+ if (Call && SpeculativeEarlyExitingBB) {
+ recordAnalysis("CantVectorizeInstruction", Call)
+ << "cannot vectorize calls in early exit loop";
+ LLVM_DEBUG(dbgs() << "LAA: Found a call in early exit loop.\n");
+ HasComplexWorkInEarlyExitLoop = true;
+ continue;
+ }
+
if (Call && getVectorIntrinsicIDForCall(Call, TLI))
continue;
@@ -2412,6 +2553,13 @@ void LoopAccessInfo::analyzeLoop(AAResults *AA, LoopInfo *LI,
HasComplexMemInst = true;
continue;
}
+ if (SpeculativeEarlyExitingBB && BB == LatchBB) {
+ recordAnalysis("CantVectorizeInstruction", Call)
+ << "cannot vectorize loads after early exit block";
+ LLVM_DEBUG(dbgs() << "LAA: Found a load after early exit.\n");
+ HasComplexWorkInEarlyExitLoop = true;
+ continue;
+ }
NumLoads++;
Loads.push_back(Ld);
DepChecker->addAccess(Ld);
@@ -2423,6 +2571,13 @@ void LoopAccessInfo::analyzeLoop(AAResults *AA, LoopInfo *LI,
// Save 'store' instructions. Abort if othe...
[truncated]
|
@@ -10260,7 +10562,11 @@ bool LoopVectorizePass::processLoop(Loop *L) { | |||
Hints.setAlreadyVectorized(); | |||
} | |||
|
|||
assert(!verifyFunction(*L->getHeader()->getParent(), &dbgs())); | |||
// assert(!verifyFunction(*L->getHeader()->getParent(), &dbgs())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies - I just realised this debug code is still present. I'll fix asap!
@@ -2423,6 +2571,13 @@ void LoopAccessInfo::analyzeLoop(AAResults *AA, LoopInfo *LI, | |||
// Save 'store' instructions. Abort if other instructions write to memory. | |||
if (I.mayWriteToMemory()) { | |||
auto *St = dyn_cast<StoreInst>(&I); | |||
if (SpeculativeEarlyExitingBB) { | |||
recordAnalysis("CantVectorizeInstruction", St) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if St is null here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a similar problem in the code below too - recordAnalysis simply uses the debug information for the loop instead, but it won't crash. However, I think it makes sense to record the instruction using I
instead and I'll update the message to show that it might not be a store.
// a later poison exit count should not propagate into the result. This are | ||
// exactly the semantics provided by umin_seq. | ||
return SE->getUMinFromMismatchedTypes(Ops, /* Sequential */ true); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this "speculative" BECount differ from the SymbolicMax BECount?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it does seem like it does the same thing, except there is no predicated version that accepts a vector of SCEVPredicate pointers, which is required for getPredicatedBackedgeTakenCount. I can try adding a predicated version of getSymbolicMax to see if that works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other major difference between getSpeculative and getSymbolicMax is the former requires the exact-not-taken count for the latch to be known, whereas the latter doesn't care. So I think in order to use something close to the existing getSymbolicMax interface I'll need to do two things:
- Rewrite getSymbolicMax (or add an overloaded interface) so that it's a
const
interface (allowing it to be called from getPredicatedBackedgeTakenCount). Also, add aSmallVector<const SCEVPredicate *, 4> *Predicates
argument. - Add code to getPredicatedBackedgeTakenCount to explicitly check we have a exact-not-taken count for the latch.
I'm happy to do this of course - just pointing out that getSymbolicMax isn't a drop-in replacement that's all. I'll try it out and see if I get the same behaviour as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried posting a new commit that teaches getPredicatedBackedgeTakenCount to use a version of getSymbolicMax that accepts predicates, provided we have an exact count for the latch. Hopefully this makes better reuse of the code.
llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll
Outdated
Show resolved
Hide resolved
bool LoopMayFault = false; | ||
|
||
/// Keeps track of the early-exiting block, if present. | ||
BasicBlock *SpeculativeEarlyExitingBB = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better name just EarlyExitingBB
? At least to me Speculative
implies that there's speculation on memory, i.e. that refers to BBs with mayfault accesses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a sense though that is not far off the truth, because when vectorising the loop we are by definition reading ahead in memory which could potentially cause a fault where the scalar loop would not. However, the main reason I added the word 'Speculative' was to distinguish between early exits with exact exit-not-taken counts (which the vectoriser does support) and early exits that cannot be counted.
I'd prefer not to call it EarlyExitingBB
to avoid any possible confusion, but I'm happy to take suggestions on alternative names that are better? Perhaps UncountableEarlyExitingBB
?
auto *UI = cast<Instruction>(U); | ||
if (!L->contains(UI)) { | ||
PHINode *PHI = dyn_cast<PHINode>(UI); | ||
assert(PHI && "Expected LCSSA form"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: checking LCSSA form could be hoisted and checked earlier and just once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't hoist this assert out, since it's based upon the User U
which varies in each loop iteration.
In PR llvm#88385 I've added support for auto-vectorisation of some early exit loops, which requires using the experimental.cttz.elts to calculate final indices in the early exit block. We need a more accurate cost model for this intrinsic to better reflect the cost of work required in the early exit block. I've tried to accurately represent the expansion code for the intrinsic when the target does not have efficient lowering for it. It's quite tricky to model because you need to first figure out what types will actually be used in the expansion. The type used can have a significant effect on the cost if you end up using illegal vector types. Tests added here: Analysis/CostModel/AArch64/cttz_elts.ll Analysis/CostModel/RISCV/cttz_elts.ll
I was wondering if it would be good to add some AArch64 codegen tests too so that we can look at some codegen? |
If you're referring to the codegen coming out of clang after vectorising the loop, I don't think we typically have tests like that in test/Transform/LoopVectorize. They are normally IR/opt based tests. Are you referring specifically to the codegen from the cttz.elts intrinsic? If so, we already have tests for them - see CodeGen/AArch64/intrinsic-cttz-elts-sve.ll, for example. |
Yes, I appreciate we test all things individually, but I was just thinking that it is a bit of shame we can't look at some codegen for a loop for all of this work. For example, take the resulting IR of some of the tests in test/Transform/LoopVectorize/AArch64, and create llc tests. Not sure if there's precedent for that, I guess not. |
It would probably make sense to have some micro-benchmarks for some loops with varying trip counts (both statically known and unknown) to cover the end-to-end flow and allow for easy evaluation. Sharing the generated assembly end-to-end for some of those might help, as @sjoerdmeijer suggested? (I don't think we should add end-to-end tests to llvm-project/llvm/tests/ directly that run the vectorizer (and possibly other passes) all the way down to assembly) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch basically contains two parts: the LAA/SCEV and the vectorisation part.
I have only looked at the vectorisation part and that looks good to me:
- thanks for taking the cost-model remarks into account, the added logic seems like a good first step,
- the option to vectorise loops with early breaks is off by default. This allows us to experiment more with this, possibly refine the cost-model, without creating a lot of turbulence.
- It's a shame we can't look at final codegen for these sort of patches, but that is not a problem of this patch. I like the idea of some microbenchmarks for this, but given that this is off by default I don't think that this needs to hold up this patch.
So, LGTM, but I haven't looked at the LAA part, perhaps @nikic or @nikolaypanchenko can sign off on that part.
This work is in preparation for PRs llvm#112138 and llvm#88385 where the middle block is not guaranteed to be the immediate successor to the region block. I've simply add new getMiddleBlock() interfaces to VPlan that for now just return cast<VPBasicBlock>(VectorRegion->getSingleSuccessor()) Once PR llvm#112138 lands we'll need to do more work to discover the middle block.
With PR llvm#88385 I am introducing support for vectorising more loops with early exits that don't require a scalar epilogue. As such, if a loop doesn't have a unique exit block it will not automatically imply we require a scalar epilogue. Also, in the only place in the code today where we use the variable LoopExitBlock we actually mean the exit block from the latch. Therefore, it seemed reasonable to add a new getUniqueLatchExitBlock helper that allows the caller to determine the exit block taken from the latch and use this instead of getUniqueExitBlock. I also removed LoopExitBlock since it was only used in one place. While doing this I also noticed that one of the comments in requiresScalarEpilogue is wrong when we require a scalar epilogue, i.e. when we're not exiting from the latch block. This doesn't always imply we have multiple exits, e.g. see the test in Transforms/LoopVectorize/unroll_nonlatch.ll where the latch unconditionally branches back to the only exiting block.
With PR #88385 I am introducing support for vectorising more loops with early exits that don't require a scalar epilogue. As such, if a loop doesn't have a unique exit block it will not automatically imply we require a scalar epilogue. Also, in all places in the code today where we use the variable LoopExitBlock we actually mean the exit block from the latch. Therefore, it seemed reasonable to add a new getUniqueLatchExitBlock that allows the caller to determine the exit block taken from the latch and use this instead of getUniqueExitBlock. I also renamed LoopExitBlock to be LatchExitBlock. I feel this not only better reflects how the variable is used today, but also prepares the code for PR #88385. While doing this I also noticed that one of the comments in requiresScalarEpilogue is wrong when we require a scalar epilogue, i.e. when we're not exiting from the latch block. This doesn't always imply we have multiple exits, e.g. see the test in Transforms/LoopVectorize/unroll_nonlatch.ll where the latch unconditionally branches back to the only exiting block.
Caching the decision returned by requiresScalarEpilogue means that we can avoid printing out the same debug many times, and also avoids repeating the same calculation. This function will get more complex when we start to reason about more early exit loops, such as in PR llvm#88385. The only problem with this is we sometimes have to invalidate the previous result due to changes in the scalar epilogue status or interleave groups.
PR llvm#112138 introduced initial support for dispatching to multiple exit blocks via split middle blocks. This patch fixes a few issues so that we can enable more tests to use the new enable-early-exit-vectorization flag. Fixes are: 1. The code to bail out for any loop live-out values happens too late. This is because collectUsersInExitBlocks ignores induction variables, which get dealt with in fixupIVUsers. I've moved the check much earlier in processLoop by looking for outside users of loop-defined values. 2. We shouldn't yet be interleaving when vectorising loops with uncountable early exits, since we've not added support for this yet. 3. Similarly, we also shouldn't be creating vector epilogues. 4. Similarly, we shouldn't enable tail-folding. 5. The existing implementation doesn't yet support loops that require scalar epilogues, although I plan to add that as part of PR llvm#88385. 6. The new split middle blocks weren't being added to the parent loop. 7. VPIRInstruction::execute assumed that the VPIRBasicBlock predecessors correspond like-for-like with the predecessors of the scalar exit block prior to vectorisation. For example, collectUsersInExitBlocks adds the operands to the VPIRInstruction in the order returned by predecessors(ExitBB), whereas VPIRInstruction::execute processes the operands in order of the VPIRBasicBlock predecessors. There is absolutely no guarantee that they match up, which in some cases (such as the yacr2 test in the LLVM test suite) they don't. I've fixed this by maintaining the old behaviour when there is a single operand, and when there are 2 or more operands we use the same ordering as the BasicBlock predecessors.
PR llvm#112138 introduced initial support for dispatching to multiple exit blocks via split middle blocks. This patch fixes a few issues so that we can enable more tests to use the new enable-early-exit-vectorization flag. Fixes are: 1. The code to bail out for any loop live-out values happens too late. This is because collectUsersInExitBlocks ignores induction variables, which get dealt with in fixupIVUsers. I've moved the check much earlier in processLoop by looking for outside users of loop-defined values. 2. We shouldn't yet be interleaving when vectorising loops with uncountable early exits, since we've not added support for this yet. 3. Similarly, we also shouldn't be creating vector epilogues. 4. Similarly, we shouldn't enable tail-folding. 5. The existing implementation doesn't yet support loops that require scalar epilogues, although I plan to add that as part of PR llvm#88385. 6. The new split middle blocks weren't being added to the parent loop.
A more lightweight variant of #109193, which dispatches to multiple exit blocks via the middle blocks. The patch also introduces a bit of required scaffolding to enable early-exit vectorization, including an option. At the moment, early-exit vectorization doesn't come with legality checks, and is only used if the option is provided and the loop has metadata forcing vectorization. This is only intended to be used for testing during bring-up, with @david-arm enabling auto early-exit vectorization plugging in the changes from #88385. PR: #112138
PR llvm#112138 introduced initial support for dispatching to multiple exit blocks via split middle blocks. This patch fixes a few issues so that we can enable more tests to use the new enable-early-exit-vectorization flag. Fixes are: 1. The code to bail out for any loop live-out values happens too late. This is because collectUsersInExitBlocks ignores induction variables, which get dealt with in fixupIVUsers. I've moved the check much earlier in processLoop by looking for outside users of loop-defined values. 2. We shouldn't yet be interleaving when vectorising loops with uncountable early exits, since we've not added support for this yet. 3. Similarly, we also shouldn't be creating vector epilogues. 4. Similarly, we shouldn't enable tail-folding. 5. The existing implementation doesn't yet support loops that require scalar epilogues, although I plan to add that as part of PR llvm#88385. 6. The new split middle blocks weren't being added to the parent loop.
PR llvm#112138 introduced initial support for dispatching to multiple exit blocks via split middle blocks. This patch fixes a few issues so that we can enable more tests to use the new enable-early-exit-vectorization flag. Fixes are: 1. The code to bail out for any loop live-out values happens too late. This is because collectUsersInExitBlocks ignores induction variables, which get dealt with in fixupIVUsers. I've moved the check much earlier in processLoop by looking for outside users of loop-defined values. 2. We shouldn't yet be interleaving when vectorising loops with uncountable early exits, since we've not added support for this yet. 3. Similarly, we also shouldn't be creating vector epilogues. 4. Similarly, we shouldn't enable tail-folding. 5. The existing implementation doesn't yet support loops that require scalar epilogues, although I plan to add that as part of PR llvm#88385. 6. The new split middle blocks weren't being added to the parent loop.
PR #112138 introduced initial support for dispatching to multiple exit blocks via split middle blocks. This patch fixes a few issues so that we can enable more tests to use the new enable-early-exit-vectorization flag. Fixes are: 1. The code to bail out for any loop live-out values happens too late. This is because collectUsersInExitBlocks ignores induction variables, which get dealt with in fixupIVUsers. I've moved the check much earlier in processLoop by looking for outside users of loop-defined values. 2. We shouldn't yet be interleaving when vectorising loops with uncountable early exits, since we've not added support for this yet. 3. Similarly, we also shouldn't be creating vector epilogues. 4. Similarly, we shouldn't enable tail-folding. 5. The existing implementation doesn't yet support loops that require scalar epilogues, although I plan to add that as part of PR #88385. 6. The new split middle blocks weren't being added to the parent loop.
This work feeds part of PR llvm#88385, and adds support for vectorising loops with uncountable early exits and outside users of loop-defined variables. I've added a new fixupEarlyExitIVUsers to mirror what happens in fixupIVUsers when patching up outside users of induction variables in the early exit block. We have to handle these differently for two reasons: 1. We can't work backwards from the end value in the middle block because we didn't leave at the last iteration. 2. We need to generate different IR that calculates the vector lane that triggered the exit, and hence can determine the induction value at the point we exited. I've added a new 'null' VPValue as a dummy placeholder to manage the incoming operands of PHI nodes in the exit block. We can have situations where one of the incoming values is an induction variable (or its update) and the other is not. For example, both the latch and the early exiting block can jump to the same exit block. However, VPInstruction::generate walks through all predecessors of the PHI assuming the value is *not* an IV. In order to ensure that we process the right value for the right incoming block we use this new 'null' value is a marker to indicate it should be skipped, since it will be handled separately in fixupIVUsers or fixupEarlyExitIVUsers. All code for calculating the last value when exiting the loop early now lives in a new vector.early.exit block, which sits between the middle.split block and the original exit block. I also had to fix up the vplan verifier because it assumed that the block containing a definition always dominated the parent of the user. That's no longer the case because we can arrive at the exit block via one of the latch or the early exiting block. I've added a new ExtractFirstActive VPInstruction that extracts the first active lane of a vector, i.e. the lane of the vector predicate that triggered the exit.
I realise this PR is very out of date, but just for reference this work is nearing completion, but has been done through many other smaller PRs, including a nice patch by @fhahn adding support for multiple exits in VPlan. Here is one of the remaining PRs to add support for loops with live-outs: #120567 I also have a WIP patch to add support for versioning early exit loops with potentially faulting pointers: #120603 |
This work feeds part of PR llvm#88385, and adds support for vectorising loops with uncountable early exits and outside users of loop-defined variables. I've added a new fixupEarlyExitIVUsers to mirror what happens in fixupIVUsers when patching up outside users of induction variables in the early exit block. We have to handle these differently for two reasons: 1. We can't work backwards from the end value in the middle block because we didn't leave at the last iteration. 2. We need to generate different IR that calculates the vector lane that triggered the exit, and hence can determine the induction value at the point we exited. I've added a new 'null' VPValue as a dummy placeholder to manage the incoming operands of PHI nodes in the exit block. We can have situations where one of the incoming values is an induction variable (or its update) and the other is not. For example, both the latch and the early exiting block can jump to the same exit block. However, VPInstruction::generate walks through all predecessors of the PHI assuming the value is *not* an IV. In order to ensure that we process the right value for the right incoming block we use this new 'null' value is a marker to indicate it should be skipped, since it will be handled separately in fixupIVUsers or fixupEarlyExitIVUsers. All code for calculating the last value when exiting the loop early now lives in a new vector.early.exit block, which sits between the middle.split block and the original exit block. I also had to fix up the vplan verifier because it assumed that the block containing a definition always dominated the parent of the user. That's no longer the case because we can arrive at the exit block via one of the latch or the early exiting block. I've added a new ExtractFirstActive VPInstruction that extracts the first active lane of a vector, i.e. the lane of the vector predicate that triggered the exit.
This work feeds part of PR llvm#88385, and adds support for vectorising loops with uncountable early exits and outside users of loop-defined variables. When calculating the final value from an uncountable early exit we need to calculate the vector lane that triggered the exit, and hence determine the value at the point we exited. All code for calculating the last value when exiting the loop early now lives in a new vector.early.exit block, which sits between the middle.split block and the original exit block. Doing this required two fixes: 1. The vplan verifier incorrectly assumed that the block containing a definition always dominates the block of the user. That's not true if you can arrive at the use block from multiple incoming blocks. This is possible for early exit loops where both the early exit and the latch jump to the same block. I've added a new ExtractFirstActive VPInstruction that extracts the first active lane of a vector, i.e. the lane of the vector predicate that triggered the exit. NOTE: The IR generated for dealing with live-outs from early exit loops is unoptimised, as opposed to normal loops. This inevitably leads to poor quality code, but this can be fixed up later.
…AlignedInLoop (#96752) Currently when we encounter a negative step in the induction variable isDereferenceableAndAlignedInLoop bails out because the element size is signed greater than the step. This patch adds support for negative steps in cases where we detect the start address for the load is of the form base + offset. In this case the address decrements in each iteration so we need to calculate the access size differently. I have done this by caling getStartAndEndForAccess from LoopAccessAnalysis.cpp. The motivation for this patch comes from PR #88385 where a reviewer requested reusing isDereferenceableAndAlignedInLoop, but that PR itself does support reverse loops. The changed test in LoopVectorize/X86/load-deref-pred.ll now passes because previously we were calculating the total access size incorrectly, whereas now it is 412 bytes and fits perfectly into the alloca.
This work feeds part of PR llvm#88385, and adds support for vectorising loops with uncountable early exits and outside users of loop-defined variables. When calculating the final value from an uncountable early exit we need to calculate the vector lane that triggered the exit, and hence determine the value at the point we exited. All code for calculating the last value when exiting the loop early now lives in a new vector.early.exit block, which sits between the middle.split block and the original exit block. Doing this required two fixes: 1. The vplan verifier incorrectly assumed that the block containing a definition always dominates the block of the user. That's not true if you can arrive at the use block from multiple incoming blocks. This is possible for early exit loops where both the early exit and the latch jump to the same block. I've added a new ExtractFirstActive VPInstruction that extracts the first active lane of a vector, i.e. the lane of the vector predicate that triggered the exit. NOTE: The IR generated for dealing with live-outs from early exit loops is unoptimised, as opposed to normal loops. This inevitably leads to poor quality code, but this can be fixed up later.
This work feeds part of PR llvm#88385, and adds support for vectorising loops with uncountable early exits and outside users of loop-defined variables. When calculating the final value from an uncountable early exit we need to calculate the vector lane that triggered the exit, and hence determine the value at the point we exited. All code for calculating the last value when exiting the loop early now lives in a new vector.early.exit block, which sits between the middle.split block and the original exit block. Doing this required two fixes: 1. The vplan verifier incorrectly assumed that the block containing a definition always dominates the block of the user. That's not true if you can arrive at the use block from multiple incoming blocks. This is possible for early exit loops where both the early exit and the latch jump to the same block. I've added a new ExtractFirstActive VPInstruction that extracts the first active lane of a vector, i.e. the lane of the vector predicate that triggered the exit. NOTE: The IR generated for dealing with live-outs from early exit loops is unoptimised, as opposed to normal loops. This inevitably leads to poor quality code, but this can be fixed up later.
This work feeds part of PR llvm#88385, and adds support for vectorising loops with uncountable early exits and outside users of loop-defined variables. When calculating the final value from an uncountable early exit we need to calculate the vector lane that triggered the exit, and hence determine the value at the point we exited. All code for calculating the last value when exiting the loop early now lives in a new vector.early.exit block, which sits between the middle.split block and the original exit block. Doing this required two fixes: 1. The vplan verifier incorrectly assumed that the block containing a definition always dominates the block of the user. That's not true if you can arrive at the use block from multiple incoming blocks. This is possible for early exit loops where both the early exit and the latch jump to the same block. I've added a new ExtractFirstActive VPInstruction that extracts the first active lane of a vector, i.e. the lane of the vector predicate that triggered the exit. NOTE: The IR generated for dealing with live-outs from early exit loops is unoptimised, as opposed to normal loops. This inevitably leads to poor quality code, but this can be fixed up later.
This work feeds part of PR llvm#88385, and adds support for vectorising loops with uncountable early exits and outside users of loop-defined variables. When calculating the final value from an uncountable early exit we need to calculate the vector lane that triggered the exit, and hence determine the value at the point we exited. All code for calculating the last value when exiting the loop early now lives in a new vector.early.exit block, which sits between the middle.split block and the original exit block. Doing this required two fixes: 1. The vplan verifier incorrectly assumed that the block containing a definition always dominates the block of the user. That's not true if you can arrive at the use block from multiple incoming blocks. This is possible for early exit loops where both the early exit and the latch jump to the same block. I've added a new ExtractFirstActive VPInstruction that extracts the first active lane of a vector, i.e. the lane of the vector predicate that triggered the exit. NOTE: The IR generated for dealing with live-outs from early exit loops is unoptimised, as opposed to normal loops. This inevitably leads to poor quality code, but this can be fixed up later.
This work feeds part of PR llvm#88385, and adds support for vectorising loops with uncountable early exits and outside users of loop-defined variables. When calculating the final value from an uncountable early exit we need to calculate the vector lane that triggered the exit, and hence determine the value at the point we exited. All code for calculating the last value when exiting the loop early now lives in a new vector.early.exit block, which sits between the middle.split block and the original exit block. Doing this required the following fix: * The vplan verifier incorrectly assumed that the block containing a definition always dominates the block of the user. That's not true if you can arrive at the use block from multiple incoming blocks. This is possible for early exit loops where both the early exit and the latch jump to the same block. I've added a new ExtractFirstActive VPInstruction that extracts the first active lane of a vector, i.e. the lane of the vector predicate that triggered the exit. NOTE: The IR generated for dealing with live-outs from early exit loops is unoptimised, as opposed to normal loops. This inevitably leads to poor quality code, but this can be fixed up later.
This work feeds part of PR llvm#88385, and adds support for vectorising loops with uncountable early exits and outside users of loop-defined variables. When calculating the final value from an uncountable early exit we need to calculate the vector lane that triggered the exit, and hence determine the value at the point we exited. All code for calculating the last value when exiting the loop early now lives in a new vector.early.exit block, which sits between the middle.split block and the original exit block. Doing this required the following fix: * The vplan verifier incorrectly assumed that the block containing a definition always dominates the block of the user. That's not true if you can arrive at the use block from multiple incoming blocks. This is possible for early exit loops where both the early exit and the latch jump to the same block. I've added a new ExtractFirstActive VPInstruction that extracts the first active lane of a vector, i.e. the lane of the vector predicate that triggered the exit. NOTE: The IR generated for dealing with live-outs from early exit loops is unoptimised, as opposed to normal loops. This inevitably leads to poor quality code, but this can be fixed up later.
…ts (#120567) This work feeds part of PR #88385, and adds support for vectorising loops with uncountable early exits and outside users of loop-defined variables. When calculating the final value from an uncountable early exit we need to calculate the vector lane that triggered the exit, and hence determine the value at the point we exited. All code for calculating the last value when exiting the loop early now lives in a new vector.early.exit block, which sits between the middle.split block and the original exit block. Doing this required two fixes: 1. The vplan verifier incorrectly assumed that the block containing a definition always dominates the block of the user. That's not true if you can arrive at the use block from multiple incoming blocks. This is possible for early exit loops where both the early exit and the latch jump to the same block. 2. We were adding the new vector.early.exit to the wrong parent loop. It needs to have the same parent as the actual early exit block from the original loop. I've added a new ExtractFirstActive VPInstruction that extracts the first active lane of a vector, i.e. the lane of the vector predicate that triggered the exit. NOTE: The IR generated for dealing with live-outs from early exit loops is unoptimised, as opposed to normal loops. This inevitably leads to poor quality code, but this can be fixed up later.
…ith live-outs (#120567) This work feeds part of PR llvm/llvm-project#88385, and adds support for vectorising loops with uncountable early exits and outside users of loop-defined variables. When calculating the final value from an uncountable early exit we need to calculate the vector lane that triggered the exit, and hence determine the value at the point we exited. All code for calculating the last value when exiting the loop early now lives in a new vector.early.exit block, which sits between the middle.split block and the original exit block. Doing this required two fixes: 1. The vplan verifier incorrectly assumed that the block containing a definition always dominates the block of the user. That's not true if you can arrive at the use block from multiple incoming blocks. This is possible for early exit loops where both the early exit and the latch jump to the same block. 2. We were adding the new vector.early.exit to the wrong parent loop. It needs to have the same parent as the actual early exit block from the original loop. I've added a new ExtractFirstActive VPInstruction that extracts the first active lane of a vector, i.e. the lane of the vector predicate that triggered the exit. NOTE: The IR generated for dealing with live-outs from early exit loops is unoptimised, as opposed to normal loops. This inevitably leads to poor quality code, but this can be fixed up later.
This patch follows on from PR #107004 by adding support for vectorisation of a simple class of loops that typically involves searching for something, i.e.
or
In this initial commit we will only vectorise early exit loops legal if they
follow these criteria:
above example.
faulting loads.
For point 7 once this patch lands I intend to follow up by supporting
some limited cases of faulting loops where we can version the loop based
on pointer alignment. For example, it turns out in the SPEC2017 benchmark
(xalancbmk) there is a std::find loop that we can vectorise provided we
add SCEV checks for the initial pointer being aligned to a multiple of
the VF. In practice, the pointer is regularly aligned to at least 32/64
bytes and since the VF is a power of 2, any vector loads <= 32/64 bytes
in size will always fault on the first lane, following the same behaviour
as the scalar loop. Given we already do such speculative versioning for
loops with unknown strides, alignment-based versioning doesn't seem to be
any worse at least for loops with only one load.
This patch makes use of the existing experimental_cttz_elems intrinsic
that's required in the vectorised early exit block to determine the first
lane that triggered the exit. This intrinsic has generic lowering support
so it's guaranteed to work for all targets.
Tests have been updated here:
Transforms/LoopVectorize/simple_early_exit.ll