Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VPlan] Introduce ResumePhi VPInstruction, use to create phi for FOR. #94760

Merged
merged 10 commits into from
Jul 11, 2024

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented Jun 7, 2024

This patch introduces a new ResumePhi VPInstruction which creates a phi in
a leaf block of a VPlan. The first use is to create the phi node for
fixed-order recurrence resume values in the scalar preheader.

The VPInstruction takes 2 operands: 1) the incoming value from the
middle-block and a default value to be used for all other incoming
blocks.

In follow-up changes, it will also be used to create phis for reduction and
induction resume values.

Depends on #92651

@llvmbot
Copy link
Member

llvmbot commented Jun 7, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Florian Hahn (fhahn)

Changes

This patch introduces a new ExitPhi VPInstruction which creates a phi in
a leaf block of a VPlan. The first use is to create the phi node for
fixed-order recurrence resume values in the scalar preheader.

The VPInstruction takes 2 operands: 1) the incoming value from the
middle-block and a default value to be used for all other incoming
blocks.

In follow-up changes, it will also be used to create phis for reduction and
induction resume values.

Depends on #92651


Patch is 177.53 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/94760.diff

38 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+118-103)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+91-12)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+26-4)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+56-16)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+19-12)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/first-order-recurrence-fold-tail.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/fixed-order-recurrence.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/reduction-recurrence-costs-sve.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll (+3-3)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/synthesize-mask-for-call.ll (+48)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/widen-call-with-intrinsic-or-libfunc.ll (+16)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll (+18)
  • (modified) llvm/test/Transforms/LoopVectorize/SystemZ/predicated-first-order-recurrence.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/fixed-order-recurrence.ll (+6-6)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/pr72969.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/branch-weights.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains-vplan.ll (+27-6)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains.ll (+28-28)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-complex.ll (+21-21)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-multiply-recurrences.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll (+54-6)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll (+78-78)
  • (modified) llvm/test/Transforms/LoopVectorize/induction.ll (+21-21)
  • (modified) llvm/test/Transforms/LoopVectorize/interleave-and-scalarize-only.ll (+18-1)
  • (modified) llvm/test/Transforms/LoopVectorize/interleaved-accesses.ll (+3-3)
  • (modified) llvm/test/Transforms/LoopVectorize/pr36983.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/pr59319-loop-access-info-invalidation.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/scalable-first-order-recurrence.ll (+8-8)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-iv-transforms.ll (+9-1)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing-before-execute.ll (+17)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing-outer-loop.ll (+10-1)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing.ll (+124-3)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge-vf1.ll (+9-1)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll (+15)
  • (modified) llvm/unittests/Transforms/Vectorize/VPlanTestBase.h (+8-8)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index c7c19ef456c7c..ae62df3aed207 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -607,10 +607,6 @@ class InnerLoopVectorizer {
                     BasicBlock *MiddleBlock, BasicBlock *VectorHeader,
                     VPlan &Plan, VPTransformState &State);
 
-  /// Create the phi node for the resume value of first order recurrences in the
-  /// scalar preheader and update the users in the scalar loop.
-  void fixFixedOrderRecurrence(VPLiveOut *LO, VPTransformState &State);
-
   /// Iteratively sink the scalarized operands of a predicated instruction into
   /// the block that was created for it.
   void sinkScalarOperands(Instruction *PredInst);
@@ -2972,22 +2968,7 @@ void InnerLoopVectorizer::createVectorLoopSkeleton(StringRef Prefix) {
       SplitBlock(LoopMiddleBlock, LoopMiddleBlock->getTerminator(), DT, LI,
                  nullptr, Twine(Prefix) + "scalar.ph");
 
-  auto *ScalarLatchTerm = OrigLoop->getLoopLatch()->getTerminator();
-
-  // Set up the middle block terminator.  Two cases:
-  // 1) If we know that we must execute the scalar epilogue, emit an
-  //    unconditional branch.
-  // 2) Otherwise, we must have a single unique exit block (due to how we
-  //    implement the multiple exit case).  In this case, set up a conditional
-  //    branch from the middle block to the loop scalar preheader, and the
-  //    exit block.  completeLoopSkeleton will update the condition to use an
-  //    iteration check, if required to decide whether to execute the remainder.
-  BranchInst *BrInst =
-      Cost->requiresScalarEpilogue(VF.isVector())
-          ? BranchInst::Create(LoopScalarPreHeader)
-          : BranchInst::Create(LoopExitBlock, LoopScalarPreHeader,
-                               Builder.getTrue());
-  BrInst->setDebugLoc(ScalarLatchTerm->getDebugLoc());
+  auto *BrInst = new UnreachableInst(LoopMiddleBlock->getContext());
   ReplaceInstWithInst(LoopMiddleBlock->getTerminator(), BrInst);
 
   // Update dominator for loop exit. During skeleton creation, only the vector
@@ -3094,51 +3075,6 @@ void InnerLoopVectorizer::createInductionResumeValues(
   }
 }
 
-BasicBlock *InnerLoopVectorizer::completeLoopSkeleton() {
-  // The trip counts should be cached by now.
-  Value *Count = getTripCount();
-  Value *VectorTripCount = getOrCreateVectorTripCount(LoopVectorPreHeader);
-
-  auto *ScalarLatchTerm = OrigLoop->getLoopLatch()->getTerminator();
-
-  // Add a check in the middle block to see if we have completed
-  // all of the iterations in the first vector loop.  Three cases:
-  // 1) If we require a scalar epilogue, there is no conditional branch as
-  //    we unconditionally branch to the scalar preheader.  Do nothing.
-  // 2) If (N - N%VF) == N, then we *don't* need to run the remainder.
-  //    Thus if tail is to be folded, we know we don't need to run the
-  //    remainder and we can use the previous value for the condition (true).
-  // 3) Otherwise, construct a runtime check.
-  if (!Cost->requiresScalarEpilogue(VF.isVector()) &&
-      !Cost->foldTailByMasking()) {
-    // Here we use the same DebugLoc as the scalar loop latch terminator instead
-    // of the corresponding compare because they may have ended up with
-    // different line numbers and we want to avoid awkward line stepping while
-    // debugging. Eg. if the compare has got a line number inside the loop.
-    // TODO: At the moment, CreateICmpEQ will simplify conditions with constant
-    // operands. Perform simplification directly on VPlan once the branch is
-    // modeled there.
-    IRBuilder<> B(LoopMiddleBlock->getTerminator());
-    B.SetCurrentDebugLocation(ScalarLatchTerm->getDebugLoc());
-    Value *CmpN = B.CreateICmpEQ(Count, VectorTripCount, "cmp.n");
-    BranchInst &BI = *cast<BranchInst>(LoopMiddleBlock->getTerminator());
-    BI.setCondition(CmpN);
-    if (hasBranchWeightMD(*ScalarLatchTerm)) {
-      // Assume that `Count % VectorTripCount` is equally distributed.
-      unsigned TripCount = UF * VF.getKnownMinValue();
-      assert(TripCount > 0 && "trip count should not be zero");
-      const uint32_t Weights[] = {1, TripCount - 1};
-      setBranchWeights(BI, Weights);
-    }
-  }
-
-#ifdef EXPENSIVE_CHECKS
-  assert(DT->verify(DominatorTree::VerificationLevel::Fast));
-#endif
-
-  return LoopVectorPreHeader;
-}
-
 std::pair<BasicBlock *, Value *>
 InnerLoopVectorizer::createVectorizedLoopSkeleton(
     const SCEV2ValueTy &ExpandedSCEVs) {
@@ -3198,7 +3134,7 @@ InnerLoopVectorizer::createVectorizedLoopSkeleton(
   // Emit phis for the new starting index of the scalar loop.
   createInductionResumeValues(ExpandedSCEVs);
 
-  return {completeLoopSkeleton(), nullptr};
+  return {LoopVectorPreHeader, nullptr};
 }
 
 // Fix up external users of the induction variable. At this point, we are
@@ -3399,8 +3335,6 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
   for (const auto &[_, LO] : to_vector(Plan.getLiveOuts())) {
     if (!Legal->isFixedOrderRecurrence(LO->getPhi()))
       continue;
-    fixFixedOrderRecurrence(LO, State);
-    Plan.removeLiveOut(LO->getPhi());
   }
 
   // Forget the original basic block.
@@ -3470,31 +3404,16 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
                                VF.getKnownMinValue() * UF);
 }
 
-void InnerLoopVectorizer::fixFixedOrderRecurrence(VPLiveOut *LO,
-                                                  VPTransformState &State) {
-  // Extract the last vector element in the middle block. This will be the
-  // initial value for the recurrence when jumping to the scalar loop.
-  VPValue *VPExtract = LO->getOperand(0);
-  using namespace llvm::VPlanPatternMatch;
-  assert(match(VPExtract, m_VPInstruction<VPInstruction::ExtractFromEnd>(
-                              m_VPValue(), m_VPValue())) &&
-         "FOR LiveOut expects to use an extract from end.");
-  Value *ResumeScalarFOR = State.get(VPExtract, UF - 1, true);
-
-  // Fix the initial value of the original recurrence in the scalar loop.
-  PHINode *ScalarHeaderPhi = LO->getPhi();
-  auto *InitScalarFOR =
-      ScalarHeaderPhi->getIncomingValueForBlock(LoopScalarPreHeader);
-  Builder.SetInsertPoint(LoopScalarPreHeader, LoopScalarPreHeader->begin());
-  auto *ScalarPreheaderPhi =
-      Builder.CreatePHI(ScalarHeaderPhi->getType(), 2, "scalar.recur.init");
-  for (auto *BB : predecessors(LoopScalarPreHeader)) {
-    auto *Incoming = BB == LoopMiddleBlock ? ResumeScalarFOR : InitScalarFOR;
-    ScalarPreheaderPhi->addIncoming(Incoming, BB);
-  }
-  ScalarHeaderPhi->setIncomingValueForBlock(LoopScalarPreHeader,
-                                            ScalarPreheaderPhi);
-  ScalarHeaderPhi->setName("scalar.recur");
+// Helper to reorder blocks so they match the original order even after the
+// order of the predecessors changes. This is only used to avoid a number of
+// test changes due to reordering of incoming blocks in phi nodes and should be
+// removed soon, with the tests being updated.
+static void reorderIncomingBlocks(SmallVectorImpl<BasicBlock *> &Blocks,
+                                  BasicBlock *LoopMiddleBlock) {
+  if (Blocks.front() == LoopMiddleBlock)
+    std::swap(Blocks.front(), Blocks.back());
+  if (Blocks.size() == 3)
+    std::swap(Blocks[0], Blocks[1]);
 }
 
 void InnerLoopVectorizer::sinkScalarOperands(Instruction *PredInst) {
@@ -7388,7 +7307,9 @@ static void createAndCollectMergePhiForReduction(
   // If we are fixing reductions in the epilogue loop then we should already
   // have created a bc.merge.rdx Phi after the main vector body. Ensure that
   // we carry over the incoming values correctly.
-  for (auto *Incoming : predecessors(LoopScalarPreHeader)) {
+  SmallVector<BasicBlock *> Blocks(predecessors(LoopScalarPreHeader));
+  reorderIncomingBlocks(Blocks, LoopMiddleBlock);
+  for (auto *Incoming : Blocks) {
     if (Incoming == LoopMiddleBlock)
       BCBlockPhi->addIncoming(FinalValue, Incoming);
     else if (ResumePhi && is_contained(ResumePhi->blocks(), Incoming))
@@ -7459,6 +7380,21 @@ LoopVectorizationPlanner::executePlan(
   std::tie(State.CFG.PrevBB, CanonicalIVStartValue) =
       ILV.createVectorizedLoopSkeleton(ExpandedSCEVs ? *ExpandedSCEVs
                                                      : State.ExpandedSCEVs);
+#ifdef EXPENSIVE_CHECKS
+  assert(DT->verify(DominatorTree::VerificationLevel::Fast));
+#endif
+
+  VPBasicBlock *MiddleVPBB =
+      cast<VPBasicBlock>(BestVPlan.getVectorLoopRegion()->getSingleSuccessor());
+
+  using namespace llvm::VPlanPatternMatch;
+  if (MiddleVPBB->begin() != MiddleVPBB->end() &&
+      match(&MiddleVPBB->back(), m_BranchOnCond(m_VPValue()))) {
+    cast<VPIRBasicBlock>(MiddleVPBB->getSuccessors()[1])
+        ->resetBlock(OrigLoop->getLoopPreheader());
+  } else
+    cast<VPIRBasicBlock>(MiddleVPBB->getSuccessors()[0])
+        ->resetBlock(OrigLoop->getLoopPreheader());
 
   // Only use noalias metadata when using memory checks guaranteeing no overlap
   // across all iterations.
@@ -7539,6 +7475,18 @@ LoopVectorizationPlanner::executePlan(
 
   ILV.printDebugTracesAtEnd();
 
+  // Adjust branch weight of the branch in the middle block.
+  auto *MiddleTerm =
+      cast<BranchInst>(State.CFG.VPBB2IRBB[ExitVPBB]->getTerminator());
+  if (MiddleTerm->isConditional() &&
+      hasBranchWeightMD(*OrigLoop->getLoopLatch()->getTerminator())) {
+    // Assume that `Count % VectorTripCount` is equally distributed.
+    unsigned TripCount = State.UF * State.VF.getKnownMinValue();
+    assert(TripCount > 0 && "trip count should not be zero");
+    const uint32_t Weights[] = {1, TripCount - 1};
+    setBranchWeights(*MiddleTerm, Weights);
+  }
+
   return {State.ExpandedSCEVs, ReductionResumeValues};
 }
 
@@ -7595,7 +7543,7 @@ EpilogueVectorizerMainLoop::createEpilogueVectorizedLoopSkeleton(
   // inductions in the epilogue loop are created before executing the plan for
   // the epilogue loop.
 
-  return {completeLoopSkeleton(), nullptr};
+  return {LoopVectorPreHeader, nullptr};
 }
 
 void EpilogueVectorizerMainLoop::printDebugTracesAtStart() {
@@ -7719,8 +7667,11 @@ EpilogueVectorizerEpilogueLoop::createEpilogueVectorizedLoopSkeleton(
       VecEpilogueIterationCountCheck,
       VecEpilogueIterationCountCheck->getSinglePredecessor());
 
-  DT->changeImmediateDominator(LoopScalarPreHeader,
-                               EPI.EpilogueIterationCountCheck);
+  if (auto *N = DT->getNode(LoopScalarPreHeader))
+    DT->changeImmediateDominator(LoopScalarPreHeader,
+                                 EPI.EpilogueIterationCountCheck);
+  else
+    DT->addNewBlock(LoopScalarPreHeader, EPI.EpilogueIterationCountCheck);
   if (!Cost->requiresScalarEpilogue(EPI.EpilogueVF.isVector()))
     // If there is an epilogue which must run, there's no edge from the
     // middle block to exit blocks  and thus no need to update the immediate
@@ -7784,7 +7735,7 @@ EpilogueVectorizerEpilogueLoop::createEpilogueVectorizedLoopSkeleton(
                               {VecEpilogueIterationCountCheck,
                                EPI.VectorTripCount} /* AdditionalBypass */);
 
-  return {completeLoopSkeleton(), EPResumeVal};
+  return {LoopVectorPreHeader, EPResumeVal};
 }
 
 BasicBlock *
@@ -8515,7 +8466,9 @@ static void addUsersInExitBlock(VPBasicBlock *HeaderVPBB, Loop *OrigLoop,
     Value *IncomingValue =
         ExitPhi.getIncomingValueForBlock(ExitingBB);
     VPValue *V = Builder.getVPValueOrAddLiveIn(IncomingValue, Plan);
-    Plan.addLiveOut(&ExitPhi, V);
+    Plan.addLiveOut(
+        &ExitPhi, V,
+        cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getSingleSuccessor()));
   }
 }
 
@@ -8534,9 +8487,25 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
   // modified; a basic block for the vector pre-header, followed by a region for
   // the vector loop, followed by the middle basic block. The skeleton vector
   // loop region contains a header and latch basic blocks.
+
+  // Add a check in the middle block to see if we have completed
+  // all of the iterations in the first vector loop.  Three cases:
+  // 1) If we require a scalar epilogue, there is no conditional branch as
+  //    we unconditionally branch to the scalar preheader.  Do nothing.
+  // 2) If (N - N%VF) == N, then we *don't* need to run the remainder.
+  //    Thus if tail is to be folded, we know we don't need to run the
+  //    remainder and we can use the previous value for the condition (true).
+  // 3) Otherwise, construct a runtime check.
+  bool RequiresScalarEpilogueCheck =
+      LoopVectorizationPlanner::getDecisionAndClampRange(
+          [this](ElementCount VF) {
+            return !CM.requiresScalarEpilogue(VF.isVector());
+          },
+          Range);
   VPlanPtr Plan = VPlan::createInitialVPlan(
       createTripCountSCEV(Legal->getWidestInductionType(), PSE, OrigLoop),
-      *PSE.getSE(), OrigLoop->getLoopPreheader());
+      *PSE.getSE(), RequiresScalarEpilogueCheck, CM.foldTailByMasking(),
+      OrigLoop);
   VPBasicBlock *HeaderVPBB = new VPBasicBlock("vector.body");
   VPBasicBlock *LatchVPBB = new VPBasicBlock("vector.latch");
   VPBlockUtils::insertBlockAfter(LatchVPBB, HeaderVPBB);
@@ -8679,6 +8648,49 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
          "VPBasicBlock");
   RecipeBuilder.fixHeaderPhis();
 
+  auto *MiddleVPBB =
+      cast<VPBasicBlock>(Plan->getVectorLoopRegion()->getSingleSuccessor());
+
+  VPBasicBlock *ScalarPH = nullptr;
+  for (VPBlockBase *Succ : MiddleVPBB->getSuccessors()) {
+    auto *VPIRBB = dyn_cast<VPIRBasicBlock>(Succ);
+    if (VPIRBB && VPIRBB->getIRBasicBlock() == OrigLoop->getHeader()) {
+      ScalarPH = VPIRBB;
+      break;
+    }
+  }
+
+  if (ScalarPH) {
+    for (auto &H : HeaderVPBB->phis()) {
+      auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&H);
+      if (!FOR)
+        continue;
+      VPBuilder B(ScalarPH);
+      VPBuilder MiddleBuilder;
+      // Set insert point so new recipes are inserted before terminator and
+      // condition, if there is either the former or both.
+      if (MiddleVPBB->getNumSuccessors() != 2)
+        MiddleBuilder.setInsertPoint(MiddleVPBB);
+      else if (isa<VPInstruction>(MiddleVPBB->getTerminator()->getOperand(0)))
+        MiddleBuilder.setInsertPoint(
+            &*std::prev(MiddleVPBB->getTerminator()->getIterator()));
+      else
+        MiddleBuilder.setInsertPoint(MiddleVPBB->getTerminator());
+
+      // Extract the resume value and create a new VPLiveOut for it.
+      auto *Resume = MiddleBuilder.createNaryOp(
+          VPInstruction::ExtractFromEnd,
+          {FOR->getBackedgeValue(),
+           Plan->getOrAddLiveIn(
+               ConstantInt::get(Plan->getCanonicalIV()->getScalarType(), 1))},
+          {}, "vector.recur.extract");
+      auto *R =
+          B.createNaryOp(VPInstruction::ExitPhi, {Resume, FOR->getStartValue()},
+                         {}, "scalar.recur.init");
+      Plan->addLiveOut(cast<PHINode>(FOR->getUnderlyingInstr()), R, ScalarPH);
+    }
+  }
+
   // ---------------------------------------------------------------------------
   // Transform initial VPlan: Apply previously taken decisions, in order, to
   // bring the VPlan to its final state.
@@ -8784,7 +8796,7 @@ VPlanPtr LoopVectorizationPlanner::buildVPlan(VFRange &Range) {
   // Create new empty VPlan
   auto Plan = VPlan::createInitialVPlan(
       createTripCountSCEV(Legal->getWidestInductionType(), PSE, OrigLoop),
-      *PSE.getSE(), OrigLoop->getLoopPreheader());
+      *PSE.getSE(), true, false, OrigLoop);
 
   // Build hierarchical CFG
   VPlanHCFGBuilder HCFGBuilder(OrigLoop, LI, *Plan);
@@ -8993,6 +9005,9 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
     }
   }
   Builder.setInsertPoint(&*LatchVPBB->begin());
+  VPBasicBlock *MiddleVPBB =
+      cast<VPBasicBlock>(VectorLoopRegion->getSingleSuccessor());
+  VPBasicBlock::iterator IP = MiddleVPBB->begin();
   for (VPRecipeBase &R :
        Plan->getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
     VPReductionPHIRecipe *PhiR = dyn_cast<VPReductionPHIRecipe>(&R);
@@ -9101,8 +9116,8 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
     // also modeled in VPlan.
     auto *FinalReductionResult = new VPInstruction(
         VPInstruction::ComputeReductionResult, {PhiR, NewExitingVPV}, ExitDL);
-    cast<VPBasicBlock>(VectorLoopRegion->getSingleSuccessor())
-        ->appendRecipe(FinalReductionResult);
+    FinalReductionResult->insertBefore(*MiddleVPBB, IP);
+    IP = std::next(FinalReductionResult->getIterator());
     OrigExitingVPV->replaceUsesWithIf(
         FinalReductionResult,
         [](VPUser &User, unsigned) { return isa<VPLiveOut>(&User); });
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index f17be451e6846..4138bee310ece 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -443,11 +443,29 @@ VPBasicBlock::createEmptyBasicBlock(VPTransformState::CFGState &CFG) {
 }
 
 void VPIRBasicBlock::execute(VPTransformState *State) {
-  assert(getHierarchicalPredecessors().empty() &&
-         "VPIRBasicBlock cannot have predecessors at the moment");
   assert(getHierarchicalSuccessors().empty() &&
          "VPIRBasicBlock cannot have successors at the moment");
 
+  for (VPBlockBase *PredVPBlock : getHierarchicalPredecessors()) {
+    VPBasicBlock *PredVPBB = PredVPBlock->getExitingBasicBlock();
+    auto &PredVPSuccessors = PredVPBB->getHierarchicalSuccessors();
+    BasicBlock *PredBB = State->CFG.VPBB2IRBB[PredVPBB];
+
+    assert(PredBB && "Predecessor basic-block not found building successor.");
+    auto *PredBBTerminator = PredBB->getTerminator();
+    LLVM_DEBUG(dbgs() << "LV: draw edge from" << PredBB->getName() << '\n');
+
+    auto *TermBr = dyn_cast<BranchInst>(PredBBTerminator);
+    if (TermBr) {
+      // Set each forward successor here when it is created, excluding
+      // backedges. A backward successor is set when the branch is created.
+      unsigned idx = PredVPSuccessors.front() == this ? 0 : 1;
+      assert(!TermBr->getSuccessor(idx) &&
+             "Trying to reset an existing successor block.");
+      TermBr->setSuccessor(idx, IRBB);
+    }
+  }
+
   State->Builder.SetInsertPoint(getIRBasicBlock()->getTerminator());
   executeRecipes(State, getIRBasicBlock());
 }
@@ -479,6 +497,14 @@ void VPBasicBlock::execute(VPTransformState *State) {
     // The Exit block of a loop is always set to be successor 0 of the Exiting
     // block.
     cast<BranchInst>(ExitingBB->getTerminator())->setSuccessor(0, NewBB);
+    // Set the insert point for recipe execution in the block.
+    State->Builder.SetInsertPoint(NewBB->getTerminator());
+    if (getSuccessors().size() == 1) {
+      BranchInst *Br = State->Builder.CreateBr(NewBB);
+      Br->setSuccessor(0, nullptr);
+      NewBB->getTerminator()->eraseFromParent();
+      State->Builder.SetInsertPoint(NewBB->getTerminator());
+    }
     State->CFG.DTU.applyUpdates({{DominatorTree::Insert, ExitingBB, NewBB}});
   } else if (PrevVPBB && /* A */
              !((SingleHPred = getSingleHierarchicalPredecessor()) &&
@@ -639,6 +665,7 @@ void VPBasicBlock::print(raw_ostream &O, const Twine &Indent,
 
   printSuccessors(O, Indent);
 }
+
 #endif
 
 static std::pair<VPBlockBase *, VPBlockBase *> cloneSESE(VPBlockBase *Entry);
@@ -654,12 +681,23 @@ static std::pair<VPBlockBase *, VPBlockBase *> cloneSESE(VPBlockBase *Entry) {
       Entry);
   for (VPBlockBase *BB : RPOT) {
     VPBlockBase *NewBB = BB->clone();
-    for (VPBlockBase *Pred : BB->getPredecessors())
-      VPBlockUtils::connectBlocks(Old2NewVPBlocks[Pred], NewBB);
-
     Old2NewVPBlocks[BB] = NewBB;
   }
 
+  for (VPBlockBase *BB : RPOT) {
+    VPBlockBase *NewBB = Old2NewVPBlocks[BB];
+    SmallVector<VPBlockBase *> NewPreds;
+    for (VPBlockBase *Pred : BB->getPredecessors()) {
+      NewPreds.push_back(Old2NewVPBlocks[Pred]);
+    }
+    NewBB->setPredecessors(NewPreds);
+    SmallVector<VPBlockBase *> NewSuccs;
+    for (VPBlockBase *Succ : BB->successors()) {
+      NewSuccs.push_back(Old2NewVPBlocks[Succ]);
+    }
+    NewBB->setSuccessors(NewSuccs);
+  }
+
 #if !defined(NDEBUG)
   // Verif...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Jun 7, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

This patch introduces a new ExitPhi VPInstruction which creates a phi in
a leaf block of a VPlan. The first use is to create the phi node for
fixed-order recurrence resume values in the scalar preheader.

The VPInstruction takes 2 operands: 1) the incoming value from the
middle-block and a default value to be used for all other incoming
blocks.

In follow-up changes, it will also be used to create phis for reduction and
induction resume values.

Depends on #92651


Patch is 177.53 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/94760.diff

38 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+118-103)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+91-12)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+26-4)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+56-16)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+19-12)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/first-order-recurrence-fold-tail.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/fixed-order-recurrence.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/reduction-recurrence-costs-sve.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll (+3-3)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/synthesize-mask-for-call.ll (+48)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/widen-call-with-intrinsic-or-libfunc.ll (+16)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll (+18)
  • (modified) llvm/test/Transforms/LoopVectorize/SystemZ/predicated-first-order-recurrence.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/fixed-order-recurrence.ll (+6-6)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/pr72969.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/branch-weights.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains-vplan.ll (+27-6)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains.ll (+28-28)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-complex.ll (+21-21)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-multiply-recurrences.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll (+54-6)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll (+78-78)
  • (modified) llvm/test/Transforms/LoopVectorize/induction.ll (+21-21)
  • (modified) llvm/test/Transforms/LoopVectorize/interleave-and-scalarize-only.ll (+18-1)
  • (modified) llvm/test/Transforms/LoopVectorize/interleaved-accesses.ll (+3-3)
  • (modified) llvm/test/Transforms/LoopVectorize/pr36983.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/pr59319-loop-access-info-invalidation.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/scalable-first-order-recurrence.ll (+8-8)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-iv-transforms.ll (+9-1)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing-before-execute.ll (+17)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing-outer-loop.ll (+10-1)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing.ll (+124-3)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge-vf1.ll (+9-1)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll (+15)
  • (modified) llvm/unittests/Transforms/Vectorize/VPlanTestBase.h (+8-8)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index c7c19ef456c7c..ae62df3aed207 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -607,10 +607,6 @@ class InnerLoopVectorizer {
                     BasicBlock *MiddleBlock, BasicBlock *VectorHeader,
                     VPlan &Plan, VPTransformState &State);
 
-  /// Create the phi node for the resume value of first order recurrences in the
-  /// scalar preheader and update the users in the scalar loop.
-  void fixFixedOrderRecurrence(VPLiveOut *LO, VPTransformState &State);
-
   /// Iteratively sink the scalarized operands of a predicated instruction into
   /// the block that was created for it.
   void sinkScalarOperands(Instruction *PredInst);
@@ -2972,22 +2968,7 @@ void InnerLoopVectorizer::createVectorLoopSkeleton(StringRef Prefix) {
       SplitBlock(LoopMiddleBlock, LoopMiddleBlock->getTerminator(), DT, LI,
                  nullptr, Twine(Prefix) + "scalar.ph");
 
-  auto *ScalarLatchTerm = OrigLoop->getLoopLatch()->getTerminator();
-
-  // Set up the middle block terminator.  Two cases:
-  // 1) If we know that we must execute the scalar epilogue, emit an
-  //    unconditional branch.
-  // 2) Otherwise, we must have a single unique exit block (due to how we
-  //    implement the multiple exit case).  In this case, set up a conditional
-  //    branch from the middle block to the loop scalar preheader, and the
-  //    exit block.  completeLoopSkeleton will update the condition to use an
-  //    iteration check, if required to decide whether to execute the remainder.
-  BranchInst *BrInst =
-      Cost->requiresScalarEpilogue(VF.isVector())
-          ? BranchInst::Create(LoopScalarPreHeader)
-          : BranchInst::Create(LoopExitBlock, LoopScalarPreHeader,
-                               Builder.getTrue());
-  BrInst->setDebugLoc(ScalarLatchTerm->getDebugLoc());
+  auto *BrInst = new UnreachableInst(LoopMiddleBlock->getContext());
   ReplaceInstWithInst(LoopMiddleBlock->getTerminator(), BrInst);
 
   // Update dominator for loop exit. During skeleton creation, only the vector
@@ -3094,51 +3075,6 @@ void InnerLoopVectorizer::createInductionResumeValues(
   }
 }
 
-BasicBlock *InnerLoopVectorizer::completeLoopSkeleton() {
-  // The trip counts should be cached by now.
-  Value *Count = getTripCount();
-  Value *VectorTripCount = getOrCreateVectorTripCount(LoopVectorPreHeader);
-
-  auto *ScalarLatchTerm = OrigLoop->getLoopLatch()->getTerminator();
-
-  // Add a check in the middle block to see if we have completed
-  // all of the iterations in the first vector loop.  Three cases:
-  // 1) If we require a scalar epilogue, there is no conditional branch as
-  //    we unconditionally branch to the scalar preheader.  Do nothing.
-  // 2) If (N - N%VF) == N, then we *don't* need to run the remainder.
-  //    Thus if tail is to be folded, we know we don't need to run the
-  //    remainder and we can use the previous value for the condition (true).
-  // 3) Otherwise, construct a runtime check.
-  if (!Cost->requiresScalarEpilogue(VF.isVector()) &&
-      !Cost->foldTailByMasking()) {
-    // Here we use the same DebugLoc as the scalar loop latch terminator instead
-    // of the corresponding compare because they may have ended up with
-    // different line numbers and we want to avoid awkward line stepping while
-    // debugging. Eg. if the compare has got a line number inside the loop.
-    // TODO: At the moment, CreateICmpEQ will simplify conditions with constant
-    // operands. Perform simplification directly on VPlan once the branch is
-    // modeled there.
-    IRBuilder<> B(LoopMiddleBlock->getTerminator());
-    B.SetCurrentDebugLocation(ScalarLatchTerm->getDebugLoc());
-    Value *CmpN = B.CreateICmpEQ(Count, VectorTripCount, "cmp.n");
-    BranchInst &BI = *cast<BranchInst>(LoopMiddleBlock->getTerminator());
-    BI.setCondition(CmpN);
-    if (hasBranchWeightMD(*ScalarLatchTerm)) {
-      // Assume that `Count % VectorTripCount` is equally distributed.
-      unsigned TripCount = UF * VF.getKnownMinValue();
-      assert(TripCount > 0 && "trip count should not be zero");
-      const uint32_t Weights[] = {1, TripCount - 1};
-      setBranchWeights(BI, Weights);
-    }
-  }
-
-#ifdef EXPENSIVE_CHECKS
-  assert(DT->verify(DominatorTree::VerificationLevel::Fast));
-#endif
-
-  return LoopVectorPreHeader;
-}
-
 std::pair<BasicBlock *, Value *>
 InnerLoopVectorizer::createVectorizedLoopSkeleton(
     const SCEV2ValueTy &ExpandedSCEVs) {
@@ -3198,7 +3134,7 @@ InnerLoopVectorizer::createVectorizedLoopSkeleton(
   // Emit phis for the new starting index of the scalar loop.
   createInductionResumeValues(ExpandedSCEVs);
 
-  return {completeLoopSkeleton(), nullptr};
+  return {LoopVectorPreHeader, nullptr};
 }
 
 // Fix up external users of the induction variable. At this point, we are
@@ -3399,8 +3335,6 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
   for (const auto &[_, LO] : to_vector(Plan.getLiveOuts())) {
     if (!Legal->isFixedOrderRecurrence(LO->getPhi()))
       continue;
-    fixFixedOrderRecurrence(LO, State);
-    Plan.removeLiveOut(LO->getPhi());
   }
 
   // Forget the original basic block.
@@ -3470,31 +3404,16 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
                                VF.getKnownMinValue() * UF);
 }
 
-void InnerLoopVectorizer::fixFixedOrderRecurrence(VPLiveOut *LO,
-                                                  VPTransformState &State) {
-  // Extract the last vector element in the middle block. This will be the
-  // initial value for the recurrence when jumping to the scalar loop.
-  VPValue *VPExtract = LO->getOperand(0);
-  using namespace llvm::VPlanPatternMatch;
-  assert(match(VPExtract, m_VPInstruction<VPInstruction::ExtractFromEnd>(
-                              m_VPValue(), m_VPValue())) &&
-         "FOR LiveOut expects to use an extract from end.");
-  Value *ResumeScalarFOR = State.get(VPExtract, UF - 1, true);
-
-  // Fix the initial value of the original recurrence in the scalar loop.
-  PHINode *ScalarHeaderPhi = LO->getPhi();
-  auto *InitScalarFOR =
-      ScalarHeaderPhi->getIncomingValueForBlock(LoopScalarPreHeader);
-  Builder.SetInsertPoint(LoopScalarPreHeader, LoopScalarPreHeader->begin());
-  auto *ScalarPreheaderPhi =
-      Builder.CreatePHI(ScalarHeaderPhi->getType(), 2, "scalar.recur.init");
-  for (auto *BB : predecessors(LoopScalarPreHeader)) {
-    auto *Incoming = BB == LoopMiddleBlock ? ResumeScalarFOR : InitScalarFOR;
-    ScalarPreheaderPhi->addIncoming(Incoming, BB);
-  }
-  ScalarHeaderPhi->setIncomingValueForBlock(LoopScalarPreHeader,
-                                            ScalarPreheaderPhi);
-  ScalarHeaderPhi->setName("scalar.recur");
+// Helper to reorder blocks so they match the original order even after the
+// order of the predecessors changes. This is only used to avoid a number of
+// test changes due to reordering of incoming blocks in phi nodes and should be
+// removed soon, with the tests being updated.
+static void reorderIncomingBlocks(SmallVectorImpl<BasicBlock *> &Blocks,
+                                  BasicBlock *LoopMiddleBlock) {
+  if (Blocks.front() == LoopMiddleBlock)
+    std::swap(Blocks.front(), Blocks.back());
+  if (Blocks.size() == 3)
+    std::swap(Blocks[0], Blocks[1]);
 }
 
 void InnerLoopVectorizer::sinkScalarOperands(Instruction *PredInst) {
@@ -7388,7 +7307,9 @@ static void createAndCollectMergePhiForReduction(
   // If we are fixing reductions in the epilogue loop then we should already
   // have created a bc.merge.rdx Phi after the main vector body. Ensure that
   // we carry over the incoming values correctly.
-  for (auto *Incoming : predecessors(LoopScalarPreHeader)) {
+  SmallVector<BasicBlock *> Blocks(predecessors(LoopScalarPreHeader));
+  reorderIncomingBlocks(Blocks, LoopMiddleBlock);
+  for (auto *Incoming : Blocks) {
     if (Incoming == LoopMiddleBlock)
       BCBlockPhi->addIncoming(FinalValue, Incoming);
     else if (ResumePhi && is_contained(ResumePhi->blocks(), Incoming))
@@ -7459,6 +7380,21 @@ LoopVectorizationPlanner::executePlan(
   std::tie(State.CFG.PrevBB, CanonicalIVStartValue) =
       ILV.createVectorizedLoopSkeleton(ExpandedSCEVs ? *ExpandedSCEVs
                                                      : State.ExpandedSCEVs);
+#ifdef EXPENSIVE_CHECKS
+  assert(DT->verify(DominatorTree::VerificationLevel::Fast));
+#endif
+
+  VPBasicBlock *MiddleVPBB =
+      cast<VPBasicBlock>(BestVPlan.getVectorLoopRegion()->getSingleSuccessor());
+
+  using namespace llvm::VPlanPatternMatch;
+  if (MiddleVPBB->begin() != MiddleVPBB->end() &&
+      match(&MiddleVPBB->back(), m_BranchOnCond(m_VPValue()))) {
+    cast<VPIRBasicBlock>(MiddleVPBB->getSuccessors()[1])
+        ->resetBlock(OrigLoop->getLoopPreheader());
+  } else
+    cast<VPIRBasicBlock>(MiddleVPBB->getSuccessors()[0])
+        ->resetBlock(OrigLoop->getLoopPreheader());
 
   // Only use noalias metadata when using memory checks guaranteeing no overlap
   // across all iterations.
@@ -7539,6 +7475,18 @@ LoopVectorizationPlanner::executePlan(
 
   ILV.printDebugTracesAtEnd();
 
+  // Adjust branch weight of the branch in the middle block.
+  auto *MiddleTerm =
+      cast<BranchInst>(State.CFG.VPBB2IRBB[ExitVPBB]->getTerminator());
+  if (MiddleTerm->isConditional() &&
+      hasBranchWeightMD(*OrigLoop->getLoopLatch()->getTerminator())) {
+    // Assume that `Count % VectorTripCount` is equally distributed.
+    unsigned TripCount = State.UF * State.VF.getKnownMinValue();
+    assert(TripCount > 0 && "trip count should not be zero");
+    const uint32_t Weights[] = {1, TripCount - 1};
+    setBranchWeights(*MiddleTerm, Weights);
+  }
+
   return {State.ExpandedSCEVs, ReductionResumeValues};
 }
 
@@ -7595,7 +7543,7 @@ EpilogueVectorizerMainLoop::createEpilogueVectorizedLoopSkeleton(
   // inductions in the epilogue loop are created before executing the plan for
   // the epilogue loop.
 
-  return {completeLoopSkeleton(), nullptr};
+  return {LoopVectorPreHeader, nullptr};
 }
 
 void EpilogueVectorizerMainLoop::printDebugTracesAtStart() {
@@ -7719,8 +7667,11 @@ EpilogueVectorizerEpilogueLoop::createEpilogueVectorizedLoopSkeleton(
       VecEpilogueIterationCountCheck,
       VecEpilogueIterationCountCheck->getSinglePredecessor());
 
-  DT->changeImmediateDominator(LoopScalarPreHeader,
-                               EPI.EpilogueIterationCountCheck);
+  if (auto *N = DT->getNode(LoopScalarPreHeader))
+    DT->changeImmediateDominator(LoopScalarPreHeader,
+                                 EPI.EpilogueIterationCountCheck);
+  else
+    DT->addNewBlock(LoopScalarPreHeader, EPI.EpilogueIterationCountCheck);
   if (!Cost->requiresScalarEpilogue(EPI.EpilogueVF.isVector()))
     // If there is an epilogue which must run, there's no edge from the
     // middle block to exit blocks  and thus no need to update the immediate
@@ -7784,7 +7735,7 @@ EpilogueVectorizerEpilogueLoop::createEpilogueVectorizedLoopSkeleton(
                               {VecEpilogueIterationCountCheck,
                                EPI.VectorTripCount} /* AdditionalBypass */);
 
-  return {completeLoopSkeleton(), EPResumeVal};
+  return {LoopVectorPreHeader, EPResumeVal};
 }
 
 BasicBlock *
@@ -8515,7 +8466,9 @@ static void addUsersInExitBlock(VPBasicBlock *HeaderVPBB, Loop *OrigLoop,
     Value *IncomingValue =
         ExitPhi.getIncomingValueForBlock(ExitingBB);
     VPValue *V = Builder.getVPValueOrAddLiveIn(IncomingValue, Plan);
-    Plan.addLiveOut(&ExitPhi, V);
+    Plan.addLiveOut(
+        &ExitPhi, V,
+        cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getSingleSuccessor()));
   }
 }
 
@@ -8534,9 +8487,25 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
   // modified; a basic block for the vector pre-header, followed by a region for
   // the vector loop, followed by the middle basic block. The skeleton vector
   // loop region contains a header and latch basic blocks.
+
+  // Add a check in the middle block to see if we have completed
+  // all of the iterations in the first vector loop.  Three cases:
+  // 1) If we require a scalar epilogue, there is no conditional branch as
+  //    we unconditionally branch to the scalar preheader.  Do nothing.
+  // 2) If (N - N%VF) == N, then we *don't* need to run the remainder.
+  //    Thus if tail is to be folded, we know we don't need to run the
+  //    remainder and we can use the previous value for the condition (true).
+  // 3) Otherwise, construct a runtime check.
+  bool RequiresScalarEpilogueCheck =
+      LoopVectorizationPlanner::getDecisionAndClampRange(
+          [this](ElementCount VF) {
+            return !CM.requiresScalarEpilogue(VF.isVector());
+          },
+          Range);
   VPlanPtr Plan = VPlan::createInitialVPlan(
       createTripCountSCEV(Legal->getWidestInductionType(), PSE, OrigLoop),
-      *PSE.getSE(), OrigLoop->getLoopPreheader());
+      *PSE.getSE(), RequiresScalarEpilogueCheck, CM.foldTailByMasking(),
+      OrigLoop);
   VPBasicBlock *HeaderVPBB = new VPBasicBlock("vector.body");
   VPBasicBlock *LatchVPBB = new VPBasicBlock("vector.latch");
   VPBlockUtils::insertBlockAfter(LatchVPBB, HeaderVPBB);
@@ -8679,6 +8648,49 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
          "VPBasicBlock");
   RecipeBuilder.fixHeaderPhis();
 
+  auto *MiddleVPBB =
+      cast<VPBasicBlock>(Plan->getVectorLoopRegion()->getSingleSuccessor());
+
+  VPBasicBlock *ScalarPH = nullptr;
+  for (VPBlockBase *Succ : MiddleVPBB->getSuccessors()) {
+    auto *VPIRBB = dyn_cast<VPIRBasicBlock>(Succ);
+    if (VPIRBB && VPIRBB->getIRBasicBlock() == OrigLoop->getHeader()) {
+      ScalarPH = VPIRBB;
+      break;
+    }
+  }
+
+  if (ScalarPH) {
+    for (auto &H : HeaderVPBB->phis()) {
+      auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&H);
+      if (!FOR)
+        continue;
+      VPBuilder B(ScalarPH);
+      VPBuilder MiddleBuilder;
+      // Set insert point so new recipes are inserted before terminator and
+      // condition, if there is either the former or both.
+      if (MiddleVPBB->getNumSuccessors() != 2)
+        MiddleBuilder.setInsertPoint(MiddleVPBB);
+      else if (isa<VPInstruction>(MiddleVPBB->getTerminator()->getOperand(0)))
+        MiddleBuilder.setInsertPoint(
+            &*std::prev(MiddleVPBB->getTerminator()->getIterator()));
+      else
+        MiddleBuilder.setInsertPoint(MiddleVPBB->getTerminator());
+
+      // Extract the resume value and create a new VPLiveOut for it.
+      auto *Resume = MiddleBuilder.createNaryOp(
+          VPInstruction::ExtractFromEnd,
+          {FOR->getBackedgeValue(),
+           Plan->getOrAddLiveIn(
+               ConstantInt::get(Plan->getCanonicalIV()->getScalarType(), 1))},
+          {}, "vector.recur.extract");
+      auto *R =
+          B.createNaryOp(VPInstruction::ExitPhi, {Resume, FOR->getStartValue()},
+                         {}, "scalar.recur.init");
+      Plan->addLiveOut(cast<PHINode>(FOR->getUnderlyingInstr()), R, ScalarPH);
+    }
+  }
+
   // ---------------------------------------------------------------------------
   // Transform initial VPlan: Apply previously taken decisions, in order, to
   // bring the VPlan to its final state.
@@ -8784,7 +8796,7 @@ VPlanPtr LoopVectorizationPlanner::buildVPlan(VFRange &Range) {
   // Create new empty VPlan
   auto Plan = VPlan::createInitialVPlan(
       createTripCountSCEV(Legal->getWidestInductionType(), PSE, OrigLoop),
-      *PSE.getSE(), OrigLoop->getLoopPreheader());
+      *PSE.getSE(), true, false, OrigLoop);
 
   // Build hierarchical CFG
   VPlanHCFGBuilder HCFGBuilder(OrigLoop, LI, *Plan);
@@ -8993,6 +9005,9 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
     }
   }
   Builder.setInsertPoint(&*LatchVPBB->begin());
+  VPBasicBlock *MiddleVPBB =
+      cast<VPBasicBlock>(VectorLoopRegion->getSingleSuccessor());
+  VPBasicBlock::iterator IP = MiddleVPBB->begin();
   for (VPRecipeBase &R :
        Plan->getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
     VPReductionPHIRecipe *PhiR = dyn_cast<VPReductionPHIRecipe>(&R);
@@ -9101,8 +9116,8 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
     // also modeled in VPlan.
     auto *FinalReductionResult = new VPInstruction(
         VPInstruction::ComputeReductionResult, {PhiR, NewExitingVPV}, ExitDL);
-    cast<VPBasicBlock>(VectorLoopRegion->getSingleSuccessor())
-        ->appendRecipe(FinalReductionResult);
+    FinalReductionResult->insertBefore(*MiddleVPBB, IP);
+    IP = std::next(FinalReductionResult->getIterator());
     OrigExitingVPV->replaceUsesWithIf(
         FinalReductionResult,
         [](VPUser &User, unsigned) { return isa<VPLiveOut>(&User); });
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index f17be451e6846..4138bee310ece 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -443,11 +443,29 @@ VPBasicBlock::createEmptyBasicBlock(VPTransformState::CFGState &CFG) {
 }
 
 void VPIRBasicBlock::execute(VPTransformState *State) {
-  assert(getHierarchicalPredecessors().empty() &&
-         "VPIRBasicBlock cannot have predecessors at the moment");
   assert(getHierarchicalSuccessors().empty() &&
          "VPIRBasicBlock cannot have successors at the moment");
 
+  for (VPBlockBase *PredVPBlock : getHierarchicalPredecessors()) {
+    VPBasicBlock *PredVPBB = PredVPBlock->getExitingBasicBlock();
+    auto &PredVPSuccessors = PredVPBB->getHierarchicalSuccessors();
+    BasicBlock *PredBB = State->CFG.VPBB2IRBB[PredVPBB];
+
+    assert(PredBB && "Predecessor basic-block not found building successor.");
+    auto *PredBBTerminator = PredBB->getTerminator();
+    LLVM_DEBUG(dbgs() << "LV: draw edge from" << PredBB->getName() << '\n');
+
+    auto *TermBr = dyn_cast<BranchInst>(PredBBTerminator);
+    if (TermBr) {
+      // Set each forward successor here when it is created, excluding
+      // backedges. A backward successor is set when the branch is created.
+      unsigned idx = PredVPSuccessors.front() == this ? 0 : 1;
+      assert(!TermBr->getSuccessor(idx) &&
+             "Trying to reset an existing successor block.");
+      TermBr->setSuccessor(idx, IRBB);
+    }
+  }
+
   State->Builder.SetInsertPoint(getIRBasicBlock()->getTerminator());
   executeRecipes(State, getIRBasicBlock());
 }
@@ -479,6 +497,14 @@ void VPBasicBlock::execute(VPTransformState *State) {
     // The Exit block of a loop is always set to be successor 0 of the Exiting
     // block.
     cast<BranchInst>(ExitingBB->getTerminator())->setSuccessor(0, NewBB);
+    // Set the insert point for recipe execution in the block.
+    State->Builder.SetInsertPoint(NewBB->getTerminator());
+    if (getSuccessors().size() == 1) {
+      BranchInst *Br = State->Builder.CreateBr(NewBB);
+      Br->setSuccessor(0, nullptr);
+      NewBB->getTerminator()->eraseFromParent();
+      State->Builder.SetInsertPoint(NewBB->getTerminator());
+    }
     State->CFG.DTU.applyUpdates({{DominatorTree::Insert, ExitingBB, NewBB}});
   } else if (PrevVPBB && /* A */
              !((SingleHPred = getSingleHierarchicalPredecessor()) &&
@@ -639,6 +665,7 @@ void VPBasicBlock::print(raw_ostream &O, const Twine &Indent,
 
   printSuccessors(O, Indent);
 }
+
 #endif
 
 static std::pair<VPBlockBase *, VPBlockBase *> cloneSESE(VPBlockBase *Entry);
@@ -654,12 +681,23 @@ static std::pair<VPBlockBase *, VPBlockBase *> cloneSESE(VPBlockBase *Entry) {
       Entry);
   for (VPBlockBase *BB : RPOT) {
     VPBlockBase *NewBB = BB->clone();
-    for (VPBlockBase *Pred : BB->getPredecessors())
-      VPBlockUtils::connectBlocks(Old2NewVPBlocks[Pred], NewBB);
-
     Old2NewVPBlocks[BB] = NewBB;
   }
 
+  for (VPBlockBase *BB : RPOT) {
+    VPBlockBase *NewBB = Old2NewVPBlocks[BB];
+    SmallVector<VPBlockBase *> NewPreds;
+    for (VPBlockBase *Pred : BB->getPredecessors()) {
+      NewPreds.push_back(Old2NewVPBlocks[Pred]);
+    }
+    NewBB->setPredecessors(NewPreds);
+    SmallVector<VPBlockBase *> NewSuccs;
+    for (VPBlockBase *Succ : BB->successors()) {
+      NewSuccs.push_back(Old2NewVPBlocks[Succ]);
+    }
+    NewBB->setSuccessors(NewSuccs);
+  }
+
 #if !defined(NDEBUG)
   // Verif...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Jun 7, 2024

@llvm/pr-subscribers-backend-systemz

Author: Florian Hahn (fhahn)

Changes

This patch introduces a new ExitPhi VPInstruction which creates a phi in
a leaf block of a VPlan. The first use is to create the phi node for
fixed-order recurrence resume values in the scalar preheader.

The VPInstruction takes 2 operands: 1) the incoming value from the
middle-block and a default value to be used for all other incoming
blocks.

In follow-up changes, it will also be used to create phis for reduction and
induction resume values.

Depends on #92651


Patch is 177.53 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/94760.diff

38 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+118-103)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+91-12)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+26-4)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+56-16)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+19-12)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/first-order-recurrence-fold-tail.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/fixed-order-recurrence.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/reduction-recurrence-costs-sve.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll (+3-3)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/synthesize-mask-for-call.ll (+48)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/widen-call-with-intrinsic-or-libfunc.ll (+16)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll (+18)
  • (modified) llvm/test/Transforms/LoopVectorize/SystemZ/predicated-first-order-recurrence.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/fixed-order-recurrence.ll (+6-6)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/pr72969.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/branch-weights.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains-vplan.ll (+27-6)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains.ll (+28-28)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-complex.ll (+21-21)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-multiply-recurrences.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll (+54-6)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll (+78-78)
  • (modified) llvm/test/Transforms/LoopVectorize/induction.ll (+21-21)
  • (modified) llvm/test/Transforms/LoopVectorize/interleave-and-scalarize-only.ll (+18-1)
  • (modified) llvm/test/Transforms/LoopVectorize/interleaved-accesses.ll (+3-3)
  • (modified) llvm/test/Transforms/LoopVectorize/pr36983.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/pr59319-loop-access-info-invalidation.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/scalable-first-order-recurrence.ll (+8-8)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-iv-transforms.ll (+9-1)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing-before-execute.ll (+17)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing-outer-loop.ll (+10-1)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing.ll (+124-3)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge-vf1.ll (+9-1)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll (+15)
  • (modified) llvm/unittests/Transforms/Vectorize/VPlanTestBase.h (+8-8)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index c7c19ef456c7c..ae62df3aed207 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -607,10 +607,6 @@ class InnerLoopVectorizer {
                     BasicBlock *MiddleBlock, BasicBlock *VectorHeader,
                     VPlan &Plan, VPTransformState &State);
 
-  /// Create the phi node for the resume value of first order recurrences in the
-  /// scalar preheader and update the users in the scalar loop.
-  void fixFixedOrderRecurrence(VPLiveOut *LO, VPTransformState &State);
-
   /// Iteratively sink the scalarized operands of a predicated instruction into
   /// the block that was created for it.
   void sinkScalarOperands(Instruction *PredInst);
@@ -2972,22 +2968,7 @@ void InnerLoopVectorizer::createVectorLoopSkeleton(StringRef Prefix) {
       SplitBlock(LoopMiddleBlock, LoopMiddleBlock->getTerminator(), DT, LI,
                  nullptr, Twine(Prefix) + "scalar.ph");
 
-  auto *ScalarLatchTerm = OrigLoop->getLoopLatch()->getTerminator();
-
-  // Set up the middle block terminator.  Two cases:
-  // 1) If we know that we must execute the scalar epilogue, emit an
-  //    unconditional branch.
-  // 2) Otherwise, we must have a single unique exit block (due to how we
-  //    implement the multiple exit case).  In this case, set up a conditional
-  //    branch from the middle block to the loop scalar preheader, and the
-  //    exit block.  completeLoopSkeleton will update the condition to use an
-  //    iteration check, if required to decide whether to execute the remainder.
-  BranchInst *BrInst =
-      Cost->requiresScalarEpilogue(VF.isVector())
-          ? BranchInst::Create(LoopScalarPreHeader)
-          : BranchInst::Create(LoopExitBlock, LoopScalarPreHeader,
-                               Builder.getTrue());
-  BrInst->setDebugLoc(ScalarLatchTerm->getDebugLoc());
+  auto *BrInst = new UnreachableInst(LoopMiddleBlock->getContext());
   ReplaceInstWithInst(LoopMiddleBlock->getTerminator(), BrInst);
 
   // Update dominator for loop exit. During skeleton creation, only the vector
@@ -3094,51 +3075,6 @@ void InnerLoopVectorizer::createInductionResumeValues(
   }
 }
 
-BasicBlock *InnerLoopVectorizer::completeLoopSkeleton() {
-  // The trip counts should be cached by now.
-  Value *Count = getTripCount();
-  Value *VectorTripCount = getOrCreateVectorTripCount(LoopVectorPreHeader);
-
-  auto *ScalarLatchTerm = OrigLoop->getLoopLatch()->getTerminator();
-
-  // Add a check in the middle block to see if we have completed
-  // all of the iterations in the first vector loop.  Three cases:
-  // 1) If we require a scalar epilogue, there is no conditional branch as
-  //    we unconditionally branch to the scalar preheader.  Do nothing.
-  // 2) If (N - N%VF) == N, then we *don't* need to run the remainder.
-  //    Thus if tail is to be folded, we know we don't need to run the
-  //    remainder and we can use the previous value for the condition (true).
-  // 3) Otherwise, construct a runtime check.
-  if (!Cost->requiresScalarEpilogue(VF.isVector()) &&
-      !Cost->foldTailByMasking()) {
-    // Here we use the same DebugLoc as the scalar loop latch terminator instead
-    // of the corresponding compare because they may have ended up with
-    // different line numbers and we want to avoid awkward line stepping while
-    // debugging. Eg. if the compare has got a line number inside the loop.
-    // TODO: At the moment, CreateICmpEQ will simplify conditions with constant
-    // operands. Perform simplification directly on VPlan once the branch is
-    // modeled there.
-    IRBuilder<> B(LoopMiddleBlock->getTerminator());
-    B.SetCurrentDebugLocation(ScalarLatchTerm->getDebugLoc());
-    Value *CmpN = B.CreateICmpEQ(Count, VectorTripCount, "cmp.n");
-    BranchInst &BI = *cast<BranchInst>(LoopMiddleBlock->getTerminator());
-    BI.setCondition(CmpN);
-    if (hasBranchWeightMD(*ScalarLatchTerm)) {
-      // Assume that `Count % VectorTripCount` is equally distributed.
-      unsigned TripCount = UF * VF.getKnownMinValue();
-      assert(TripCount > 0 && "trip count should not be zero");
-      const uint32_t Weights[] = {1, TripCount - 1};
-      setBranchWeights(BI, Weights);
-    }
-  }
-
-#ifdef EXPENSIVE_CHECKS
-  assert(DT->verify(DominatorTree::VerificationLevel::Fast));
-#endif
-
-  return LoopVectorPreHeader;
-}
-
 std::pair<BasicBlock *, Value *>
 InnerLoopVectorizer::createVectorizedLoopSkeleton(
     const SCEV2ValueTy &ExpandedSCEVs) {
@@ -3198,7 +3134,7 @@ InnerLoopVectorizer::createVectorizedLoopSkeleton(
   // Emit phis for the new starting index of the scalar loop.
   createInductionResumeValues(ExpandedSCEVs);
 
-  return {completeLoopSkeleton(), nullptr};
+  return {LoopVectorPreHeader, nullptr};
 }
 
 // Fix up external users of the induction variable. At this point, we are
@@ -3399,8 +3335,6 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
   for (const auto &[_, LO] : to_vector(Plan.getLiveOuts())) {
     if (!Legal->isFixedOrderRecurrence(LO->getPhi()))
       continue;
-    fixFixedOrderRecurrence(LO, State);
-    Plan.removeLiveOut(LO->getPhi());
   }
 
   // Forget the original basic block.
@@ -3470,31 +3404,16 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
                                VF.getKnownMinValue() * UF);
 }
 
-void InnerLoopVectorizer::fixFixedOrderRecurrence(VPLiveOut *LO,
-                                                  VPTransformState &State) {
-  // Extract the last vector element in the middle block. This will be the
-  // initial value for the recurrence when jumping to the scalar loop.
-  VPValue *VPExtract = LO->getOperand(0);
-  using namespace llvm::VPlanPatternMatch;
-  assert(match(VPExtract, m_VPInstruction<VPInstruction::ExtractFromEnd>(
-                              m_VPValue(), m_VPValue())) &&
-         "FOR LiveOut expects to use an extract from end.");
-  Value *ResumeScalarFOR = State.get(VPExtract, UF - 1, true);
-
-  // Fix the initial value of the original recurrence in the scalar loop.
-  PHINode *ScalarHeaderPhi = LO->getPhi();
-  auto *InitScalarFOR =
-      ScalarHeaderPhi->getIncomingValueForBlock(LoopScalarPreHeader);
-  Builder.SetInsertPoint(LoopScalarPreHeader, LoopScalarPreHeader->begin());
-  auto *ScalarPreheaderPhi =
-      Builder.CreatePHI(ScalarHeaderPhi->getType(), 2, "scalar.recur.init");
-  for (auto *BB : predecessors(LoopScalarPreHeader)) {
-    auto *Incoming = BB == LoopMiddleBlock ? ResumeScalarFOR : InitScalarFOR;
-    ScalarPreheaderPhi->addIncoming(Incoming, BB);
-  }
-  ScalarHeaderPhi->setIncomingValueForBlock(LoopScalarPreHeader,
-                                            ScalarPreheaderPhi);
-  ScalarHeaderPhi->setName("scalar.recur");
+// Helper to reorder blocks so they match the original order even after the
+// order of the predecessors changes. This is only used to avoid a number of
+// test changes due to reordering of incoming blocks in phi nodes and should be
+// removed soon, with the tests being updated.
+static void reorderIncomingBlocks(SmallVectorImpl<BasicBlock *> &Blocks,
+                                  BasicBlock *LoopMiddleBlock) {
+  if (Blocks.front() == LoopMiddleBlock)
+    std::swap(Blocks.front(), Blocks.back());
+  if (Blocks.size() == 3)
+    std::swap(Blocks[0], Blocks[1]);
 }
 
 void InnerLoopVectorizer::sinkScalarOperands(Instruction *PredInst) {
@@ -7388,7 +7307,9 @@ static void createAndCollectMergePhiForReduction(
   // If we are fixing reductions in the epilogue loop then we should already
   // have created a bc.merge.rdx Phi after the main vector body. Ensure that
   // we carry over the incoming values correctly.
-  for (auto *Incoming : predecessors(LoopScalarPreHeader)) {
+  SmallVector<BasicBlock *> Blocks(predecessors(LoopScalarPreHeader));
+  reorderIncomingBlocks(Blocks, LoopMiddleBlock);
+  for (auto *Incoming : Blocks) {
     if (Incoming == LoopMiddleBlock)
       BCBlockPhi->addIncoming(FinalValue, Incoming);
     else if (ResumePhi && is_contained(ResumePhi->blocks(), Incoming))
@@ -7459,6 +7380,21 @@ LoopVectorizationPlanner::executePlan(
   std::tie(State.CFG.PrevBB, CanonicalIVStartValue) =
       ILV.createVectorizedLoopSkeleton(ExpandedSCEVs ? *ExpandedSCEVs
                                                      : State.ExpandedSCEVs);
+#ifdef EXPENSIVE_CHECKS
+  assert(DT->verify(DominatorTree::VerificationLevel::Fast));
+#endif
+
+  VPBasicBlock *MiddleVPBB =
+      cast<VPBasicBlock>(BestVPlan.getVectorLoopRegion()->getSingleSuccessor());
+
+  using namespace llvm::VPlanPatternMatch;
+  if (MiddleVPBB->begin() != MiddleVPBB->end() &&
+      match(&MiddleVPBB->back(), m_BranchOnCond(m_VPValue()))) {
+    cast<VPIRBasicBlock>(MiddleVPBB->getSuccessors()[1])
+        ->resetBlock(OrigLoop->getLoopPreheader());
+  } else
+    cast<VPIRBasicBlock>(MiddleVPBB->getSuccessors()[0])
+        ->resetBlock(OrigLoop->getLoopPreheader());
 
   // Only use noalias metadata when using memory checks guaranteeing no overlap
   // across all iterations.
@@ -7539,6 +7475,18 @@ LoopVectorizationPlanner::executePlan(
 
   ILV.printDebugTracesAtEnd();
 
+  // Adjust branch weight of the branch in the middle block.
+  auto *MiddleTerm =
+      cast<BranchInst>(State.CFG.VPBB2IRBB[ExitVPBB]->getTerminator());
+  if (MiddleTerm->isConditional() &&
+      hasBranchWeightMD(*OrigLoop->getLoopLatch()->getTerminator())) {
+    // Assume that `Count % VectorTripCount` is equally distributed.
+    unsigned TripCount = State.UF * State.VF.getKnownMinValue();
+    assert(TripCount > 0 && "trip count should not be zero");
+    const uint32_t Weights[] = {1, TripCount - 1};
+    setBranchWeights(*MiddleTerm, Weights);
+  }
+
   return {State.ExpandedSCEVs, ReductionResumeValues};
 }
 
@@ -7595,7 +7543,7 @@ EpilogueVectorizerMainLoop::createEpilogueVectorizedLoopSkeleton(
   // inductions in the epilogue loop are created before executing the plan for
   // the epilogue loop.
 
-  return {completeLoopSkeleton(), nullptr};
+  return {LoopVectorPreHeader, nullptr};
 }
 
 void EpilogueVectorizerMainLoop::printDebugTracesAtStart() {
@@ -7719,8 +7667,11 @@ EpilogueVectorizerEpilogueLoop::createEpilogueVectorizedLoopSkeleton(
       VecEpilogueIterationCountCheck,
       VecEpilogueIterationCountCheck->getSinglePredecessor());
 
-  DT->changeImmediateDominator(LoopScalarPreHeader,
-                               EPI.EpilogueIterationCountCheck);
+  if (auto *N = DT->getNode(LoopScalarPreHeader))
+    DT->changeImmediateDominator(LoopScalarPreHeader,
+                                 EPI.EpilogueIterationCountCheck);
+  else
+    DT->addNewBlock(LoopScalarPreHeader, EPI.EpilogueIterationCountCheck);
   if (!Cost->requiresScalarEpilogue(EPI.EpilogueVF.isVector()))
     // If there is an epilogue which must run, there's no edge from the
     // middle block to exit blocks  and thus no need to update the immediate
@@ -7784,7 +7735,7 @@ EpilogueVectorizerEpilogueLoop::createEpilogueVectorizedLoopSkeleton(
                               {VecEpilogueIterationCountCheck,
                                EPI.VectorTripCount} /* AdditionalBypass */);
 
-  return {completeLoopSkeleton(), EPResumeVal};
+  return {LoopVectorPreHeader, EPResumeVal};
 }
 
 BasicBlock *
@@ -8515,7 +8466,9 @@ static void addUsersInExitBlock(VPBasicBlock *HeaderVPBB, Loop *OrigLoop,
     Value *IncomingValue =
         ExitPhi.getIncomingValueForBlock(ExitingBB);
     VPValue *V = Builder.getVPValueOrAddLiveIn(IncomingValue, Plan);
-    Plan.addLiveOut(&ExitPhi, V);
+    Plan.addLiveOut(
+        &ExitPhi, V,
+        cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getSingleSuccessor()));
   }
 }
 
@@ -8534,9 +8487,25 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
   // modified; a basic block for the vector pre-header, followed by a region for
   // the vector loop, followed by the middle basic block. The skeleton vector
   // loop region contains a header and latch basic blocks.
+
+  // Add a check in the middle block to see if we have completed
+  // all of the iterations in the first vector loop.  Three cases:
+  // 1) If we require a scalar epilogue, there is no conditional branch as
+  //    we unconditionally branch to the scalar preheader.  Do nothing.
+  // 2) If (N - N%VF) == N, then we *don't* need to run the remainder.
+  //    Thus if tail is to be folded, we know we don't need to run the
+  //    remainder and we can use the previous value for the condition (true).
+  // 3) Otherwise, construct a runtime check.
+  bool RequiresScalarEpilogueCheck =
+      LoopVectorizationPlanner::getDecisionAndClampRange(
+          [this](ElementCount VF) {
+            return !CM.requiresScalarEpilogue(VF.isVector());
+          },
+          Range);
   VPlanPtr Plan = VPlan::createInitialVPlan(
       createTripCountSCEV(Legal->getWidestInductionType(), PSE, OrigLoop),
-      *PSE.getSE(), OrigLoop->getLoopPreheader());
+      *PSE.getSE(), RequiresScalarEpilogueCheck, CM.foldTailByMasking(),
+      OrigLoop);
   VPBasicBlock *HeaderVPBB = new VPBasicBlock("vector.body");
   VPBasicBlock *LatchVPBB = new VPBasicBlock("vector.latch");
   VPBlockUtils::insertBlockAfter(LatchVPBB, HeaderVPBB);
@@ -8679,6 +8648,49 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
          "VPBasicBlock");
   RecipeBuilder.fixHeaderPhis();
 
+  auto *MiddleVPBB =
+      cast<VPBasicBlock>(Plan->getVectorLoopRegion()->getSingleSuccessor());
+
+  VPBasicBlock *ScalarPH = nullptr;
+  for (VPBlockBase *Succ : MiddleVPBB->getSuccessors()) {
+    auto *VPIRBB = dyn_cast<VPIRBasicBlock>(Succ);
+    if (VPIRBB && VPIRBB->getIRBasicBlock() == OrigLoop->getHeader()) {
+      ScalarPH = VPIRBB;
+      break;
+    }
+  }
+
+  if (ScalarPH) {
+    for (auto &H : HeaderVPBB->phis()) {
+      auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&H);
+      if (!FOR)
+        continue;
+      VPBuilder B(ScalarPH);
+      VPBuilder MiddleBuilder;
+      // Set insert point so new recipes are inserted before terminator and
+      // condition, if there is either the former or both.
+      if (MiddleVPBB->getNumSuccessors() != 2)
+        MiddleBuilder.setInsertPoint(MiddleVPBB);
+      else if (isa<VPInstruction>(MiddleVPBB->getTerminator()->getOperand(0)))
+        MiddleBuilder.setInsertPoint(
+            &*std::prev(MiddleVPBB->getTerminator()->getIterator()));
+      else
+        MiddleBuilder.setInsertPoint(MiddleVPBB->getTerminator());
+
+      // Extract the resume value and create a new VPLiveOut for it.
+      auto *Resume = MiddleBuilder.createNaryOp(
+          VPInstruction::ExtractFromEnd,
+          {FOR->getBackedgeValue(),
+           Plan->getOrAddLiveIn(
+               ConstantInt::get(Plan->getCanonicalIV()->getScalarType(), 1))},
+          {}, "vector.recur.extract");
+      auto *R =
+          B.createNaryOp(VPInstruction::ExitPhi, {Resume, FOR->getStartValue()},
+                         {}, "scalar.recur.init");
+      Plan->addLiveOut(cast<PHINode>(FOR->getUnderlyingInstr()), R, ScalarPH);
+    }
+  }
+
   // ---------------------------------------------------------------------------
   // Transform initial VPlan: Apply previously taken decisions, in order, to
   // bring the VPlan to its final state.
@@ -8784,7 +8796,7 @@ VPlanPtr LoopVectorizationPlanner::buildVPlan(VFRange &Range) {
   // Create new empty VPlan
   auto Plan = VPlan::createInitialVPlan(
       createTripCountSCEV(Legal->getWidestInductionType(), PSE, OrigLoop),
-      *PSE.getSE(), OrigLoop->getLoopPreheader());
+      *PSE.getSE(), true, false, OrigLoop);
 
   // Build hierarchical CFG
   VPlanHCFGBuilder HCFGBuilder(OrigLoop, LI, *Plan);
@@ -8993,6 +9005,9 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
     }
   }
   Builder.setInsertPoint(&*LatchVPBB->begin());
+  VPBasicBlock *MiddleVPBB =
+      cast<VPBasicBlock>(VectorLoopRegion->getSingleSuccessor());
+  VPBasicBlock::iterator IP = MiddleVPBB->begin();
   for (VPRecipeBase &R :
        Plan->getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
     VPReductionPHIRecipe *PhiR = dyn_cast<VPReductionPHIRecipe>(&R);
@@ -9101,8 +9116,8 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
     // also modeled in VPlan.
     auto *FinalReductionResult = new VPInstruction(
         VPInstruction::ComputeReductionResult, {PhiR, NewExitingVPV}, ExitDL);
-    cast<VPBasicBlock>(VectorLoopRegion->getSingleSuccessor())
-        ->appendRecipe(FinalReductionResult);
+    FinalReductionResult->insertBefore(*MiddleVPBB, IP);
+    IP = std::next(FinalReductionResult->getIterator());
     OrigExitingVPV->replaceUsesWithIf(
         FinalReductionResult,
         [](VPUser &User, unsigned) { return isa<VPLiveOut>(&User); });
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index f17be451e6846..4138bee310ece 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -443,11 +443,29 @@ VPBasicBlock::createEmptyBasicBlock(VPTransformState::CFGState &CFG) {
 }
 
 void VPIRBasicBlock::execute(VPTransformState *State) {
-  assert(getHierarchicalPredecessors().empty() &&
-         "VPIRBasicBlock cannot have predecessors at the moment");
   assert(getHierarchicalSuccessors().empty() &&
          "VPIRBasicBlock cannot have successors at the moment");
 
+  for (VPBlockBase *PredVPBlock : getHierarchicalPredecessors()) {
+    VPBasicBlock *PredVPBB = PredVPBlock->getExitingBasicBlock();
+    auto &PredVPSuccessors = PredVPBB->getHierarchicalSuccessors();
+    BasicBlock *PredBB = State->CFG.VPBB2IRBB[PredVPBB];
+
+    assert(PredBB && "Predecessor basic-block not found building successor.");
+    auto *PredBBTerminator = PredBB->getTerminator();
+    LLVM_DEBUG(dbgs() << "LV: draw edge from" << PredBB->getName() << '\n');
+
+    auto *TermBr = dyn_cast<BranchInst>(PredBBTerminator);
+    if (TermBr) {
+      // Set each forward successor here when it is created, excluding
+      // backedges. A backward successor is set when the branch is created.
+      unsigned idx = PredVPSuccessors.front() == this ? 0 : 1;
+      assert(!TermBr->getSuccessor(idx) &&
+             "Trying to reset an existing successor block.");
+      TermBr->setSuccessor(idx, IRBB);
+    }
+  }
+
   State->Builder.SetInsertPoint(getIRBasicBlock()->getTerminator());
   executeRecipes(State, getIRBasicBlock());
 }
@@ -479,6 +497,14 @@ void VPBasicBlock::execute(VPTransformState *State) {
     // The Exit block of a loop is always set to be successor 0 of the Exiting
     // block.
     cast<BranchInst>(ExitingBB->getTerminator())->setSuccessor(0, NewBB);
+    // Set the insert point for recipe execution in the block.
+    State->Builder.SetInsertPoint(NewBB->getTerminator());
+    if (getSuccessors().size() == 1) {
+      BranchInst *Br = State->Builder.CreateBr(NewBB);
+      Br->setSuccessor(0, nullptr);
+      NewBB->getTerminator()->eraseFromParent();
+      State->Builder.SetInsertPoint(NewBB->getTerminator());
+    }
     State->CFG.DTU.applyUpdates({{DominatorTree::Insert, ExitingBB, NewBB}});
   } else if (PrevVPBB && /* A */
              !((SingleHPred = getSingleHierarchicalPredecessor()) &&
@@ -639,6 +665,7 @@ void VPBasicBlock::print(raw_ostream &O, const Twine &Indent,
 
   printSuccessors(O, Indent);
 }
+
 #endif
 
 static std::pair<VPBlockBase *, VPBlockBase *> cloneSESE(VPBlockBase *Entry);
@@ -654,12 +681,23 @@ static std::pair<VPBlockBase *, VPBlockBase *> cloneSESE(VPBlockBase *Entry) {
       Entry);
   for (VPBlockBase *BB : RPOT) {
     VPBlockBase *NewBB = BB->clone();
-    for (VPBlockBase *Pred : BB->getPredecessors())
-      VPBlockUtils::connectBlocks(Old2NewVPBlocks[Pred], NewBB);
-
     Old2NewVPBlocks[BB] = NewBB;
   }
 
+  for (VPBlockBase *BB : RPOT) {
+    VPBlockBase *NewBB = Old2NewVPBlocks[BB];
+    SmallVector<VPBlockBase *> NewPreds;
+    for (VPBlockBase *Pred : BB->getPredecessors()) {
+      NewPreds.push_back(Old2NewVPBlocks[Pred]);
+    }
+    NewBB->setPredecessors(NewPreds);
+    SmallVector<VPBlockBase *> NewSuccs;
+    for (VPBlockBase *Succ : BB->successors()) {
+      NewSuccs.push_back(Old2NewVPBlocks[Succ]);
+    }
+    NewBB->setSuccessors(NewSuccs);
+  }
+
 #if !defined(NDEBUG)
   // Verif...
[truncated]

This patch introduces a new ExitPhi VPInstruction which creates a phi in
a leaf block of a VPlan. The first use is to create the phi node for
fixed-order recurrence resume values in the scalar preheader.

The VPInstruction takes 2 operands: 1) the incoming value from the
middle-block and a default value to be used for all other incoming
blocks.

In follow-up changes, it will also be used to create phis for reduction and
induction resume values.
@fhahn fhahn force-pushed the vplan-scalar-ph-phi-for branch from 984e005 to 1ea4a7c Compare July 5, 2024 15:07
Copy link
Collaborator

@ayalz ayalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice extension! Adding various comments.

Comment on lines -3318 to -3299
fixFixedOrderRecurrence(LO, State);
Plan.removeLiveOut(LO->getPhi());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's more to remove here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, removed, thanks!

Plan.addLiveOut(&ExitPhi, V);
Plan.addLiveOut(
&ExitPhi, V,
cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getSingleSuccessor()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could V's parent be used instead of recording MiddleVPBB, when it's not a live-in?
Can V be a live-in, where ExitPhi uses multiple predecessors to select between invariant values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could V's parent be used instead of recording MiddleVPBB, when it's not a live-in?
Not at the moment I think, as V may be defined in the vector loop region, as the required extracts aren't modeled explicitly yet (which would be placed in MiddleVPBB)

Can V be a live-in, where ExitPhi uses multiple predecessors to select between invariant values?
V can be a live-in

Comment on lines 8607 to 8649
auto *MiddleVPBB =
cast<VPBasicBlock>(Plan->getVectorLoopRegion()->getSingleSuccessor());

VPBasicBlock *ScalarPH = nullptr;
for (VPBlockBase *Succ : MiddleVPBB->getSuccessors()) {
auto *VPBB = dyn_cast<VPBasicBlock>(Succ);
if (VPBB && !isa<VPIRBasicBlock>(VPBB)) {
ScalarPH = VPBB;
break;
}
}

if (ScalarPH) {
for (auto &H : HeaderVPBB->phis()) {
auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&H);
if (!FOR)
continue;
VPBuilder B(ScalarPH);
VPBuilder MiddleBuilder;
// Set insert point so new recipes are inserted before terminator and
// condition, if there is either the former or both.
if (MiddleVPBB->getNumSuccessors() != 2)
MiddleBuilder.setInsertPoint(MiddleVPBB);
else if (isa<VPInstruction>(MiddleVPBB->getTerminator()->getOperand(0)))
MiddleBuilder.setInsertPoint(
&*std::prev(MiddleVPBB->getTerminator()->getIterator()));
else
MiddleBuilder.setInsertPoint(MiddleVPBB->getTerminator());

// Extract the resume value and create a new VPLiveOut for it.
auto *Resume = MiddleBuilder.createNaryOp(
VPInstruction::ExtractFromEnd,
{FOR->getBackedgeValue(),
Plan->getOrAddLiveIn(
ConstantInt::get(Plan->getCanonicalIV()->getScalarType(), 1))},
{}, "vector.recur.extract");
auto *R =
B.createNaryOp(VPInstruction::ExitPhi, {Resume, FOR->getStartValue()},
{}, "scalar.recur.init");
Plan->addLiveOut(cast<PHINode>(FOR->getUnderlyingInstr()), R, ScalarPH);
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extending an already too long function - it now exceeds 280 LOC...

auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&H);
if (!FOR)
continue;
VPBuilder B(ScalarPH);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VPBuilder B(ScalarPH);
VPBuilder ScalarPHBuilder(ScalarPH);

Comment on lines 8625 to 8634
VPBuilder MiddleBuilder;
// Set insert point so new recipes are inserted before terminator and
// condition, if there is either the former or both.
if (MiddleVPBB->getNumSuccessors() != 2)
MiddleBuilder.setInsertPoint(MiddleVPBB);
else if (isa<VPInstruction>(MiddleVPBB->getTerminator()->getOperand(0)))
MiddleBuilder.setInsertPoint(
&*std::prev(MiddleVPBB->getTerminator()->getIterator()));
else
MiddleBuilder.setInsertPoint(MiddleVPBB->getTerminator());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VPBuilder MiddleBuilder;
// Set insert point so new recipes are inserted before terminator and
// condition, if there is either the former or both.
if (MiddleVPBB->getNumSuccessors() != 2)
MiddleBuilder.setInsertPoint(MiddleVPBB);
else if (isa<VPInstruction>(MiddleVPBB->getTerminator()->getOperand(0)))
MiddleBuilder.setInsertPoint(
&*std::prev(MiddleVPBB->getTerminator()->getIterator()));
else
MiddleBuilder.setInsertPoint(MiddleVPBB->getTerminator());
VPBuilder MiddleBuilder(MiddleVPBB);
// Reset insert point so new recipes are inserted before terminator and
// condition, if there is either the former or both.
if (auto *Terminator = MiddleVPBB->getTerminator()) {
auto *Condition = dyn_cast<VPInstruction>(Terminator->getOperand(0));
assert((!Condition || Condition->parent() == MiddleVPBB) && "Condition expected in MiddleVPBB");
MiddleBuilder.setInsertPoint(Condition ? Condition : Terminator);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified, thanks!

}
ScalarHeaderPhi->setIncomingValueForBlock(LoopScalarPreHeader,
ScalarPreheaderPhi);
ScalarHeaderPhi->setName("scalar.recur");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this setting of scalar header phi's name to "scalar.recur" retained?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No at the moment it is left as is. It would need to be done as post-processing, but adjusting the name of the phi in the scalar loop seems a bit inconsistent already (we don't rename any other phis or values in the scalar loop), so IMO removing it makes sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, better to remove it in a separate preparatory patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do ,thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 88e9c56

Comment on lines -3411 to -3392
ScalarHeaderPhi->setIncomingValueForBlock(LoopScalarPreHeader,
ScalarPreheaderPhi);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this incoming is set here, rather than by VPLiveIn's fixPhi.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScalarPreheaderPhi is being generated by ResumePhi instead, and is being set here as the incoming value of ScalarHeaderPhi - w/o any vector-to-scalar extraction - so can ResumePhi (in its current use) entail a vector-to-scalar extraction (as opposed to reusing it for feeding exit block phis)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, ResumePhi won't generate any extracts, as it will be fed by the extracted scalar resume value of the FOR. Possibly more accurately to mark ResumePhi as only having its first lane used instead of vector-to-scalar. (Done in latest version)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Would be good to simply indicate somehow that its dealing with a single scalar - coming in (from each pred) and going out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added an isSingleScalar helper to handle this

@@ -8635,6 +8604,49 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
"VPBasicBlock");
RecipeBuilder.fixHeaderPhis();

auto *MiddleVPBB =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain what is about to happen next, possibly directing to VPlanTransforms::adjustFixedOrderRecurrences() for further explanation.

// Feed a resume value for every FOR from the vector loop to the scalar loop, if middle block branches to scalar preheader, by introducing ExtractFromEnd and ExitPhi recipes in each, respectively, and a VPLiveOut which uses the latter and corresponds to the scalar header.
// Start by finding out if middle block branches to scalar preheader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to separate function, with the explanation as comment, thanks!

auto *NewPhi = Builder.CreatePHI(IncomingFromOtherPred->getType(), 2, Name);
BasicBlock *VPlanPred =
State.CFG
.VPBB2IRBB[cast<VPBasicBlock>(getParent()->getSinglePredecessor())];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that since middle block is a VPIRBasicBlock, its corresponding IRBB can be obtained directly via getIRBasicBlock().

@@ -3386,33 +3380,6 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
VF.getKnownMinValue() * UF);
}

void InnerLoopVectorizer::fixFixedOrderRecurrence(VPLiveOut *LO,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (independent): is the second part of the following comment from above along with setting of insertion point still relevant?

  // Fix LCSSA phis not already fixed earlier. Extracts may need to be generated
  // in the exit block, so update the builder.
  State.Builder.SetInsertPoint(State.CFG.ExitBB,
                               State.CFG.ExitBB->getFirstNonPHIIt());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved setting the insert point to VPLiveOut based on the predecessor.

fhahn added a commit that referenced this pull request Jul 9, 2024
The HeaderVPBB is retrieved in a similar fashion already.

This is in preparation for splitting up the very large
tryToBuildVPlanWithVPRecipes into several distinct functions, as
suggested multiple times, including in
#94760
fhahn added a commit that referenced this pull request Jul 9, 2024
The empty header and latch blocks can be created together with the
vector loop region.

This is in preparation for splitting up the very large
tryToBuildVPlanWithVPRecipes into several distinct functions, as
suggested multiple times, including in
#94760
fhahn added a commit that referenced this pull request Jul 9, 2024
Split off from #94760, clarify
as suggested.
Copy link

github-actions bot commented Jul 9, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@fhahn fhahn force-pushed the vplan-scalar-ph-phi-for branch from 09411ed to f028ab3 Compare July 9, 2024 13:45
@@ -3343,8 +3326,6 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
VectorLoop->getHeader(), Plan, State);
}

// Fix LCSSA phis not already fixed earlier. Extracts may need to be generated
// in the exit block, so update the builder.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to retain the first sentence: // Fix live-out phis not already fixed earlier., (LCSSA holds only for exit block phis, now live-outs are fixed also for non LCSSA header phis in scalar header) and remove the following line

  State.Builder.SetInsertPoint(State.CFG.ExitBB,
                               State.CFG.ExitBB->getFirstNonPHIIt());

?
(also good to remove the subsequent loop sinking scalar operands, when ready :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

static void addLiveOutsForFirstOrderRecurrences(VPlan &Plan) {
VPRegionBlock *VectorRegion = Plan.getVectorLoopRegion();

// Start by finding out if middle block branches to scalar preheader.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Start by finding out if middle block branches to scalar preheader.
// Start by finding out if middle block branches to scalar preheader, which is not a VPIRBasicBlock, unlike Exit block - the other possible successor of middle block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

Comment on lines 8501 to 8502
for (auto &H : VectorRegion->getEntryBasicBlock()->phis()) {
auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&H);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (auto &H : VectorRegion->getEntryBasicBlock()->phis()) {
auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&H);
for (auto &HeaderPhi : VectorRegion->getEntryBasicBlock()->phis()) {
auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&HeaderPhi);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed, thanks!

if (!FOR)
continue;

VPBuilder B(ScalarPHVPBB);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VPBuilder B(ScalarPHVPBB);
VPBuilder ScalarPHBuilder(ScalarPHVPBB);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed thanks! Also hoisted out defining the builders.

Comment on lines 8521 to 8522
Plan.getOrAddLiveIn(
ConstantInt::get(Plan.getCanonicalIV()->getScalarType(), 1))},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be clearer to pre-generate a VPValue of 1 and reuse it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

Comment on lines 700 to 701
/// Fixup the wrapped LCSSA phi node. This means we need to add the
/// appropriate incoming value from the precessor Pred.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Fixup the wrapped LCSSA phi node. This means we need to add the
/// appropriate incoming value from the precessor Pred.
/// Fix the wrapped phi node. This means adding an incoming value to exit block phi's from the vector loop via middle block (values from scalar loop already reach these phi's), and updating the value to scalar header phi's from the scalar preheader.

Note that a phi (ResumePhi?) could be added inside middle block for the former case, thereby having VPlan maintain LCSSA form, and ensuring parent of incoming value is always the desired predecessor. Can be done as a follow-up (or preparatory?) patch if preferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the wording.

Note that a phi (ResumePhi?) could be added inside middle block for the former case, thereby having VPlan maintain LCSSA form, and ensuring parent of incoming value is always the desired predecessor. Can be done as a follow-up (or preparatory?) patch if preferred.

Would be good as follow-up

/// The first operand is the incoming value from the predecessor in VPlan,
/// the second operand is the incoming value for all other predecessors
/// (which are currently not modeled in VPlan).
ResumePhi,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for renaming, worth updating the title and summary of the patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Title + description should be updated now

Value *IncomingFromVPlanPred =
State.get(getOperand(0), Part, /* IsScalar */ true);
Value *IncomingFromOtherPreds =
State.get(getOperand(1), Part, /* IsScalar */ true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it conceivable for the scalar preheader to have a single predecessor, namely, VPlanPred/middle-block? I.e., when trip count is known to be larger that VFxUF and no (other) runtime checks are needed to bypass the vector loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that should be possible, but it should be handled correctly (generating a phi with a single incoming value from the VPlan predecessor)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, the second operand (FOR->getStartValue()) is redundant in this case, but exists - so getOperand(1) works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

@@ -729,6 +759,9 @@ void VPInstruction::print(raw_ostream &O, const Twine &Indent,
case VPInstruction::ActiveLaneMask:
O << "active lane mask";
break;
case VPInstruction::ResumePhi:
O << "exit-phi";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
O << "exit-phi";
O << "resume-phi";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

Comment on lines -3411 to -3392
ScalarHeaderPhi->setIncomingValueForBlock(LoopScalarPreHeader,
ScalarPreheaderPhi);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScalarPreheaderPhi is being generated by ResumePhi instead, and is being set here as the incoming value of ScalarHeaderPhi - w/o any vector-to-scalar extraction - so can ResumePhi (in its current use) entail a vector-to-scalar extraction (as opposed to reusing it for feeding exit block phis)?

fhahn added a commit that referenced this pull request Jul 10, 2024
Adjusting the name of the recurrence phi in the scalar loop is a bit
inconsistent, as we do not adjust any other names in the scalar loops
(including other phis).

Remove this adjustment in preparation for
#94760 and as discussed there.
@fhahn fhahn changed the title [VPlan] Introduce ExitPhi VPInstruction, use to create phi for FOR. [VPlan] Introduce ResumePhi VPInstruction, use to create phi for FOR. Jul 10, 2024
Copy link
Collaborator

@ayalz ayalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, thanks!
Adding last couple of comments and thoughts.

// in the exit block, so update the builder.
State.Builder.SetInsertPoint(State.CFG.ExitBB,
State.CFG.ExitBB->getFirstNonPHIIt());
// Fix LCSSA phis not already fixed earlier.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Fix LCSSA phis not already fixed earlier.
// Fix live-out phis not already fixed earlier.

Comment on lines 199 to 200
VPRecipeBase *DefRecipe = ExitValue->getDefiningRecipe();
auto *ExitingVPBB = DefRecipe ? DefRecipe->getParent() : nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VPRecipeBase *DefRecipe = ExitValue->getDefiningRecipe();
auto *ExitingVPBB = DefRecipe ? DefRecipe->getParent() : nullptr;
VPRecipeBase *ExitingRecipe = ExitValue->getDefiningRecipe();
auto *ExitingVPBB = ExitingRecipe ? ExitingRecipe->getParent() : nullptr;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted, thanks!

Value *IncomingFromVPlanPred =
State.get(getOperand(0), Part, /* IsScalar */ true);
Value *IncomingFromOtherPreds =
State.get(getOperand(1), Part, /* IsScalar */ true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, the second operand (FOR->getStartValue()) is redundant in this case, but exists - so getOperand(1) works.

Comment on lines +616 to +621
NewPhi->addIncoming(IncomingFromVPlanPred, VPlanPred);
for (auto *OtherPred : predecessors(Builder.GetInsertBlock())) {
assert(OtherPred != VPlanPred &&
"VPlan predecessors should not be connected yet");
NewPhi->addIncoming(IncomingFromOtherPreds, OtherPred);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

default:
llvm_unreachable("Unsupported opcode for instruction");
}
}

bool VPInstruction::isVectorToScalar() const {
return getOpcode() == VPInstruction::ExtractFromEnd ||
getOpcode() == VPInstruction::ResumePhi ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, superseded by isSingleScalar.

Comment on lines +50 to +51
; CHECK-NEXT: Live-out i16 %for.1 = vp<[[RESUME_1_P]]>
; CHECK-NEXT: Live-out i16 %for.2 = vp<[[RESUME_2_P]]>
Copy link
Collaborator

@ayalz ayalz Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continuing above follow-up thought: live-outs were first introduced in exit block only, with middle-block as its implicit predecessor, so there was no need to specify the parental VPBB/VPIRBB, which was probably introduced later anyhow.

Now that live-outs correspond to both scalar header and exit block, how about promoting them from a distinct VPLiveOut derivative of VPUser to a recipe/VPInstruction? Say, VPIRRecipe/VPIRPhi/VPIRLiveOut, which wraps a original phi in an original basic-block - the latter wrapped in a VPIRBasicBlock? Then live-outs will appear as recipes of their ir-bb when printing, can reason about their parental VPBB with its predecessor(s), and their fixPhis() would fold into the standard execute(). Note that the scalar header block is not (yet) represented as a VPIRBasicBlock (inside a VPIRLoopRegion..).

Comment on lines -3411 to -3392
ScalarHeaderPhi->setIncomingValueForBlock(LoopScalarPreHeader,
ScalarPreheaderPhi);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Would be good to simply indicate somehow that its dealing with a single scalar - coming in (from each pred) and going out.

BasicBlock *MiddleBB = State.CFG.VPBB2IRBB[MiddleVPBB];
Phi->addIncoming(State.get(ExitValue, VPIteration(State.UF - 1, Lane)),
MiddleBB);
VPRecipeBase *DefRecipe = ExitValue->getDefiningRecipe();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, perhaps a live-in feeding a live-out can be useful, e.g., to indicate if the loop was entered(?).

Comment on lines 699 to 737
State.VF.isScalar()) &&
State.VF.isSingleScalar()) &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking if VF is 1 should stay isScalar()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, should be fixed, thanks!

@fhahn fhahn merged commit 9a5a873 into llvm:main Jul 11, 2024
5 of 6 checks passed
@fhahn fhahn deleted the vplan-scalar-ph-phi-for branch July 11, 2024 15:08
@llvm-ci
Copy link
Collaborator

llvm-ci commented Jul 12, 2024

LLVM Buildbot has detected a new failure on builder lld-x86_64-ubuntu-fast running on as-builder-4 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/33/builds/603

Here is the relevant piece of the build log for the reference:

Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'lld :: ELF/basic-sparcv9.s' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
RUN: at line 2: /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/llvm-mc -filetype=obj -triple=sparc64-unknown-openbsd /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/lld/test/ELF/basic-sparcv9.s -o /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/tools/lld/test/ELF/Output/basic-sparcv9.s.tmp
+ /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/llvm-mc -filetype=obj -triple=sparc64-unknown-openbsd /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/lld/test/ELF/basic-sparcv9.s -o /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/tools/lld/test/ELF/Output/basic-sparcv9.s.tmp
RUN: at line 3: /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/ld.lld /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/tools/lld/test/ELF/Output/basic-sparcv9.s.tmp -o /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/tools/lld/test/ELF/Output/basic-sparcv9.s.tmp2
+ /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/ld.lld /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/tools/lld/test/ELF/Output/basic-sparcv9.s.tmp -o /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/tools/lld/test/ELF/Output/basic-sparcv9.s.tmp2
RUN: at line 4: /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/llvm-readobj --file-headers --sections -l --symbols /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/tools/lld/test/ELF/Output/basic-sparcv9.s.tmp2    | /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/FileCheck /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/lld/test/ELF/basic-sparcv9.s
+ /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/llvm-readobj --file-headers --sections -l --symbols /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/tools/lld/test/ELF/Output/basic-sparcv9.s.tmp2
+ /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/FileCheck /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/lld/test/ELF/basic-sparcv9.s
/home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/lld/test/ELF/basic-sparcv9.s:20:15: error: CHECK-NEXT: expected string not found in input
# CHECK-NEXT: OS/ABI: SystemV (0x0)
              ^
<stdin>:12:16: note: scanning from here
 FileVersion: 1
               ^
<stdin>:13:2: note: possible intended match here
 OS/ABI: OpenBSD (0xC)
 ^

Input file: <stdin>
Check file: /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/lld/test/ELF/basic-sparcv9.s

-dump-input=help explains the following input dump.

Input was:
<<<<<<
           .
           .
           .
           7: ElfHeader { 
           8:  Ident { 
           9:  Magic: (7F 45 4C 46) 
          10:  Class: 64-bit (0x2) 
          11:  DataEncoding: BigEndian (0x2) 
          12:  FileVersion: 1 
next:20'0                    X error: no match found
          13:  OS/ABI: OpenBSD (0xC) 
next:20'0     ~~~~~~~~~~~~~~~~~~~~~~~
next:20'1      ?                      possible intended match
          14:  ABIVersion: 0 
next:20'0     ~~~~~~~~~~~~~~~
          15:  Unused: (00 00 00 00 00 00 00) 
next:20'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          16:  } 
next:20'0     ~~~
          17:  Type: Executable (0x2) 
next:20'0     ~~~~~~~~~~~~~~~~~~~~~~~~
...

aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this pull request Jul 14, 2024
The HeaderVPBB is retrieved in a similar fashion already.

This is in preparation for splitting up the very large
tryToBuildVPlanWithVPRecipes into several distinct functions, as
suggested multiple times, including in
llvm#94760
aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this pull request Jul 14, 2024
The empty header and latch blocks can be created together with the
vector loop region.

This is in preparation for splitting up the very large
tryToBuildVPlanWithVPRecipes into several distinct functions, as
suggested multiple times, including in
llvm#94760
aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this pull request Jul 14, 2024
Split off from llvm#94760, clarify
as suggested.
aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this pull request Jul 14, 2024
Adjusting the name of the recurrence phi in the scalar loop is a bit
inconsistent, as we do not adjust any other names in the scalar loops
(including other phis).

Remove this adjustment in preparation for
llvm#94760 and as discussed there.
aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this pull request Jul 14, 2024
…llvm#94760)

This patch introduces a new ResumePhi VPInstruction which creates a phi
in a leaf block of a VPlan. The first use is to create the phi node for
fixed-order recurrence resume values in the scalar preheader.

The VPInstruction takes 2 operands: 1) the incoming value from the
middle-block and a default value to be used for all other incoming
blocks.

In follow-up changes, it will also be used to create phis for reduction
and induction resume values.

Depends on llvm#92651

PR: llvm#94760
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants