Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SLP] Initial vectorization of non-power-of-2 ops. #77790

Merged
merged 43 commits into from
Apr 13, 2024
Merged

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented Jan 11, 2024

This patch enables vectorization for non-power-of-2 VFs. Initially only
VFs where adding 1 makes the VF a power-of-2, i.e. we can still make
relatively effective use of the vectors.

It relies on the existing target cost-models to return accurate costs for
non-power-of-2 vectors. I checked mostly AArch64 and X86 and
there the costs seem reasonable for the costs I checked, although
I expect there will be a need to refine both the cost-models and lowering
to make most effective use of non-power-of-2 SLP vectorization.

Note that re-ordering and shuffling is not implemented for nodes
requiring padding yet to keep the initial implementation simpler.

The feature is guarded by a new flag, off by defaul for now.

@llvmbot
Copy link
Member

llvmbot commented Jan 11, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

This patch introduces a new VectorizeWithPadding node type for root and
leave nodes to allow vectorizing loads/stores with non-power-of-2 number
of elements.

VectorizeWithPadding load nodes will pad the result to the next power of 2
with poison elements.

Non-leaf nodes will operate on normal power-of-2 vectors. For those
non-leaf nodes, we still track the number of padding elements needed to
go to the next power-of-2, to be used in various places, like cost
computation.

VectorizeWithPadding store nodes strip away the padding elements and
store the non-power-of-2 number of data elements.

Note that re-ordering and shuffling is not implemented for nodes
requiring padding yet to keep the initial implementation simpler.

The initial implementation also only tries to vectorize with padding if
original number of elements + 1 is a power-of-2, i.e. if only a single
padding element is needed.

The feature is guarded by a new flag, off by defaul for now.


Patch is 135.40 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/77790.diff

7 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+233-48)
  • (added) llvm/test/Transforms/SLPVectorizer/AArch64/vec15-base.ll (+155)
  • (added) llvm/test/Transforms/SLPVectorizer/AArch64/vec3-base.ll (+402)
  • (added) llvm/test/Transforms/SLPVectorizer/AArch64/vec3-calls.ll (+64)
  • (added) llvm/test/Transforms/SLPVectorizer/AArch64/vec3-reorder-reshuffle.ll (+583)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/odd_store.ll (+43-26)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/vect_copyable_in_binops.ll (+202-149)
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 055fbb00871f89..a281ec3acb3b46 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -179,6 +179,10 @@ static cl::opt<bool>
     ViewSLPTree("view-slp-tree", cl::Hidden,
                 cl::desc("Display the SLP trees with Graphviz"));
 
+static cl::opt<bool> VectorizeWithPadding(
+    "slp-vectorize-with-padding", cl::init(false), cl::Hidden,
+    cl::desc("Try to vectorize non-power-of-2 operations using padding."));
+
 // Limit the number of alias checks. The limit is chosen so that
 // it has no negative effect on the llvm benchmarks.
 static const unsigned AliasedCheckLimit = 10;
@@ -2557,7 +2561,7 @@ class BoUpSLP {
     unsigned getVectorFactor() const {
       if (!ReuseShuffleIndices.empty())
         return ReuseShuffleIndices.size();
-      return Scalars.size();
+      return Scalars.size() + getNumPadding();
     };
 
     /// A vector of scalars.
@@ -2574,6 +2578,7 @@ class BoUpSLP {
     /// intrinsics for store/load)?
     enum EntryState {
       Vectorize,
+      VectorizeWithPadding,
       ScatterVectorize,
       PossibleStridedVectorize,
       NeedToGather
@@ -2611,6 +2616,9 @@ class BoUpSLP {
     Instruction *MainOp = nullptr;
     Instruction *AltOp = nullptr;
 
+    /// The number of padding lanes (containing poison).
+    unsigned NumPadding = 0;
+
   public:
     /// Set this bundle's \p OpIdx'th operand to \p OpVL.
     void setOperand(unsigned OpIdx, ArrayRef<Value *> OpVL) {
@@ -2733,6 +2741,15 @@ class BoUpSLP {
                           SmallVectorImpl<Value *> *OpScalars = nullptr,
                           SmallVectorImpl<Value *> *AltScalars = nullptr) const;
 
+    /// Set the number of apdding lanes for this node.
+    void setNumPadding(unsigned Padding) {
+      assert(NumPadding == 0 && "Cannot change padding more than once.");
+      NumPadding = Padding;
+    }
+
+    /// Return the number of padding lanes (containg poison) for this node.
+    unsigned getNumPadding() const { return NumPadding; }
+
 #ifndef NDEBUG
     /// Debug printer.
     LLVM_DUMP_METHOD void dump() const {
@@ -2750,6 +2767,9 @@ class BoUpSLP {
       case Vectorize:
         dbgs() << "Vectorize\n";
         break;
+      case VectorizeWithPadding:
+        dbgs() << "VectorizeWithPadding\n";
+        break;
       case ScatterVectorize:
         dbgs() << "ScatterVectorize\n";
         break;
@@ -2790,6 +2810,8 @@ class BoUpSLP {
       for (const auto &EInfo : UserTreeIndices)
         dbgs() << EInfo << ", ";
       dbgs() << "\n";
+      if (getNumPadding() > 0)
+        dbgs() << "Padding: " << getNumPadding() << "\n";
     }
 #endif
   };
@@ -2891,9 +2913,19 @@ class BoUpSLP {
           ValueToGatherNodes.try_emplace(V).first->getSecond().insert(Last);
     }
 
-    if (UserTreeIdx.UserTE)
+    if (UserTreeIdx.UserTE) {
       Last->UserTreeIndices.push_back(UserTreeIdx);
-
+      if (!isPowerOf2_32(Last->Scalars.size()) &&
+          Last->State != TreeEntry::VectorizeWithPadding) {
+        if (UserTreeIdx.UserTE->State == TreeEntry::VectorizeWithPadding)
+          Last->setNumPadding(1);
+        else {
+          Last->setNumPadding(UserTreeIdx.UserTE->getNumPadding());
+          assert((Last->getNumPadding() == 0 || Last->ReorderIndices.empty()) &&
+                 "Reodering isn't implemented for nodes with padding yet");
+        }
+      }
+    }
     return Last;
   }
 
@@ -2921,7 +2953,8 @@ class BoUpSLP {
   /// and fills required data before actual scheduling of the instructions.
   TreeEntry::EntryState getScalarsVectorizationState(
       InstructionsState &S, ArrayRef<Value *> VL, bool IsScatterVectorizeUserTE,
-      OrdersType &CurrentOrder, SmallVectorImpl<Value *> &PointerOps) const;
+      OrdersType &CurrentOrder, SmallVectorImpl<Value *> &PointerOps,
+      bool HasPadding) const;
 
   /// Maps a specific scalar to its tree entry.
   SmallDenseMap<Value *, TreeEntry *> ScalarToTreeEntry;
@@ -3822,6 +3855,7 @@ namespace {
 enum class LoadsState {
   Gather,
   Vectorize,
+  VectorizeWithPadding,
   ScatterVectorize,
   PossibleStridedVectorize
 };
@@ -3898,8 +3932,10 @@ static LoadsState canVectorizeLoads(ArrayRef<Value *> VL, const Value *VL0,
       std::optional<int> Diff =
           getPointersDiff(ScalarTy, Ptr0, ScalarTy, PtrN, DL, SE);
       // Check that the sorted loads are consecutive.
+      bool NeedsPadding = !isPowerOf2_32(VL.size());
       if (static_cast<unsigned>(*Diff) == VL.size() - 1)
-        return LoadsState::Vectorize;
+        return NeedsPadding ? LoadsState::VectorizeWithPadding
+                            : LoadsState::Vectorize;
       // Simple check if not a strided access - clear order.
       IsPossibleStrided = *Diff % (VL.size() - 1) == 0;
     }
@@ -4534,7 +4570,8 @@ void BoUpSLP::reorderTopToBottom() {
         continue;
       }
       if ((TE->State == TreeEntry::Vectorize ||
-           TE->State == TreeEntry::PossibleStridedVectorize) &&
+           TE->State == TreeEntry::PossibleStridedVectorize ||
+           TE->State == TreeEntry::VectorizeWithPadding) &&
           isa<ExtractElementInst, ExtractValueInst, LoadInst, StoreInst,
               InsertElementInst>(TE->getMainOp()) &&
           !TE->isAltShuffle()) {
@@ -4568,6 +4605,10 @@ bool BoUpSLP::canReorderOperands(
     TreeEntry *UserTE, SmallVectorImpl<std::pair<unsigned, TreeEntry *>> &Edges,
     ArrayRef<TreeEntry *> ReorderableGathers,
     SmallVectorImpl<TreeEntry *> &GatherOps) {
+  // Reordering isn't implemented for nodes with padding yet.
+  if (UserTE->getNumPadding() > 0)
+    return false;
+
   for (unsigned I = 0, E = UserTE->getNumOperands(); I < E; ++I) {
     if (any_of(Edges, [I](const std::pair<unsigned, TreeEntry *> &OpData) {
           return OpData.first == I &&
@@ -4746,6 +4787,10 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
         auto Res = OrdersUses.insert(std::make_pair(OrdersType(), 0));
         const auto &&AllowsReordering = [IgnoreReorder, &GathersToOrders](
                                             const TreeEntry *TE) {
+          // Reordering for nodes with padding not implemented yet.
+          if (TE->getNumPadding() > 0 ||
+              TE->State == TreeEntry::VectorizeWithPadding)
+            return false;
           if (!TE->ReorderIndices.empty() || !TE->ReuseShuffleIndices.empty() ||
               (TE->State == TreeEntry::Vectorize && TE->isAltShuffle()) ||
               (IgnoreReorder && TE->Idx == 0))
@@ -5233,7 +5278,8 @@ static bool isAlternateInstruction(const Instruction *I,
 
 BoUpSLP::TreeEntry::EntryState BoUpSLP::getScalarsVectorizationState(
     InstructionsState &S, ArrayRef<Value *> VL, bool IsScatterVectorizeUserTE,
-    OrdersType &CurrentOrder, SmallVectorImpl<Value *> &PointerOps) const {
+    OrdersType &CurrentOrder, SmallVectorImpl<Value *> &PointerOps,
+    bool HasPadding) const {
   assert(S.MainOp && "Expected instructions with same/alternate opcodes only.");
 
   unsigned ShuffleOrOp =
@@ -5256,7 +5302,7 @@ BoUpSLP::TreeEntry::EntryState BoUpSLP::getScalarsVectorizationState(
   }
   case Instruction::ExtractValue:
   case Instruction::ExtractElement: {
-    bool Reuse = canReuseExtract(VL, VL0, CurrentOrder);
+    bool Reuse = !HasPadding && canReuseExtract(VL, VL0, CurrentOrder);
     if (Reuse || !CurrentOrder.empty())
       return TreeEntry::Vectorize;
     LLVM_DEBUG(dbgs() << "SLP: Gather extract sequence.\n");
@@ -5294,6 +5340,8 @@ BoUpSLP::TreeEntry::EntryState BoUpSLP::getScalarsVectorizationState(
                               PointerOps)) {
     case LoadsState::Vectorize:
       return TreeEntry::Vectorize;
+    case LoadsState::VectorizeWithPadding:
+      return TreeEntry::VectorizeWithPadding;
     case LoadsState::ScatterVectorize:
       return TreeEntry::ScatterVectorize;
     case LoadsState::PossibleStridedVectorize:
@@ -5353,6 +5401,15 @@ BoUpSLP::TreeEntry::EntryState BoUpSLP::getScalarsVectorizationState(
     }
     return TreeEntry::Vectorize;
   }
+  case Instruction::UDiv:
+  case Instruction::SDiv:
+  case Instruction::URem:
+  case Instruction::SRem:
+    // The instruction may trigger immediate UB on the poison/undef padding
+    // elements, so force gather to avoid introducing new UB.
+    if (HasPadding)
+      return TreeEntry::NeedToGather;
+    [[fallthrough]];
   case Instruction::Select:
   case Instruction::FNeg:
   case Instruction::Add:
@@ -5361,11 +5418,7 @@ BoUpSLP::TreeEntry::EntryState BoUpSLP::getScalarsVectorizationState(
   case Instruction::FSub:
   case Instruction::Mul:
   case Instruction::FMul:
-  case Instruction::UDiv:
-  case Instruction::SDiv:
   case Instruction::FDiv:
-  case Instruction::URem:
-  case Instruction::SRem:
   case Instruction::FRem:
   case Instruction::Shl:
   case Instruction::LShr:
@@ -5548,6 +5601,7 @@ void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
                                  bool DoNotFail = false) {
     // Check that every instruction appears once in this bundle.
     DenseMap<Value *, unsigned> UniquePositions(VL.size());
+    auto OriginalVL = VL;
     for (Value *V : VL) {
       if (isConstant(V)) {
         ReuseShuffleIndicies.emplace_back(
@@ -5560,6 +5614,7 @@ void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
       if (Res.second)
         UniqueValues.emplace_back(V);
     }
+
     size_t NumUniqueScalarValues = UniqueValues.size();
     if (NumUniqueScalarValues == VL.size()) {
       ReuseShuffleIndicies.clear();
@@ -5587,6 +5642,15 @@ void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
             NonUniqueValueVL.append(PWSz - UniqueValues.size(),
                                     UniqueValues.back());
             VL = NonUniqueValueVL;
+
+            if (UserTreeIdx.UserTE &&
+                UserTreeIdx.UserTE->getNumPadding() != 0) {
+              LLVM_DEBUG(dbgs() << "SLP: Reshuffling scalars not yet supported "
+                                   "for nodes with padding.\n");
+              newTreeEntry(OriginalVL, std::nullopt /*not vectorized*/, S,
+                           UserTreeIdx);
+              return false;
+            }
           }
           return true;
         }
@@ -5595,6 +5659,13 @@ void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
         return false;
       }
       VL = UniqueValues;
+      if (UserTreeIdx.UserTE && UserTreeIdx.UserTE->getNumPadding() != 0) {
+        LLVM_DEBUG(dbgs() << "SLP: Reshuffling scalars not yet supported for "
+                             "nodes with padding.\n");
+        newTreeEntry(OriginalVL, std::nullopt /*not vectorized*/, S,
+                     UserTreeIdx);
+        return false;
+      }
     }
     return true;
   };
@@ -5859,7 +5930,8 @@ void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
   OrdersType CurrentOrder;
   SmallVector<Value *> PointerOps;
   TreeEntry::EntryState State = getScalarsVectorizationState(
-      S, VL, IsScatterVectorizeUserTE, CurrentOrder, PointerOps);
+      S, VL, IsScatterVectorizeUserTE, CurrentOrder, PointerOps,
+      UserTreeIdx.UserTE && UserTreeIdx.UserTE->getNumPadding() > 0);
   if (State == TreeEntry::NeedToGather) {
     newTreeEntry(VL, std::nullopt /*not vectorized*/, S, UserTreeIdx,
                  ReuseShuffleIndicies);
@@ -6001,16 +6073,25 @@ void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
       fixupOrderingIndices(CurrentOrder);
       switch (State) {
       case TreeEntry::Vectorize:
+      case TreeEntry::VectorizeWithPadding:
         if (CurrentOrder.empty()) {
           // Original loads are consecutive and does not require reordering.
-          TE = newTreeEntry(VL, Bundle /*vectorized*/, S, UserTreeIdx,
+          TE = newTreeEntry(VL, State, Bundle, S, UserTreeIdx,
                             ReuseShuffleIndicies);
-          LLVM_DEBUG(dbgs() << "SLP: added a vector of loads.\n");
+          LLVM_DEBUG(dbgs() << "SLP: added a vector of loads"
+                            << (State == TreeEntry::VectorizeWithPadding
+                                    ? " with padding"
+                                    : "")
+                            << ".\n");
         } else {
           // Need to reorder.
-          TE = newTreeEntry(VL, Bundle /*vectorized*/, S, UserTreeIdx,
+          TE = newTreeEntry(VL, State, Bundle, S, UserTreeIdx,
                             ReuseShuffleIndicies, CurrentOrder);
-          LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled loads.\n");
+          LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled loads"
+                            << (State == TreeEntry::VectorizeWithPadding
+                                    ? " with padding"
+                                    : "")
+                            << ".\n");
         }
         TE->setOperandsInOrder();
         break;
@@ -6211,21 +6292,32 @@ void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
         *OIter = SI->getValueOperand();
         ++OIter;
       }
+      TreeEntry::EntryState State = isPowerOf2_32(VL.size())
+                                        ? TreeEntry::Vectorize
+                                        : TreeEntry::VectorizeWithPadding;
       // Check that the sorted pointer operands are consecutive.
       if (CurrentOrder.empty()) {
         // Original stores are consecutive and does not require reordering.
-        TreeEntry *TE = newTreeEntry(VL, Bundle /*vectorized*/, S, UserTreeIdx,
+        TreeEntry *TE = newTreeEntry(VL, State, Bundle, S, UserTreeIdx,
                                      ReuseShuffleIndicies);
         TE->setOperandsInOrder();
         buildTree_rec(Operands, Depth + 1, {TE, 0});
-        LLVM_DEBUG(dbgs() << "SLP: added a vector of stores.\n");
+        LLVM_DEBUG(dbgs() << "SLP: added a vector of stores"
+                          << (State == TreeEntry::VectorizeWithPadding
+                                  ? " with padding"
+                                  : "")
+                          << ".\n");
       } else {
         fixupOrderingIndices(CurrentOrder);
-        TreeEntry *TE = newTreeEntry(VL, Bundle /*vectorized*/, S, UserTreeIdx,
+        TreeEntry *TE = newTreeEntry(VL, State, Bundle, S, UserTreeIdx,
                                      ReuseShuffleIndicies, CurrentOrder);
         TE->setOperandsInOrder();
         buildTree_rec(Operands, Depth + 1, {TE, 0});
-        LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled stores.\n");
+        LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled stores"
+                          << (State == TreeEntry::VectorizeWithPadding
+                                  ? " with padding"
+                                  : "")
+                          << ".\n");
       }
       return;
     }
@@ -6955,7 +7047,8 @@ class BoUpSLP::ShuffleCostEstimator : public BaseShuffleAnalysis {
     return Constant::getAllOnesValue(Ty);
   }
 
-  InstructionCost getBuildVectorCost(ArrayRef<Value *> VL, Value *Root) {
+  InstructionCost getBuildVectorCost(ArrayRef<Value *> VL, Value *Root,
+                                     bool WithPadding = false) {
     if ((!Root && allConstant(VL)) || all_of(VL, UndefValue::classof))
       return TTI::TCC_Free;
     auto *VecTy = FixedVectorType::get(VL.front()->getType(), VL.size());
@@ -6966,7 +7059,7 @@ class BoUpSLP::ShuffleCostEstimator : public BaseShuffleAnalysis {
     InstructionsState S = getSameOpcode(VL, *R.TLI);
     const unsigned Sz = R.DL->getTypeSizeInBits(VL.front()->getType());
     unsigned MinVF = R.getMinVF(2 * Sz);
-    if (VL.size() > 2 &&
+    if (!WithPadding && VL.size() > 2 &&
         ((S.getOpcode() == Instruction::Load && !S.isAltShuffle()) ||
          (InVectors.empty() &&
           any_of(seq<unsigned>(0, VL.size() / MinVF),
@@ -7002,6 +7095,7 @@ class BoUpSLP::ShuffleCostEstimator : public BaseShuffleAnalysis {
                                   *R.LI, *R.TLI, CurrentOrder, PointerOps);
             switch (LS) {
             case LoadsState::Vectorize:
+            case LoadsState::VectorizeWithPadding:
             case LoadsState::ScatterVectorize:
             case LoadsState::PossibleStridedVectorize:
               // Mark the vectorized loads so that we don't vectorize them
@@ -7077,7 +7171,7 @@ class BoUpSLP::ShuffleCostEstimator : public BaseShuffleAnalysis {
         }
         GatherCost -= ScalarsCost;
       }
-    } else if (!Root && isSplat(VL)) {
+    } else if (!WithPadding && !Root && isSplat(VL)) {
       // Found the broadcasting of the single scalar, calculate the cost as
       // the broadcast.
       const auto *It =
@@ -7638,8 +7732,8 @@ class BoUpSLP::ShuffleCostEstimator : public BaseShuffleAnalysis {
         CommonMask[Idx] = Mask[Idx] + VF;
   }
   Value *gather(ArrayRef<Value *> VL, unsigned MaskVF = 0,
-                Value *Root = nullptr) {
-    Cost += getBuildVectorCost(VL, Root);
+                Value *Root = nullptr, bool WithPadding = false) {
+    Cost += getBuildVectorCost(VL, Root, WithPadding);
     if (!Root) {
       // FIXME: Need to find a way to avoid use of getNullValue here.
       SmallVector<Constant *> Vals;
@@ -7743,7 +7837,7 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
   }
   if (!FixedVectorType::isValidElementType(ScalarTy))
     return InstructionCost::getInvalid();
-  auto *VecTy = FixedVectorType::get(ScalarTy, VL.size());
+  auto *VecTy = FixedVectorType::get(ScalarTy, VL.size() + E->getNumPadding());
   TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
 
   // If we have computed a smaller type for the expression, update VecTy so
@@ -7751,7 +7845,7 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
   auto It = MinBWs.find(E);
   if (It != MinBWs.end()) {
     ScalarTy = IntegerType::get(F->getContext(), It->second.first);
-    VecTy = FixedVectorType::get(ScalarTy, VL.size());
+    VecTy = FixedVectorType::get(ScalarTy, VL.size() + E->getNumPadding());
   }
   unsigned EntryVF = E->getVectorFactor();
   auto *FinalVecTy = FixedVectorType::get(ScalarTy, EntryVF);
@@ -7785,6 +7879,7 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
     CommonCost =
         TTI->getShuffleCost(TTI::SK_PermuteSingleSrc, FinalVecTy, Mask);
   assert((E->State == TreeEntry::Vectorize ||
+          E->State == TreeEntry::VectorizeWithPadding ||
           E->State == TreeEntry::ScatterVectorize ||
           E->State == TreeEntry::PossibleStridedVectorize) &&
          "Unhandled state");
@@ -7890,7 +7985,8 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
     // loads) or (2) when Ptrs are the arguments of loads or stores being
     // vectorized as plane wide unit-stride load/store since all the
     // loads/stores are known to be from/to adjacent locations.
-    assert(E->State == TreeEntry::Vectorize &&
+    assert((E->State == TreeEntry::Vectorize ||
+            E->State == TreeEntry::VectorizeWithPadding) &&
            "Entry state expected to be Vectorize here.");
     if (isa<LoadInst, StoreInst>(VL0)) {
       // Case 2: estimate costs for pointer related costs when vectorizing to
@@ -8146,7 +8242,8 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
   case Instruction::BitCast: {
     auto SrcIt = MinBWs.find(getOperandEntry(E, 0));
     Type *SrcScalarTy = VL0->getOperand(0)->getType();
-    auto *SrcVecTy = FixedVectorType::get(SrcScalarTy, VL.size());
+    auto *SrcVecTy =
+        FixedVectorType::get(SrcScalarTy, VL.size() + E->getNumPadding());
     unsigned Opcode = ShuffleOrOp;
     unsigned VecOpcode = Opcode;
     if (!ScalarTy->isFloatingPointTy() && !SrcScalarTy->isFloatingPointTy() &&
@@ -8156,7 +8253,8 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
       if (SrcIt != MinBWs.end()) {
         SrcBWSz = SrcIt->second.first;
         SrcScalarTy = IntegerType::get(F->getContext(), SrcBWSz);
-        SrcVecTy = FixedVectorType::get(SrcScalarTy, VL.size());
+        SrcVecTy =
+            FixedVectorType::get(SrcScalarTy, VL.size() + E->getNumPadding());
       }
       unsigned BWSz = DL->getTypeSizeInBits(ScalarTy);
       if (BWSz == SrcBWSz) {
@@ -8299,10 +8397,19 @@ BoUpSLP::getEntryCost(con...
[truncated]

@alexey-bataev
Copy link
Member

Thanks for the patch, but this is too early to land. I'm working on non-power-of-2 vectorization (it is WIP and still requires couple more patches to go). Being implemented without some extra patches it leads to perf regression in some cases. Need to handle all this stuff correctly.
Would be good if you could land new tests separately, may help with the perf gains/regressions later.

@fhahn
Copy link
Contributor Author

fhahn commented Jan 11, 2024

Thanks for the patch, but this is too early to land. I'm working on non-power-of-2 vectorization (it is WIP and still requires couple more patches to go). Being implemented without some extra patches it leads to perf regression in some cases. Need to handle all this stuff correctly.

Do you have any more info about where those perf regressions are showing up (like architecture, benchmark, code pattern)? Curious if they are mostly related to reordering/shuffling/improved gathering? This patch does only support full gather, which should hopefully reflect the cost quite accurately (modulo cost-model gaps)

Would be good if you could land new tests separately, may help with the perf gains/regressions later.

Done! Some of the tests are for profitability, but many of them are also to cover crashes.

@alexey-bataev
Copy link
Member

Thanks for the patch, but this is too early to land. I'm working on non-power-of-2 vectorization (it is WIP and still requires couple more patches to go). Being implemented without some extra patches it leads to perf regression in some cases. Need to handle all this stuff correctly.

Do you have any more info about where those perf regressions are showing up (like architecture, benchmark, code pattern)? Curious if they are mostly related to reordering/shuffling/improved gathering? This patch does only support full gather, which should hopefully reflect the cost quite accurately (modulo cost-model gaps)

I investigated Spec + some other benchmarks, mostly the regressions are because of the too early SLP vectorization with LTO scenarios. To fix this, need to implement couple more patches (vectorization of gathered loads + reordering) + need to add some limitations.
The reordering is also a problem, but not the biggest one.
This new node will be a burden to support, later the genric Vectorization will be extended to support non-power-of-2 nodes directly.

Would be good if you could land new tests separately, may help with the perf gains/regressions later.

Done! Some of the tests are for profitability, but many of them are also to cover crashes.

@fhahn
Copy link
Contributor Author

fhahn commented Jan 12, 2024

This new node will be a burden to support, later the genric Vectorization will be extended to support non-power-of-2 nodes directly.

Agreed, it turns out the new node is not really needed and non-power-of-2 vector ops are handled quite well directly by backends (checked AArch64). Adjusted the patch to drop the special TreeEntry, tracking of NumPadding and generate non-power-of-2 vector ops directly for all nodes in the tree, WDYT as a starting point? 0ddbdd3

@alexey-bataev
Copy link
Member

As I said before, landing this alone causes perf regressions. I'm working on this, need to land couple patches to avoid this. I'm working on this feature for 3 years already (or more), most part of my previous patches are related to non-power-of-2 support.

fhahn added a commit to fhahn/llvm-project that referenced this pull request Jan 15, 2024
Improve cost computaton for odd vector mem ops by breaking them down
into smaller power-of-2 parts and sum up the cost of those parts.

This fixes the current cost estimates, which for most parts
underestimated the cos, due to using getTypeLegalizationCost, which
widens to the next power-of-2 in a single step in most cases. This
doesn't reflect the actual cost.

See https://llvm.godbolt.org/z/vMsnxMf1v for codegen for the tests.

Note that there is a special case for v3i8, for which current codegen is
pretty bad, due to automatic widening to v4i8, which in turn requires
the conversion to go through memory ops in the stack. I am planning on
fixing that as a follow-up, but I am not yet sure where to best fix
this.

At the moment, there are almost no cases in which such vector operations
will be generated automatically. The motivating case is non-power-of-2
SLP vectorization: llvm#77790
@fhahn
Copy link
Contributor Author

fhahn commented Jan 16, 2024

As I said before, landing this alone causes perf regressions. I'm working on this, need to land couple patches to avoid this. I'm working on this feature for 3 years already (or more), most part of my previous patches are related to non-power-of-2 support.

I did perf evaluation for this change on SPEC2017 and some large internal benchmarks on AArch64, and the only regressions I was able to spot were due to AArch64 cost-model issues, where the cost of non-power-of-2 vector ops wasn't estimated accurately (one instance fixed by #78181).

With the initial level of support, do you think there are fundamental issues in SLPVectorizer that aren't related to target specific cost models not providing accurate costs? To iterate on the latter, having limited support early would be helpful as people can fix their cost models early and also iteratively as more features enabled for non-power-of-2 vec.

@alexey-bataev
Copy link
Member

As I said before, landing this alone causes perf regressions. I'm working on this, need to land couple patches to avoid this. I'm working on this feature for 3 years already (or more), most part of my previous patches are related to non-power-of-2 support.

I did perf evaluation for this change on SPEC2017 and some large internal benchmarks on AArch64, and the only regressions I was able to spot were due to AArch64 cost-model issues, where the cost of non-power-of-2 vector ops wasn't estimated accurately (one instance fixed by #78181).

With the initial level of support, do you think there are fundamental issues in SLPVectorizer that aren't related to target specific cost models not providing accurate costs? To iterate on the latter, having limited support early would be helpful as people can fix their cost models early and also iteratively as more features enabled for non-power-of-2 vec.

It contradicts with my work, I did for several years. I have a plan of implementing this feature and moving forward step by step. Most of this newly added stuff, like padding etc., is not needed, if everything is implemented correctly. This is not a complete solution, but just a small hack to enable the feature, which may cause issues in future and adds extra burden to maintaining and removing it later.

@fhahn
Copy link
Contributor Author

fhahn commented Jan 16, 2024

Most of this newly added stuff, like padding etc., is not needed, if everything is implemented correctly. This is not a complete solution, but just a small hack to enable the feature, which may cause issues in future and adds extra burden to maintaining and removing it later.

Agreed that the padding isn't needed, did you see 0ddbdd3052e694eedef6bccce3b17f92dff7add7 mentioned in one of my previous comment that removes all the padding stuff?

There are still a few things that can be cleaned up there, but with that removed, the changes mostly boil down to disabling shuffling/reordering, as most of that code isn't able to handle non-power-of-2 vectors yet. As support is added for those, the restrictions could be loosened gradually.

fhahn added a commit that referenced this pull request Jan 17, 2024
Improve cost computaton for odd vector mem ops by breaking them down
into smaller power-of-2 parts and sum up the cost of those parts.

This fixes the current cost estimates, which for most parts
underestimated the cos, due to using getTypeLegalizationCost, which
widens to the next power-of-2 in a single step in most cases. This
doesn't reflect the actual cost.

See https://llvm.godbolt.org/z/vMsnxMf1v for codegen for the tests.

Note that there is a special case for v3i8, for which current codegen is
pretty bad, due to automatic widening to v4i8, which in turn requires
the conversion to go through memory ops in the stack. I am planning on
fixing that as a follow-up, but I am not yet sure where to best fix
this.

At the moment, there are almost no cases in which such vector operations
will be generated automatically. The motivating case is non-power-of-2
SLP vectorization: #77790

PR: #78181
fhahn added a commit to fhahn/llvm-project that referenced this pull request Jan 18, 2024
Improve cost computaton for odd vector mem ops by breaking them down
into smaller power-of-2 parts and sum up the cost of those parts.

This fixes the current cost estimates, which for most parts
underestimated the cos, due to using getTypeLegalizationCost, which
widens to the next power-of-2 in a single step in most cases. This
doesn't reflect the actual cost.

See https://llvm.godbolt.org/z/vMsnxMf1v for codegen for the tests.

Note that there is a special case for v3i8, for which current codegen is
pretty bad, due to automatic widening to v4i8, which in turn requires
the conversion to go through memory ops in the stack. I am planning on
fixing that as a follow-up, but I am not yet sure where to best fix
this.

At the moment, there are almost no cases in which such vector operations
will be generated automatically. The motivating case is non-power-of-2
SLP vectorization: llvm#77790

PR: llvm#78181

(cherry-picked from e473daa)
fhahn added a commit to fhahn/llvm-project that referenced this pull request Jan 18, 2024
Add custom combine to lower load <3 x i8> as the more efficient sequence
below:
   ldrb wX, [x0, swiftlang#2]
   ldrh wY, [x0]
   orr wX, wY, wX, lsl swiftlang#16
   fmov s0, wX

At the moment, there are almost no cases in which such vector operations
will be generated automatically. The motivating case is non-power-of-2
SLP vectorization: llvm#77790
fhahn added a commit to fhahn/llvm-project that referenced this pull request Jan 18, 2024
Improve codegen for (trunc X to <3 x i8>) by converting it to a sequence
of 3 ST1.b, but first converting the truncate operand to either v8i8 or
v16i8, extracting the lanes for the truncate results and storing them.

At the moment, there are almost no cases in which such vector operations
will be generated automatically. The motivating case is non-power-of-2
SLP vectorization: llvm#77790
ampandey-1995 pushed a commit to ampandey-1995/llvm-project that referenced this pull request Jan 19, 2024
Improve cost computaton for odd vector mem ops by breaking them down
into smaller power-of-2 parts and sum up the cost of those parts.

This fixes the current cost estimates, which for most parts
underestimated the cos, due to using getTypeLegalizationCost, which
widens to the next power-of-2 in a single step in most cases. This
doesn't reflect the actual cost.

See https://llvm.godbolt.org/z/vMsnxMf1v for codegen for the tests.

Note that there is a special case for v3i8, for which current codegen is
pretty bad, due to automatic widening to v4i8, which in turn requires
the conversion to go through memory ops in the stack. I am planning on
fixing that as a follow-up, but I am not yet sure where to best fix
this.

At the moment, there are almost no cases in which such vector operations
will be generated automatically. The motivating case is non-power-of-2
SLP vectorization: llvm#77790

PR: llvm#78181
@fhahn fhahn changed the title [SLP] Vectorize non-power-of-2 ops with padding. [SLP] Vectorize non-power-of-2 ops. Jan 19, 2024
Copy link
Contributor Author

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR with the version without padding, including addressing @alexey-bataev 's comments at 0ddbdd3052e694eedef6bccce3b17f92dff7add7

; NON-POW2-NEXT: store float [[TMP0]], ptr [[DST]], align 4
; NON-POW2-NEXT: [[TMP1:%.*]] = load <3 x float>, ptr [[INCDEC_PTR]], align 4
; NON-POW2-NEXT: [[TMP2:%.*]] = fadd <3 x float> [[TMP1]], <float 1.000000e+00, float 2.000000e+00, float 3.000000e+00>
; NON-POW2-NEXT: store <3 x float> [[TMP2]], ptr [[INCDEC_PTR1]], align 4
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Moved comment here 0ddbdd3#r137302312)

Is this supported on all platforms? Or you rely on the check that this will end up with the scalarized version in the end?

I checked on AArch64 and X86 and both support OK lowering of such operations, although in a number of cases improvements can be made, e.g. #78632 and #78637 for loads and stores of <3 x i8> on AArch64.

This effectively relies on the cost-models to return accurate costs for non-power-of-2 vector operations. If a vector op of a non-power-of-2 type will get scalarizied in the backend, the cost model should return a high cost.

@fhahn
Copy link
Contributor Author

fhahn commented Mar 25, 2024

Ping :)

Copy link

✅ With the latest revision this PR passed the Python code formatter.

@fhahn
Copy link
Contributor Author

fhahn commented Apr 2, 2024

ping

Comment on lines 6138 to 6143
if (UserTreeIdx.UserTE && UserTreeIdx.UserTE->isNonPowOf2Vec()) {
LLVM_DEBUG(dbgs() << "SLP: Reshuffling scalars not yet supported "
"for nodes with padding.\n");
newTreeEntry(VL, std::nullopt /*not vectorized*/, S, UserTreeIdx);
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to move this check and one below to line (or after) 6399 in the original source code to avoid duplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the check up the beginning of the else { in the lambada to avoid the duplication. I think it cannot be moved out of TryToFindDuplicates, as we only need to have the bail out when reshuffling would be needed.

// FIXME: Vectorizing is not supported yet for non-power-of-2 ops.
if (!isPowerOf2_32(VL.size()))
return TreeEntry::NeedToGather;
if ((Reuse || !CurrentOrder.empty()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra parens, restore original check here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@@ -6111,6 +6141,12 @@ void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
if (NumUniqueScalarValues == VL.size()) {
ReuseShuffleIndicies.clear();
} else {
if (UserTreeIdx.UserTE && UserTreeIdx.UserTE->isNonPowOf2Vec()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add FIXME here for non-power-of-2 support

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

Comment on lines 14852 to 14854
if (isPowerOf2_32(CandVF + 1) && CandVF <= MaxVF) {
NonPowerOf2VF = CandVF;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (isPowerOf2_32(CandVF + 1) && CandVF <= MaxVF) {
NonPowerOf2VF = CandVF;
}
if (isPowerOf2_32(CandVF + 1) && CandVF <= MaxVF)
NonPowerOf2VF = CandVF;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@@ -14798,14 +14843,23 @@ bool SLPVectorizerPass::vectorizeStores(ArrayRef<StoreInst *> Stores,
continue;
}

std::optional<unsigned> NonPowerOf2VF;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::optional<unsigned> NonPowerOf2VF;
unsigned NonPowerOf2VF = 0;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

for_each(CandidateVFs, [&](unsigned &VF) {
VF = Size;
Size /= 2;
SmallVector<unsigned> CandidateVFs(Sz + bool(NonPowerOf2VF));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SmallVector<unsigned> CandidateVFs(Sz + bool(NonPowerOf2VF));
SmallVector<unsigned> CandidateVFs(Sz + (NonPowerOf2VF > 0 ? 1 : 0));

Better to avoid adding bool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

SmallVector<unsigned> CandidateVFs(Sz + bool(NonPowerOf2VF));
unsigned Size = MinVF;
for_each(reverse(CandidateVFs), [&](unsigned &VF) {
VF = Size > MaxVF ? *NonPowerOf2VF : Size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VF = Size > MaxVF ? *NonPowerOf2VF : Size;
VF = Size > MaxVF ? NonPowerOf2VF : Size;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

Copy link
Contributor Author

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest comments should be addressed, thanks!

// FIXME: Vectorizing is not supported yet for non-power-of-2 ops.
if (!isPowerOf2_32(VL.size()))
return TreeEntry::NeedToGather;
if ((Reuse || !CurrentOrder.empty()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@@ -6111,6 +6141,12 @@ void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
if (NumUniqueScalarValues == VL.size()) {
ReuseShuffleIndicies.clear();
} else {
if (UserTreeIdx.UserTE && UserTreeIdx.UserTE->isNonPowOf2Vec()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

@@ -14798,14 +14843,23 @@ bool SLPVectorizerPass::vectorizeStores(ArrayRef<StoreInst *> Stores,
continue;
}

std::optional<unsigned> NonPowerOf2VF;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

Comment on lines 14852 to 14854
if (isPowerOf2_32(CandVF + 1) && CandVF <= MaxVF) {
NonPowerOf2VF = CandVF;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

for_each(CandidateVFs, [&](unsigned &VF) {
VF = Size;
Size /= 2;
SmallVector<unsigned> CandidateVFs(Sz + bool(NonPowerOf2VF));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

SmallVector<unsigned> CandidateVFs(Sz + bool(NonPowerOf2VF));
unsigned Size = MinVF;
for_each(reverse(CandidateVFs), [&](unsigned &VF) {
VF = Size > MaxVF ? *NonPowerOf2VF : Size;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@@ -2806,6 +2810,9 @@ class BoUpSLP {
SmallVectorImpl<Value *> *OpScalars = nullptr,
SmallVectorImpl<Value *> *AltScalars = nullptr) const;

/// Return true if this is a non-power-of-2 node.
bool isNonPowOf2Vec() const { return !isPowerOf2_32(Scalars.size()); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if ReuseShuffleIndices is not empty? Will it work?Can you add a test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All code paths should guard against that AFAICT. I added an assertion to make sure. Couldn't find any test case that triggers this across large code bases (SPEC2006, SPEC2017, llvm-test-suite, clang bootstrap and large internal benchmarks)

@fhahn
Copy link
Contributor Author

fhahn commented Apr 11, 2024

ping

Copy link
Member

@alexey-bataev alexey-bataev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

@fhahn fhahn merged commit 6d66db3 into llvm:main Apr 13, 2024
3 of 4 checks passed
@fhahn fhahn deleted the slp-vec3 branch April 13, 2024 07:14
@dtcxzyw
Copy link
Member

dtcxzyw commented Apr 13, 2024

@fhahn This PR breaks our CI: dtcxzyw/llvm-opt-benchmark#514

opt: /home/dtcxzyw/WorkSpace/Projects/compilers/llvm-project/llvm/include/llvm/ADT/ArrayRef.h:169: const T& llvm::ArrayRef<T>::front() const [with T = llvm::Value*]: Assertion `!empty()' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: bin/opt -passes=slp-vectorizer reduced.ll
 #0 0x0000721336bed7b0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/dtcxzyw/WorkSpace/Projects/compilers/LLVM/llvm-build/bin/../lib/libLLVMSupport.so.19.0git+0x1ed7b0)
 #1 0x0000721336beabbf llvm::sys::RunSignalHandlers() (/home/dtcxzyw/WorkSpace/Projects/compilers/LLVM/llvm-build/bin/../lib/libLLVMSupport.so.19.0git+0x1eabbf)
 #2 0x0000721336bead15 SignalHandler(int) Signals.cpp:0:0
 #3 0x0000721336642520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #4 0x00007213366969fc __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
 #5 0x00007213366969fc __pthread_kill_internal ./nptl/pthread_kill.c:78:10
 #6 0x00007213366969fc pthread_kill ./nptl/pthread_kill.c:89:10
 #7 0x0000721336642476 gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #8 0x00007213366287f3 abort ./stdlib/abort.c:81:7
 #9 0x000072133662871b _nl_load_domain ./intl/loadmsgcat.c:1177:9
#10 0x0000721336639e96 (/lib/x86_64-linux-gnu/libc.so.6+0x39e96)
#11 0x000072133178328c llvm::SLPVectorizerPass::vectorizeStores(llvm::ArrayRef<llvm::StoreInst*>, llvm::slpvectorizer::BoUpSLP&)::'lambda'(std::set<std::pair<unsigned int, int>, llvm::SLPVectorizerPass::vectorizeStores(llvm::ArrayRef<llvm::StoreInst*>, llvm::slpvectorizer::BoUpSLP&)::StoreDistCompare, std::allocator<std::pair<unsigned int, int>>> const&)::operator()(std::set<std::pair<unsigned int, int>, llvm::SLPVectorizerPass::vectorizeStores(llvm::ArrayRef<llvm::StoreInst*>, llvm::slpvectorizer::BoUpSLP&)::StoreDistCompare, std::allocator<std::pair<unsigned int, int>>> const&) const SLPVectorizer.cpp:0:0
#12 0x0000721331783b90 llvm::SLPVectorizerPass::vectorizeStores(llvm::ArrayRef<llvm::StoreInst*>, llvm::slpvectorizer::BoUpSLP&) (/home/dtcxzyw/WorkSpace/Projects/compilers/LLVM/llvm-build/bin/../lib/../lib/libLLVMVectorize.so.19.0git+0x183b90)
#13 0x0000721331784286 llvm::SLPVectorizerPass::vectorizeStoreChains(llvm::slpvectorizer::BoUpSLP&) (/home/dtcxzyw/WorkSpace/Projects/compilers/LLVM/llvm-build/bin/../lib/../lib/libLLVMVectorize.so.19.0git+0x184286)
#14 0x000072133178540f llvm::SLPVectorizerPass::runImpl(llvm::Function&, llvm::ScalarEvolution*, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo*, llvm::AAResults*, llvm::LoopInfo*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::DemandedBits*, llvm::OptimizationRemarkEmitter*) (.part.0) SLPVectorizer.cpp:0:0
#15 0x0000721331785e1b llvm::SLPVectorizerPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/home/dtcxzyw/WorkSpace/Projects/compilers/LLVM/llvm-build/bin/../lib/../lib/libLLVMVectorize.so.19.0git+0x185e1b)
#16 0x000072133206f636 llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/home/dtcxzyw/WorkSpace/Projects/compilers/LLVM/llvm-build/bin/../lib/../lib/libLLVMPasses.so.19.0git+0x6f636)
#17 0x000072132f51eec1 llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/home/dtcxzyw/WorkSpace/Projects/compilers/LLVM/llvm-build/bin/../lib/../lib/libLLVMCore.so.19.0git+0x31eec1)
#18 0x00007213358a45d6 llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/home/dtcxzyw/WorkSpace/Projects/compilers/LLVM/llvm-build/bin/../lib/../lib/libLLVMX86CodeGen.so.19.0git+0xa45d6)
#19 0x000072132f51db9b llvm::ModuleToFunctionPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/home/dtcxzyw/WorkSpace/Projects/compilers/LLVM/llvm-build/bin/../lib/../lib/libLLVMCore.so.19.0git+0x31db9b)
#20 0x00007213358a59c6 llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/home/dtcxzyw/WorkSpace/Projects/compilers/LLVM/llvm-build/bin/../lib/../lib/libLLVMX86CodeGen.so.19.0git+0xa59c6)
#21 0x000072132f51b9d1 llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/home/dtcxzyw/WorkSpace/Projects/compilers/LLVM/llvm-build/bin/../lib/../lib/libLLVMCore.so.19.0git+0x31b9d1)
#22 0x000072133702e5c5 llvm::runPassPipeline(llvm::StringRef, llvm::Module&, llvm::TargetMachine*, llvm::TargetLibraryInfoImpl*, llvm::ToolOutputFile*, llvm::ToolOutputFile*, llvm::ToolOutputFile*, llvm::StringRef, llvm::ArrayRef<llvm::PassPlugin>, llvm::ArrayRef<std::function<void (llvm::PassBuilder&)>>, llvm::opt_tool::OutputKind, llvm::opt_tool::VerifierKind, bool, bool, bool, bool, bool, bool, bool) (/home/dtcxzyw/WorkSpace/Projects/compilers/LLVM/llvm-build/bin/../lib/libLLVMOptDriver.so.19.0git+0x245c5)
#23 0x0000721337039216 optMain (/home/dtcxzyw/WorkSpace/Projects/compilers/LLVM/llvm-build/bin/../lib/libLLVMOptDriver.so.19.0git+0x2f216)
#24 0x0000721336629d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#25 0x0000721336629e40 call_init ./csu/../csu/libc-start.c:128:20
#26 0x0000721336629e40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#27 0x0000566543d65095 _start (bin/opt+0x1095)
Aborted (core dumped)

Reduced test case:

; opt -passes=slp-vectorizer reduced.ll -S

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define void @test(i24 %0, ptr %.pre) {
  br label %2

2:                                                ; preds = %2, %1
  %.sroa.28.0.extract.trunc = trunc i24 %0 to i8
  store i8 %.sroa.28.0.extract.trunc, ptr %.pre, align 1
  %3 = getelementptr i8, ptr %.pre, i64 1
  store i8 0, ptr %3, align 1
  br label %2
}

@fhahn
Copy link
Contributor Author

fhahn commented Apr 13, 2024

@dtcxzyw thanks for the reproducer, the issue was that in some cases on X86, MinVF wasn't a power-of-2. Fixed by using the stored type to compute the VF passed to getStoreMinimumVF

bazuzi pushed a commit to bazuzi/llvm-project that referenced this pull request Apr 15, 2024
This patch enables vectorization for non-power-of-2 VFs. Initially only
VFs where adding 1 makes the VF a power-of-2, i.e. we can still make
relatively effective use of the vectors.

It relies on the existing target cost-models to return accurate costs
for
non-power-of-2 vectors. I checked mostly AArch64 and X86 and
there the costs seem reasonable for the costs I checked, although
I expect there will be a need to refine both the cost-models and
lowering
to make most effective use of non-power-of-2 SLP vectorization.

Note that re-ordering and shuffling is not implemented for nodes
requiring padding yet to keep the initial implementation simpler.

The feature is guarded by a new flag, off by defaul for now.

PR: llvm#77790
kito-cheng added a commit to kito-cheng/llvm-project that referenced this pull request Jun 28, 2024
SLP supports non-power-of-2 vectors [1], so we should consider supporting this
for RISC-V vector code generation. It is natural to support non-power-of-2 VLS
vectors for the vector extension, as VL does not impose any constraints on this.

In theory, we could support any length, but we want to prevent the
number of MVTs from growing too quickly. Therefore, we only add v3, v5,
v7 and v15.

[1] llvm#77790
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants