Skip to content

[LoopVectorize] Allow Early-Exit Loop Vectorization with EVL #130918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

arcbbb
Copy link
Contributor

@arcbbb arcbbb commented Mar 12, 2025

This patch enables vectorization of early-exit loops when tailfolding with EVL.

Although these early-exit loops do not require a scalar epilogue, CM_ScalarEpilogueAllowed skips setTailFoldingStyles in the current path, preventiing tryAddExplicitVectorLength() from being invoked.
To ensure early-exit loops are vectorized with EVL transform, both -force-tail-folding-style=data-with-evl and
-prefer-predicate-over-epilogue=predicate-dont-vectorize must be used.

This patch updates the check on bottom-test loop and ensures the tailfolding style is applied.

This patch enables vectorization of early-exit loops when
tailfolding with EVL.

Although these early-exit loops do not require a scalar epilogue,
CM_ScalarEpilogueAllowed skips setTailFoldingStyles in the current
path, preventiing tryAddExplicitVectorLength() from being invoked.

To ensure early-exit loops are vectorized with EVL transform, both
`-force-tail-folding-style=data-with-evl` and
`-prefer-predicate-over-epilogue=predicate-dont-vectorize` must be
used.
This patch updates the check on bottom-test loop and ensures the
tailfolding style is applied.
@llvmbot
Copy link
Member

llvmbot commented Mar 12, 2025

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Shih-Po Hung (arcbbb)

Changes

This patch enables vectorization of early-exit loops when tailfolding with EVL.

Although these early-exit loops do not require a scalar epilogue, CM_ScalarEpilogueAllowed skips setTailFoldingStyles in the current path, preventiing tryAddExplicitVectorLength() from being invoked.
To ensure early-exit loops are vectorized with EVL transform, both -force-tail-folding-style=data-with-evl and
-prefer-predicate-over-epilogue=predicate-dont-vectorize must be used.

This patch updates the check on bottom-test loop and ensures the tailfolding style is applied.


Full diff: https://github.com/llvm/llvm-project/pull/130918.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+9-6)
  • (added) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-early-exit.ll (+108)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index bab2c6efd4035..1ee4de8d108e8 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4038,10 +4038,12 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
   }
 
   // The only loops we can vectorize without a scalar epilogue, are loops with
-  // a bottom-test and a single exiting block. We'd have to handle the fact
-  // that not every instruction executes on the last iteration.  This will
-  // require a lane mask which varies through the vector loop body.  (TODO)
-  if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) {
+  // a bottom-test and a single exiting block or those with early exits. We'd
+  // have to handle the fact that not every instruction executes on the last
+  // iteration. This will require a lane mask which varies through the vector
+  // loop body. (TODO)
+  if ((TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) &&
+      !Legal->hasUncountableEarlyExit()) {
     // If there was a tail-folding hint/switch, but we can't fold the tail by
     // masking, fallback to a vectorization with a scalar epilogue.
     if (ScalarEpilogueStatus == CM_ScalarEpilogueNotNeededUsePredicate) {
@@ -4092,8 +4094,9 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
     // uncountable exits whilst also ensuring the symbolic maximum and known
     // back-edge taken count remain identical for loops with countable exits.
     const SCEV *BackedgeTakenCount = PSE.getSymbolicMaxBackedgeTakenCount();
-    assert(BackedgeTakenCount == PSE.getBackedgeTakenCount() &&
-           "Invalid loop count");
+    assert(Legal->hasUncountableEarlyExit() ||
+           (BackedgeTakenCount == PSE.getBackedgeTakenCount()) &&
+               "Invalid loop count");
     const SCEV *ExitCount = SE->getAddExpr(
         BackedgeTakenCount, SE->getOne(BackedgeTakenCount->getType()));
     const SCEV *Rem = SE->getURemExpr(
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-early-exit.ll b/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-early-exit.ll
new file mode 100644
index 0000000000000..a53b4019242f9
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-early-exit.ll
@@ -0,0 +1,108 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -passes=loop-vectorize -force-tail-folding-style=data-with-evl -prefer-predicate-over-epilogue=predicate-dont-vectorize \
+; RUN: -mtriple=riscv64 -mattr=+v -S -enable-early-exit-vectorization  %s | FileCheck %s
+
+; REQUIRES: asserts
+
+declare void @init(ptr)
+
+define i64 @multi_exiting_to_different_exits_live_in_exit_values() {
+; CHECK-LABEL: define i64 @multi_exiting_to_different_exits_live_in_exit_values(
+; CHECK-SAME: ) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[SRC:%.*]] = alloca [128 x i32], align 4
+; CHECK-NEXT:    call void @init(ptr [[SRC]])
+; CHECK-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 4
+; CHECK-NEXT:    [[TMP2:%.*]] = sub i64 [[TMP1]], 1
+; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i64 128, [[TMP2]]
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT:    [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP4:%.*]] = mul i64 [[TMP3]], 4
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[EVL_BASED_IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_EVL_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[AVL:%.*]] = sub i64 128, [[EVL_BASED_IV]]
+; CHECK-NEXT:    [[TMP5:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 4, i1 true)
+; CHECK-NEXT:    [[IV:%.*]] = add i64 [[EVL_BASED_IV]], 0
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[EVL_BASED_IV]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP7:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
+; CHECK-NEXT:    [[TMP8:%.*]] = add <vscale x 4 x i64> zeroinitializer, [[TMP7]]
+; CHECK-NEXT:    [[VEC_IV:%.*]] = add <vscale x 4 x i64> [[BROADCAST_SPLAT]], [[TMP8]]
+; CHECK-NEXT:    [[TMP9:%.*]] = icmp ule <vscale x 4 x i64> [[VEC_IV]], splat (i64 127)
+; CHECK-NEXT:    [[GEP_SRC:%.*]] = getelementptr inbounds i32, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i32, ptr [[GEP_SRC]], i32 0
+; CHECK-NEXT:    [[VP_OP_LOAD:%.*]] = call <vscale x 4 x i32> @llvm.vp.load.nxv4i32.p0(ptr align 4 [[TMP11]], <vscale x 4 x i1> splat (i1 true), i32 [[TMP5]])
+; CHECK-NEXT:    [[TMP12:%.*]] = icmp eq <vscale x 4 x i32> [[VP_OP_LOAD]], splat (i32 10)
+; CHECK-NEXT:    [[TMP13:%.*]] = xor <vscale x 4 x i1> [[TMP12]], splat (i1 true)
+; CHECK-NEXT:    [[TMP14:%.*]] = select <vscale x 4 x i1> [[TMP9]], <vscale x 4 x i1> [[TMP13]], <vscale x 4 x i1> zeroinitializer
+; CHECK-NEXT:    [[TMP15:%.*]] = zext i32 [[TMP5]] to i64
+; CHECK-NEXT:    [[INDEX_EVL_NEXT]] = add nuw i64 [[TMP15]], [[EVL_BASED_IV]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP4]]
+; CHECK-NEXT:    [[TMP16:%.*]] = xor <vscale x 4 x i1> [[TMP14]], splat (i1 true)
+; CHECK-NEXT:    [[TMP17:%.*]] = call i1 @llvm.vector.reduce.or.nxv4i1(<vscale x 4 x i1> [[TMP16]])
+; CHECK-NEXT:    [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    [[TMP19:%.*]] = or i1 [[TMP17]], [[TMP18]]
+; CHECK-NEXT:    br i1 [[TMP19]], label %[[MIDDLE_SPLIT:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_SPLIT]]:
+; CHECK-NEXT:    br i1 [[TMP17]], label %[[VECTOR_EARLY_EXIT:.*]], label %[[MIDDLE_BLOCK:.*]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br i1 true, label %[[E2:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[VECTOR_EARLY_EXIT]]:
+; CHECK-NEXT:    br label %[[E1:.*]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    br label %[[LOOP_HEADER:.*]]
+; CHECK:       [[LOOP_HEADER]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ [[INC:%.*]], %[[LOOP_LATCH:.*]] ], [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ]
+; CHECK-NEXT:    [[GEP_SRC1:%.*]] = getelementptr inbounds i32, ptr [[SRC]], i64 [[IV1]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[GEP_SRC1]], align 4
+; CHECK-NEXT:    [[C_1:%.*]] = icmp eq i32 [[L]], 10
+; CHECK-NEXT:    br i1 [[C_1]], label %[[E1]], label %[[LOOP_LATCH]]
+; CHECK:       [[LOOP_LATCH]]:
+; CHECK-NEXT:    [[INC]] = add nuw i64 [[IV1]], 1
+; CHECK-NEXT:    [[C_2:%.*]] = icmp eq i64 [[INC]], 128
+; CHECK-NEXT:    br i1 [[C_2]], label %[[E2]], label %[[LOOP_HEADER]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       [[E1]]:
+; CHECK-NEXT:    [[P1:%.*]] = phi i64 [ 0, %[[LOOP_HEADER]] ], [ 0, %[[VECTOR_EARLY_EXIT]] ]
+; CHECK-NEXT:    ret i64 [[P1]]
+; CHECK:       [[E2]]:
+; CHECK-NEXT:    [[P2:%.*]] = phi i64 [ 1, %[[LOOP_LATCH]] ], [ 1, %[[MIDDLE_BLOCK]] ]
+; CHECK-NEXT:    ret i64 [[P2]]
+;
+entry:
+  %src = alloca [128 x i32]
+  call void @init(ptr %src)
+  br label %loop.header
+
+loop.header:
+  %iv = phi i64 [ %inc, %loop.latch ], [ 0, %entry ]
+  %gep.src = getelementptr inbounds i32, ptr %src, i64 %iv
+  %l = load i32, ptr %gep.src
+  %c.1 = icmp eq i32 %l, 10
+  br i1 %c.1, label %e1, label %loop.latch
+
+loop.latch:
+  %inc = add nuw i64 %iv, 1
+  %c.2 = icmp eq i64 %inc, 128
+  br i1 %c.2, label %e2, label %loop.header
+
+e1:
+  %p1 = phi i64 [ 0, %loop.header ]
+  ret i64 %p1
+
+e2:
+  %p2 = phi i64 [ 1, %loop.latch ]
+  ret i64 %p2
+}
+;.
+; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; CHECK: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK: [[LOOP3]] = distinct !{[[LOOP3]], [[META2]], [[META1]]}
+;.

// iteration. This will require a lane mask which varies through the vector
// loop body. (TODO)
if ((TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) &&
!Legal->hasUncountableEarlyExit()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like you are in general enabling tail-folding for early exit loops, which I haven't tested at all for AArch64/SVE targets. It may work, but it would be good to add variants of test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll with tail-folding enabled so that we can verify the correct behaviour. Also, have you run the LLVM test suite with tail-folding and early-exit vectorisation enabled to verify all the tests build and pass? When developing the early exit work I found it was a great test suite for exposing issues. To get better coverage you can modify this code in LoopVectorizationLegality::isVectorizableEarlyExitLoop:

  if (!isDereferenceableReadOnlyLoop(TheLoop, PSE.getSE(), DT, AC,
                                     &Predicates)) {

to always vectorise:

  if (false && !isDereferenceableReadOnlyLoop(TheLoop, PSE.getSE(), DT, AC,
                                     &Predicates)) {

It's technically unsafe, but the LLVM test suite is well behaved and tests shouldn't crash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the tip! I'll apply the modification and run the LLVM test suite locally to check for potential issues and follow up once I have results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve added a test at llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit_predication.ll.
From what I observed, the main difference is that vectorization is not considered beneficial when predication is enabled.

I also ran the LLVM test suite with march=rv64gcv_zvl256b, patched to always vectorize regardless of isDereferenceableReadOnlyLoop., All 2041 tests passed under the following flag:
-O3 -mllvm -prefer-predicate-over-epilogue=predicate-dont-vectorize
-O3 -mllvm -prefer-predicate-over-epilogue=predicate-dont-vectorize -mllvm -force-tail-folding-style=data-with-evl

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So tail predication only works with some early exit loops it seems. Some of the tests in simple_early_exit_predication.ll fail to vectorise because of this:

LV: checking if tail can be folded by masking.
LV: Cannot fold tail by masking, loop has an outside user for   %retval = phi i64 [ %index, %loop ], [ 66, %loop.inc ]
LV: Can't fold tail by masking: don't vectorize
LV: Vectorization is possible but not beneficial.

That explains why it's not considered beneficial for many loops. This error comes from LoopVectorizationLegality::canFoldTailByMasking. In all likelihood a simple search loop such as std::find will have an outside use of an induction variable so I'm not sure how much value there is right now in enabling early exit vectorisation with tail-folding? I'm not against enabling it, but I wonder what loops you're specifically interested in here?

Also some of the tests have an exact trip count of 64, where we know there will not be a tail so we avoid using predication. It would be good to change the tests same_exit_block_pre_inc_use1, same_exit_block_pre_inc_use4, loop_contains_safe_call, loop_contains_safe_div and loop_contains_load_after_early_exit to have a trip count of 63 instead of 64. Also, would be good to have at least one test that doesn't have any outside uses so we can verify it's working correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I just tried out your patch and when we use tail-folding in combination with get.active.lane.mask for early exit loops the IR is broken. I modified same_exit_block_pre_inc_use1 to have a trip count of 63 and removed outside uses of the induction variable so that I ended up with vectorised IR like this:

  %wide.masked.load2 = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr %9, i32 1, <vscale x 16 x i1> %active.lane.mask, <vscale x 16 x i8> poison)
  %10 = icmp eq <vscale x 16 x i8> %wide.masked.load, %wide.masked.load2
  %11 = select <vscale x 16 x i1> %active.lane.mask, <vscale x 16 x i1> %10, <vscale x 16 x i1> zeroinitializer
  %index.next3 = add i64 %index1, %4
  %12 = xor <vscale x 16 x i1> %11, splat (i1 true)
  %13 = call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> %12)
  %active.lane.mask.next = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index.next3, i64 63)
  %14 = xor <vscale x 16 x i1> %active.lane.mask.next, splat (i1 true)
  %15 = extractelement <vscale x 16 x i1> %14, i32 0
  br i1 %15, label %middle.split, label %vector.body, !llvm.loop !0

The branch condition should be an or of %13 and %15. Unfortunately, I think I have to request changes this PR for now. Sorry about that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed feedback—it's super helpful!
I am inspired by your patch (PR #120603) and would like to extend it by leveraging the vp.load.ff intrinsics (PR #128593) to safely handle out-of-bounds accesses. My goal is to introduce a new WidenFFLoad recipe, to enable vectorization of loops like std::find using EVL-based tail-folding.
You're right that canFoldTailByMasking() currently blocks this. Since EVL-based tail-folding doesn't mask the loop body, I'll probably need to relax that check specifically for EVL tail-folding. Also, the current VPLane::getAsRuntimeExpr() implementation, which uses ElementCount to calculate the last lane index, gets in the way too.
Good catch on the tests—I now realize they're currently using multiples of VF for the trip count, meaning the tail-folding paths aren't really being tested properly. I'll fix that up.

Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in my most recent comment, the vector latch branch condition is incorrect when using the tail-folding style that requires calling get.active.lane.mask.

// iteration. This will require a lane mask which varies through the vector
// loop body. (TODO)
if ((TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) &&
!Legal->hasUncountableEarlyExit()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I just tried out your patch and when we use tail-folding in combination with get.active.lane.mask for early exit loops the IR is broken. I modified same_exit_block_pre_inc_use1 to have a trip count of 63 and removed outside uses of the induction variable so that I ended up with vectorised IR like this:

  %wide.masked.load2 = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr %9, i32 1, <vscale x 16 x i1> %active.lane.mask, <vscale x 16 x i8> poison)
  %10 = icmp eq <vscale x 16 x i8> %wide.masked.load, %wide.masked.load2
  %11 = select <vscale x 16 x i1> %active.lane.mask, <vscale x 16 x i1> %10, <vscale x 16 x i1> zeroinitializer
  %index.next3 = add i64 %index1, %4
  %12 = xor <vscale x 16 x i1> %11, splat (i1 true)
  %13 = call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> %12)
  %active.lane.mask.next = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index.next3, i64 63)
  %14 = xor <vscale x 16 x i1> %active.lane.mask.next, splat (i1 true)
  %15 = extractelement <vscale x 16 x i1> %14, i32 0
  br i1 %15, label %middle.split, label %vector.body, !llvm.loop !0

The branch condition should be an or of %13 and %15. Unfortunately, I think I have to request changes this PR for now. Sorry about that!

@arcbbb
Copy link
Contributor Author

arcbbb commented May 6, 2025

Thanks for the review and discussion. There's still some up-front work needed before we can meaningfully move this forward. Closing this for now to reduce noise and revisit when the prerequisites are in place

@arcbbb arcbbb closed this May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants