Skip to content

[LoopVectorize] Allow Early-Exit Loop Vectorization with EVL #130918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 13 additions & 10 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4038,10 +4038,12 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
}

// The only loops we can vectorize without a scalar epilogue, are loops with
// a bottom-test and a single exiting block. We'd have to handle the fact
// that not every instruction executes on the last iteration. This will
// require a lane mask which varies through the vector loop body. (TODO)
if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) {
// a bottom-test and a single exiting block or those with early exits. We'd
// have to handle the fact that not every instruction executes on the last
// iteration. This will require a lane mask which varies through the vector
// loop body. (TODO)
if ((TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) &&
!Legal->hasUncountableEarlyExit()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like you are in general enabling tail-folding for early exit loops, which I haven't tested at all for AArch64/SVE targets. It may work, but it would be good to add variants of test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll with tail-folding enabled so that we can verify the correct behaviour. Also, have you run the LLVM test suite with tail-folding and early-exit vectorisation enabled to verify all the tests build and pass? When developing the early exit work I found it was a great test suite for exposing issues. To get better coverage you can modify this code in LoopVectorizationLegality::isVectorizableEarlyExitLoop:

  if (!isDereferenceableReadOnlyLoop(TheLoop, PSE.getSE(), DT, AC,
                                     &Predicates)) {

to always vectorise:

  if (false && !isDereferenceableReadOnlyLoop(TheLoop, PSE.getSE(), DT, AC,
                                     &Predicates)) {

It's technically unsafe, but the LLVM test suite is well behaved and tests shouldn't crash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the tip! I'll apply the modification and run the LLVM test suite locally to check for potential issues and follow up once I have results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve added a test at llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit_predication.ll.
From what I observed, the main difference is that vectorization is not considered beneficial when predication is enabled.

I also ran the LLVM test suite with march=rv64gcv_zvl256b, patched to always vectorize regardless of isDereferenceableReadOnlyLoop., All 2041 tests passed under the following flag:
-O3 -mllvm -prefer-predicate-over-epilogue=predicate-dont-vectorize
-O3 -mllvm -prefer-predicate-over-epilogue=predicate-dont-vectorize -mllvm -force-tail-folding-style=data-with-evl

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So tail predication only works with some early exit loops it seems. Some of the tests in simple_early_exit_predication.ll fail to vectorise because of this:

LV: checking if tail can be folded by masking.
LV: Cannot fold tail by masking, loop has an outside user for   %retval = phi i64 [ %index, %loop ], [ 66, %loop.inc ]
LV: Can't fold tail by masking: don't vectorize
LV: Vectorization is possible but not beneficial.

That explains why it's not considered beneficial for many loops. This error comes from LoopVectorizationLegality::canFoldTailByMasking. In all likelihood a simple search loop such as std::find will have an outside use of an induction variable so I'm not sure how much value there is right now in enabling early exit vectorisation with tail-folding? I'm not against enabling it, but I wonder what loops you're specifically interested in here?

Also some of the tests have an exact trip count of 64, where we know there will not be a tail so we avoid using predication. It would be good to change the tests same_exit_block_pre_inc_use1, same_exit_block_pre_inc_use4, loop_contains_safe_call, loop_contains_safe_div and loop_contains_load_after_early_exit to have a trip count of 63 instead of 64. Also, would be good to have at least one test that doesn't have any outside uses so we can verify it's working correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I just tried out your patch and when we use tail-folding in combination with get.active.lane.mask for early exit loops the IR is broken. I modified same_exit_block_pre_inc_use1 to have a trip count of 63 and removed outside uses of the induction variable so that I ended up with vectorised IR like this:

  %wide.masked.load2 = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr %9, i32 1, <vscale x 16 x i1> %active.lane.mask, <vscale x 16 x i8> poison)
  %10 = icmp eq <vscale x 16 x i8> %wide.masked.load, %wide.masked.load2
  %11 = select <vscale x 16 x i1> %active.lane.mask, <vscale x 16 x i1> %10, <vscale x 16 x i1> zeroinitializer
  %index.next3 = add i64 %index1, %4
  %12 = xor <vscale x 16 x i1> %11, splat (i1 true)
  %13 = call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> %12)
  %active.lane.mask.next = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index.next3, i64 63)
  %14 = xor <vscale x 16 x i1> %active.lane.mask.next, splat (i1 true)
  %15 = extractelement <vscale x 16 x i1> %14, i32 0
  br i1 %15, label %middle.split, label %vector.body, !llvm.loop !0

The branch condition should be an or of %13 and %15. Unfortunately, I think I have to request changes this PR for now. Sorry about that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed feedback—it's super helpful!
I am inspired by your patch (PR #120603) and would like to extend it by leveraging the vp.load.ff intrinsics (PR #128593) to safely handle out-of-bounds accesses. My goal is to introduce a new WidenFFLoad recipe, to enable vectorization of loops like std::find using EVL-based tail-folding.
You're right that canFoldTailByMasking() currently blocks this. Since EVL-based tail-folding doesn't mask the loop body, I'll probably need to relax that check specifically for EVL tail-folding. Also, the current VPLane::getAsRuntimeExpr() implementation, which uses ElementCount to calculate the last lane index, gets in the way too.
Good catch on the tests—I now realize they're currently using multiples of VF for the trip count, meaning the tail-folding paths aren't really being tested properly. I'll fix that up.

// If there was a tail-folding hint/switch, but we can't fold the tail by
// masking, fallback to a vectorization with a scalar epilogue.
if (ScalarEpilogueStatus == CM_ScalarEpilogueNotNeededUsePredicate) {
Expand Down Expand Up @@ -4087,13 +4089,14 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
unsigned MaxVFtimesIC =
UserIC ? *MaxPowerOf2RuntimeVF * UserIC : *MaxPowerOf2RuntimeVF;
ScalarEvolution *SE = PSE.getSE();
// Currently only loops with countable exits are vectorized, but calling
// getSymbolicMaxBackedgeTakenCount allows enablement work for loops with
// uncountable exits whilst also ensuring the symbolic maximum and known
// back-edge taken count remain identical for loops with countable exits.

// Calling getSymbolicMaxBackedgeTakenCount enables support for loops
// with uncountable exits. For countable loops, the symbolic maximum must
// remain identical to the known back-edge taken count.
const SCEV *BackedgeTakenCount = PSE.getSymbolicMaxBackedgeTakenCount();
assert(BackedgeTakenCount == PSE.getBackedgeTakenCount() &&
"Invalid loop count");
assert(Legal->hasUncountableEarlyExit() ||
(BackedgeTakenCount == PSE.getBackedgeTakenCount()) &&
"Invalid loop count");
const SCEV *ExitCount = SE->getAddExpr(
BackedgeTakenCount, SE->getOne(BackedgeTakenCount->getType()));
const SCEV *Rem = SE->getURemExpr(
Expand Down
Loading