Vectorized loops do not get proper handling by LoopStrenghtReduction #30061

JonPsson · 2016-10-16T16:07:45Z


Bugzilla Link	30713
Version	trunk
OS	Linux
CC	@hfinkel,@JonPsson,@uweigand

Extended Description

I find that many loops that get vectorized suffer from lack of handling of the LSR pass.

I first thought this was related to the fact of vectorized addressing computations, but even when I (on my experimental branch), forced the vectorizer to scalarize all address computations, the LSR pass does nothing.

The vectorization factor was 2.

What I see is that IVUsers do not recognize the loads as IV users in the vectorized loop. It turns out that the vectorized loops 'sext' instruction is !isInteresting(), because this SCEV:
(sext i32 {(1 + (2 * %mm.651771)),+,4}<%vector.body3549> to i64)
for it is not an SCEVAddRecExpr.

In the scalar loop, the sext is however interesting, with this SCEVAddRecExpr:
{(sext i32 (-1 + (2 * %kk1108.11765.in.ph)) to i64),+,-2}<%for.body1153>

The basic use of the PHINode is following similar patterns:
phi -> add -> shl -> mul -> sext -> gep -> bitcast -> load
phi -> sub -> add -> add -> shl -> sext -> gep -> bitcast -> load

I would like to ask for any input on if the IVUsers / LSR is supposed to work well with vectorized loops, and if anyone has encountered this before?

/Jonas

My example looks like this:

Loop before vectorize pass:

for.body1153: ; preds = %for.body1153, %for.body1153.preheader
%kk1108.11765.in = phi i32 [ %kk1108.11765, %for.body1153 ], [ %kstart1109.11769, %for.body1153.preheader ]
%mm.661764 = phi i32 [ %inc1171, %for.body1153 ], [ %mm.651771, %for.body1153.preheader ]
%ii1106.11763 = phi i32 [ %inc1169, %for.body1153 ], [ %add1150, %for.body1153.preheader ]
%kk1108.11765 = add nsw i32 %kk1108.11765.in, -1
%mul1154 = shl nsw i32 %kk1108.11765, 1
%idxprom1155 = sext i32 %mul1154 to i64
%arrayidx1156 = getelementptr inbounds double, double* %call113, i64 %idxprom1155
%387 = bitcast double* %arrayidx1156 to i64*
%388 = load i64, i64* %387, align 8, !tbaa !19
%mul1157 = shl nsw i32 %mm.661764, 1
%idxprom1158 = sext i32 %mul1157 to i64
%arrayidx1159 = getelementptr inbounds double, double* %dvec, i64 %idxprom1158
%389 = bitcast double* %arrayidx1159 to i64*
store i64 %388, i64* %389, align 8, !tbaa !19
%add1161 = or i32 %mul1154, 1
%idxprom1162 = sext i32 %add1161 to i64
%arrayidx1163 = getelementptr inbounds double, double* %call113, i64 %idxprom1162
%390 = bitcast double* %arrayidx1163 to i64*
%391 = load i64, i64* %390, align 8, !tbaa !19
%add1165 = or i32 %mul1157, 1
%idxprom1166 = sext i32 %add1165 to i64
%arrayidx1167 = getelementptr inbounds double, double* %dvec, i64 %idxprom1166
%392 = bitcast double* %arrayidx1167 to i64*
store i64 %391, i64* %392, align 8, !tbaa !19
%inc1169 = add nuw nsw i32 %ii1106.11763, 1
%inc1171 = add nsw i32 %mm.661764, 1
%exitcond2440 = icmp eq i32 %inc1169, %3
br i1 %exitcond2440, label %for.end1172.loopexit, label %for.body1153

Loop after vectorize pass (with scalarized address computations to try to help LSR):

vector.body3549: ; preds = %vector.body3549, %vector.ph3582
%index3583 = phi i32 [ 0, %vector.ph3582 ], [ %index.next3584, %vector.body3549 ]
%offset.idx3592 = sub i32 %kstart1109.11769, %index3583
%broadcast.splatinsert3593 = insertelement <2 x i32> undef, i32 %offset.idx3592, i32 0
%broadcast.splat3594 = shufflevector <2 x i32> %broadcast.splatinsert3593, <2 x i32> undef, <2 x i32> zeroinitializer
%induction3595 = add <2 x i32> %broadcast.splat3594, <i32 0, i32 -1>
%418 = add i32 %offset.idx3592, 0
%419 = add i32 %offset.idx3592, -1
%offset.idx3596 = add i32 %mm.651771, %index3583
%broadcast.splatinsert3597 = insertelement <2 x i32> undef, i32 %offset.idx3596, i32 0
%broadcast.splat3598 = shufflevector <2 x i32> %broadcast.splatinsert3597, <2 x i32> undef, <2 x i32> zeroinitializer
%induction3599 = add <2 x i32> %broadcast.splat3598, <i32 0, i32 1>
%420 = add i32 %offset.idx3596, 0
%offset.idx3600 = add i32 %add1150, %index3583
%broadcast.splatinsert3601 = insertelement <2 x i32> undef, i32 %offset.idx3600, i32 0
%broadcast.splat3602 = shufflevector <2 x i32> %broadcast.splatinsert3601, <2 x i32> undef, <2 x i32> zeroinitializer
%induction3603 = add <2 x i32> %broadcast.splat3602, <i32 0, i32 1>
%421 = add i32 %offset.idx3600, 0
%422 = add nsw i32 %418, -1
%423 = add nsw i32 %419, -1
%424 = shl nsw i32 %422, 1
%425 = shl nsw i32 %423, 1
%426 = sext i32 %424 to i64
%427 = sext i32 %425 to i64
%428 = getelementptr inbounds double, double* %call113, i64 %426
%429 = getelementptr inbounds double, double* %call113, i64 %427
%430 = bitcast double* %428 to i64*
%431 = bitcast double* %429 to i64*
%432 = load i64, i64* %430, align 8, !tbaa !19, !alias.scope !21
%433 = load i64, i64* %431, align 8, !tbaa !19, !alias.scope !21
%434 = insertelement <2 x i64> undef, i64 %432, i32 0
%435 = insertelement <2 x i64> %434, i64 %433, i32 1
%436 = shl nsw i32 %420, 1
%437 = sext i32 %436 to i64
%438 = getelementptr inbounds double, double* %dvec, i64 %437
%439 = bitcast double* %438 to i64*
%440 = or i32 %424, 1
%441 = or i32 %425, 1
%442 = sext i32 %440 to i64
%443 = sext i32 %441 to i64
%444 = getelementptr inbounds double, double* %call113, i64 %442
%445 = getelementptr inbounds double, double* %call113, i64 %443
%446 = bitcast double* %444 to i64*
%447 = bitcast double* %445 to i64*
%448 = load i64, i64* %446, align 8, !tbaa !19, !alias.scope !24
%449 = load i64, i64* %447, align 8, !tbaa !19, !alias.scope !24
%450 = insertelement <2 x i64> undef, i64 %448, i32 0
%451 = insertelement <2 x i64> %450, i64 %449, i32 1
%452 = or i32 %436, 1
%453 = sext i32 %452 to i64
%454 = getelementptr inbounds double, double* %dvec, i64 %453
%455 = bitcast double* %454 to i64*
%456 = getelementptr i64, i64* %455, i32 -1
%457 = bitcast i64* %456 to <4 x i64>*
%458 = shufflevector <2 x i64> %435, <2 x i64> %451, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%interleaved.vec3604 = shufflevector <4 x i64> %458, <4 x i64> undef, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
store <4 x i64> %interleaved.vec3604, <4 x i64>* %457, align 8, !tbaa !19, !alias.scope !26, !noalias !28
%459 = add nuw nsw i32 %421, 1
%460 = add nsw i32 %420, 1
%461 = icmp eq i32 %459, %3
%index.next3584 = add i32 %index3583, 2
%462 = icmp eq i32 %index.next3584, %n.vec3555
br i1 %462, label %middle.block3550, label %vector.body3549, !llvm.loop !29

Final vectorized MachineLoop :
Final Header:
vector.body3549: ; preds = %vector.body3549, %vector.body3549.preheader
%lsr.iv467 = phi i32 [ %lsr.iv.next468, %vector.body3549 ], [ %lsr.iv465, %vector.body3549.preheader ]
%lsr.iv461 = phi i32 [ %lsr.iv.next462, %vector.body3549 ], [ %749, %vector.body3549.preheader ]
%lsr.iv459 = phi i32 [ %lsr.iv.next460, %vector.body3549 ], [ %727, %vector.body3549.preheader ]
%750 = add i32 %lsr.iv467, -1
%751 = add i32 %lsr.iv467, -3
%752 = sext i32 %750 to i64
%753 = sext i32 %751 to i64
%754 = getelementptr inbounds double, double* %call113, i64 %752
%755 = getelementptr inbounds double, double* %call113, i64 %753
%756 = bitcast double* %754 to i64*
%757 = bitcast double* %755 to i64*
%758 = load i64, i64* %756, align 8, !tbaa !19, !alias.scope !93
%759 = load i64, i64* %757, align 8, !tbaa !19, !alias.scope !93
%760 = insertelement <2 x i64> undef, i64 %758, i32 0
%761 = insertelement <2 x i64> %760, i64 %759, i32 1
%762 = add i32 %lsr.iv467, -2
%763 = sext i32 %lsr.iv467 to i64
%764 = sext i32 %762 to i64
%765 = getelementptr inbounds double, double* %call113, i64 %763
%766 = getelementptr inbounds double, double* %call113, i64 %764
%767 = bitcast double* %765 to i64*
%768 = bitcast double* %766 to i64*
%769 = load i64, i64* %767, align 8, !tbaa !19, !alias.scope !96
%770 = load i64, i64* %768, align 8, !tbaa !19, !alias.scope !96
%771 = insertelement <2 x i64> undef, i64 %769, i32 0
%772 = insertelement <2 x i64> %771, i64 %770, i32 1
%773 = sext i32 %lsr.iv461 to i64
%774 = getelementptr inbounds double, double* %dvec, i64 %773
%775 = getelementptr double, double* %774, i64 -1
%776 = bitcast double* %775 to <4 x i64>*
%interleaved.vec3604 = shufflevector <2 x i64> %761, <2 x i64> %772, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
store <4 x i64> %interleaved.vec3604, <4 x i64>* %776, align 8, !tbaa !19, !alias.scope !98, !noalias !100
%lsr.iv.next460 = add i32 %lsr.iv459, -2
%lsr.iv.next462 = add i32 %lsr.iv461, 4
%lsr.iv.next468 = add i32 %lsr.iv467, -4
%777 = icmp eq i32 %lsr.iv.next460, 0
br i1 %777, label %middle.block3550, label %vector.body3549, !llvm.loop !101

Final scalar MachineLoop :
Final Header:
for.body1153: ; preds = %for.body1153, %for.body1153.preheader4255
%lsr.iv483 = phi i32 [ %lsr.iv.next484, %for.body1153 ], [ %785, %for.body1153.preheader4255 ]
%lsr.iv479 = phi double* [ %scevgep480, %for.body1153 ], [ %scevgep478, %for.body1153.preheader4255 ]
%lsr.iv474 = phi double* [ %scevgep475, %for.body1153 ], [ %scevgep473, %for.body1153.preheader4255 ]
%lsr.iv470 = phi double* [ %scevgep471, %for.body1153 ], [ %scevgep469, %for.body1153.preheader4255 ]
%lsr.iv479481 = bitcast double* %lsr.iv479 to i64*
%lsr.iv474476 = bitcast double* %lsr.iv474 to i64*
%lsr.iv470472 = bitcast double* %lsr.iv470 to i64*
%786 = load i64, i64* %lsr.iv474476, align 8, !tbaa !19
%scevgep482 = getelementptr i64, i64* %lsr.iv479481, i64 -1
store i64 %786, i64* %scevgep482, align 8, !tbaa !19
%787 = load i64, i64* %lsr.iv470472, align 8, !tbaa !19
store i64 %787, i64* %lsr.iv479481, align 8, !tbaa !19
%scevgep471 = getelementptr double, double* %lsr.iv470, i64 -2
%scevgep475 = getelementptr double, double* %lsr.iv474, i64 -2
%scevgep480 = getelementptr double, double* %lsr.iv479, i64 2
%lsr.iv.next484 = add i32 %lsr.iv483, -1
%exitcond2440 = icmp eq i32 %lsr.iv.next484, 0
br i1 %exitcond2440, label %for.end1172.loopexit, label %for.body1153, !llvm.loop !102

Final vectorized MachineLoop :
BB#261: derived from LLVM BB %vector.body3549
Live Ins: %R6D %R9D %R0H %R0L %R1L %R3L %R4L %R5L %R7L %R8L %R10L %R11L %R12L %R13L
Predecessors according to CFG: BB#260 BB#261
%R2D = LGFR %R1L
%R2D = SLLG %R2D, %noreg, 3
%R2D = LG %R6D, 0, %R2D; mem:LD8%767(alias.scope=#97)
%R14L = AHIK %R1L, -2, %CC<imp-def,dead>
%R14D = LGFR %R14L
%R14D = SLLG %R14D, %noreg, 3
%R14D = LG %R6D, 0, %R14D; mem:LD8%768(alias.scope=#97)
%V0 = VLVGP %R2D, %R14D
%R2L = AHIK %R1L, -3, %CC<imp-def,dead>
%R2D = LGFR %R2L
%R2D = SLLG %R2D, %noreg, 3
%R2D = LG %R6D, 0, %R2D; mem:LD8%757(alias.scope=#94)
%R14L = AHIK %R1L, -1, %CC<imp-def,dead>
%R14D = LGFR %R14L
%R14D = SLLG %R14D, %noreg, 3
%R14D = LG %R6D, 0, %R14D; mem:LD8%756(alias.scope=#94)
%V1 = VLVGP %R14D, %R2D
%V2 = VMRLG %V1, %V0
%R2D = LGFR %R11L
%R2D = SLLG %R2D, %noreg, 3
VST %V2, %R9D, 8, %R2D; mem:ST16%776+16(tbaa=#20)(alias.scope=#99)(noalias=#97,#94)
%V0 = VMRHG %V1, %V0
%R2D = LAY %R9D, -8, %R2D
VST %V0, %R2D, 0, %noreg; mem:ST16%776(tbaa=#20)(alias.scope=#99)(noalias=#97,#94)
%R1L<def,tied1> = AHI %R1L<kill,tied0>, -4, %CC<imp-def,dead>
%R11L<def,tied1> = AHI %R11L<kill,tied0>, 4, %CC<imp-def,dead>
%R5L<def,tied1> = AHI %R5L<kill,tied0>, -2, %CC
BRC 15, 7, <BB#261>, %CC<imp-use,kill>
Successors according to CFG: BB#262(0x04000000 / 0x80000000 = 3.12%) BB#261(0x7c000000 / 0x80000000 = 96.88%)

Final scalar MachineLoop :
BB#265: derived from LLVM BB %for.body1153
Live Ins: %R1D %R2D %R5D %R6D %R0H %R3L %R4L %R7L %R8L %R9L %R10L %R11L %R12L %R13L
Predecessors according to CFG: BB#264 BB#265
%R14D = LG %R5D, 0, %noreg; mem:LD8%lsr.iv474476
STG %R14D, %R1D, -8, %noreg; mem:ST8%scevgep482
%R14D = LG %R2D, 0, %noreg; mem:LD8%lsr.iv470472
STG %R14D, %R1D, 0, %noreg; mem:ST8%lsr.iv479481
%R1D = LA %R1D, 16, %noreg
%R5D = LAY %R5D, -16, %noreg
%R2D = LAY %R2D, -16, %noreg
%R4L<def,tied1> = BRCT %R4L<kill,tied0>, <BB#265>, %CC<imp-def,dead>
Successors according to CFG: BB#266(0x04000000 / 0x80000000 = 3.12%) BB#265(0x7c000000 / 0x80000000 = 96.88%)

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

Endilll added the llvm:optimizations label Apr 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized loops do not get proper handling by LoopStrenghtReduction #30061

Vectorized loops do not get proper handling by LoopStrenghtReduction #30061

JonPsson commented Oct 16, 2016

Vectorized loops do not get proper handling by LoopStrenghtReduction #30061

Vectorized loops do not get proper handling by LoopStrenghtReduction #30061

Comments

JonPsson commented Oct 16, 2016

Extended Description