You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I find that many loops that get vectorized suffer from lack of handling of the LSR pass.
I first thought this was related to the fact of vectorized addressing computations, but even when I (on my experimental branch), forced the vectorizer to scalarize all address computations, the LSR pass does nothing.
The vectorization factor was 2.
What I see is that IVUsers do not recognize the loads as IV users in the vectorized loop. It turns out that the vectorized loops 'sext' instruction is !isInteresting(), because this SCEV:
(sext i32 {(1 + (2 * %mm.651771)),+,4}<%vector.body3549> to i64)
for it is not an SCEVAddRecExpr.
In the scalar loop, the sext is however interesting, with this SCEVAddRecExpr:
{(sext i32 (-1 + (2 * %kk1108.11765.in.ph)) to i64),+,-2}<%for.body1153>
The basic use of the PHINode is following similar patterns:
phi -> add -> shl -> mul -> sext -> gep -> bitcast -> load
phi -> sub -> add -> add -> shl -> sext -> gep -> bitcast -> load
I would like to ask for any input on if the IVUsers / LSR is supposed to work well with vectorized loops, and if anyone has encountered this before?
Extended Description
I find that many loops that get vectorized suffer from lack of handling of the LSR pass.
I first thought this was related to the fact of vectorized addressing computations, but even when I (on my experimental branch), forced the vectorizer to scalarize all address computations, the LSR pass does nothing.
The vectorization factor was 2.
What I see is that IVUsers do not recognize the loads as IV users in the vectorized loop. It turns out that the vectorized loops 'sext' instruction is !isInteresting(), because this SCEV:
(sext i32 {(1 + (2 * %mm.651771)),+,4}<%vector.body3549> to i64)
for it is not an SCEVAddRecExpr.
In the scalar loop, the sext is however interesting, with this SCEVAddRecExpr:
{(sext i32 (-1 + (2 * %kk1108.11765.in.ph)) to i64),+,-2}<%for.body1153>
The basic use of the PHINode is following similar patterns:
phi -> add -> shl -> mul -> sext -> gep -> bitcast -> load
phi -> sub -> add -> add -> shl -> sext -> gep -> bitcast -> load
I would like to ask for any input on if the IVUsers / LSR is supposed to work well with vectorized loops, and if anyone has encountered this before?
/Jonas
My example looks like this:
Loop before vectorize pass:
for.body1153: ; preds = %for.body1153, %for.body1153.preheader
%kk1108.11765.in = phi i32 [ %kk1108.11765, %for.body1153 ], [ %kstart1109.11769, %for.body1153.preheader ]
%mm.661764 = phi i32 [ %inc1171, %for.body1153 ], [ %mm.651771, %for.body1153.preheader ]
%ii1106.11763 = phi i32 [ %inc1169, %for.body1153 ], [ %add1150, %for.body1153.preheader ]
%kk1108.11765 = add nsw i32 %kk1108.11765.in, -1
%mul1154 = shl nsw i32 %kk1108.11765, 1
%idxprom1155 = sext i32 %mul1154 to i64
%arrayidx1156 = getelementptr inbounds double, double* %call113, i64 %idxprom1155
%387 = bitcast double* %arrayidx1156 to i64*
%388 = load i64, i64* %387, align 8, !tbaa !19
%mul1157 = shl nsw i32 %mm.661764, 1
%idxprom1158 = sext i32 %mul1157 to i64
%arrayidx1159 = getelementptr inbounds double, double* %dvec, i64 %idxprom1158
%389 = bitcast double* %arrayidx1159 to i64*
store i64 %388, i64* %389, align 8, !tbaa !19
%add1161 = or i32 %mul1154, 1
%idxprom1162 = sext i32 %add1161 to i64
%arrayidx1163 = getelementptr inbounds double, double* %call113, i64 %idxprom1162
%390 = bitcast double* %arrayidx1163 to i64*
%391 = load i64, i64* %390, align 8, !tbaa !19
%add1165 = or i32 %mul1157, 1
%idxprom1166 = sext i32 %add1165 to i64
%arrayidx1167 = getelementptr inbounds double, double* %dvec, i64 %idxprom1166
%392 = bitcast double* %arrayidx1167 to i64*
store i64 %391, i64* %392, align 8, !tbaa !19
%inc1169 = add nuw nsw i32 %ii1106.11763, 1
%inc1171 = add nsw i32 %mm.661764, 1
%exitcond2440 = icmp eq i32 %inc1169, %3
br i1 %exitcond2440, label %for.end1172.loopexit, label %for.body1153
Loop after vectorize pass (with scalarized address computations to try to help LSR):
vector.body3549: ; preds = %vector.body3549, %vector.ph3582
%index3583 = phi i32 [ 0, %vector.ph3582 ], [ %index.next3584, %vector.body3549 ]
%offset.idx3592 = sub i32 %kstart1109.11769, %index3583
%broadcast.splatinsert3593 = insertelement <2 x i32> undef, i32 %offset.idx3592, i32 0
%broadcast.splat3594 = shufflevector <2 x i32> %broadcast.splatinsert3593, <2 x i32> undef, <2 x i32> zeroinitializer
%induction3595 = add <2 x i32> %broadcast.splat3594, <i32 0, i32 -1>
%418 = add i32 %offset.idx3592, 0
%419 = add i32 %offset.idx3592, -1
%offset.idx3596 = add i32 %mm.651771, %index3583
%broadcast.splatinsert3597 = insertelement <2 x i32> undef, i32 %offset.idx3596, i32 0
%broadcast.splat3598 = shufflevector <2 x i32> %broadcast.splatinsert3597, <2 x i32> undef, <2 x i32> zeroinitializer
%induction3599 = add <2 x i32> %broadcast.splat3598, <i32 0, i32 1>
%420 = add i32 %offset.idx3596, 0
%offset.idx3600 = add i32 %add1150, %index3583
%broadcast.splatinsert3601 = insertelement <2 x i32> undef, i32 %offset.idx3600, i32 0
%broadcast.splat3602 = shufflevector <2 x i32> %broadcast.splatinsert3601, <2 x i32> undef, <2 x i32> zeroinitializer
%induction3603 = add <2 x i32> %broadcast.splat3602, <i32 0, i32 1>
%421 = add i32 %offset.idx3600, 0
%422 = add nsw i32 %418, -1
%423 = add nsw i32 %419, -1
%424 = shl nsw i32 %422, 1
%425 = shl nsw i32 %423, 1
%426 = sext i32 %424 to i64
%427 = sext i32 %425 to i64
%428 = getelementptr inbounds double, double* %call113, i64 %426
%429 = getelementptr inbounds double, double* %call113, i64 %427
%430 = bitcast double* %428 to i64*
%431 = bitcast double* %429 to i64*
%432 = load i64, i64* %430, align 8, !tbaa !19, !alias.scope !21
%433 = load i64, i64* %431, align 8, !tbaa !19, !alias.scope !21
%434 = insertelement <2 x i64> undef, i64 %432, i32 0
%435 = insertelement <2 x i64> %434, i64 %433, i32 1
%436 = shl nsw i32 %420, 1
%437 = sext i32 %436 to i64
%438 = getelementptr inbounds double, double* %dvec, i64 %437
%439 = bitcast double* %438 to i64*
%440 = or i32 %424, 1
%441 = or i32 %425, 1
%442 = sext i32 %440 to i64
%443 = sext i32 %441 to i64
%444 = getelementptr inbounds double, double* %call113, i64 %442
%445 = getelementptr inbounds double, double* %call113, i64 %443
%446 = bitcast double* %444 to i64*
%447 = bitcast double* %445 to i64*
%448 = load i64, i64* %446, align 8, !tbaa !19, !alias.scope !24
%449 = load i64, i64* %447, align 8, !tbaa !19, !alias.scope !24
%450 = insertelement <2 x i64> undef, i64 %448, i32 0
%451 = insertelement <2 x i64> %450, i64 %449, i32 1
%452 = or i32 %436, 1
%453 = sext i32 %452 to i64
%454 = getelementptr inbounds double, double* %dvec, i64 %453
%455 = bitcast double* %454 to i64*
%456 = getelementptr i64, i64* %455, i32 -1
%457 = bitcast i64* %456 to <4 x i64>*
%458 = shufflevector <2 x i64> %435, <2 x i64> %451, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%interleaved.vec3604 = shufflevector <4 x i64> %458, <4 x i64> undef, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
store <4 x i64> %interleaved.vec3604, <4 x i64>* %457, align 8, !tbaa !19, !alias.scope !26, !noalias !28
%459 = add nuw nsw i32 %421, 1
%460 = add nsw i32 %420, 1
%461 = icmp eq i32 %459, %3
%index.next3584 = add i32 %index3583, 2
%462 = icmp eq i32 %index.next3584, %n.vec3555
br i1 %462, label %middle.block3550, label %vector.body3549, !llvm.loop !29
Final vectorized MachineLoop :
Final Header:
vector.body3549: ; preds = %vector.body3549, %vector.body3549.preheader
%lsr.iv467 = phi i32 [ %lsr.iv.next468, %vector.body3549 ], [ %lsr.iv465, %vector.body3549.preheader ]
%lsr.iv461 = phi i32 [ %lsr.iv.next462, %vector.body3549 ], [ %749, %vector.body3549.preheader ]
%lsr.iv459 = phi i32 [ %lsr.iv.next460, %vector.body3549 ], [ %727, %vector.body3549.preheader ]
%750 = add i32 %lsr.iv467, -1
%751 = add i32 %lsr.iv467, -3
%752 = sext i32 %750 to i64
%753 = sext i32 %751 to i64
%754 = getelementptr inbounds double, double* %call113, i64 %752
%755 = getelementptr inbounds double, double* %call113, i64 %753
%756 = bitcast double* %754 to i64*
%757 = bitcast double* %755 to i64*
%758 = load i64, i64* %756, align 8, !tbaa !19, !alias.scope !93
%759 = load i64, i64* %757, align 8, !tbaa !19, !alias.scope !93
%760 = insertelement <2 x i64> undef, i64 %758, i32 0
%761 = insertelement <2 x i64> %760, i64 %759, i32 1
%762 = add i32 %lsr.iv467, -2
%763 = sext i32 %lsr.iv467 to i64
%764 = sext i32 %762 to i64
%765 = getelementptr inbounds double, double* %call113, i64 %763
%766 = getelementptr inbounds double, double* %call113, i64 %764
%767 = bitcast double* %765 to i64*
%768 = bitcast double* %766 to i64*
%769 = load i64, i64* %767, align 8, !tbaa !19, !alias.scope !96
%770 = load i64, i64* %768, align 8, !tbaa !19, !alias.scope !96
%771 = insertelement <2 x i64> undef, i64 %769, i32 0
%772 = insertelement <2 x i64> %771, i64 %770, i32 1
%773 = sext i32 %lsr.iv461 to i64
%774 = getelementptr inbounds double, double* %dvec, i64 %773
%775 = getelementptr double, double* %774, i64 -1
%776 = bitcast double* %775 to <4 x i64>*
%interleaved.vec3604 = shufflevector <2 x i64> %761, <2 x i64> %772, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
store <4 x i64> %interleaved.vec3604, <4 x i64>* %776, align 8, !tbaa !19, !alias.scope !98, !noalias !100
%lsr.iv.next460 = add i32 %lsr.iv459, -2
%lsr.iv.next462 = add i32 %lsr.iv461, 4
%lsr.iv.next468 = add i32 %lsr.iv467, -4
%777 = icmp eq i32 %lsr.iv.next460, 0
br i1 %777, label %middle.block3550, label %vector.body3549, !llvm.loop !101
Final scalar MachineLoop :
Final Header:
for.body1153: ; preds = %for.body1153, %for.body1153.preheader4255
%lsr.iv483 = phi i32 [ %lsr.iv.next484, %for.body1153 ], [ %785, %for.body1153.preheader4255 ]
%lsr.iv479 = phi double* [ %scevgep480, %for.body1153 ], [ %scevgep478, %for.body1153.preheader4255 ]
%lsr.iv474 = phi double* [ %scevgep475, %for.body1153 ], [ %scevgep473, %for.body1153.preheader4255 ]
%lsr.iv470 = phi double* [ %scevgep471, %for.body1153 ], [ %scevgep469, %for.body1153.preheader4255 ]
%lsr.iv479481 = bitcast double* %lsr.iv479 to i64*
%lsr.iv474476 = bitcast double* %lsr.iv474 to i64*
%lsr.iv470472 = bitcast double* %lsr.iv470 to i64*
%786 = load i64, i64* %lsr.iv474476, align 8, !tbaa !19
%scevgep482 = getelementptr i64, i64* %lsr.iv479481, i64 -1
store i64 %786, i64* %scevgep482, align 8, !tbaa !19
%787 = load i64, i64* %lsr.iv470472, align 8, !tbaa !19
store i64 %787, i64* %lsr.iv479481, align 8, !tbaa !19
%scevgep471 = getelementptr double, double* %lsr.iv470, i64 -2
%scevgep475 = getelementptr double, double* %lsr.iv474, i64 -2
%scevgep480 = getelementptr double, double* %lsr.iv479, i64 2
%lsr.iv.next484 = add i32 %lsr.iv483, -1
%exitcond2440 = icmp eq i32 %lsr.iv.next484, 0
br i1 %exitcond2440, label %for.end1172.loopexit, label %for.body1153, !llvm.loop !102
Final vectorized MachineLoop :
BB#261: derived from LLVM BB %vector.body3549
Live Ins: %R6D %R9D %R0H %R0L %R1L %R3L %R4L %R5L %R7L %R8L %R10L %R11L %R12L %R13L
Predecessors according to CFG: BB#260 BB#261
%R2D = LGFR %R1L
%R2D = SLLG %R2D, %noreg, 3
%R2D = LG %R6D, 0, %R2D; mem:LD8%767(alias.scope=#97)
%R14L = AHIK %R1L, -2, %CC<imp-def,dead>
%R14D = LGFR %R14L
%R14D = SLLG %R14D, %noreg, 3
%R14D = LG %R6D, 0, %R14D; mem:LD8%768(alias.scope=#97)
%V0 = VLVGP %R2D, %R14D
%R2L = AHIK %R1L, -3, %CC<imp-def,dead>
%R2D = LGFR %R2L
%R2D = SLLG %R2D, %noreg, 3
%R2D = LG %R6D, 0, %R2D; mem:LD8%757(alias.scope=#94)
%R14L = AHIK %R1L, -1, %CC<imp-def,dead>
%R14D = LGFR %R14L
%R14D = SLLG %R14D, %noreg, 3
%R14D = LG %R6D, 0, %R14D; mem:LD8%756(alias.scope=#94)
%V1 = VLVGP %R14D, %R2D
%V2 = VMRLG %V1, %V0
%R2D = LGFR %R11L
%R2D = SLLG %R2D, %noreg, 3
VST %V2, %R9D, 8, %R2D; mem:ST16%776+16(tbaa=#20)(alias.scope=#99)(noalias=#97,#94)
%V0 = VMRHG %V1, %V0
%R2D = LAY %R9D, -8, %R2D
VST %V0, %R2D, 0, %noreg; mem:ST16%776(tbaa=#20)(alias.scope=#99)(noalias=#97,#94)
%R1L<def,tied1> = AHI %R1L<kill,tied0>, -4, %CC<imp-def,dead>
%R11L<def,tied1> = AHI %R11L<kill,tied0>, 4, %CC<imp-def,dead>
%R5L<def,tied1> = AHI %R5L<kill,tied0>, -2, %CC
BRC 15, 7, <BB#261>, %CC<imp-use,kill>
Successors according to CFG: BB#262(0x04000000 / 0x80000000 = 3.12%) BB#261(0x7c000000 / 0x80000000 = 96.88%)
Final scalar MachineLoop :
BB#265: derived from LLVM BB %for.body1153
Live Ins: %R1D %R2D %R5D %R6D %R0H %R3L %R4L %R7L %R8L %R9L %R10L %R11L %R12L %R13L
Predecessors according to CFG: BB#264 BB#265
%R14D = LG %R5D, 0, %noreg; mem:LD8%lsr.iv474476
STG %R14D, %R1D, -8, %noreg; mem:ST8%scevgep482
%R14D = LG %R2D, 0, %noreg; mem:LD8%lsr.iv470472
STG %R14D, %R1D, 0, %noreg; mem:ST8%lsr.iv479481
%R1D = LA %R1D, 16, %noreg
%R5D = LAY %R5D, -16, %noreg
%R2D = LAY %R2D, -16, %noreg
%R4L<def,tied1> = BRCT %R4L<kill,tied0>, <BB#265>, %CC<imp-def,dead>
Successors according to CFG: BB#266(0x04000000 / 0x80000000 = 3.12%) BB#265(0x7c000000 / 0x80000000 = 96.88%)
The text was updated successfully, but these errors were encountered: