-
Notifications
You must be signed in to change notification settings - Fork 13.3k
[InstCombine] Transform high latency, dependent FSQRT/FDIV into FMUL #87474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I tried adjusting the comment to use the original variable names instead of expressing it as assignment of the original names |
@llvm/pr-subscribers-llvm-transforms Author: None (sushgokh) ChangesThe proposed patch, in general, tries to transform the below code sequence: TO (If x, r1 and r2 are all used further in the code) The transform tries to make high latency sqrt and div operations independent and also saves on one multiplication. The patch was tested with SPEC17 suite with cpu=neoverse-v2. The performance uplift achieved was: No other regressions were observed. Also, no compile time differences were observed with the patch. Closes #54652 Patch is 27.76 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/87474.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp b/llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
index 8c698e52b5a0e6..bfe65264738c4d 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
@@ -626,6 +626,127 @@ Instruction *InstCombinerImpl::foldPowiReassoc(BinaryOperator &I) {
return nullptr;
}
+bool isFSqrtDivToFMulLegal(Instruction *X, SmallSetVector<Instruction *, 2> &R1,
+ SmallSetVector<Instruction *, 2> &R2) {
+
+ BasicBlock *BBx = X->getParent();
+ BasicBlock *BBr1 = R1[0]->getParent();
+ BasicBlock *BBr2 = R2[0]->getParent();
+
+ auto IsStrictFP = [](Instruction *I) {
+ IntrinsicInst *II = dyn_cast<IntrinsicInst>(I);
+ return II && II->isStrictFP();
+ };
+
+ // Check the constaints on instruction X.
+ auto XConstraintsSatisfied = [X, &IsStrictFP]() {
+ if (IsStrictFP(X))
+ return false;
+ // X must atleast have 4 uses.
+ // 3 uses as part of
+ // r1 = x * x
+ // r2 = a * x
+ // Now, post-transform, r1/r2 will no longer have usage of 'x' and if the
+ // changes to 'x' need to persist, we must have one more usage of 'x'
+ if (!X->hasNUsesOrMore(4))
+ return false;
+ // Check if reciprocalFP is enabled.
+ bool RecipFPMath = dyn_cast<FPMathOperator>(X)->hasAllowReciprocal();
+ return RecipFPMath;
+ };
+ if (!XConstraintsSatisfied())
+ return false;
+
+ // Check the constraints on instructions in R1.
+ auto R1ConstraintsSatisfied = [BBr1, &IsStrictFP](Instruction *I) {
+ if (IsStrictFP(I))
+ return false;
+ // When you have multiple instructions residing in R1 and R2 respectively,
+ // it's difficult to generate combinations of (R1,R2) and then check if we
+ // have the required pattern. So, for now, just be conservative.
+ if (I->getParent() != BBr1)
+ return false;
+ if (!I->hasNUsesOrMore(1))
+ return false;
+ // The optimization tries to convert
+ // R1 = div * div where, div = 1/sqrt(a)
+ // to
+ // R1 = 1/a
+ // Now, this simplication does not work because sqrt(a)=NaN when a<0
+ if (!I->hasNoNaNs())
+ return false;
+ // sqrt(-0.0) = -0.0, and doing this simplication would change the sign of
+ // the result.
+ return I->hasNoSignedZeros();
+ };
+ if (!std::all_of(R1.begin(), R1.end(), R1ConstraintsSatisfied))
+ return false;
+
+ // Check the constraints on instructions in R2.
+ auto R2ConstraintsSatisfied = [BBr2, &IsStrictFP](Instruction *I) {
+ if (IsStrictFP(I))
+ return false;
+ // When you have multiple instructions residing in R1 and R2 respectively,
+ // it's difficult to generate combination of (R1,R2) and then check if we
+ // have the required pattern. So, for now, just be conservative.
+ if (I->getParent() != BBr2)
+ return false;
+ if (!I->hasNUsesOrMore(1))
+ return false;
+ // This simplication changes
+ // R2 = a * 1/sqrt(a)
+ // to
+ // R2 = sqrt(a)
+ // Now, sqrt(-0.0) = -0.0 and doing this simplication would produce -0.0
+ // instead of NaN.
+ return I->hasNoSignedZeros();
+ };
+ if (!std::all_of(R2.begin(), R2.end(), R2ConstraintsSatisfied))
+ return false;
+
+ // Check the constraints on X, R1 and R2 combined.
+ // fdiv instruction and one of the multiplications must reside in the same
+ // block. If not, the optimized code may execute more ops than before and
+ // this may hamper the performance.
+ return (BBx == BBr1 || BBx == BBr2);
+}
+
+void getFSqrtDivOptPattern(Value *Div, SmallSetVector<Instruction *, 2> &R1,
+ SmallSetVector<Instruction *, 2> &R2) {
+ Value *A;
+ if (match(Div, m_FDiv(m_FPOne(), m_Sqrt(m_Value(A)))) ||
+ match(Div, m_FDiv(m_SpecificFP(-1.0), m_Sqrt(m_Value(A))))) {
+ for (auto U : Div->users()) {
+ Instruction *I = dyn_cast<Instruction>(U);
+ if (!(I && I->getOpcode() == Instruction::FMul))
+ continue;
+
+ if (match(I, m_FMul(m_Specific(Div), m_Specific(Div)))) {
+ R1.insert(I);
+ continue;
+ }
+
+ Value *X;
+ if (match(I, m_FMul(m_Specific(Div), m_Value(X))) && X == A) {
+ R2.insert(I);
+ continue;
+ }
+
+ if (match(I, m_FMul(m_Value(X), m_Specific(Div))) && X == A) {
+ R2.insert(I);
+ continue;
+ }
+ }
+ }
+}
+
+bool delayFMulSqrtTransform(Value *Div) {
+ SmallSetVector<Instruction *, 2> R1, R2;
+ getFSqrtDivOptPattern(Div, R1, R2);
+ return (!(R1.empty() || R2.empty()) &&
+ isFSqrtDivToFMulLegal((Instruction *)Div, R1, R2));
+}
+
Instruction *InstCombinerImpl::foldFMulReassoc(BinaryOperator &I) {
Value *Op0 = I.getOperand(0);
Value *Op1 = I.getOperand(1);
@@ -705,11 +826,11 @@ Instruction *InstCombinerImpl::foldFMulReassoc(BinaryOperator &I) {
// has the necessary (reassoc) fast-math-flags.
if (I.hasNoSignedZeros() &&
match(Op0, (m_FDiv(m_SpecificFP(1.0), m_Value(Y)))) &&
- match(Y, m_Sqrt(m_Value(X))) && Op1 == X)
+ match(Y, m_Sqrt(m_Value(X))) && Op1 == X && !delayFMulSqrtTransform(Op0))
return BinaryOperator::CreateFDivFMF(X, Y, &I);
if (I.hasNoSignedZeros() &&
match(Op1, (m_FDiv(m_SpecificFP(1.0), m_Value(Y)))) &&
- match(Y, m_Sqrt(m_Value(X))) && Op0 == X)
+ match(Y, m_Sqrt(m_Value(X))) && Op0 == X && !delayFMulSqrtTransform(Op1))
return BinaryOperator::CreateFDivFMF(X, Y, &I);
// Like the similar transform in instsimplify, this requires 'nsz' because
@@ -717,7 +838,8 @@ Instruction *InstCombinerImpl::foldFMulReassoc(BinaryOperator &I) {
if (I.hasNoNaNs() && I.hasNoSignedZeros() && Op0 == Op1 && Op0->hasNUses(2)) {
// Peek through fdiv to find squaring of square root:
// (X / sqrt(Y)) * (X / sqrt(Y)) --> (X * X) / Y
- if (match(Op0, m_FDiv(m_Value(X), m_Sqrt(m_Value(Y))))) {
+ if (match(Op0, m_FDiv(m_Value(X), m_Sqrt(m_Value(Y)))) &&
+ !delayFMulSqrtTransform(Op0)) {
Value *XX = Builder.CreateFMulFMF(X, X, &I);
return BinaryOperator::CreateFDivFMF(XX, Y, &I);
}
@@ -1796,6 +1918,35 @@ static Instruction *foldFDivSqrtDivisor(BinaryOperator &I,
return BinaryOperator::CreateFMulFMF(Op0, NewSqrt, &I);
}
+Value *convertFSqrtDivIntoFMul(CallInst *CI, Instruction *X,
+ SmallSetVector<Instruction *, 2> &R1,
+ SmallSetVector<Instruction *, 2> &R2,
+ Value *SqrtOp, InstCombiner::BuilderTy &B) {
+
+ // 1. synthesize tmp1 = 1/a and replace uses of r1
+ B.SetInsertPoint(X);
+ Value *Tmp1 =
+ B.CreateFDivFMF(ConstantFP::get(R1[0]->getType(), 1.0), SqrtOp, R1[0]);
+ for (auto *I : R1)
+ I->replaceAllUsesWith(Tmp1);
+
+ // 2. No need of synthesizing Tmp2 again. In this scenario, tmp2 = CI. Replace
+ // uses of r2 with tmp2
+ for (auto *I : R2)
+ I->replaceAllUsesWith(CI);
+
+ // 3. synthesize tmp3 = tmp1 * tmp2 . Replace uses of 'x' with tmp3
+ Value *Tmp3;
+ // If x = -1/sqrt(a) initially,then Tmp3 = -(Tmp1*tmp2)
+ if (match(X, m_FDiv(m_SpecificFP(-1.0), m_Specific(CI)))) {
+ Value *Mul = B.CreateFMul(Tmp1, CI);
+ Tmp3 = B.CreateFNegFMF(Mul, X);
+ } else
+ Tmp3 = B.CreateFMulFMF(Tmp1, CI, X);
+
+ return Tmp3;
+}
+
Instruction *InstCombinerImpl::visitFDiv(BinaryOperator &I) {
Module *M = I.getModule();
@@ -1820,6 +1971,26 @@ Instruction *InstCombinerImpl::visitFDiv(BinaryOperator &I) {
return R;
Value *Op0 = I.getOperand(0), *Op1 = I.getOperand(1);
+
+ // Convert
+ // x = 1.0/sqrt(a)
+ // r1 = x * x;
+ // r2 = a * x;
+ //
+ // TO
+ //
+ // r1 = 1/a
+ // r2 = sqrt(a)
+ // x = r1 * r2
+ SmallSetVector<Instruction *, 2> R1, R2;
+ getFSqrtDivOptPattern(&I, R1, R2);
+ if (!(R1.empty() || R2.empty()) && isFSqrtDivToFMulLegal(&I, R1, R2)) {
+ CallInst *CI = (CallInst *)((&I)->getOperand(1));
+ Value *SqrtOp = CI->getArgOperand(0);
+ if (Value *D = convertFSqrtDivIntoFMul(CI, &I, R1, R2, SqrtOp, Builder))
+ return replaceInstUsesWith(I, D);
+ }
+
if (isa<Constant>(Op0))
if (SelectInst *SI = dyn_cast<SelectInst>(Op1))
if (Instruction *R = FoldOpIntoSelect(I, SI))
diff --git a/llvm/test/Transforms/InstCombine/fsqrtdiv-transform.ll b/llvm/test/Transforms/InstCombine/fsqrtdiv-transform.ll
new file mode 100644
index 00000000000000..4852337d4b6586
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/fsqrtdiv-transform.ll
@@ -0,0 +1,463 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -S -passes='instcombine<no-verify-fixpoint>' < %s | FileCheck %s
+
+@x = global double 0.000000e+00
+@r1 = global double 0.000000e+00
+@r2 = global double 0.000000e+00
+@r3 = global double 0.000000e+00
+
+; div/mul/mul1 all in the same block.
+define void @bb_constraint_case1(double %a) {
+; CHECK-LABEL: define void @bb_constraint_case1(
+; CHECK-SAME: double [[A:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[TMP0:%.*]] = tail call double @llvm.sqrt.f64(double [[A]])
+; CHECK-NEXT: [[TMP1:%.*]] = fdiv nnan nsz double 1.000000e+00, [[A]]
+; CHECK-NEXT: [[DIV:%.*]] = fmul arcp double [[TMP1]], [[TMP0]]
+; CHECK-NEXT: store double [[DIV]], ptr @x, align 8
+; CHECK-NEXT: store double [[TMP1]], ptr @r1, align 8
+; CHECK-NEXT: store double [[TMP0]], ptr @r2, align 8
+; CHECK-NEXT: ret void
+entry:
+ %sqrt = tail call double @llvm.sqrt.f64(double %a)
+ %div = fdiv arcp double 1.000000e+00, %sqrt
+ store double %div, ptr @x
+ %mul = fmul nnan nsz double %div, %div
+ store double %mul, ptr @r1
+ %mul1 = fmul nsz double %a, %div
+ store double %mul1, ptr @r2
+ ret void
+}
+; div/mul in one block and mul1 in other block with conditional guard.
+define void @bb_constraint_case2(double %a, i32 %d) {
+; CHECK-LABEL: define void @bb_constraint_case2(
+; CHECK-SAME: double [[A:%.*]], i32 [[D:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[TMP0:%.*]] = call double @llvm.sqrt.f64(double [[A]])
+; CHECK-NEXT: [[TMP1:%.*]] = fdiv nnan nsz double 1.000000e+00, [[A]]
+; CHECK-NEXT: [[DIV:%.*]] = fmul arcp double [[TMP1]], [[TMP0]]
+; CHECK-NEXT: store double [[DIV]], ptr @x, align 8
+; CHECK-NEXT: store double [[TMP1]], ptr @r1, align 8
+; CHECK-NEXT: [[TOBOOL_NOT:%.*]] = icmp eq i32 [[D]], 0
+; CHECK-NEXT: br i1 [[TOBOOL_NOT]], label [[IF_END:%.*]], label [[IF_THEN:%.*]]
+; CHECK: if.then:
+; CHECK-NEXT: store double [[TMP0]], ptr @r2, align 8
+; CHECK-NEXT: br label [[IF_END]]
+; CHECK: if.end:
+; CHECK-NEXT: ret void
+entry:
+ %sqrt = call double @llvm.sqrt.f64(double %a)
+ %div = fdiv arcp double 1.000000e+00, %sqrt
+ store double %div, ptr @x
+ %mul = fmul nnan nsz double %div, %div
+ store double %mul, ptr @r1
+ %tobool.not = icmp eq i32 %d, 0
+ br i1 %tobool.not, label %if.end, label %if.then
+
+if.then: ; preds = %entry
+ %mul1 = fmul nsz double %div, %a
+ store double %mul1, ptr @r2
+ br label %if.end
+
+if.end: ; preds = %if.then, %entry
+ ret void
+}
+
+; div in one block. mul/mul1 in other block and conditionally guarded. Don't optimize.
+define void @bb_constraint_case3(double %a, i32 %d) {
+; CHECK-LABEL: define void @bb_constraint_case3(
+; CHECK-SAME: double [[A:%.*]], i32 [[D:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[TMP0:%.*]] = call double @llvm.sqrt.f64(double [[A]])
+; CHECK-NEXT: [[DIV:%.*]] = fdiv arcp double 1.000000e+00, [[TMP0]]
+; CHECK-NEXT: store double [[DIV]], ptr @x, align 8
+; CHECK-NEXT: [[TOBOOL_NOT:%.*]] = icmp eq i32 [[D]], 0
+; CHECK-NEXT: br i1 [[TOBOOL_NOT]], label [[IF_END:%.*]], label [[IF_THEN:%.*]]
+; CHECK: if.then:
+; CHECK-NEXT: [[MUL:%.*]] = fmul nnan nsz double [[DIV]], [[DIV]]
+; CHECK-NEXT: store double [[MUL]], ptr @r1, align 8
+; CHECK-NEXT: [[TMP1:%.*]] = load double, ptr @x, align 8
+; CHECK-NEXT: [[MUL1:%.*]] = fmul nsz double [[TMP1]], [[A]]
+; CHECK-NEXT: store double [[MUL1]], ptr @r2, align 8
+; CHECK-NEXT: br label [[IF_END]]
+; CHECK: if.end:
+; CHECK-NEXT: ret void
+entry:
+ %sqrt = call double @llvm.sqrt.f64(double %a)
+ %div = fdiv arcp double 1.000000e+00, %sqrt
+ store double %div, ptr @x
+ %tobool = icmp ne i32 %d, 0
+ br i1 %tobool, label %if.then, label %if.end
+
+if.then: ; preds = %entry
+ %mul = fmul nnan nsz double %div, %div
+ store double %mul, ptr @r1
+ %1 = load double, ptr @x
+ %mul1 = fmul nsz double %a, %1
+ store double %mul1, ptr @r2
+ br label %if.end
+
+if.end: ; preds = %if.then, %entry
+ ret void
+}
+
+; div in one block. mul/mul3 each in different block and conditionally guarded. Don't optimize.
+define void @bb_constraint_case4(double %a, i32 %c, i32 %d) {
+; CHECK-LABEL: define void @bb_constraint_case4(
+; CHECK-SAME: double [[A:%.*]], i32 [[C:%.*]], i32 [[D:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[TMP0:%.*]] = call double @llvm.sqrt.f64(double [[A]])
+; CHECK-NEXT: [[DIV:%.*]] = fdiv arcp double 1.000000e+00, [[TMP0]]
+; CHECK-NEXT: store double [[DIV]], ptr @x, align 8
+; CHECK-NEXT: [[TOBOOL_NOT:%.*]] = icmp eq i32 [[C]], 0
+; CHECK-NEXT: br i1 [[TOBOOL_NOT]], label [[IF_END:%.*]], label [[IF_THEN:%.*]]
+; CHECK: if.then:
+; CHECK-NEXT: [[MUL:%.*]] = fmul nnan nsz double [[DIV]], [[DIV]]
+; CHECK-NEXT: store double [[MUL]], ptr @r1, align 8
+; CHECK-NEXT: br label [[IF_END]]
+; CHECK: if.end:
+; CHECK-NEXT: [[TOBOOL1_NOT:%.*]] = icmp eq i32 [[D]], 0
+; CHECK-NEXT: br i1 [[TOBOOL1_NOT]], label [[IF_END4:%.*]], label [[IF_THEN2:%.*]]
+; CHECK: if.then2:
+; CHECK-NEXT: [[TMP1:%.*]] = load double, ptr @x, align 8
+; CHECK-NEXT: [[MUL3:%.*]] = fmul nsz double [[TMP1]], [[A]]
+; CHECK-NEXT: store double [[MUL3]], ptr @r2, align 8
+; CHECK-NEXT: br label [[IF_END4]]
+; CHECK: if.end4:
+; CHECK-NEXT: ret void
+entry:
+ %sqrt = call double @llvm.sqrt.f64(double %a)
+ %div = fdiv arcp double 1.000000e+00, %sqrt
+ store double %div, ptr @x
+ %tobool = icmp ne i32 %c, 0
+ br i1 %tobool, label %if.then, label %if.end
+
+if.then: ; preds = %entry
+ %mul = fmul nnan nsz double %div, %div
+ store double %mul, ptr @r1
+ br label %if.end
+
+if.end: ; preds = %if.then, %entry
+ %tobool1 = icmp ne i32 %d, 0
+ br i1 %tobool1, label %if.then2, label %if.end4
+
+if.then2: ; preds = %if.end
+ %1 = load double, ptr @x
+ %mul3 = fmul nsz double %a, %1
+ store double %mul3, ptr @r2
+ br label %if.end4
+
+if.end4: ; preds = %if.then2, %if.end
+ ret void
+}
+
+; sqrt value comes from different blocks. Don't optimize.
+define void @bb_constraint_case5(double %a, i32 %c) {
+; CHECK-LABEL: define void @bb_constraint_case5(
+; CHECK-SAME: double [[A:%.*]], i32 [[C:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[TOBOOL_NOT:%.*]] = icmp eq i32 [[C]], 0
+; CHECK-NEXT: br i1 [[TOBOOL_NOT]], label [[IF_ELSE:%.*]], label [[IF_THEN:%.*]]
+; CHECK: if.then:
+; CHECK-NEXT: [[TMP0:%.*]] = call double @llvm.sqrt.f64(double [[A]])
+; CHECK-NEXT: br label [[IF_END:%.*]]
+; CHECK: if.else:
+; CHECK-NEXT: [[ADD:%.*]] = fadd double [[A]], 1.000000e+01
+; CHECK-NEXT: [[TMP1:%.*]] = call double @llvm.sqrt.f64(double [[ADD]])
+; CHECK-NEXT: br label [[IF_END]]
+; CHECK: if.end:
+; CHECK-NEXT: [[DOTPN:%.*]] = phi double [ [[TMP0]], [[IF_THEN]] ], [ [[TMP1]], [[IF_ELSE]] ]
+; CHECK-NEXT: [[DIV:%.*]] = fdiv arcp double 1.000000e+00, [[DOTPN]]
+; CHECK-NEXT: [[MUL:%.*]] = fmul nnan nsz double [[DIV]], [[DIV]]
+; CHECK-NEXT: store double [[MUL]], ptr @r1, align 8
+; CHECK-NEXT: [[MUL2:%.*]] = fmul nsz double [[DIV]], [[A]]
+; CHECK-NEXT: store double [[MUL2]], ptr @r2, align 8
+; CHECK-NEXT: ret void
+entry:
+ %tobool = icmp ne i32 %c, 0
+ br i1 %tobool, label %if.then, label %if.else
+
+if.then: ; preds = %entry
+ %0 = call double @llvm.sqrt.f64(double %a)
+ br label %if.end
+
+if.else: ; preds = %entry
+ %add = fadd double %a, 1.000000e+01
+ %1 = call double @llvm.sqrt.f64(double %add)
+ br label %if.end
+
+if.end: ; preds = %if.else, %if.then
+ %sqrt = phi double[ %0, %if.then], [ %1, %if.else]
+ %div = fdiv arcp double 1.000000e+00, %sqrt
+ %mul = fmul nnan nsz double %div, %div
+ store double %mul, ptr @r1
+ %mul2 = fmul nsz double %a, %div
+ store double %mul2, ptr @r2
+ ret void
+}
+
+; div in one block and conditionally guarded. mul/mul1 in other block. Don't optimize.
+define void @bb_constraint_case6(double %a, i32 %d) {
+; CHECK-LABEL: define void @bb_constraint_case6(
+; CHECK-SAME: double [[A:%.*]], i32 [[D:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[TOBOOL_NOT:%.*]] = icmp eq i32 [[D]], 0
+; CHECK-NEXT: br i1 [[TOBOOL_NOT]], label [[IF_END:%.*]], label [[IF_THEN:%.*]]
+; CHECK: entry.if.end_crit_edge:
+; CHECK-NEXT: [[DOTPRE:%.*]] = load double, ptr @x, align 8
+; CHECK-NEXT: br label [[IF_END1:%.*]]
+; CHECK: if.then:
+; CHECK-NEXT: [[TMP0:%.*]] = tail call double @llvm.sqrt.f64(double [[A]])
+; CHECK-NEXT: [[DIV:%.*]] = fdiv arcp double 1.000000e+00, [[TMP0]]
+; CHECK-NEXT: store double [[DIV]], ptr @x, align 8
+; CHECK-NEXT: br label [[IF_END1]]
+; CHECK: if.end:
+; CHECK-NEXT: [[TMP1:%.*]] = phi double [ [[DOTPRE]], [[IF_END]] ], [ [[DIV]], [[IF_THEN]] ]
+; CHECK-NEXT: [[MUL:%.*]] = fmul nnan nsz double [[TMP1]], [[TMP1]]
+; CHECK-NEXT: store double [[MUL]], ptr @r1, align 8
+; CHECK-NEXT: [[MUL1:%.*]] = fmul nsz double [[TMP1]], [[A]]
+; CHECK-NEXT: store double [[MUL1]], ptr @r2, align 8
+; CHECK-NEXT: ret void
+entry:
+ %tobool.not = icmp eq i32 %d, 0
+ br i1 %tobool.not, label %entry.if.end_crit_edge, label %if.then
+
+entry.if.end_crit_edge: ; preds = %entry
+ %.pre = load double, ptr @x
+ br label %if.end
+
+if.then: ; preds = %entry
+ %sqrt = tail call double @llvm.sqrt.f64(double %a)
+ %div = fdiv arcp double 1.000000e+00, %sqrt
+ store double %div, ptr @x
+ br label %if.end
+
+if.end: ; preds = %entry.if.end_crit_edge, %if.then
+ %1 = phi double [ %.pre, %entry.if.end_crit_edge ], [ %div, %if.then ]
+ %mul = fmul nnan nsz double %1, %1
+ store double %mul, ptr @r1
+ %mul1 = fmul nsz double %1, %a
+ store double %mul1, ptr @r2
+ ret void
+}
+
+; value for first mul(i.e. div4.sink) comes from different blocks. Don't optimize.
+define void @bb_constraint_case7(double %a, i32 %c, i32 %d) {
+; CHECK-LABEL: define void @bb_constraint_case7(
+; CHECK-SAME: double [[A:%.*]], i32 [[C:%.*]], i32 [[D:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[TMP0:%.*]] = tail call double @llvm.sqrt.f64(double [[A]])
+; CHECK-NEXT: [[DIV:%.*]] = fdiv arcp double 1.000000e+00, [[TMP0]]
+; CHECK-NEXT: store double [[DIV]], ptr @x, align 8
+; CHECK-NEXT: [[TOBOOL_NOT:%.*]] = icmp eq i32 [[C]], 0
+; CHECK-NEXT: br i1 [[TOBOOL_NOT]], label [[IF_ELSE:%.*]], label [[IF_THEN:%.*]]
+; CHECK: if.then:
+; CHECK-NEXT: [[DIV1:%.*]] = fdiv double 3.000000e+00, [[A]]
+; CHECK-NEXT: br label [[IF_END6:%.*]]
+; CHECK: if.else:
+; CHECK-NEXT: [[TOBOOL2_NOT:%.*]] = icmp eq i32 [[D]], 0
+; CHECK-NEXT: br i1 [[TOBOOL2_NOT]], label [[IF_ELSE5:%.*]], label [[IF_THEN3:%.*]]
+; CHECK: if.then3:
+; CHECK-NEXT: [[DIV4:%.*]] = fdiv double 2.000000e+00, [[A]]
+; CHECK-NEXT: br label [[IF_END6]]
+; CHECK: if.else5:
+; CHECK-NEXT: [[MUL:%.*]] = fmul nnan nsz double [[DIV]], [[DIV]]
+; CHECK-NEXT: br label [[IF_END6]]
+; CHECK: if.end6:
+; CHECK-NEXT: [[DIV4_SINK:%.*]] = phi double [ [[DIV4]], [[IF_THEN3]] ], [ [[MUL]], [[IF_ELSE5]] ], [ [[DIV1]], [[IF_THEN]] ]
+; CHECK-NEXT: store double [[DIV4_SINK]], ptr @r1, align 8
+; CHECK-NEXT: [[MUL7:%.*]] = fmul nsz double [[DIV]], [[A]]
+; CHECK-NEXT: store double [[MUL7]], ptr @r2, align 8
+; CHECK-NEXT: ret void
+entry:
+ %sqrt = tail call double @llvm.sqrt.f64(double %a)
...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gone through all of the thinking on fast-math flags yet, but I've noted at least one incorrect flag:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(drive-by comments only, I don't review FP transforms.)
@arsenm @jcranmer-intel I have tried addressing all the floating point issues. Are there any more issues I need to address? |
ping @arsenm @jcranmer-intel |
// r1 = 1/a | ||
// r2 = sqrt(a) | ||
// x = r1 * r2 | ||
SmallPtrSet<Instruction *, 2> R1, R2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to change from SmallVector to SmallPtrSet since invoking the users()
API somehow got me duplicate users
B.SetInsertPoint(X); | ||
|
||
// Every instance of R1 may have different fpmath metadata and fpmath flags. | ||
// We try to preserve them by having seperate fdiv instruction per R1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seperate -> separate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
FSqrt->insertBefore(CI); | ||
FSqrt->copyFastMathFlags(I); | ||
FSqrt->copyMetadata(*I); | ||
I->replaceAllUsesWith(FSqrt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you use InstCombine's eraseInstrFromFunction?
FSqrt->insertBefore(CI); | ||
FSqrt->copyFastMathFlags(I); | ||
FSqrt->copyMetadata(*I); | ||
I->replaceAllUsesWith(FSqrt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should maybe use InstCombiner::replaceInstUsesWith
so the users of I are added to the worklist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
for (Instruction *I : R1) { | ||
FDiv = cast<Instruction>( | ||
B.CreateFDiv(ConstantFP::get((*R1.begin())->getType(), 1.0), SqrtOp)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you get the type from I instead of *R1.begin()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// Replace Uses Of R2 With FSqrt | ||
// Replace Uses Of X With FMul | ||
static Value *convertFSqrtDivIntoFMul(CallInst *CI, Instruction *X, | ||
SmallPtrSetImpl<Instruction *> &R1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these by const references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ping |
BasicBlock *BBr1 = (*R1.begin())->getParent(); | ||
BasicBlock *BBr2 = (*R2.begin())->getParent(); | ||
|
||
CallInst *FSqrt = cast<CallInst>(X->getOperand(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the use of cast<> here. I understand that you've verified in getFSqrtDivOptPattern() that this will be a call, but the cast here creates a tight coupling between the functions that isn't enforced by the function semantics. That is, someone could call this function without having called the other. I think it makes more sense to combine them or to call getFSqrtDivOptPattern from here and not require it to be called separately by users of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, any attempt to call the func directly where the cast<> fails should be immediately visible in the debug build.
Second, This part is keep different just to increase readability and seperate functionally different things. The scenario you have mentioned is bound to happen everywhere. For the same reason, I have mentioned in the description of the function that how should x/r1/r2
look like. If you are not satisfied with the cast, maybe I can add an assert
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cast contains an assert, so there's no need to add one. I understand your point about readability. I think it's better to have code that avoids failures than to have code that fails in obvious ways if misused, but I'm willing to leave that up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sure. Will make the necessary change.
if (!FSqrt->hasAllowReassoc() || !FSqrt->hasNoNaNs() || | ||
!FSqrt->hasNoSignedZeros() || !FSqrt->hasNoInfs()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nnan restriction severely limits the usefulness of this transformation, so I think it's worth trying to find alternative ways for the transformation to trigger. I understand that you want it to trigger for cases where x, r1, and r2 have an arbitrary number of users, but if the nnan condition isn't met, you could still perform the transformation if r1 is the only user or x. In addition, you could check isKnownNonNegative() for a.
On the other hand, the ninf requirement is similarly restrictive, so maybe it's just necessary to accept that the limitations on this transformation. Since you mentioned CPU2017 in your description, I would mention that ninf can't be used with all CPU2017 benchmarks. In particular, it breaks povray. That's not to say this transformation isn't general enough. I'm just highlighting the limitation.
// Check legality for transforming | ||
// x = 1.0/sqrt(a) | ||
// r1 = x * x; | ||
// r2 = a/sqrt(a); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excuse me if this was covered in one of the many resolved conversations, but it's not clear to me why you're transforming a/sqrt(a) as part of this pattern. Is it because you need to hoist R2 into the same block as X?
I see that in InstCombinerImpl::foldFMulReassoc() we are transforming a number of patterns into x/sqrt(x) with a comment that the backend is expected to transform that into sqrt(x) if the necessary fast-math flags are present. I'm not sure why that's being left to the backend, but I don't see any reason to perform the transformation here. If you want InstCombine to do that, it could just as easily be an independent transformation.
It may be that this is a case where we decided we needed the "unsafe-fp-math" function attribute because none of the individual fast-math flags clearly allows it. @jcranmer-intel has been working on clarifying the semantics of these flags and might have more to say on this. I think it's definitely a transformation we want to allow, but I would argue that it requires more than just reassoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it's not clear to me why you're transforming a/sqrt(a) as part of this pattern
we would like to express x = r1 * r2
where r1 and r2 are in suitable form. If r2 is not in required form, there is no point in doing this transformation(i.e. we wont be saving on 1 division in the backend).
We cant wait for the backend to transform r2 here.
I think it's definitely a transformation we want to allow, but I would argue that it requires more than just reassoc.
as far as I remember, I had this discussion with @jcranmer-intel on this PR itself. This is just considered algebraic-rewrite and hence, the reassoc flag. But if there any other flags required, coming post acceptance of his proposal, I am not sure if we should wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantics of the 'reassoc' flag don't explicitly allow algebraic transformations other than reassociation. It's often treated that way, but that's not what the language reference says. This is one of the things @jcranmer-intel is hoping to correct long-term. For now, we don't have another flag that clearly allows this, so I suppose we'll need to rely on 'reassoc' here.
return false; | ||
|
||
// Check the constraints on instructions in R2. | ||
return all_of(R2, [BBr2](Instruction *I) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to use any_of above and all_of here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. Initially I had all_of at all the places and someone suggested to have any_of. I can make it consistent at all the places i.e use all_of. My understanding is it wont matter
// Although, by value, FSqrt = CI , every instance of R2 may have different | ||
// fpmath metadata and fpmath flags. We try to preserve them by cloning the | ||
// call instruction per R2 instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think combining the calls and using the most generic fast-math flags would be preferable. You seem to be losing fast-math flags on the sqrt call anyway if R2 and the original sqrt call had different flags.
; CHECK-NEXT: entry: | ||
; CHECK-NEXT: [[SQRT1:%.*]] = call reassoc double @llvm.sqrt.f64(double [[A]]) | ||
; CHECK-NEXT: [[TMP0:%.*]] = fdiv reassoc double 1.000000e+00, [[A]] | ||
; CHECK-NEXT: [[DIV:%.*]] = fmul reassoc ninf arcp double [[TMP0]], [[SQRT1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate to drag this out further, but this is not quite right. Because this transformation depends on rearranging multiple instructions, the fast-math flags kept should be the intersection of the fast-math flags present on the incoming instructions. You can make a case that r2 is just a simplification of the original r2 instruction, but x and r1 in the output are both formed by combining the original x and r1 instructions, so the arcp flag here should be dropped. The nnan and ninf are slightly different because they can be inferred from their presence on other instructions, so you can use the union of those. Helper functions were recently introduced to let you do it like this:
FastMathFlags NewFMF = FastMathFlags::intersectRewrite(Op1FMF, Op2FMF) |
FastMathFlags::unionValue(Op1FMF, Op2FMF);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of the arcp
flag, just trying to clarify the last sentence of the definition from langref:
arcp
Allows division to be treated as a multiplication by a reciprocal. Specifically, this permits a / b to be considered equivalent to a * (1.0 / b) (which may subsequently be susceptible to code motion), and it also permits a / (b / c) to be considered equivalent to a * (c / b). Both of these rewrites can be applied in either direction: a * (c / b) can be rewritten into a / (b / c).
This means that arcp
can be used to switch back and forth.
For the concerned example, if the transformed operation is
`[[DIV:%.*]] = fmul reassoc ninf arcp double [[TMP0]], [[SQRT1]] --> div = (1/a) * sqrt(a) '
then I should have arcp
flag on this operation because as per the last sentence of arcp definition, I should be able to rewrite it as:
div = sqrt(a)/a
If we remove the arcp flag, we wont be able to do so. Am I misinterpreting anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, ignore above comment. Got confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FastMathFlags NewFMF = FastMathFlags::intersectRewrite(Op1FMF, Op2FMF) |
FastMathFlags::unionValue(Op1FMF, Op2FMF);
Do you mean to say this:
FastMathFlags NewFMF = FastMathFlags::intersectRewrite(Op1 - rewrite-flags, Op2 - rewrite-flags) |
FastMathFlags::unionValue(Op1 - Nonrewrite-flags, Op2 - Nonrewrite-flags);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should just be
FastMathFlags NewFMF = FastMathFlags::intersectRewrite(Op1FMF, Op2FMF)
since this is multiplication operation, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear why you're suggesting that this being a multiplication operation means that the value flags should be dropped. The intersectRewrite and unionValue functions mask off the flags that aren't rewrite or value flags respectively, so you don't need to explicitly remove them as in your question above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the generic multiply case here:
Op3 = Op1 * Op2
where,
Op1 = ninf ...
Op2 = ...
So,
unionValue(Op1FMF, Op2FMF) = ninf
which is not correct, right? i.e. Op3 cant guarantee ninf
But if you are assuming for this particular transformation that this wont happen because of the restrictions of the values of a
, then its fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you are saying, and I do think I was probably oversimplifying this in my mind, but I think it works out as I suggested. Because the incoming values of x, r1, and r2 are being mutually transformed to create the outgoing x and r1 values, the rewrite fast-math flags must be intersected, but the value flags require further analysis. You are checking for the ninf flag on the incoming x value (1/sqrt(a)). I think we can conclude that since 1/sqrt(a) does not have infinite inputs or result then 1/a doesn't either, so the ninf flag can be applied to the outgoing r1 value. The nnan flag can be transferred by similar logic. You've also checked for the nnan and ninf flags on the incoming sqrt(a) instruction, so r2 after the transformation must meet the conditions for nnan and ninf as well.
Since outgoing r1 and r2 both meet the conditions for nnan and ninf, then the outgoing x value can also have these flags applied. So the nnan and ninf flags can be set using the unionValue() function as I suggested. I don't think the reasoning is quite as clear for nsz, since it means that the sign can be ignored and not that the -0.0 will not be seen, but I don't think it matters for this transformation since the existing checks would make a must be positive and any zero appearing would thus be positive zero.
The proposed patch, in general, tries to transform the below code sequence: x = 1.0 / sqrt (a); r1 = x * x; // same as 1.0 / a r2 = a / sqrt(a); // same as sqrt (a) TO (If x, r1 and r2 are all used further in the code) tmp1 = 1.0 / a tmp2 = sqrt (a) tmp3 = tmp1 * tmp2 x = tmp3 r1 = tmp1 r2 = tmp2 The transform tries to make high latency sqrt and div operations independent and also saves on one multiplication. The patch was tested with SPEC17 suite with cpu=neoverse-v2. The performance uplift achieved was: 544.nab_r ~4% No other regressions were observed. Also, no compile time differences were observed with the patch. Closes llvm#54652
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good now. Thanks for your perseverance.
Thanks @andykaylor for diligently looking into this and helping me bring it to the approval stage :) |
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/190/builds/12997 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/55/builds/5688 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/81/builds/3742 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/85/builds/4445 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/24/builds/4294 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/169/builds/7477 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/23/builds/6711 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/52/builds/5306 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/25/builds/5712 Here is the relevant piece of the build log for the reference
|
…QRT/FDIV into FMUL" (#123289) Reverts llvm/llvm-project#87474
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/164/builds/6527 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/94/builds/3830 Here is the relevant piece of the build log for the reference
|
The proposed patch, in general, tries to transform the below code sequence:
x = 1.0 / sqrt (a);
r1 = x * x; // same as 1.0 / a
r2 = a / sqrt(a); // same as sqrt (a)
TO
(If x, r1 and r2 are all used further in the code)
r1 = 1.0 / a
r2 = sqrt (a)
x = r1 * r2
The transform tries to make high latency sqrt and div operations independent and also saves on one multiplication.
The patch was tested with SPEC17 suite with cpu=neoverse-v2. The performance uplift achieved was:
544.nab_r ~4%
No other regressions were observed. Also, no compile time differences were observed with the patch.
Closes #54652