[LoopUnroll] Clamp PartialThreshold for large LoopMicroOpBufferSize #67657

nikic · 2023-09-28T10:48:10Z

The znver3/znver4 scheduler models are outliers, specifying very large LoopMicroOpBufferSizes at 512, while typical values for other subtargets are on the order of ~50. Even if this information is micro-architecturally correct (*), this does not mean that we want to runtime unroll all loops to a size that completely fills the loop buffer. Unless this is the single hot loop in the entire application, the massive code size increase will bust the micro-op and instruction caches.

Protect against this by clamping to the default PartialThreshold of 150, which is the same as the default full-unroll threshold and half the aggressive full-unroll threshold. Allowing more partial unrolling than full unrolling is certainly nonsensical.

(*) I strongly doubt that this is actually correct -- I believe this may derive from an incorrect reading of Agner Fog's micro-architecture guide. The number 4096 that was originally used here is the size of the general micro-op cache, not that of a loop buffer. A separate loop buffer is not listed for the Zen microarchitecture. Comparing this to the listing for Skylake, it has a 1536 micro-op buffer, but only a 64 micro-op loopback buffer, with a note that it's rarely fully utilized. Our scheduling model specifies LoopMicroOpBufferSize of 50 in that case.

llvmbot · 2023-09-28T10:49:11Z

@llvm/pr-subscribers-llvm-transforms

Changes

The znver3/znver4 scheduler models are outliers, specifying very large LoopMicroOpBufferSizes at 512, while typical values for other subtargets are on the order of ~50. Even if this information is micro-architecturally correct (*), this does not mean that we want to runtime unroll all loops to a size that completely fills the loop buffer. Unless this is the single hot loop in the entire application, the massive code size increase will bust the micro-op and instruction caches.

Protect against this by clamping to the default PartialThreshold of 150, which is the same as the default full-unroll threshold and half the aggressive full-unroll threshold. Allowing more partial unrolling than full unrolling is certainly nonsensical.

(*) I strongly doubt that this is actually correct -- I believe this may derive from an incorrect reading of Agner Fog's micro-architecture guide. The number 4096 that was originally used here is the size of the general micro-op cache, not that of a loop buffer. A separate loop buffer is not listed for the Zen microarchitecture. Comparing this to the listing for Skylake, it has a 1536 micro-op buffer, but only a 64 micro-op loopback buffer, with a note that it's rarely fully utilized. Our scheduling model specifies LoopMicroOpBufferSize of 50 in that case.

Patch is 65.57 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/67657.diff

2 Files Affected:

(modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+7-1)
(modified) llvm/test/Transforms/LoopUnroll/X86/znver3.ll (+23-741)

diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index c11d558a73e9d09..91daeee19acad34 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -575,7 +575,13 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     if (PartialUnrollingThreshold.getNumOccurrences() > 0)
       MaxOps = PartialUnrollingThreshold;
     else if (ST->getSchedModel().LoopMicroOpBufferSize > 0)
-      MaxOps = ST->getSchedModel().LoopMicroOpBufferSize;
+      // Upper bound by the default PartialThreshold, which is the same as
+      // the default full-unroll Threshold. Even if the loop micro-op buffer
+      // is very large, this does not mean that we want to unroll all loops
+      // to that length, as it would increase code size beyond the limits of
+      // what unrolling normally allows.
+      MaxOps = std::min(ST->getSchedModel().LoopMicroOpBufferSize,
+                        UP.PartialThreshold);
     else
       return;
 
diff --git a/llvm/test/Transforms/LoopUnroll/X86/znver3.ll b/llvm/test/Transforms/LoopUnroll/X86/znver3.ll
index 30389062a096788..467c57906d88827 100644
--- a/llvm/test/Transforms/LoopUnroll/X86/znver3.ll
+++ b/llvm/test/Transforms/LoopUnroll/X86/znver3.ll
@@ -9,8 +9,8 @@ define i32 @test(ptr %ary) "target-cpu"="znver3" {
 ; CHECK-NEXT:  entry:
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
 ; CHECK:       for.body:
-; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDVARS_IV_NEXT_127:%.*]], [[FOR_BODY]] ]
-; CHECK-NEXT:    [[SUM:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[SUM_NEXT_127:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDVARS_IV_NEXT_31:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[SUM:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[SUM_NEXT_31:%.*]], [[FOR_BODY]] ]
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV]]
 ; CHECK-NEXT:    [[VAL:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
 ; CHECK-NEXT:    [[SUM_NEXT:%.*]] = add nsw i32 [[VAL]], [[SUM]]
@@ -137,396 +137,12 @@ define i32 @test(ptr %ary) "target-cpu"="znver3" {
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT_30:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 31
 ; CHECK-NEXT:    [[ARRAYIDX_31:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_30]]
 ; CHECK-NEXT:    [[VAL_31:%.*]] = load i32, ptr [[ARRAYIDX_31]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_31:%.*]] = add nsw i32 [[VAL_31]], [[SUM_NEXT_30]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_31:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 32
-; CHECK-NEXT:    [[ARRAYIDX_32:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_31]]
-; CHECK-NEXT:    [[VAL_32:%.*]] = load i32, ptr [[ARRAYIDX_32]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_32:%.*]] = add nsw i32 [[VAL_32]], [[SUM_NEXT_31]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_32:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 33
-; CHECK-NEXT:    [[ARRAYIDX_33:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_32]]
-; CHECK-NEXT:    [[VAL_33:%.*]] = load i32, ptr [[ARRAYIDX_33]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_33:%.*]] = add nsw i32 [[VAL_33]], [[SUM_NEXT_32]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_33:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 34
-; CHECK-NEXT:    [[ARRAYIDX_34:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_33]]
-; CHECK-NEXT:    [[VAL_34:%.*]] = load i32, ptr [[ARRAYIDX_34]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_34:%.*]] = add nsw i32 [[VAL_34]], [[SUM_NEXT_33]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_34:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 35
-; CHECK-NEXT:    [[ARRAYIDX_35:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_34]]
-; CHECK-NEXT:    [[VAL_35:%.*]] = load i32, ptr [[ARRAYIDX_35]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_35:%.*]] = add nsw i32 [[VAL_35]], [[SUM_NEXT_34]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_35:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 36
-; CHECK-NEXT:    [[ARRAYIDX_36:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_35]]
-; CHECK-NEXT:    [[VAL_36:%.*]] = load i32, ptr [[ARRAYIDX_36]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_36:%.*]] = add nsw i32 [[VAL_36]], [[SUM_NEXT_35]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_36:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 37
-; CHECK-NEXT:    [[ARRAYIDX_37:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_36]]
-; CHECK-NEXT:    [[VAL_37:%.*]] = load i32, ptr [[ARRAYIDX_37]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_37:%.*]] = add nsw i32 [[VAL_37]], [[SUM_NEXT_36]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_37:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 38
-; CHECK-NEXT:    [[ARRAYIDX_38:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_37]]
-; CHECK-NEXT:    [[VAL_38:%.*]] = load i32, ptr [[ARRAYIDX_38]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_38:%.*]] = add nsw i32 [[VAL_38]], [[SUM_NEXT_37]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_38:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 39
-; CHECK-NEXT:    [[ARRAYIDX_39:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_38]]
-; CHECK-NEXT:    [[VAL_39:%.*]] = load i32, ptr [[ARRAYIDX_39]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_39:%.*]] = add nsw i32 [[VAL_39]], [[SUM_NEXT_38]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_39:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 40
-; CHECK-NEXT:    [[ARRAYIDX_40:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_39]]
-; CHECK-NEXT:    [[VAL_40:%.*]] = load i32, ptr [[ARRAYIDX_40]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_40:%.*]] = add nsw i32 [[VAL_40]], [[SUM_NEXT_39]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_40:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 41
-; CHECK-NEXT:    [[ARRAYIDX_41:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_40]]
-; CHECK-NEXT:    [[VAL_41:%.*]] = load i32, ptr [[ARRAYIDX_41]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_41:%.*]] = add nsw i32 [[VAL_41]], [[SUM_NEXT_40]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_41:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 42
-; CHECK-NEXT:    [[ARRAYIDX_42:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_41]]
-; CHECK-NEXT:    [[VAL_42:%.*]] = load i32, ptr [[ARRAYIDX_42]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_42:%.*]] = add nsw i32 [[VAL_42]], [[SUM_NEXT_41]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_42:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 43
-; CHECK-NEXT:    [[ARRAYIDX_43:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_42]]
-; CHECK-NEXT:    [[VAL_43:%.*]] = load i32, ptr [[ARRAYIDX_43]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_43:%.*]] = add nsw i32 [[VAL_43]], [[SUM_NEXT_42]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_43:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 44
-; CHECK-NEXT:    [[ARRAYIDX_44:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_43]]
-; CHECK-NEXT:    [[VAL_44:%.*]] = load i32, ptr [[ARRAYIDX_44]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_44:%.*]] = add nsw i32 [[VAL_44]], [[SUM_NEXT_43]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_44:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 45
-; CHECK-NEXT:    [[ARRAYIDX_45:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_44]]
-; CHECK-NEXT:    [[VAL_45:%.*]] = load i32, ptr [[ARRAYIDX_45]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_45:%.*]] = add nsw i32 [[VAL_45]], [[SUM_NEXT_44]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_45:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 46
-; CHECK-NEXT:    [[ARRAYIDX_46:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_45]]
-; CHECK-NEXT:    [[VAL_46:%.*]] = load i32, ptr [[ARRAYIDX_46]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_46:%.*]] = add nsw i32 [[VAL_46]], [[SUM_NEXT_45]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_46:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 47
-; CHECK-NEXT:    [[ARRAYIDX_47:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_46]]
-; CHECK-NEXT:    [[VAL_47:%.*]] = load i32, ptr [[ARRAYIDX_47]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_47:%.*]] = add nsw i32 [[VAL_47]], [[SUM_NEXT_46]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_47:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 48
-; CHECK-NEXT:    [[ARRAYIDX_48:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_47]]
-; CHECK-NEXT:    [[VAL_48:%.*]] = load i32, ptr [[ARRAYIDX_48]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_48:%.*]] = add nsw i32 [[VAL_48]], [[SUM_NEXT_47]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_48:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 49
-; CHECK-NEXT:    [[ARRAYIDX_49:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_48]]
-; CHECK-NEXT:    [[VAL_49:%.*]] = load i32, ptr [[ARRAYIDX_49]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_49:%.*]] = add nsw i32 [[VAL_49]], [[SUM_NEXT_48]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_49:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 50
-; CHECK-NEXT:    [[ARRAYIDX_50:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_49]]
-; CHECK-NEXT:    [[VAL_50:%.*]] = load i32, ptr [[ARRAYIDX_50]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_50:%.*]] = add nsw i32 [[VAL_50]], [[SUM_NEXT_49]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_50:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 51
-; CHECK-NEXT:    [[ARRAYIDX_51:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_50]]
-; CHECK-NEXT:    [[VAL_51:%.*]] = load i32, ptr [[ARRAYIDX_51]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_51:%.*]] = add nsw i32 [[VAL_51]], [[SUM_NEXT_50]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_51:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 52
-; CHECK-NEXT:    [[ARRAYIDX_52:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_51]]
-; CHECK-NEXT:    [[VAL_52:%.*]] = load i32, ptr [[ARRAYIDX_52]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_52:%.*]] = add nsw i32 [[VAL_52]], [[SUM_NEXT_51]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_52:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 53
-; CHECK-NEXT:    [[ARRAYIDX_53:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_52]]
-; CHECK-NEXT:    [[VAL_53:%.*]] = load i32, ptr [[ARRAYIDX_53]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_53:%.*]] = add nsw i32 [[VAL_53]], [[SUM_NEXT_52]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_53:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 54
-; CHECK-NEXT:    [[ARRAYIDX_54:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_53]]
-; CHECK-NEXT:    [[VAL_54:%.*]] = load i32, ptr [[ARRAYIDX_54]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_54:%.*]] = add nsw i32 [[VAL_54]], [[SUM_NEXT_53]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_54:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 55
-; CHECK-NEXT:    [[ARRAYIDX_55:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_54]]
-; CHECK-NEXT:    [[VAL_55:%.*]] = load i32, ptr [[ARRAYIDX_55]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_55:%.*]] = add nsw i32 [[VAL_55]], [[SUM_NEXT_54]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_55:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 56
-; CHECK-NEXT:    [[ARRAYIDX_56:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_55]]
-; CHECK-NEXT:    [[VAL_56:%.*]] = load i32, ptr [[ARRAYIDX_56]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_56:%.*]] = add nsw i32 [[VAL_56]], [[SUM_NEXT_55]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_56:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 57
-; CHECK-NEXT:    [[ARRAYIDX_57:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_56]]
-; CHECK-NEXT:    [[VAL_57:%.*]] = load i32, ptr [[ARRAYIDX_57]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_57:%.*]] = add nsw i32 [[VAL_57]], [[SUM_NEXT_56]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_57:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 58
-; CHECK-NEXT:    [[ARRAYIDX_58:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_57]]
-; CHECK-NEXT:    [[VAL_58:%.*]] = load i32, ptr [[ARRAYIDX_58]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_58:%.*]] = add nsw i32 [[VAL_58]], [[SUM_NEXT_57]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_58:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 59
-; CHECK-NEXT:    [[ARRAYIDX_59:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_58]]
-; CHECK-NEXT:    [[VAL_59:%.*]] = load i32, ptr [[ARRAYIDX_59]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_59:%.*]] = add nsw i32 [[VAL_59]], [[SUM_NEXT_58]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_59:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 60
-; CHECK-NEXT:    [[ARRAYIDX_60:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_59]]
-; CHECK-NEXT:    [[VAL_60:%.*]] = load i32, ptr [[ARRAYIDX_60]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_60:%.*]] = add nsw i32 [[VAL_60]], [[SUM_NEXT_59]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_60:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 61
-; CHECK-NEXT:    [[ARRAYIDX_61:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_60]]
-; CHECK-NEXT:    [[VAL_61:%.*]] = load i32, ptr [[ARRAYIDX_61]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_61:%.*]] = add nsw i32 [[VAL_61]], [[SUM_NEXT_60]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_61:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 62
-; CHECK-NEXT:    [[ARRAYIDX_62:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_61]]
-; CHECK-NEXT:    [[VAL_62:%.*]] = load i32, ptr [[ARRAYIDX_62]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_62:%.*]] = add nsw i32 [[VAL_62]], [[SUM_NEXT_61]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_62:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 63
-; CHECK-NEXT:    [[ARRAYIDX_63:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_62]]
-; CHECK-NEXT:    [[VAL_63:%.*]] = load i32, ptr [[ARRAYIDX_63]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_63:%.*]] = add nsw i32 [[VAL_63]], [[SUM_NEXT_62]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_63:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 64
-; CHECK-NEXT:    [[ARRAYIDX_64:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_63]]
-; CHECK-NEXT:    [[VAL_64:%.*]] = load i32, ptr [[ARRAYIDX_64]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_64:%.*]] = add nsw i32 [[VAL_64]], [[SUM_NEXT_63]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_64:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 65
-; CHECK-NEXT:    [[ARRAYIDX_65:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_64]]
-; CHECK-NEXT:    [[VAL_65:%.*]] = load i32, ptr [[ARRAYIDX_65]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_65:%.*]] = add nsw i32 [[VAL_65]], [[SUM_NEXT_64]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_65:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 66
-; CHECK-NEXT:    [[ARRAYIDX_66:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_65]]
-; CHECK-NEXT:    [[VAL_66:%.*]] = load i32, ptr [[ARRAYIDX_66]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_66:%.*]] = add nsw i32 [[VAL_66]], [[SUM_NEXT_65]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_66:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 67
-; CHECK-NEXT:    [[ARRAYIDX_67:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_66]]
-; CHECK-NEXT:    [[VAL_67:%.*]] = load i32, ptr [[ARRAYIDX_67]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_67:%.*]] = add nsw i32 [[VAL_67]], [[SUM_NEXT_66]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_67:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 68
-; CHECK-NEXT:    [[ARRAYIDX_68:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_67]]
-; CHECK-NEXT:    [[VAL_68:%.*]] = load i32, ptr [[ARRAYIDX_68]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_68:%.*]] = add nsw i32 [[VAL_68]], [[SUM_NEXT_67]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_68:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 69
-; CHECK-NEXT:    [[ARRAYIDX_69:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_68]]
-; CHECK-NEXT:    [[VAL_69:%.*]] = load i32, ptr [[ARRAYIDX_69]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_69:%.*]] = add nsw i32 [[VAL_69]], [[SUM_NEXT_68]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_69:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 70
-; CHECK-NEXT:    [[ARRAYIDX_70:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_69]]
-; CHECK-NEXT:    [[VAL_70:%.*]] = load i32, ptr [[ARRAYIDX_70]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_70:%.*]] = add nsw i32 [[VAL_70]], [[SUM_NEXT_69]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_70:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 71
-; CHECK-NEXT:    [[ARRAYIDX_71:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_70]]
-; CHECK-NEXT:    [[VAL_71:%.*]] = load i32, ptr [[ARRAYIDX_71]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_71:%.*]] = add nsw i32 [[VAL_71]], [[SUM_NEXT_70]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_71:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 72
-; CHECK-NEXT:    [[ARRAYIDX_72:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_71]]
-; CHECK-NEXT:    [[VAL_72:%.*]] = load i32, ptr [[ARRAYIDX_72]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_72:%.*]] = add nsw i32 [[VAL_72]], [[SUM_NEXT_71]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_72:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 73
-; CHECK-NEXT:    [[ARRAYIDX_73:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_72]]
-; CHECK-NEXT:    [[VAL_73:%.*]] = load i32, ptr [[ARRAYIDX_73]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_73:%.*]] = add nsw i32 [[VAL_73]], [[SUM_NEXT_72]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_73:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 74
-; CHECK-NEXT:    [[ARRAYIDX_74:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_73]]
-; CHECK-NEXT:    [[VAL_74:%.*]] = load i32, ptr [[ARRAYIDX_74]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_74:%.*]] = add nsw i32 [[VAL_74]], [[SUM_NEXT_73]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_74:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 75
-; CHECK-NEXT:    [[ARRAYIDX_75:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_74]]
-; CHECK-NEXT:    [[VAL_75:%.*]] = load i32, ptr [[ARRAYIDX_75]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_75:%.*]] = add nsw i32 [[VAL_75]], [[SUM_NEXT_74]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_75:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 76
-; CHECK-NEXT:    [[ARRAYIDX_76:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_75]]
-; CHECK-NEXT:    [[VAL_76:%.*]] = load i32, ptr [[ARRAYIDX_76]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_76:%.*]] = add nsw i32 [[VAL_76]], [[SUM_NEXT_75]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_76:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 77
-; CHECK-NEXT:    [[ARRAYIDX_77:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_76]]
-; CHECK-NEXT:    [[VAL_77:%.*]] = load i32, ptr [[ARRAYIDX_77]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_77:%.*]] = add nsw i32 [[VAL_77]], [[SUM_NEXT_76]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_77:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 78
-; CHECK-NEXT:    [[ARRAYIDX_78:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_77]]
-; CHECK-NEXT:    [[VAL_78:%.*]] = load i32, ptr [[ARRAYIDX_78]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_78:%.*]] = add nsw i32 [[VAL_78]], [[SUM_NEXT_77]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_78:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 79
-; CHECK-NEXT:    [[ARRAYIDX_79:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_78]]
-; CHECK-NEXT:    [[VAL_79:%.*]] = load i32, ptr [[ARRAYIDX_79]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_79:%.*]] = add nsw i32 [[VAL_79]], [[SUM_NEXT_78]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_79:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 80
-; CHECK-NEXT:    [[ARRAYIDX_80:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_79]]
-; CHECK-NEXT:    [[VAL_80:%.*]] = load i32, ptr [[ARRAYIDX_80]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_80:%.*]] = add nsw i32 [[VAL_80]], [[SUM_NEXT_79]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_80:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 81
-; CHECK-NEXT:    [[ARRAYIDX_81:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_80]]
-; CHECK-NEXT:    [[VAL_81:%.*]] = load i32, ptr [[ARRAYIDX_81]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_81:%.*]] = add nsw i32 [[VAL_81]], [[SUM_NEXT_80]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_81:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 82
-; CHECK-NEXT:    [[ARRAYIDX_82:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_81]]
-; CHECK-NEXT:    [[VAL_82:%.*]] = load i32, ptr [[ARRAYIDX_82]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_82:%.*]] = add nsw i32 [[VAL_82]], [[SUM_NEXT_81]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_82:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 83
-; CHECK-NEXT:    [[ARRAYIDX_83:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_82]]
-; CHECK-NEXT:    [[VAL_83:%.*]] = load i32, ptr [[ARRAYIDX_83]], align 4
-; CHECK-NEXT:    [[SUM_NEXT_83:%.*]] = add nsw i32 [[VAL_83]], [[SUM_NEXT_82]]
-; CHECK-NEXT:    [[INDVARS_IV_NEXT_83:%.*]] = add nuw nsw i6...
[truncated]

RKSimon · 2023-10-04T13:44:50Z

I've always struggled with the idea of LoopMicroOpBufferSize in general. It appears to be just used as a way of throttling unrolls, to avoid the potential cost of unnecessary compare+branch. In which case shouldn't we be trying to better estimate the cost of the loop control flow, based off branch distance etc? That way we avoid estimation issues for cpus without a dedicated loopback buffer etc.

nikic · 2023-10-04T14:33:44Z

I've always struggled with the idea of LoopMicroOpBufferSize in general. It appears to be just used as a way of throttling unrolls, to avoid the potential cost of unnecessary compare+branch.

I believe the primary purpose of LoopMicroOpBufferSize here is to make sure that we don't runtime-unroll past the loop buffer size: Doing so could mean that a (non-unrolled) loop that previously used the loop buffer may no longer do so after unrolling, which would be a clear pessimization.

So I think it should primarily serve as an upper bound on runtime unrolling. However, in practice we take it as the desired target, i.e. try to unroll all loops that we can to the loop buffer size. I'm not sure this part really makes sense, at least for the kind of wide out-of-order cores we're talking about here (see also #42332 (comment)).

Possibly the more correct fix here would be to set LoopMicroOpBufferSize=0 for these subtargets and disable runtime unrolling entirely. My current proposed fix is just a very conservative starting point.

In which case shouldn't we be trying to better estimate the cost of the loop control flow, based off branch distance etc? That way we avoid estimation issues for cpus without a dedicated loopback buffer etc.

Can you explain in more detail how we would estimate the cost/benefit of runtime unrolling? It's not really clear what we can do here beyond the fact that the compare and branch go away, as we generally don't expect runtime unrolling to result in any simplification beyond that.

goldsteinn · 2023-10-04T19:59:16Z

I've always struggled with the idea of LoopMicroOpBufferSize in general. It appears to be just used as a way of throttling unrolls, to avoid the potential cost of unnecessary compare+branch.

I believe the primary purpose of LoopMicroOpBufferSize here is to make sure that we don't runtime-unroll past the loop buffer size: Doing so could mean that a (non-unrolled) loop that previously used the loop buffer may no longer do so after unrolling, which would be a clear pessimization.

Not that this patch should address it, but on Intel X86 if the loop has a highly predictable iteration count in ~[24, 150] range (24=loop starts using LSD at ~this iter count, 150=max depth of branch predictor) using the loop micro op buffer can be a pessimization. This is because the LSD exits via a branch miss so a loop that would be perfectly predicted running from the DSB gets an branch miss added to it.

The znver3/znver4 scheduler modules are outliers, specifying very large LoopMicroOpBufferSizes at 512, while typical values for other subtargets are on the order of ~50. Even if this information is micro-architecturally correct (*), this does not mean that we want to runtime unroll all loops to a size that completely fills the loop buffer. Unless this is the single hot loop in the entire application, the massive code size increase will bust the micro-op and instruction caches. Protect against this by clamping to the default PartialThreshold of 150, which is the same as the default full-unroll threshold and half the aggressive full-unroll threshold. Allowing more partial unrolling than full unrolling is certainly non-sensical. (*) I strongly doubt that this is actually correct -- I believe this may derive from an incorrect reading of Agner Fog's micro-architecture guide. The number 4096 that was originally used here is the size of the general micro-op cache, not that of a loop buffer. A separate loop buffer is not listed for the Zen microarchitecture. Comparing this to the listing for Skylake, it has a 1536 micro-op buffer, but only a 64 micro-op loopback buffer, with a note that it's rarely fully utilized. Our scheduling model specifies LoopMicroOpBufferSize of 50 in that case.

nikic · 2024-05-06T06:33:50Z

Any suggestions on how to move forward with this?

RKSimon

I think this is a good generic limit to put in place so the numbers are compatible with what the loop transforms expect. I'm still hoping we can come up with some solution on #91340 for znver3/4 specifically but that's getting hindered by lack of test hardware.

…erSize (#67657)" This reverts commit f0b3654. This commit triggers UB by reading an uninitialized variable. `UP.PartialThreshold` is used uninitialized in `getUnrollingPreferences()` when it is called from `LoopVectorizationPlanner::executePlan()`. In this case the `UP` variable is created on the stack and its fields are not initialized. ``` ==8802==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x557c0b081b99 in llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getUnrollingPreferences(llvm::Loop*, llvm::ScalarEvolution&, llvm::TargetTransformInfo::UnrollingPreferences&, llvm::OptimizationRemarkEmitter*) llvm-project/llvm/include/llvm/CodeGen/BasicTTIImpl.h #1 0x557c0b07a40c in llvm::TargetTransformInfo::Model<llvm::X86TTIImpl>::getUnrollingPreferences(llvm::Loop*, llvm::ScalarEvolution&, llvm::TargetTransformInfo::UnrollingPreferences&, llvm::OptimizationRemarkEmitter*) llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:2277:17 #2 0x557c0f5d69ee in llvm::TargetTransformInfo::getUnrollingPreferences(llvm::Loop*, llvm::ScalarEvolution&, llvm::TargetTransformInfo::UnrollingPreferences&, llvm::OptimizationRemarkEmitter*) const llvm-project/llvm/lib/Analysis/TargetTransformInfo.cpp:387:19 #3 0x557c0e6b96a0 in llvm::LoopVectorizationPlanner::executePlan(llvm::ElementCount, unsigned int, llvm::VPlan&, llvm::InnerLoopVectorizer&, llvm::DominatorTree*, bool, llvm::DenseMap<llvm::SCEV const*, llvm::Value*, llvm::DenseMapInfo<llvm::SCEV const*, void>, llvm::detail::DenseMapPair<llvm::SCEV const*, llvm::Value*>> const*) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7624:7 #4 0x557c0e6e4b63 in llvm::LoopVectorizePass::processLoop(llvm::Loop*) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10253:13 #5 0x557c0e6f2429 in llvm::LoopVectorizePass::runImpl(llvm::Function&, llvm::ScalarEvolution&, llvm::LoopInfo&, llvm::TargetTransformInfo&, llvm::DominatorTree&, llvm::BlockFrequencyInfo*, llvm::TargetLibraryInfo*, llvm::DemandedBits&, llvm::AssumptionCache&, llvm::LoopAccessInfoManager&, llvm::OptimizationRemarkEmitter&, llvm::ProfileSummaryInfo*) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10344:30 #6 0x557c0e6f2f97 in llvm::LoopVectorizePass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10383:9 [...] Uninitialized value was created by an allocation of 'UP' in the stack frame #0 0x557c0e6b961e in llvm::LoopVectorizationPlanner::executePlan(llvm::ElementCount, unsigned int, llvm::VPlan&, llvm::InnerLoopVectorizer&, llvm::DominatorTree*, bool, llvm::DenseMap<llvm::SCEV const*, llvm::Value*, llvm::DenseMapInfo<llvm::SCEV const*, void>, llvm::detail::DenseMapPair<llvm::SCEV const*, llvm::Value*>> const*) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7623:3 ```

gribozavr · 2024-05-16T10:16:10Z

@nikic Sorry I reverted this PR in 83974a4 due to UB detected by MSan. See the revert description for details.

nikic requested review from fhahn, preames and RKSimon September 28, 2023 10:48

llvmbot added the llvm:transforms label Sep 28, 2023

RKSimon added the backend:X86 Scheduler Models Accuracy of X86 scheduler models label Sep 28, 2023

RKSimon requested a review from adibiagio September 28, 2023 11:37

nikic mentioned this pull request May 5, 2024

[X86] Worse runtime performance on Zen CPU when optimizing for Zen #90985

Closed

nikic force-pushed the clamp-partial-threshold branch from 3e0d138 to 5c7fbce Compare May 6, 2024 06:32

RKSimon approved these changes May 15, 2024

View reviewed changes

nikic merged commit f0b3654 into llvm:main May 16, 2024
4 checks passed

nikic deleted the clamp-partial-threshold branch May 16, 2024 01:21

RKSimon mentioned this pull request May 22, 2024

New compile time regressions (12 vs 13) on znver3 (LoopMicroOpBufferSize too large?) #50802

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LoopUnroll] Clamp PartialThreshold for large LoopMicroOpBufferSize #67657

[LoopUnroll] Clamp PartialThreshold for large LoopMicroOpBufferSize #67657

nikic commented Sep 28, 2023

llvmbot commented Sep 28, 2023

RKSimon commented Oct 4, 2023

nikic commented Oct 4, 2023

goldsteinn commented Oct 4, 2023 •

edited

Loading

nikic commented May 6, 2024

RKSimon left a comment

gribozavr commented May 16, 2024

[LoopUnroll] Clamp PartialThreshold for large LoopMicroOpBufferSize #67657

[LoopUnroll] Clamp PartialThreshold for large LoopMicroOpBufferSize #67657

Conversation

nikic commented Sep 28, 2023

llvmbot commented Sep 28, 2023

RKSimon commented Oct 4, 2023

nikic commented Oct 4, 2023

goldsteinn commented Oct 4, 2023 • edited Loading

nikic commented May 6, 2024

RKSimon left a comment

Choose a reason for hiding this comment

gribozavr commented May 16, 2024

goldsteinn commented Oct 4, 2023 •

edited

Loading