-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64][SVE] Use FeatureUseFixedOverScalableIfEqualCost for A510 and A520 #132246
Conversation
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-aarch64 Author: Nashe Mncube (nasherm) ChangesThe default MaxInterleaveFactor for AArch64 targets is 2. This produces inefficient codegen on at least two in-order cores, those being Cortex-A510 and Cortex-A520. For example a simple vector add
Vectorizes the inner loop into the following interleaved sequence of instructions
while when we reduce MaxInterleaveFactor to 1 we get the following
This patch also introduces a test Patch is 30.69 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/132246.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
index bb36af8fce5cc..57ae4dfb71c36 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
@@ -181,6 +181,7 @@ void AArch64Subtarget::initializeProperties(bool HasMinSize) {
VScaleForTuning = 1;
PrefLoopAlignment = Align(16);
MaxBytesForLoopAlignment = 8;
+ MaxInterleaveFactor = 1;
break;
case CortexA710:
case CortexA715:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleave-inorder-core.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleave-inorder-core.ll
new file mode 100644
index 0000000000000..a3bf37726943f
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleave-inorder-core.ll
@@ -0,0 +1,360 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -mtriple=aarch64-none-elf -mcpu=cortex-a510 -mattr=+sve -passes=loop-vectorize -S | FileCheck %s --check-prefix=CHECK-CA510-NOINTERLEAVE
+; RUN: opt < %s -mtriple=aarch64-none-elf -mcpu=cortex-a510 -mattr=+sve -passes=loop-vectorize -force-target-max-vector-interleave=2 -S | FileCheck %s --check-prefix=CHECK-CA510-INTERLEAVE
+; RUN: opt < %s -mtriple=aarch64-none-elf -mcpu=cortex-a520 -mattr=+sve -passes=loop-vectorize -S | FileCheck %s --check-prefix=CHECK-CA520-NOINTERLEAVE
+; RUN: opt < %s -mtriple=aarch64-none-elf -mcpu=cortex-a510 -mattr=+sve -passes=loop-vectorize -force-target-max-vector-interleave=2 -S | FileCheck %s --check-prefix=CHECK-CA520-INTERLEAVE
+
+define void @sve_add(ptr %dst, ptr %a, ptr %b, i64 %n) {
+; CHECK-CA510-NOINTERLEAVE-LABEL: define void @sve_add(
+; CHECK-CA510-NOINTERLEAVE-SAME: ptr [[DST:%.*]], ptr [[A:%.*]], ptr [[B:%.*]], i64 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[ENTRY:.*:]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[B3:%.*]] = ptrtoint ptr [[B]] to i64
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[A2:%.*]] = ptrtoint ptr [[A]] to i64
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[DST1:%.*]] = ptrtoint ptr [[DST]] to i64
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[CMP9_NOT:%.*]] = icmp eq i64 [[N]], 0
+; CHECK-CA510-NOINTERLEAVE-NEXT: br i1 [[CMP9_NOT]], label %[[FOR_COND_CLEANUP:.*]], label %[[FOR_BODY_PREHEADER:.*]]
+; CHECK-CA510-NOINTERLEAVE: [[FOR_BODY_PREHEADER]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP2:%.*]] = call i64 @llvm.umax.i64(i64 12, i64 [[TMP1]])
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], [[TMP2]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_MEMCHECK:.*]]
+; CHECK-CA510-NOINTERLEAVE: [[VECTOR_MEMCHECK]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP6:%.*]] = sub i64 [[DST1]], [[A2]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP6]], [[TMP5]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP7:%.*]] = mul i64 [[TMP4]], 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP8:%.*]] = sub i64 [[DST1]], [[B3]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[DIFF_CHECK4:%.*]] = icmp ult i64 [[TMP8]], [[TMP7]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[DIFF_CHECK]], [[DIFF_CHECK4]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: br i1 [[CONFLICT_RDX]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
+; CHECK-CA510-NOINTERLEAVE: [[VECTOR_PH]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N]], [[TMP10]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP11:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK-CA510-NOINTERLEAVE: [[VECTOR_BODY]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP13:%.*]] = add i64 [[INDEX]], 0
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP14:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[TMP13]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP15:%.*]] = getelementptr inbounds nuw float, ptr [[TMP14]], i32 0
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x float>, ptr [[TMP15]], align 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP16:%.*]] = getelementptr inbounds nuw float, ptr [[B]], i64 [[TMP13]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP17:%.*]] = getelementptr inbounds nuw float, ptr [[TMP16]], i32 0
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[WIDE_LOAD5:%.*]] = load <vscale x 4 x float>, ptr [[TMP17]], align 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP18:%.*]] = fadd fast <vscale x 4 x float> [[WIDE_LOAD5]], [[WIDE_LOAD]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP19:%.*]] = getelementptr inbounds nuw float, ptr [[DST]], i64 [[TMP13]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP20:%.*]] = getelementptr inbounds nuw float, ptr [[TMP19]], i32 0
+; CHECK-CA510-NOINTERLEAVE-NEXT: store <vscale x 4 x float> [[TMP18]], ptr [[TMP20]], align 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP12]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: br i1 [[TMP21]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-CA510-NOINTERLEAVE: [[MIDDLE_BLOCK]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: br i1 [[CMP_N]], label %[[FOR_COND_CLEANUP_LOOPEXIT:.*]], label %[[SCALAR_PH]]
+; CHECK-CA510-NOINTERLEAVE: [[SCALAR_PH]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[FOR_BODY_PREHEADER]] ], [ 0, %[[VECTOR_MEMCHECK]] ]
+; CHECK-CA510-NOINTERLEAVE-NEXT: br label %[[FOR_BODY:.*]]
+; CHECK-CA510-NOINTERLEAVE: [[FOR_BODY]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ], [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[INDVARS_IV]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP22:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds nuw float, ptr [[B]], i64 [[INDVARS_IV]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP23:%.*]] = load float, ptr [[ARRAYIDX2]], align 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[ADD:%.*]] = fadd fast float [[TMP23]], [[TMP22]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[ARRAYIDX4:%.*]] = getelementptr inbounds nuw float, ptr [[DST]], i64 [[INDVARS_IV]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: store float [[ADD]], ptr [[ARRAYIDX4]], align 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[N]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP_LOOPEXIT]], label %[[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK-CA510-NOINTERLEAVE: [[FOR_COND_CLEANUP_LOOPEXIT]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: br label %[[FOR_COND_CLEANUP]]
+; CHECK-CA510-NOINTERLEAVE: [[FOR_COND_CLEANUP]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: ret void
+;
+; CHECK-CA510-INTERLEAVE-LABEL: define void @sve_add(
+; CHECK-CA510-INTERLEAVE-SAME: ptr [[DST:%.*]], ptr [[A:%.*]], ptr [[B:%.*]], i64 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-CA510-INTERLEAVE-NEXT: [[ENTRY:.*:]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[B3:%.*]] = ptrtoint ptr [[B]] to i64
+; CHECK-CA510-INTERLEAVE-NEXT: [[A2:%.*]] = ptrtoint ptr [[A]] to i64
+; CHECK-CA510-INTERLEAVE-NEXT: [[DST1:%.*]] = ptrtoint ptr [[DST]] to i64
+; CHECK-CA510-INTERLEAVE-NEXT: [[CMP9_NOT:%.*]] = icmp eq i64 [[N]], 0
+; CHECK-CA510-INTERLEAVE-NEXT: br i1 [[CMP9_NOT]], label %[[FOR_COND_CLEANUP:.*]], label %[[FOR_BODY_PREHEADER:.*]]
+; CHECK-CA510-INTERLEAVE: [[FOR_BODY_PREHEADER]]:
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 8
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP2:%.*]] = call i64 @llvm.umax.i64(i64 12, i64 [[TMP1]])
+; CHECK-CA510-INTERLEAVE-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], [[TMP2]]
+; CHECK-CA510-INTERLEAVE-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_MEMCHECK:.*]]
+; CHECK-CA510-INTERLEAVE: [[VECTOR_MEMCHECK]]:
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 8
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP6:%.*]] = sub i64 [[DST1]], [[A2]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP6]], [[TMP5]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP7:%.*]] = mul i64 [[TMP4]], 8
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP8:%.*]] = sub i64 [[DST1]], [[B3]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[DIFF_CHECK4:%.*]] = icmp ult i64 [[TMP8]], [[TMP7]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[DIFF_CHECK]], [[DIFF_CHECK4]]
+; CHECK-CA510-INTERLEAVE-NEXT: br i1 [[CONFLICT_RDX]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
+; CHECK-CA510-INTERLEAVE: [[VECTOR_PH]]:
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 8
+; CHECK-CA510-INTERLEAVE-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N]], [[TMP10]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP11:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 8
+; CHECK-CA510-INTERLEAVE-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK-CA510-INTERLEAVE: [[VECTOR_BODY]]:
+; CHECK-CA510-INTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP13:%.*]] = add i64 [[INDEX]], 0
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP14:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[TMP13]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP15:%.*]] = getelementptr inbounds nuw float, ptr [[TMP14]], i32 0
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP17:%.*]] = mul i64 [[TMP16]], 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP18:%.*]] = getelementptr inbounds nuw float, ptr [[TMP14]], i64 [[TMP17]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x float>, ptr [[TMP15]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[WIDE_LOAD5:%.*]] = load <vscale x 4 x float>, ptr [[TMP18]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP19:%.*]] = getelementptr inbounds nuw float, ptr [[B]], i64 [[TMP13]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP20:%.*]] = getelementptr inbounds nuw float, ptr [[TMP19]], i32 0
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP21:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP22:%.*]] = mul i64 [[TMP21]], 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP23:%.*]] = getelementptr inbounds nuw float, ptr [[TMP19]], i64 [[TMP22]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[WIDE_LOAD6:%.*]] = load <vscale x 4 x float>, ptr [[TMP20]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[WIDE_LOAD7:%.*]] = load <vscale x 4 x float>, ptr [[TMP23]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP24:%.*]] = fadd fast <vscale x 4 x float> [[WIDE_LOAD6]], [[WIDE_LOAD]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP25:%.*]] = fadd fast <vscale x 4 x float> [[WIDE_LOAD7]], [[WIDE_LOAD5]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP26:%.*]] = getelementptr inbounds nuw float, ptr [[DST]], i64 [[TMP13]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP27:%.*]] = getelementptr inbounds nuw float, ptr [[TMP26]], i32 0
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP28:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP29:%.*]] = mul i64 [[TMP28]], 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP30:%.*]] = getelementptr inbounds nuw float, ptr [[TMP26]], i64 [[TMP29]]
+; CHECK-CA510-INTERLEAVE-NEXT: store <vscale x 4 x float> [[TMP24]], ptr [[TMP27]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: store <vscale x 4 x float> [[TMP25]], ptr [[TMP30]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP12]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP31:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-CA510-INTERLEAVE-NEXT: br i1 [[TMP31]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-CA510-INTERLEAVE: [[MIDDLE_BLOCK]]:
+; CHECK-CA510-INTERLEAVE-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-CA510-INTERLEAVE-NEXT: br i1 [[CMP_N]], label %[[FOR_COND_CLEANUP_LOOPEXIT:.*]], label %[[SCALAR_PH]]
+; CHECK-CA510-INTERLEAVE: [[SCALAR_PH]]:
+; CHECK-CA510-INTERLEAVE-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[FOR_BODY_PREHEADER]] ], [ 0, %[[VECTOR_MEMCHECK]] ]
+; CHECK-CA510-INTERLEAVE-NEXT: br label %[[FOR_BODY:.*]]
+; CHECK-CA510-INTERLEAVE: [[FOR_BODY]]:
+; CHECK-CA510-INTERLEAVE-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ], [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ]
+; CHECK-CA510-INTERLEAVE-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[INDVARS_IV]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP32:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds nuw float, ptr [[B]], i64 [[INDVARS_IV]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP33:%.*]] = load float, ptr [[ARRAYIDX2]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[ADD:%.*]] = fadd fast float [[TMP33]], [[TMP32]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[ARRAYIDX4:%.*]] = getelementptr inbounds nuw float, ptr [[DST]], i64 [[INDVARS_IV]]
+; CHECK-CA510-INTERLEAVE-NEXT: store float [[ADD]], ptr [[ARRAYIDX4]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-CA510-INTERLEAVE-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[N]]
+; CHECK-CA510-INTERLEAVE-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP_LOOPEXIT]], label %[[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK-CA510-INTERLEAVE: [[FOR_COND_CLEANUP_LOOPEXIT]]:
+; CHECK-CA510-INTERLEAVE-NEXT: br label %[[FOR_COND_CLEANUP]]
+; CHECK-CA510-INTERLEAVE: [[FOR_COND_CLEANUP]]:
+; CHECK-CA510-INTERLEAVE-NEXT: ret void
+;
+; CHECK-CA520-NOINTERLEAVE-LABEL: define void @sve_add(
+; CHECK-CA520-NOINTERLEAVE-SAME: ptr [[DST:%.*]], ptr [[A:%.*]], ptr [[B:%.*]], i64 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[ENTRY:.*:]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[B3:%.*]] = ptrtoint ptr [[B]] to i64
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[A2:%.*]] = ptrtoint ptr [[A]] to i64
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[DST1:%.*]] = ptrtoint ptr [[DST]] to i64
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[CMP9_NOT:%.*]] = icmp eq i64 [[N]], 0
+; CHECK-CA520-NOINTERLEAVE-NEXT: br i1 [[CMP9_NOT]], label %[[FOR_COND_CLEANUP:.*]], label %[[FOR_BODY_PREHEADER:.*]]
+; CHECK-CA520-NOINTERLEAVE: [[FOR_BODY_PREHEADER]]:
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP2:%.*]] = call i64 @llvm.umax.i64(i64 12, i64 [[TMP1]])
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], [[TMP2]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_MEMCHECK:.*]]
+; CHECK-CA520-NOINTERLEAVE: [[VECTOR_MEMCHECK]]:
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP6:%.*]] = sub i64 [[DST1]], [[A2]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP6]], [[TMP5]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP7:%.*]] = mul i64 [[TMP4]], 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP8:%.*]] = sub i64 [[DST1]], [[B3]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[DIFF_CHECK4:%.*]] = icmp ult i64 [[TMP8]], [[TMP7]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[DIFF_CHECK]], [[DIFF_CHECK4]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: br i1 [[CONFLICT_RDX]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
+; CHECK-CA520-NOINTERLEAVE: [[VECTOR_PH]]:
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N]], [[TMP10]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP11:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK-CA520-NOINTERLEAVE: [[VECTOR_BODY]]:
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP13:%.*]] = add i64 [[INDEX]], 0
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP14:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[TMP13]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP15:%.*]] = getelementptr inbounds nuw float, ptr [[TMP14]], i32 0
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x float>, ptr [[TMP15]], align 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP16:%.*]] = getelementptr inbounds nuw float, ptr [[B]], i64 [[TMP13]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP17:%.*]] = getelementptr inbounds nuw float, ptr [[TMP16]], i32 0
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[WIDE_LOAD5:%.*]] = load <vscale x 4 x float>, ptr [[TMP17]], align 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP18:%.*]] = fadd fast <vscale x 4 x float> [[WIDE_LOAD5]], [[WIDE_LOAD]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP19:%.*]] = getelementptr inbounds nuw float, ptr [[DST]], i64 [[TMP13]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP20:%.*]] = getelementptr inbounds nuw float, ptr [[TMP19]], i32 0
+; CHECK-CA520-NOINTERLEAVE-NEXT: store <vscale x 4 x float> [[TMP18]], ptr [[TMP20]], align 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP12]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: br i1 [[TMP21]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-CA520-NOINTERLEAVE: [[MIDDLE_BLOCK]]:
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: br i1 [[CMP_N]], label %[[FOR_COND_CLEANUP_LOOPEXIT:.*]], label %[[SCALAR_PH]]
+; CHECK-CA520-NOINTERLEAVE: [[SCALAR_PH]]:
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[FOR_BODY_PREHEADER]] ], [ 0, %[[VECTOR_MEMCHECK]] ]
+; CHECK-CA520-NOINTERLEAVE-NEXT: br label %[[FOR_BODY:.*]]
+; CHECK-CA520-NOINTERLEAVE: [[FOR_BODY]]:
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ], [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ]
+; CHECK-CA520-NOINTERLEAVE...
[truncated]
|
I would expect In-order cores should like unrolling, as it enables more hiding of latency hazards. (There are always times based on the trip count of the loop or the overheads that it ends up making things worse, but I would expect at least some level of interleaving to usually be useful overall). It looks from the example code that the addressing modes in the loops is not doing very well. They are usually calculated in LSR. Could they do better, and then get the benefit of interleaving without the cost of the inefficient addressing mode calculations? |
Also - I think this controls Neon too, and Neon will prefer interleaving at least a bit to make use of LDP/STP. |
The other alternative might be to prefer fixed width to scalable when the costs in the vectorizer are equal, if that is more beneficial on these cores. It is controlled via the FeatureUseFixedOverScalableIfEqualCost feature. |
Inefficient SVE codegen occurs on at least two in-order cores, those being Cortex-A510 and Cortex-A520. For example a simple vector add ``` void foo(float a, float b, float dst, unsigned n) { for (unsigned i = 0; i < n; ++i) dst[i] = a[i] + b[i]; } ``` Vectorizes the inner loop into the following interleaved sequence of instructions. ``` add x12, x1, x10 ld1b { z0.b }, p0/z, [x1, x10] add x13, x2, x10 ld1b { z1.b }, p0/z, [x2, x10] ldr z2, [x12, llvm#1, mul vl] ldr z3, [x13, llvm#1, mul vl] dech x11 add x12, x0, x10 fadd z0.s, z1.s, z0.s fadd z1.s, z3.s, z2.s st1b { z0.b }, p0, [x0, x10] addvl x10, x10, llvm#2 str z1, [x12, llvm#1, mul vl] ``` By adjusting the target features to prefer fixed over scalable if the cost is equal we get the following vectorized loop. ``` ldp q0, q3, [x11, #-16] subs x13, x13, llvm#8 ldp q1, q2, [x10, #-16] add x10, x10, llvm#32 add x11, x11, llvm#32 fadd v0.4s, v1.4s, v0.4s fadd v1.4s, v2.4s, v3.4s stp q0, q1, [x12, #-16] add x12, x12, llvm#32 ``` Which is more efficient. Change-Id: Ie1e862f6a1db851182a95534b3b987feb670d7ca
c16f09c
to
9df06bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, can you change the title to something like "[AArch64][SVE] Use FeatureUseFixedOverScalableIfEqualCost for A510 and A520". It is only when the scores are equal and the vectorizer has no reason to pick one vs the other that this will cause the vectorizer to pick fixed-width.
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/33/builds/14312 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/153/builds/27853 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/185/builds/16134 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/137/builds/16375 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/60/builds/23895 Here is the relevant piece of the build log for the reference
|
Had to revert due to buildbot failures. Investigating |
…ualCost for A510 and A520" (#134382) Reverts llvm/llvm-project#132246
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/190/builds/17694 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/175/builds/16254 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/187/builds/5218 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/52/builds/7327 Here is the relevant piece of the build log for the reference
|
Inefficient SVE codegen occurs on at least two in-order cores,
those being Cortex-A510 and Cortex-A520. For example a simple vector
add
Vectorizes the inner loop into the following interleaved sequence
of instructions.
By adjusting the target features to prefer fixed over scalable if the
cost is equal we get the following vectorized loop.
Which is more efficient.