-
Notifications
You must be signed in to change notification settings - Fork 12.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64][SVE] Detect MOV (imm, pred, zeroing/merging) #116032
Conversation
Add patterns to fold MOV (scalar, predicated) to MOV (imm, pred, merging) or MOV (imm, pred, zeroing) as appropriate.
@llvm/pr-subscribers-backend-aarch64 Author: Ricardo Jesus (rj-jesus) ChangesAdd patterns to fold MOV (scalar, predicated) to MOV (imm, pred, Full diff: https://github.com/llvm/llvm-project/pull/116032.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
index c10653e05841cd..d0b4b71a93f641 100644
--- a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
@@ -892,6 +892,26 @@ let Predicates = [HasSVEorSME] in {
def : Pat<(nxv2i64 (splat_vector (i64 (SVECpyDupImm64Pat i32:$a, i32:$b)))),
(DUP_ZI_D $a, $b)>;
+ // Duplicate Int immediate to active vector elements (zeroing).
+ def : Pat<(nxv16i8 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm8Pat i32:$a, i32:$b)), (SVEDup0Undef))),
+ (CPY_ZPzI_B $pg, $a, $b)>;
+ def : Pat<(nxv8i16 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm16Pat i32:$a, i32:$b)), (SVEDup0Undef))),
+ (CPY_ZPzI_H $pg, $a, $b)>;
+ def : Pat<(nxv4i32 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm32Pat i32:$a, i32:$b)), (SVEDup0Undef))),
+ (CPY_ZPzI_S $pg, $a, $b)>;
+ def : Pat<(nxv2i64 (AArch64dup_mt PPR:$pg, (i64 (SVECpyDupImm64Pat i32:$a, i32:$b)), (SVEDup0Undef))),
+ (CPY_ZPzI_D $pg, $a, $b)>;
+
+ // Duplicate Int immediate to active vector elements (merging).
+ def : Pat<(nxv16i8 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm8Pat i32:$a, i32:$b)), (nxv16i8 ZPR:$z))),
+ (CPY_ZPmI_B $z, $pg, $a, $b)>;
+ def : Pat<(nxv8i16 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm16Pat i32:$a, i32:$b)), (nxv8i16 ZPR:$z))),
+ (CPY_ZPmI_H $z, $pg, $a, $b)>;
+ def : Pat<(nxv4i32 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm32Pat i32:$a, i32:$b)), (nxv4i32 ZPR:$z))),
+ (CPY_ZPmI_S $z, $pg, $a, $b)>;
+ def : Pat<(nxv2i64 (AArch64dup_mt PPR:$pg, (i64 (SVECpyDupImm64Pat i32:$a, i32:$b)), (nxv2i64 ZPR:$z))),
+ (CPY_ZPmI_D $z, $pg, $a, $b)>;
+
// Duplicate immediate FP into all vector elements.
def : Pat<(nxv2f16 (splat_vector (f16 fpimm:$val))),
(DUP_ZR_H (MOVi32imm (bitcast_fpimm_to_i32 f16:$val)))>;
diff --git a/llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll b/llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll
new file mode 100644
index 00000000000000..43be70c9590fb0
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll
@@ -0,0 +1,83 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s | FileCheck %s
+
+; Zeroing.
+
+define dso_local <vscale x 16 x i8> @mov_z_b(<vscale x 16 x i1> %pg) {
+; CHECK-LABEL: mov_z_b:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z0.b, p0/z, #1 // =0x1
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> zeroinitializer, <vscale x 16 x i1> %pg, i8 1)
+ ret <vscale x 16 x i8> %r
+}
+
+define dso_local <vscale x 8 x i16> @mov_z_h(<vscale x 8 x i1> %pg) {
+; CHECK-LABEL: mov_z_h:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z0.h, p0/z, #1 // =0x1
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16> zeroinitializer, <vscale x 8 x i1> %pg, i16 1)
+ ret <vscale x 8 x i16> %r
+}
+
+define dso_local <vscale x 4 x i32> @mov_z_s(<vscale x 4 x i1> %pg) {
+; CHECK-LABEL: mov_z_s:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z0.s, p0/z, #1 // =0x1
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 4 x i32> @llvm.aarch64.sve.dup.nxv4i32(<vscale x 4 x i32> zeroinitializer, <vscale x 4 x i1> %pg, i32 1)
+ ret <vscale x 4 x i32> %r
+}
+
+define dso_local <vscale x 2 x i64> @mov_z_d(<vscale x 2 x i1> %pg) {
+; CHECK-LABEL: mov_z_d:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z0.d, p0/z, #1 // =0x1
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64> zeroinitializer, <vscale x 2 x i1> %pg, i64 1)
+ ret <vscale x 2 x i64> %r
+}
+
+; Merging.
+
+define dso_local <vscale x 16 x i8> @mov_m_b(<vscale x 16 x i8> %zd, <vscale x 16 x i1> %pg) {
+; CHECK-LABEL: mov_m_b:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z0.b, p0/m, #1 // =0x1
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %zd, <vscale x 16 x i1> %pg, i8 1)
+ ret <vscale x 16 x i8> %r
+}
+
+define dso_local <vscale x 8 x i16> @mov_m_h(<vscale x 8 x i16> %zd, <vscale x 8 x i1> %pg) {
+; CHECK-LABEL: mov_m_h:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z0.h, p0/m, #1 // =0x1
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16> %zd, <vscale x 8 x i1> %pg, i16 1)
+ ret <vscale x 8 x i16> %r
+}
+
+define dso_local <vscale x 4 x i32> @mov_m_s(<vscale x 4 x i32> %zd, <vscale x 4 x i1> %pg) {
+; CHECK-LABEL: mov_m_s:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z0.s, p0/m, #1 // =0x1
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 4 x i32> @llvm.aarch64.sve.dup.nxv4i32(<vscale x 4 x i32> %zd, <vscale x 4 x i1> %pg, i32 1)
+ ret <vscale x 4 x i32> %r
+}
+
+define dso_local <vscale x 2 x i64> @mov_m_d(<vscale x 2 x i64> %zd, <vscale x 2 x i1> %pg) {
+; CHECK-LABEL: mov_m_d:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z0.d, p0/m, #1 // =0x1
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64> %zd, <vscale x 2 x i1> %pg, i64 1)
+ ret <vscale x 2 x i64> %r
+}
+
+declare <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8>, <vscale x 16 x i1>, i8)
+declare <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16>, <vscale x 8 x i1>, i16)
+declare <vscale x 4 x i32> @llvm.aarch64.sve.dup.nxv4i32(<vscale x 4 x i32>, <vscale x 4 x i1>, i32)
+declare <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64>, <vscale x 2 x i1>, i64)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you say where the uses are coming from? I ask because I'm wondering if we should instead be lowering the intrinsics to select instructions to allow generic combines to do their thing. Plus we already have patterns to spot the select based sequences.
The uses are coming from ACLE intrinsics as in the description above. Would you rather see it done with a select in
which generates:
instead of:
So, we could probably address all these cases by going the select way. What do you think? |
Thanks for the clarification. I just wasn't sure if the example was a real reproducer or just a means to demonstrate the problem. Given this is ACLE related, let's park the decision to replace The missed optimisation of the all active case isn't too surprising given there's already an unpredicated builtin that should be used instead. That said, I see no harm in updating |
Thanks for the feedback, I've now pushed the patterns into As for the unpredicated case, does it make sense to make that change here as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor quibble with the tests but otherwise this looks good to me.
I recommend fixing it as a separate PR. |
Sounds good, thanks, and many thanks for the feedback! |
Add patterns to fold MOV (scalar, predicated) to MOV (imm, pred, merging) or MOV (imm, pred, zeroing) as appropriate. This affects the `@llvm.aarch64.sve.dup` intrinsics, which currently generate MOV (scalar, predicated) instructions even when the immediate forms are possible. For example: ``` svuint8_t mov_z_b(svbool_t p) { return svdup_u8_z(p, 1); } ``` Currently generates: ``` mov_z_b(__SVBool_t): mov z0.b, #0 mov w8, llvm#1 mov z0.b, p0/m, w8 ret ``` Instead of: ``` mov_z_b(__SVBool_t): mov z0.b, p0/z, llvm#1 ret ```
Add patterns to fold MOV (scalar, predicated) to MOV (imm, pred,
merging) or MOV (imm, pred, zeroing) as appropriate.
This affects the
@llvm.aarch64.sve.dup
intrinsics, which currentlygenerate the MOV (scalar, predicated) instructions, even when the
immediate forms are possible. For example:
Currently generates:
Instead of:
https://godbolt.org/z/GPzrW1nY8