[AArch64][SVE] Detect MOV (imm, pred, zeroing/merging) #116032

rj-jesus · 2024-11-13T11:55:26Z

Add patterns to fold MOV (scalar, predicated) to MOV (imm, pred,
merging) or MOV (imm, pred, zeroing) as appropriate.

This affects the @llvm.aarch64.sve.dup intrinsics, which currently
generate the MOV (scalar, predicated) instructions, even when the
immediate forms are possible. For example:

svuint8_t mov_z_b(svbool_t p) {
  return svdup_u8_z(p, 1);
}

Currently generates:

mov_z_b(__SVBool_t):
        mov     z0.b, #0
        mov     w8, #1
        mov     z0.b, p0/m, w8
        ret

Instead of:

mov_z_b(__SVBool_t):
        mov     z0.b, p0/z, #1
        ret

https://godbolt.org/z/GPzrW1nY8

Add patterns to fold MOV (scalar, predicated) to MOV (imm, pred, merging) or MOV (imm, pred, zeroing) as appropriate.

llvmbot · 2024-11-13T11:56:04Z

@llvm/pr-subscribers-backend-aarch64

Author: Ricardo Jesus (rj-jesus)

Changes

Add patterns to fold MOV (scalar, predicated) to MOV (imm, pred,
merging) or MOV (imm, pred, zeroing) as appropriate.

Full diff: https://github.com/llvm/llvm-project/pull/116032.diff

2 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td (+20)
(added) llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll (+83)

diff --git a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
index c10653e05841cd..d0b4b71a93f641 100644
--- a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
@@ -892,6 +892,26 @@ let Predicates = [HasSVEorSME] in {
   def : Pat<(nxv2i64 (splat_vector (i64 (SVECpyDupImm64Pat i32:$a, i32:$b)))),
             (DUP_ZI_D $a, $b)>;
 
+  // Duplicate Int immediate to active vector elements (zeroing).
+  def : Pat<(nxv16i8 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm8Pat i32:$a, i32:$b)), (SVEDup0Undef))),
+            (CPY_ZPzI_B $pg, $a, $b)>;
+  def : Pat<(nxv8i16 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm16Pat i32:$a, i32:$b)), (SVEDup0Undef))),
+            (CPY_ZPzI_H $pg, $a, $b)>;
+  def : Pat<(nxv4i32 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm32Pat i32:$a, i32:$b)), (SVEDup0Undef))),
+            (CPY_ZPzI_S $pg, $a, $b)>;
+  def : Pat<(nxv2i64 (AArch64dup_mt PPR:$pg, (i64 (SVECpyDupImm64Pat i32:$a, i32:$b)), (SVEDup0Undef))),
+            (CPY_ZPzI_D $pg, $a, $b)>;
+
+  // Duplicate Int immediate to active vector elements (merging).
+  def : Pat<(nxv16i8 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm8Pat i32:$a, i32:$b)), (nxv16i8 ZPR:$z))),
+            (CPY_ZPmI_B $z, $pg, $a, $b)>;
+  def : Pat<(nxv8i16 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm16Pat i32:$a, i32:$b)), (nxv8i16 ZPR:$z))),
+            (CPY_ZPmI_H $z, $pg, $a, $b)>;
+  def : Pat<(nxv4i32 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm32Pat i32:$a, i32:$b)), (nxv4i32 ZPR:$z))),
+            (CPY_ZPmI_S $z, $pg, $a, $b)>;
+  def : Pat<(nxv2i64 (AArch64dup_mt PPR:$pg, (i64 (SVECpyDupImm64Pat i32:$a, i32:$b)), (nxv2i64 ZPR:$z))),
+            (CPY_ZPmI_D $z, $pg, $a, $b)>;
+
   // Duplicate immediate FP into all vector elements.
   def : Pat<(nxv2f16 (splat_vector (f16 fpimm:$val))),
             (DUP_ZR_H (MOVi32imm (bitcast_fpimm_to_i32 f16:$val)))>;
diff --git a/llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll b/llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll
new file mode 100644
index 00000000000000..43be70c9590fb0
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll
@@ -0,0 +1,83 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s | FileCheck %s
+
+; Zeroing.
+
+define dso_local <vscale x 16 x i8> @mov_z_b(<vscale x 16 x i1> %pg) {
+; CHECK-LABEL: mov_z_b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov z0.b, p0/z, #1 // =0x1
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> zeroinitializer, <vscale x 16 x i1> %pg, i8 1)
+  ret <vscale x 16 x i8> %r
+}
+
+define dso_local <vscale x 8 x i16> @mov_z_h(<vscale x 8 x i1> %pg) {
+; CHECK-LABEL: mov_z_h:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov z0.h, p0/z, #1 // =0x1
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16> zeroinitializer, <vscale x 8 x i1> %pg, i16 1)
+  ret <vscale x 8 x i16> %r
+}
+
+define dso_local <vscale x 4 x i32> @mov_z_s(<vscale x 4 x i1> %pg) {
+; CHECK-LABEL: mov_z_s:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov z0.s, p0/z, #1 // =0x1
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 4 x i32> @llvm.aarch64.sve.dup.nxv4i32(<vscale x 4 x i32> zeroinitializer, <vscale x 4 x i1> %pg, i32 1)
+  ret <vscale x 4 x i32> %r
+}
+
+define dso_local <vscale x 2 x i64> @mov_z_d(<vscale x 2 x i1> %pg) {
+; CHECK-LABEL: mov_z_d:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov z0.d, p0/z, #1 // =0x1
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64> zeroinitializer, <vscale x 2 x i1> %pg, i64 1)
+  ret <vscale x 2 x i64> %r
+}
+
+; Merging.
+
+define dso_local <vscale x 16 x i8> @mov_m_b(<vscale x 16 x i8> %zd, <vscale x 16 x i1> %pg) {
+; CHECK-LABEL: mov_m_b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov z0.b, p0/m, #1 // =0x1
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %zd, <vscale x 16 x i1> %pg, i8 1)
+  ret <vscale x 16 x i8> %r
+}
+
+define dso_local <vscale x 8 x i16> @mov_m_h(<vscale x 8 x i16> %zd, <vscale x 8 x i1> %pg) {
+; CHECK-LABEL: mov_m_h:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov z0.h, p0/m, #1 // =0x1
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16> %zd, <vscale x 8 x i1> %pg, i16 1)
+  ret <vscale x 8 x i16> %r
+}
+
+define dso_local <vscale x 4 x i32> @mov_m_s(<vscale x 4 x i32> %zd, <vscale x 4 x i1> %pg) {
+; CHECK-LABEL: mov_m_s:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov z0.s, p0/m, #1 // =0x1
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 4 x i32> @llvm.aarch64.sve.dup.nxv4i32(<vscale x 4 x i32> %zd, <vscale x 4 x i1> %pg, i32 1)
+  ret <vscale x 4 x i32> %r
+}
+
+define dso_local <vscale x 2 x i64> @mov_m_d(<vscale x 2 x i64> %zd, <vscale x 2 x i1> %pg) {
+; CHECK-LABEL: mov_m_d:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov z0.d, p0/m, #1 // =0x1
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64> %zd, <vscale x 2 x i1> %pg, i64 1)
+  ret <vscale x 2 x i64> %r
+}
+
+declare <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8>, <vscale x 16 x i1>, i8)
+declare <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16>, <vscale x 8 x i1>, i16)
+declare <vscale x 4 x i32> @llvm.aarch64.sve.dup.nxv4i32(<vscale x 4 x i32>, <vscale x 4 x i1>, i32)
+declare <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64>, <vscale x 2 x i1>, i64)

paulwalker-arm

Can you say where the uses are coming from? I ask because I'm wondering if we should instead be lowering the intrinsics to select instructions to allow generic combines to do their thing. Plus we already have patterns to spot the select based sequences.

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td

rj-jesus · 2024-11-13T16:11:32Z

Can you say where the uses are coming from? I ask because I'm wondering if we should instead be lowering the intrinsics to select instructions to allow generic combines to do their thing. Plus we already have patterns to spot the select based sequences.

The uses are coming from ACLE intrinsics as in the description above. Would you rather see it done with a select in LowerSVEIntrinsicDUP? I think that should also work well. After sending this patch through I've realised we are also missing patterns for the MOV (immediate, unpredicated) case, e.g.:

svuint64_t foo() {
  svbool_t pg = svptrue_b64();
  return svdup_u64_z(pg, 1);
}

which generates:

foo:
        ptrue   p0.d
        mov     z0.d, #0
        mov     w8, #1
        mov     z0.d, p0/m, x8
        ret

instead of:

foo:
        mov     z0.d, #1
        ret

So, we could probably address all these cases by going the select way. What do you think?

paulwalker-arm · 2024-11-13T17:27:24Z

Thanks for the clarification. I just wasn't sure if the example was a real reproducer or just a means to demonstrate the problem. Given this is ACLE related, let's park the decision to replace DUP_MERGE_PASSTHRU for now and stick with adding the new patterns.

The missed optimisation of the all active case isn't too surprising given there's already an unpredicated builtin that should be used instead. That said, I see no harm in updating instCombineSVEDup to replace such instances with a vector splat.

rj-jesus · 2024-11-14T08:41:34Z

Thanks for the feedback, I've now pushed the patterns into sve_int_dup_imm_pred_zero and sve_int_dup_imm_pred_merge as appropriate - is what you had in mind?

As for the unpredicated case, does it make sense to make that change here as well?
As you say, there are unpredicated variants that can already be used, though it might still be a nice fold to have.

paulwalker-arm

A minor quibble with the tests but otherwise this looks good to me.

llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll

paulwalker-arm · 2024-11-14T15:21:05Z

As for the unpredicated case, does it make sense to make that change here as well?

I recommend fixing it as a separate PR.

rj-jesus · 2024-11-14T16:01:44Z

I recommend fixing it as a separate PR.

Sounds good, thanks, and many thanks for the feedback!

rj-jesus added 2 commits November 13, 2024 02:44

Precommit tests

247bbdd

[AArch64][SVE] Detect MOV (imm, pred, zeroing/merging)

659ea54

Add patterns to fold MOV (scalar, predicated) to MOV (imm, pred, merging) or MOV (imm, pred, zeroing) as appropriate.

rj-jesus requested review from davemgreen, paulwalker-arm and sjoerdmeijer November 13, 2024 11:55

llvmbot added the backend:AArch64 label Nov 13, 2024

paulwalker-arm reviewed Nov 13, 2024

View reviewed changes

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td Outdated Show resolved Hide resolved

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td Outdated Show resolved Hide resolved

Move patterns into sve_int_dup_imm_pred_*

bb39c64

paulwalker-arm approved these changes Nov 14, 2024

View reviewed changes

llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll Outdated Show resolved Hide resolved

Remove dso_local from tests

4f383cf

rj-jesus merged commit e543650 into llvm:main Nov 15, 2024
8 checks passed

rj-jesus deleted the rjj/aarch64-sve-mov-imm-pred branch November 15, 2024 09:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AArch64][SVE] Detect MOV (imm, pred, zeroing/merging) #116032

[AArch64][SVE] Detect MOV (imm, pred, zeroing/merging) #116032

rj-jesus commented Nov 13, 2024 •

edited

Loading

llvmbot commented Nov 13, 2024

paulwalker-arm left a comment

rj-jesus commented Nov 13, 2024

paulwalker-arm commented Nov 13, 2024

rj-jesus commented Nov 14, 2024

paulwalker-arm left a comment

paulwalker-arm commented Nov 14, 2024

rj-jesus commented Nov 14, 2024

[AArch64][SVE] Detect MOV (imm, pred, zeroing/merging) #116032

[AArch64][SVE] Detect MOV (imm, pred, zeroing/merging) #116032

Conversation

rj-jesus commented Nov 13, 2024 • edited Loading

llvmbot commented Nov 13, 2024

paulwalker-arm left a comment

Choose a reason for hiding this comment

rj-jesus commented Nov 13, 2024

paulwalker-arm commented Nov 13, 2024

rj-jesus commented Nov 14, 2024

paulwalker-arm left a comment

Choose a reason for hiding this comment

paulwalker-arm commented Nov 14, 2024

rj-jesus commented Nov 14, 2024

rj-jesus commented Nov 13, 2024 •

edited

Loading