[AArch64] Use INDEX for constant Neon step vectors #113424

rj-jesus · 2024-10-23T07:57:16Z

When compiling for an SVE target we can use INDEX to generate constant fixed-length step vectors, e.g.:

uint32x4_t bar() {
  return (uint32x4_t){0, 1, 2, 3};
}

Currently:

bar():
        adrp    x8, .LCPI1_0
        ldr     q0, [x8, :lo12:.LCPI1_0]
        ret

With INDEX:

bar():
        index   z0.s, #0, #1
        ret

The logic for this was already in LowerBUILD_VECTOR, though it was hidden under a check for !Subtarget->isNeonAvailable(). This patch refactors this to enable the corresponding code path unconditionally for constant step vectors (as long as we can use SVE for them).

GCC implemented this recently.

https://godbolt.org/z/6sP3Kj3db

Resolves #110410

llvmbot · 2024-10-23T07:57:49Z

@llvm/pr-subscribers-backend-aarch64

Author: Ricardo Jesus (rj-jesus)

Changes

When compiling for an SVE target we can use INDEX to generate constant fixed-length step vectors, e.g.:

uint32x4_t bar() {
  return (uint32x4_t){0, 1, 2, 3};
}

Currently:

bar():
        adrp    x8, .LCPI1_0
        ldr     q0, [x8, :lo12:.LCPI1_0]
        ret

With INDEX:

bar():
        index   z0.s, #<!-- -->0, #<!-- -->1
        ret

The logic for this was already in LowerBUILD_VECTOR, though it was hidden under a check for !Subtarget->isNeonAvailable(). This patch refactors this to enable the corresponding code path unconditionally for constant step vectors (as long as we can use SVE for them).

GCC implemented this recently.

https://godbolt.org/z/6sP3Kj3db

Full diff: https://github.com/llvm/llvm-project/pull/113424.diff

2 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+3-1)
(modified) llvm/test/CodeGen/AArch64/active_lane_mask.ll (+8-12)

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 4aa123b42d1966..e016a905c934bf 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -14512,7 +14512,9 @@ SDValue AArch64TargetLowering::LowerBUILD_VECTOR(SDValue Op,
                                                  SelectionDAG &DAG) const {
   EVT VT = Op.getValueType();
 
-  if (useSVEForFixedLengthVectorVT(VT, !Subtarget->isNeonAvailable()))
+  bool OverrideNEON = !Subtarget->isNeonAvailable() ||
+                      cast<BuildVectorSDNode>(Op)->isConstantSequence();
+  if (useSVEForFixedLengthVectorVT(VT, OverrideNEON))
     return LowerFixedLengthBuildVectorToSVE(Op, DAG);
 
   // Try to build a simple constant vector.
diff --git a/llvm/test/CodeGen/AArch64/active_lane_mask.ll b/llvm/test/CodeGen/AArch64/active_lane_mask.ll
index bd5d076d1ba82e..025bbf749fc71b 100644
--- a/llvm/test/CodeGen/AArch64/active_lane_mask.ll
+++ b/llvm/test/CodeGen/AArch64/active_lane_mask.ll
@@ -430,10 +430,9 @@ define <2 x i1> @lane_mask_v2i1_i64(i64 %index, i64 %TC) {
 define <16 x i1> @lane_mask_v16i1_i8(i8 %index, i8 %TC) {
 ; CHECK-LABEL: lane_mask_v16i1_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    adrp x8, .LCPI24_0
-; CHECK-NEXT:    dup v0.16b, w0
-; CHECK-NEXT:    ldr q1, [x8, :lo12:.LCPI24_0]
-; CHECK-NEXT:    uqadd v0.16b, v0.16b, v1.16b
+; CHECK-NEXT:    index z0.b, #0, #1
+; CHECK-NEXT:    dup v1.16b, w0
+; CHECK-NEXT:    uqadd v0.16b, v1.16b, v0.16b
 ; CHECK-NEXT:    dup v1.16b, w1
 ; CHECK-NEXT:    cmhi v0.16b, v1.16b, v0.16b
 ; CHECK-NEXT:    ret
@@ -444,10 +443,9 @@ define <16 x i1> @lane_mask_v16i1_i8(i8 %index, i8 %TC) {
 define <8 x i1> @lane_mask_v8i1_i8(i8 %index, i8 %TC) {
 ; CHECK-LABEL: lane_mask_v8i1_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    dup v0.8b, w0
-; CHECK-NEXT:    adrp x8, .LCPI25_0
-; CHECK-NEXT:    ldr d1, [x8, :lo12:.LCPI25_0]
-; CHECK-NEXT:    uqadd v0.8b, v0.8b, v1.8b
+; CHECK-NEXT:    index z0.b, #0, #1
+; CHECK-NEXT:    dup v1.8b, w0
+; CHECK-NEXT:    uqadd v0.8b, v1.8b, v0.8b
 ; CHECK-NEXT:    dup v1.8b, w1
 ; CHECK-NEXT:    cmhi v0.8b, v1.8b, v0.8b
 ; CHECK-NEXT:    ret
@@ -459,9 +457,8 @@ define <4 x i1> @lane_mask_v4i1_i8(i8 %index, i8 %TC) {
 ; CHECK-LABEL: lane_mask_v4i1_i8:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    dup v0.4h, w0
-; CHECK-NEXT:    adrp x8, .LCPI26_0
+; CHECK-NEXT:    index z1.h, #0, #1
 ; CHECK-NEXT:    movi d2, #0xff00ff00ff00ff
-; CHECK-NEXT:    ldr d1, [x8, :lo12:.LCPI26_0]
 ; CHECK-NEXT:    dup v3.4h, w1
 ; CHECK-NEXT:    bic v0.4h, #255, lsl #8
 ; CHECK-NEXT:    bic v3.4h, #255, lsl #8
@@ -478,8 +475,7 @@ define <2 x i1> @lane_mask_v2i1_i8(i8 %index, i8 %TC) {
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    movi d0, #0x0000ff000000ff
 ; CHECK-NEXT:    dup v1.2s, w0
-; CHECK-NEXT:    adrp x8, .LCPI27_0
-; CHECK-NEXT:    ldr d2, [x8, :lo12:.LCPI27_0]
+; CHECK-NEXT:    index z2.s, #0, #1
 ; CHECK-NEXT:    dup v3.2s, w1
 ; CHECK-NEXT:    and v1.8b, v1.8b, v0.8b
 ; CHECK-NEXT:    add v1.2s, v1.2s, v2.2s

MacDue · 2024-10-23T10:23:30Z

Related issue #110410

paulwalker-arm

Please add a more specific/dedicated test because the existing active_lane_mask tests should all ideally result in while instructions when SVE is available, so we'll be missing test coverage when that happens.

When compiling for an SVE target we can use INDEX to generate constant fixed-length step vectors. The logic for this was already in `LowerBUILD_VECTOR`, though it was hidden under `!Subtarget->isNeonAvailable()`. This patch refactors this to enable the corresponding code path unconditionally for constant step vectors (as long as we can use SVE for them).

rj-jesus · 2024-10-23T11:29:38Z

Please add a more specific/dedicated test because the existing active_lane_mask tests should all ideally result in while instructions when SVE is available, so we'll be missing test coverage when that happens.

Thanks, done.

A couple of questions:

Should we limit this to cases where the start and step of the sequence are in [-16, 15] so that we can use INDEX (immediates)? Or should we allow the general case (e.g. test case v4i32_out_range_start_step), where we might need to use INDEX (scalars) or INDEX (scalar, immediate)? What do you think?
For the sequence:

define <4 x i32> @v4i32_non_zero_non_one() {
  ret <4 x i32> <i32 1, i32 3, i32 5, i32 7>
}

We generate:

v4i32_non_zero_non_one:
  index z0.s, #0, #2
  orr z0.s, z0.s, #0x1
  ret

This doesn't seem specific to this patch so I think we can address it separately.

paulwalker-arm · 2024-10-23T12:03:18Z

We've tried to be conservative when using all the different forms of INDEX. At least on Neoverse cores their bandwidth can be limiting and so we generally prefer the immediate only forms that can be hoisted out of loops leaving only high bandwidth arithmetic instructions in loop rather than an in-loop register variant of INDEX.

I have some work that I need to extract into a pull request to better enforce this by making the decision explicit (i.e. lowering code can choose exactly the variant it wants) rather than the current use count method we use during instruction selection.

Agreed.

rj-jesus · 2024-10-23T12:21:45Z

Sounds great, thanks very much!

We've tried to be conservative when using all the different forms of INDEX. At least on Neoverse cores their bandwidth can be limiting and so we generally prefer the immediate only forms that can be hoisted out of loops leaving only high bandwidth arithmetic instructions in loop rather than an in-loop register variant of INDEX.

Should I then enforce the sequence to be within the immediate range, or would you rather handle this later as part of your follow-up patch?

paulwalker-arm · 2024-10-23T12:26:32Z

Should I then enforce the sequence to be within the immediate range, or would you rather handle this later as part of your follow-up patch?

You can leave it with me.

rj-jesus · 2024-10-23T12:29:29Z

Perfect, thanks.

When compiling for an SVE target we can use INDEX to generate constant fixed-length step vectors, e.g.: ``` uint32x4_t foo() { return (uint32x4_t){0, 1, 2, 3}; } ``` Currently: ``` foo(): adrp x8, .LCPI1_0 ldr q0, [x8, :lo12:.LCPI1_0] ret ``` With INDEX: ``` foo(): index z0.s, #0, llvm#1 ret ``` The logic for this was already in `LowerBUILD_VECTOR`, though it was hidden under a check for `!Subtarget->isNeonAvailable()`. This patch refactors this to enable the corresponding code path unconditionally for constant step vectors (as long as we can use SVE for them).

rj-jesus requested review from davemgreen, paulwalker-arm, sjoerdmeijer and david-arm October 23, 2024 07:57

llvmbot added the backend:AArch64 label Oct 23, 2024

paulwalker-arm reviewed Oct 23, 2024

View reviewed changes

rj-jesus added 2 commits October 23, 2024 04:11

[AArch64] Add new tests for INDEX fold

a80656b

rj-jesus force-pushed the rjj/aarch64-enable-index-for-neon branch from b3e3195 to 44a1040 Compare October 23, 2024 11:18

paulwalker-arm approved these changes Oct 23, 2024

View reviewed changes

rj-jesus merged commit 8a9921f into llvm:main Oct 23, 2024
8 checks passed

frobtech mentioned this pull request Oct 25, 2024

p/libc fexcept t #113664

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AArch64] Use INDEX for constant Neon step vectors #113424

[AArch64] Use INDEX for constant Neon step vectors #113424

Uh oh!

rj-jesus commented Oct 23, 2024 •

edited by MacDue

Loading

Uh oh!

llvmbot commented Oct 23, 2024

Uh oh!

MacDue commented Oct 23, 2024

Uh oh!

paulwalker-arm left a comment

Uh oh!

rj-jesus commented Oct 23, 2024

Uh oh!

paulwalker-arm commented Oct 23, 2024 •

edited

Loading

Uh oh!

rj-jesus commented Oct 23, 2024

Uh oh!

paulwalker-arm commented Oct 23, 2024

Uh oh!

rj-jesus commented Oct 23, 2024

Uh oh!

Uh oh!

Uh oh!

[AArch64] Use INDEX for constant Neon step vectors #113424

[AArch64] Use INDEX for constant Neon step vectors #113424

Uh oh!

Conversation

rj-jesus commented Oct 23, 2024 • edited by MacDue Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 23, 2024

Uh oh!

MacDue commented Oct 23, 2024

Uh oh!

paulwalker-arm left a comment

Choose a reason for hiding this comment

Uh oh!

rj-jesus commented Oct 23, 2024

Uh oh!

paulwalker-arm commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rj-jesus commented Oct 23, 2024

Uh oh!

paulwalker-arm commented Oct 23, 2024

Uh oh!

rj-jesus commented Oct 23, 2024

Uh oh!

Uh oh!

Uh oh!

rj-jesus commented Oct 23, 2024 •

edited by MacDue

Loading

paulwalker-arm commented Oct 23, 2024 •

edited

Loading