Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64] Use INDEX for constant Neon step vectors #113424

Merged
merged 2 commits into from
Oct 23, 2024

Conversation

rj-jesus
Copy link
Contributor

@rj-jesus rj-jesus commented Oct 23, 2024

When compiling for an SVE target we can use INDEX to generate constant fixed-length step vectors, e.g.:

uint32x4_t bar() {
  return (uint32x4_t){0, 1, 2, 3};
}

Currently:

bar():
        adrp    x8, .LCPI1_0
        ldr     q0, [x8, :lo12:.LCPI1_0]
        ret

With INDEX:

bar():
        index   z0.s, #0, #1
        ret

The logic for this was already in LowerBUILD_VECTOR, though it was hidden under a check for !Subtarget->isNeonAvailable(). This patch refactors this to enable the corresponding code path unconditionally for constant step vectors (as long as we can use SVE for them).

GCC implemented this recently.

https://godbolt.org/z/6sP3Kj3db

Resolves #110410

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 23, 2024

@llvm/pr-subscribers-backend-aarch64

Author: Ricardo Jesus (rj-jesus)

Changes

When compiling for an SVE target we can use INDEX to generate constant fixed-length step vectors, e.g.:

uint32x4_t bar() {
  return (uint32x4_t){0, 1, 2, 3};
}

Currently:

bar():
        adrp    x8, .LCPI1_0
        ldr     q0, [x8, :lo12:.LCPI1_0]
        ret

With INDEX:

bar():
        index   z0.s, #<!-- -->0, #<!-- -->1
        ret

The logic for this was already in LowerBUILD_VECTOR, though it was hidden under a check for !Subtarget-&gt;isNeonAvailable(). This patch refactors this to enable the corresponding code path unconditionally for constant step vectors (as long as we can use SVE for them).

GCC implemented this recently.

https://godbolt.org/z/6sP3Kj3db


Full diff: https://github.com/llvm/llvm-project/pull/113424.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+3-1)
  • (modified) llvm/test/CodeGen/AArch64/active_lane_mask.ll (+8-12)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 4aa123b42d1966..e016a905c934bf 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -14512,7 +14512,9 @@ SDValue AArch64TargetLowering::LowerBUILD_VECTOR(SDValue Op,
                                                  SelectionDAG &DAG) const {
   EVT VT = Op.getValueType();
 
-  if (useSVEForFixedLengthVectorVT(VT, !Subtarget->isNeonAvailable()))
+  bool OverrideNEON = !Subtarget->isNeonAvailable() ||
+                      cast<BuildVectorSDNode>(Op)->isConstantSequence();
+  if (useSVEForFixedLengthVectorVT(VT, OverrideNEON))
     return LowerFixedLengthBuildVectorToSVE(Op, DAG);
 
   // Try to build a simple constant vector.
diff --git a/llvm/test/CodeGen/AArch64/active_lane_mask.ll b/llvm/test/CodeGen/AArch64/active_lane_mask.ll
index bd5d076d1ba82e..025bbf749fc71b 100644
--- a/llvm/test/CodeGen/AArch64/active_lane_mask.ll
+++ b/llvm/test/CodeGen/AArch64/active_lane_mask.ll
@@ -430,10 +430,9 @@ define <2 x i1> @lane_mask_v2i1_i64(i64 %index, i64 %TC) {
 define <16 x i1> @lane_mask_v16i1_i8(i8 %index, i8 %TC) {
 ; CHECK-LABEL: lane_mask_v16i1_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    adrp x8, .LCPI24_0
-; CHECK-NEXT:    dup v0.16b, w0
-; CHECK-NEXT:    ldr q1, [x8, :lo12:.LCPI24_0]
-; CHECK-NEXT:    uqadd v0.16b, v0.16b, v1.16b
+; CHECK-NEXT:    index z0.b, #0, #1
+; CHECK-NEXT:    dup v1.16b, w0
+; CHECK-NEXT:    uqadd v0.16b, v1.16b, v0.16b
 ; CHECK-NEXT:    dup v1.16b, w1
 ; CHECK-NEXT:    cmhi v0.16b, v1.16b, v0.16b
 ; CHECK-NEXT:    ret
@@ -444,10 +443,9 @@ define <16 x i1> @lane_mask_v16i1_i8(i8 %index, i8 %TC) {
 define <8 x i1> @lane_mask_v8i1_i8(i8 %index, i8 %TC) {
 ; CHECK-LABEL: lane_mask_v8i1_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    dup v0.8b, w0
-; CHECK-NEXT:    adrp x8, .LCPI25_0
-; CHECK-NEXT:    ldr d1, [x8, :lo12:.LCPI25_0]
-; CHECK-NEXT:    uqadd v0.8b, v0.8b, v1.8b
+; CHECK-NEXT:    index z0.b, #0, #1
+; CHECK-NEXT:    dup v1.8b, w0
+; CHECK-NEXT:    uqadd v0.8b, v1.8b, v0.8b
 ; CHECK-NEXT:    dup v1.8b, w1
 ; CHECK-NEXT:    cmhi v0.8b, v1.8b, v0.8b
 ; CHECK-NEXT:    ret
@@ -459,9 +457,8 @@ define <4 x i1> @lane_mask_v4i1_i8(i8 %index, i8 %TC) {
 ; CHECK-LABEL: lane_mask_v4i1_i8:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    dup v0.4h, w0
-; CHECK-NEXT:    adrp x8, .LCPI26_0
+; CHECK-NEXT:    index z1.h, #0, #1
 ; CHECK-NEXT:    movi d2, #0xff00ff00ff00ff
-; CHECK-NEXT:    ldr d1, [x8, :lo12:.LCPI26_0]
 ; CHECK-NEXT:    dup v3.4h, w1
 ; CHECK-NEXT:    bic v0.4h, #255, lsl #8
 ; CHECK-NEXT:    bic v3.4h, #255, lsl #8
@@ -478,8 +475,7 @@ define <2 x i1> @lane_mask_v2i1_i8(i8 %index, i8 %TC) {
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    movi d0, #0x0000ff000000ff
 ; CHECK-NEXT:    dup v1.2s, w0
-; CHECK-NEXT:    adrp x8, .LCPI27_0
-; CHECK-NEXT:    ldr d2, [x8, :lo12:.LCPI27_0]
+; CHECK-NEXT:    index z2.s, #0, #1
 ; CHECK-NEXT:    dup v3.2s, w1
 ; CHECK-NEXT:    and v1.8b, v1.8b, v0.8b
 ; CHECK-NEXT:    add v1.2s, v1.2s, v2.2s

@MacDue
Copy link
Member

MacDue commented Oct 23, 2024

Related issue #110410

Copy link
Collaborator

@paulwalker-arm paulwalker-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a more specific/dedicated test because the existing active_lane_mask tests should all ideally result in while instructions when SVE is available, so we'll be missing test coverage when that happens.

When compiling for an SVE target we can use INDEX to generate constant
fixed-length step vectors.

The logic for this was already in `LowerBUILD_VECTOR`, though it was
hidden under `!Subtarget->isNeonAvailable()`. This patch refactors this
to enable the corresponding code path unconditionally for constant step
vectors (as long as we can use SVE for them).
@rj-jesus rj-jesus force-pushed the rjj/aarch64-enable-index-for-neon branch from b3e3195 to 44a1040 Compare October 23, 2024 11:18
@rj-jesus
Copy link
Contributor Author

Please add a more specific/dedicated test because the existing active_lane_mask tests should all ideally result in while instructions when SVE is available, so we'll be missing test coverage when that happens.

Thanks, done.

A couple of questions:

  1. Should we limit this to cases where the start and step of the sequence are in [-16, 15] so that we can use INDEX (immediates)? Or should we allow the general case (e.g. test case v4i32_out_range_start_step), where we might need to use INDEX (scalars) or INDEX (scalar, immediate)? What do you think?

  2. For the sequence:

define <4 x i32> @v4i32_non_zero_non_one() {
  ret <4 x i32> <i32 1, i32 3, i32 5, i32 7>
}

We generate:

v4i32_non_zero_non_one:
  index z0.s, #0, #2
  orr z0.s, z0.s, #0x1
  ret

This doesn't seem specific to this patch so I think we can address it separately.

@paulwalker-arm
Copy link
Collaborator

paulwalker-arm commented Oct 23, 2024

  1. We've tried to be conservative when using all the different forms of INDEX. At least on Neoverse cores their bandwidth can be limiting and so we generally prefer the immediate only forms that can be hoisted out of loops leaving only high bandwidth arithmetic instructions in loop rather than an in-loop register variant of INDEX.

I have some work that I need to extract into a pull request to better enforce this by making the decision explicit (i.e. lowering code can choose exactly the variant it wants) rather than the current use count method we use during instruction selection.

  1. Agreed.

@rj-jesus
Copy link
Contributor Author

Sounds great, thanks very much!

We've tried to be conservative when using all the different forms of INDEX. At least on Neoverse cores their bandwidth can be limiting and so we generally prefer the immediate only forms that can be hoisted out of loops leaving only high bandwidth arithmetic instructions in loop rather than an in-loop register variant of INDEX.

Should I then enforce the sequence to be within the immediate range, or would you rather handle this later as part of your follow-up patch?

@paulwalker-arm
Copy link
Collaborator

Should I then enforce the sequence to be within the immediate range, or would you rather handle this later as part of your follow-up patch?

You can leave it with me.

@rj-jesus
Copy link
Contributor Author

Perfect, thanks.

@rj-jesus rj-jesus merged commit 8a9921f into llvm:main Oct 23, 2024
8 checks passed
@frobtech frobtech mentioned this pull request Oct 25, 2024
NoumanAmir657 pushed a commit to NoumanAmir657/llvm-project that referenced this pull request Nov 4, 2024
When compiling for an SVE target we can use INDEX to generate constant
fixed-length step vectors, e.g.:
```
uint32x4_t foo() {
  return (uint32x4_t){0, 1, 2, 3};
}
```
Currently:
```
foo():
        adrp    x8, .LCPI1_0
        ldr     q0, [x8, :lo12:.LCPI1_0]
        ret
```
With INDEX:
```
foo():
        index   z0.s, #0, llvm#1
        ret
```

The logic for this was already in `LowerBUILD_VECTOR`, though it was
hidden under a check for `!Subtarget->isNeonAvailable()`. This patch
refactors this to enable the corresponding code path unconditionally for
constant step vectors (as long as we can use SVE for them).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AArch64][SVE] can improve vector constant generation with SVE's INDEX instruction
4 participants