Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64] Lower __builtin_bswap16 to rev16 if bswap followed by any_extend #105375

Merged
merged 1 commit into from
Sep 10, 2024

Conversation

adprasad-nvidia
Copy link
Contributor

GCC compiles the built-in function __builtin_bswap16, to the ARM instruction rev16, which reverses the byte order of 16-bit data. On the other Clang compiles the same built-in function to e.g.

        rev     w8, w0
        lsr     w0, w8, #16

i.e. it performs a byte reversal of a 32-bit register, (which moves the lower half, which contains the 16-bit data, to the upper half) and then right shifts the reversed 16-bit data back to the lower half of the register.
We can improve Clang codegen by generating rev16 instead of rev and lsr, like GCC.

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Aug 20, 2024

@llvm/pr-subscribers-backend-aarch64

Author: None (adprasad-nvidia)

Changes

GCC compiles the built-in function __builtin_bswap16, to the ARM instruction rev16, which reverses the byte order of 16-bit data. On the other Clang compiles the same built-in function to e.g.

        rev     w8, w0
        lsr     w0, w8, #<!-- -->16

i.e. it performs a byte reversal of a 32-bit register, (which moves the lower half, which contains the 16-bit data, to the upper half) and then right shifts the reversed 16-bit data back to the lower half of the register.
We can improve Clang codegen by generating rev16 instead of rev and lsr, like GCC.


Full diff: https://github.com/llvm/llvm-project/pull/105375.diff

5 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64InstrInfo.td (+3-3)
  • (modified) llvm/test/CodeGen/AArch64/arm64-rev.ll (+13-28)
  • (modified) llvm/test/CodeGen/AArch64/bswap.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/memcmp.ll (+5-10)
  • (modified) llvm/test/CodeGen/AArch64/merge-trunc-store.ll (+4-8)
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index a9324af5beb784..227ed141075582 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -2825,9 +2825,9 @@ def : InstAlias<"rev64 $Rd, $Rn", (REVXr GPR64:$Rd, GPR64:$Rn), 0>;
 def : Pat<(bswap (rotr GPR32:$Rn, (i64 16))), (REV16Wr GPR32:$Rn)>;
 def : Pat<(bswap (rotr GPR64:$Rn, (i64 32))), (REV32Xr GPR64:$Rn)>;
 
-// Match (srl (bswap x), C) -> revC if the upper bswap bits are known zero.
-def : Pat<(srl (bswap top16Zero:$Rn), (i64 16)), (REV16Wr GPR32:$Rn)>;
-def : Pat<(srl (bswap top32Zero:$Rn), (i64 32)), (REV32Xr GPR64:$Rn)>;
+// Match (srl (bswap x), C) -> revC.
+def : Pat<(srl (bswap GPR32:$Rn), (i64 16)), (REV16Wr GPR32:$Rn)>;
+def : Pat<(srl (bswap GPR64:$Rn), (i64 32)), (REV32Xr GPR64:$Rn)>;
 
 def : Pat<(or (and (srl GPR64:$Rn, (i64 8)), (i64 0x00ff00ff00ff00ff)),
               (and (shl GPR64:$Rn, (i64 8)), (i64 0xff00ff00ff00ff00))),
diff --git a/llvm/test/CodeGen/AArch64/arm64-rev.ll b/llvm/test/CodeGen/AArch64/arm64-rev.ll
index f548a0e01feee6..5973a6a0cf113f 100644
--- a/llvm/test/CodeGen/AArch64/arm64-rev.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-rev.ll
@@ -27,15 +27,13 @@ entry:
 define i32 @test_rev_w_srl16(i16 %a) {
 ; CHECK-SD-LABEL: test_rev_w_srl16:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    rev w8, w0
-; CHECK-SD-NEXT:    lsr w0, w8, #16
+; CHECK-SD-NEXT:    rev16 w0, w0
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: test_rev_w_srl16:
 ; CHECK-GI:       // %bb.0: // %entry
 ; CHECK-GI-NEXT:    and w8, w0, #0xffff
-; CHECK-GI-NEXT:    rev w8, w8
-; CHECK-GI-NEXT:    lsr w0, w8, #16
+; CHECK-GI-NEXT:    rev16 w0, w8
 ; CHECK-GI-NEXT:    ret
 entry:
   %0 = zext i16 %a to i32
@@ -48,8 +46,7 @@ define i32 @test_rev_w_srl16_load(ptr %a) {
 ; CHECK-LABEL: test_rev_w_srl16_load:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldrh w8, [x0]
-; CHECK-NEXT:    rev w8, w8
-; CHECK-NEXT:    lsr w0, w8, #16
+; CHECK-NEXT:    rev16 w0, w8
 ; CHECK-NEXT:    ret
 entry:
   %0 = load i16, ptr %a
@@ -71,8 +68,7 @@ define i32 @test_rev_w_srl16_add(i8 %a, i8 %b) {
 ; CHECK-GI:       // %bb.0: // %entry
 ; CHECK-GI-NEXT:    and w8, w1, #0xff
 ; CHECK-GI-NEXT:    add w8, w8, w0, uxtb
-; CHECK-GI-NEXT:    rev w8, w8
-; CHECK-GI-NEXT:    lsr w0, w8, #16
+; CHECK-GI-NEXT:    rev16 w0, w8
 ; CHECK-GI-NEXT:    ret
 entry:
   %0 = zext i8 %a to i32
@@ -89,15 +85,13 @@ define i64 @test_rev_x_srl32(i32 %a) {
 ; CHECK-SD-LABEL: test_rev_x_srl32:
 ; CHECK-SD:       // %bb.0: // %entry
 ; CHECK-SD-NEXT:    // kill: def $w0 killed $w0 def $x0
-; CHECK-SD-NEXT:    rev x8, x0
-; CHECK-SD-NEXT:    lsr x0, x8, #32
+; CHECK-SD-NEXT:    rev32 x0, x0
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: test_rev_x_srl32:
 ; CHECK-GI:       // %bb.0: // %entry
 ; CHECK-GI-NEXT:    mov w8, w0
-; CHECK-GI-NEXT:    rev x8, x8
-; CHECK-GI-NEXT:    lsr x0, x8, #32
+; CHECK-GI-NEXT:    rev32 x0, x8
 ; CHECK-GI-NEXT:    ret
 entry:
   %0 = zext i32 %a to i64
@@ -110,8 +104,7 @@ define i64 @test_rev_x_srl32_load(ptr %a) {
 ; CHECK-LABEL: test_rev_x_srl32_load:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr w8, [x0]
-; CHECK-NEXT:    rev x8, x8
-; CHECK-NEXT:    lsr x0, x8, #32
+; CHECK-NEXT:    rev32 x0, x8
 ; CHECK-NEXT:    ret
 entry:
   %0 = load i32, ptr %a
@@ -122,18 +115,11 @@ entry:
 }
 
 define i64 @test_rev_x_srl32_shift(i64 %a) {
-; CHECK-SD-LABEL: test_rev_x_srl32_shift:
-; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    ubfx x8, x0, #2, #29
-; CHECK-SD-NEXT:    rev32 x0, x8
-; CHECK-SD-NEXT:    ret
-;
-; CHECK-GI-LABEL: test_rev_x_srl32_shift:
-; CHECK-GI:       // %bb.0: // %entry
-; CHECK-GI-NEXT:    ubfx x8, x0, #2, #29
-; CHECK-GI-NEXT:    rev x8, x8
-; CHECK-GI-NEXT:    lsr x0, x8, #32
-; CHECK-GI-NEXT:    ret
+; CHECK-LABEL: test_rev_x_srl32_shift:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    ubfx x8, x0, #2, #29
+; CHECK-NEXT:    rev32 x0, x8
+; CHECK-NEXT:    ret
 entry:
   %0 = shl i64 %a, 33
   %1 = lshr i64 %0, 35
@@ -472,8 +458,7 @@ define void @test_rev16_truncstore() {
 ; CHECK-GI-NEXT:  .LBB30_1: // %cleanup
 ; CHECK-GI-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-GI-NEXT:    ldrh w8, [x8]
-; CHECK-GI-NEXT:    rev w8, w8
-; CHECK-GI-NEXT:    lsr w8, w8, #16
+; CHECK-GI-NEXT:    rev16 w8, w8
 ; CHECK-GI-NEXT:    strh w8, [x8]
 ; CHECK-GI-NEXT:    tbz wzr, #0, .LBB30_1
 ; CHECK-GI-NEXT:  .LBB30_2: // %fail
diff --git a/llvm/test/CodeGen/AArch64/bswap.ll b/llvm/test/CodeGen/AArch64/bswap.ll
index 071613b9cc011e..2a60abdc2308f0 100644
--- a/llvm/test/CodeGen/AArch64/bswap.ll
+++ b/llvm/test/CodeGen/AArch64/bswap.ll
@@ -6,8 +6,7 @@
 define i16 @bswap_i16(i16 %a){
 ; CHECK-LABEL: bswap_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    rev w8, w0
-; CHECK-NEXT:    lsr w0, w8, #16
+; CHECK-NEXT:    rev16 w0, w0
 ; CHECK-NEXT:    ret
     %3 = call i16 @llvm.bswap.i16(i16 %a)
     ret i16 %3
diff --git a/llvm/test/CodeGen/AArch64/memcmp.ll b/llvm/test/CodeGen/AArch64/memcmp.ll
index 4da7c8c95a4e4f..0a6a03844128c3 100644
--- a/llvm/test/CodeGen/AArch64/memcmp.ll
+++ b/llvm/test/CodeGen/AArch64/memcmp.ll
@@ -39,9 +39,8 @@ define i32 @length2(ptr %X, ptr %Y) nounwind {
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    ldrh w8, [x0]
 ; CHECK-NEXT:    ldrh w9, [x1]
-; CHECK-NEXT:    rev w8, w8
+; CHECK-NEXT:    rev16 w8, w8
 ; CHECK-NEXT:    rev w9, w9
-; CHECK-NEXT:    lsr w8, w8, #16
 ; CHECK-NEXT:    sub w0, w8, w9, lsr #16
 ; CHECK-NEXT:    ret
   %m = tail call i32 @memcmp(ptr %X, ptr %Y, i64 2) nounwind
@@ -93,9 +92,8 @@ define i1 @length2_lt(ptr %X, ptr %Y) nounwind {
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    ldrh w8, [x0]
 ; CHECK-NEXT:    ldrh w9, [x1]
-; CHECK-NEXT:    rev w8, w8
+; CHECK-NEXT:    rev16 w8, w8
 ; CHECK-NEXT:    rev w9, w9
-; CHECK-NEXT:    lsr w8, w8, #16
 ; CHECK-NEXT:    sub w8, w8, w9, lsr #16
 ; CHECK-NEXT:    lsr w0, w8, #31
 ; CHECK-NEXT:    ret
@@ -109,9 +107,8 @@ define i1 @length2_gt(ptr %X, ptr %Y) nounwind {
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    ldrh w8, [x0]
 ; CHECK-NEXT:    ldrh w9, [x1]
-; CHECK-NEXT:    rev w8, w8
+; CHECK-NEXT:    rev16 w8, w8
 ; CHECK-NEXT:    rev w9, w9
-; CHECK-NEXT:    lsr w8, w8, #16
 ; CHECK-NEXT:    sub w8, w8, w9, lsr #16
 ; CHECK-NEXT:    cmp w8, #0
 ; CHECK-NEXT:    cset w0, gt
@@ -536,10 +533,8 @@ define i32 @length10(ptr %X, ptr %Y) nounwind {
 ; CHECK-NEXT:  // %bb.1: // %loadbb1
 ; CHECK-NEXT:    ldrh w8, [x0, #8]
 ; CHECK-NEXT:    ldrh w9, [x1, #8]
-; CHECK-NEXT:    rev w8, w8
-; CHECK-NEXT:    rev w9, w9
-; CHECK-NEXT:    lsr w8, w8, #16
-; CHECK-NEXT:    lsr w9, w9, #16
+; CHECK-NEXT:    rev16 w8, w8
+; CHECK-NEXT:    rev16 w9, w9
 ; CHECK-NEXT:    cmp x8, x9
 ; CHECK-NEXT:    b.ne .LBB32_3
 ; CHECK-NEXT:  // %bb.2:
diff --git a/llvm/test/CodeGen/AArch64/merge-trunc-store.ll b/llvm/test/CodeGen/AArch64/merge-trunc-store.ll
index b161d746ad11d5..4fcd030db1bace 100644
--- a/llvm/test/CodeGen/AArch64/merge-trunc-store.ll
+++ b/llvm/test/CodeGen/AArch64/merge-trunc-store.ll
@@ -10,8 +10,7 @@ define void @le_i16_to_i8(i16 %x, ptr %p0) {
 ;
 ; BE-LABEL: le_i16_to_i8:
 ; BE:       // %bb.0:
-; BE-NEXT:    rev w8, w0
-; BE-NEXT:    lsr w8, w8, #16
+; BE-NEXT:    rev16 w8, w0
 ; BE-NEXT:    strh w8, [x1]
 ; BE-NEXT:    ret
   %sh1 = lshr i16 %x, 8
@@ -31,8 +30,7 @@ define void @le_i16_to_i8_order(i16 %x, ptr %p0) {
 ;
 ; BE-LABEL: le_i16_to_i8_order:
 ; BE:       // %bb.0:
-; BE-NEXT:    rev w8, w0
-; BE-NEXT:    lsr w8, w8, #16
+; BE-NEXT:    rev16 w8, w0
 ; BE-NEXT:    strh w8, [x1]
 ; BE-NEXT:    ret
   %sh1 = lshr i16 %x, 8
@@ -47,8 +45,7 @@ define void @le_i16_to_i8_order(i16 %x, ptr %p0) {
 define void @be_i16_to_i8_offset(i16 %x, ptr %p0) {
 ; LE-LABEL: be_i16_to_i8_offset:
 ; LE:       // %bb.0:
-; LE-NEXT:    rev w8, w0
-; LE-NEXT:    lsr w8, w8, #16
+; LE-NEXT:    rev16 w8, w0
 ; LE-NEXT:    sturh w8, [x1, #11]
 ; LE-NEXT:    ret
 ;
@@ -69,8 +66,7 @@ define void @be_i16_to_i8_offset(i16 %x, ptr %p0) {
 define void @be_i16_to_i8_order(i16 %x, ptr %p0) {
 ; LE-LABEL: be_i16_to_i8_order:
 ; LE:       // %bb.0:
-; LE-NEXT:    rev w8, w0
-; LE-NEXT:    lsr w8, w8, #16
+; LE-NEXT:    rev16 w8, w0
 ; LE-NEXT:    strh w8, [x1]
 ; LE-NEXT:    ret
 ;

@adprasad-nvidia adprasad-nvidia changed the title [AArch64] [CodeGen] [REV] Generate revC for all (srl (bswap x), C) instructions when C is 16 or 32 [AArch64] Generate revC for all (srl (bswap x), C) instructions when C is 16 or 32 Sep 2, 2024
@adprasad-nvidia adprasad-nvidia changed the title [AArch64] Generate revC for all (srl (bswap x), C) instructions when C is 16 or 32 [AArch64] Generate rev16 for certain uses of __builtin_bswap16 Sep 3, 2024
@adprasad-nvidia
Copy link
Contributor Author

@DTeachs Thanks for pointing this out. As you pointed out, rev16 should only be generated from __builtin_bswap16 when the result is used as an i16, as if it's used as an i32, the top bits are still needed.
The updated patch now checks for this, and only lowers to rev16 if the result is used as an i16. It does this by checking that the bswap is followed by an any_extend - which means the top half is not used i.e. result is used as an i16. If it were instead followed by zero_extend, the top half would be used, and the updated patch does not lower to rev16.
I have added a test in CodeGen/AArch64/bswap.ll that checks bswap is not lowered to rev16 when the result is used as an i32.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi - I think this sounds good. There might be a way to handle this where it is performed from the demanded bits of the operation, but that isn't easy to do during selection and performing it at least for the anyext case sounds good for the time being.

I had some comments inline.

Comment on lines 22147 to 22148
N->getOperand(0).getValueType().isScalarInteger() &&
N->getOperand(0).getValueType().getFixedSizeInBits() == 16) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could check N->getOperand(0).getValueType() == MVT::i16. Should it check that the input type is i32 too? Otherwise it would need to check i64 or i32, and add a pattern for i64 (and a test).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion on simplifying to == MVT::i16.

Copy link
Contributor Author

@adprasad-nvidia adprasad-nvidia Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By input type, I assume you mean the input type to the rev16 / output type of the old any_extend? As far as I can tell, this is only ever an i32. If an i64 is needed than a zero_extend is generated instead.
In the absence of writing a test, do you think it would be worthwhile to write a comment explaining this? No changes to the code would need to be made, because I am following your suggestion of using N->getValueType(0) instead of EVT(MVT::i32), which means the logic in the code is independent of whether it is i64 or i32.

Copy link
Contributor Author

@adprasad-nvidia adprasad-nvidia Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To explain further - there are 2 cases.
Case 1: if the source code uses the result of the __builtin_bswap16 as either an i32 or an i64, then a zext i16 to i32/i64 is inserted in the IR before instruction selection. This patch's optimisation doesn't trigger because in the DAG, this zext becomes a zero_extend not an any_extend.
Case 2: It's only if the result is used as an i16 that the zext is not inserted. In this case, the any_extend i16 to i32 seems to be inserted later, during the initial building of the DAG from IR (presumably to help with type legalisation later on). The code generator should always insert an any_extend i16 to i32, as there's never any reason for an any_extend i16 to i64. If there were, i.e. we need the result to be an i64, we would be in case 1, not case 2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Dave's suggestion is to add a check:

 bswap->getOperand(0).getValueType() == MVT::i32

where bswap = N->getOperand(0).

That's a very good point, it guarantees that we are only dealing with REV16 for i32 where we know the upper bits are zero. It avoids us having to worry about i64s and different patterns.

Copy link
Contributor Author

@adprasad-nvidia adprasad-nvidia Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. bswap's output type is always its input type, so bswap->getOperand(0).getValueType() is always i16, not i32 (assuming we've checked its output type is i16, which we already do). I just verified this by adding a check in the if statement that N->getOperand(0)->getOperand(0).getValueType() == MVT::i32. This makes the test that checks rev16 is generated fail i.e. we generate the old rev and lsr instead of rev16, because the check will evaluate to false.
We guarantee we are only dealing with REV16 for i32 because we insert an any_extend before the REV16, and that any_extend is guaranteed to be extending to i32.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about anyext(bswap) coming up for other types. These two look like they should end up with i64 anyext and i128 anyext after a little optimization: https://godbolt.org/z/G1Kc9K1cv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, got it. I didn't realise that we can also get an any_extend from an optmised zero_extend.
I can, as suggested, avoid this by adding a check that the old any_extend output type / rev16 input type is i32, i.e. add a check that N.getValueType() == MVT::i32.
But it might also be relatively simple to handle the i64 and i128 cases and still get the optimised rev16 codegen. We could insert two any_extends instead of one: one immediately before the rev16 that extends i16 to i32, and one immediately after the rev16 that extends i32 to whatever the value type of the old any_extend was (i32, i64, i128...).
Would you be happy with the second option too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Adding i64 sounds like a good idea. We can just add a new tablegen pattern for REV16Xr if I understand it correctly.

Usually with the way types legalize, if you handle i32 and i64 then the other types will follow suit, as they get legalized to i32/i64 anyway. We just need to make sure we don't generate a i128 AArch64ISD::REV16 that cannot be selected.

N->getOperand(0).getOpcode() == ISD::BSWAP &&
N->getOperand(0).getValueType().isScalarInteger() &&
N->getOperand(0).getValueType().getFixedSizeInBits() == 16) {
SDNode *BswapNode = N->getOperand(0).getNode();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we would do SDLoc DL(N);, and use that elsewhere in both the nodes.

Copy link
Contributor Author

@adprasad-nvidia adprasad-nvidia Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this do you mean just replacing the inline SDLoc(N) in the second DAG.getNode with DL, defined as you suggest?
I'm not sure how else I could use SDLoc DL(N) and how this would replace needing to define BswapNode. The first DAG.getNode uses SDLoc(BswapNode) instead of SDLoc(N), and I also use the definition of BswapNode in the BswapNode->getOperand(0) expression, not just as a location.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change SDLoc(BswapNode) with DL in the getNode calls and define:

SDLoc DL(N);

N->getOperand(0).getValueType().getFixedSizeInBits() == 16) {
SDNode *BswapNode = N->getOperand(0).getNode();
SDValue NewAnyExtend = DAG.getNode(ISD::ANY_EXTEND, SDLoc(BswapNode),
EVT(MVT::i32), BswapNode->getOperand(0));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EVT(MVT::i32) can just be MVT::i32, or probably just N->getValueType(0).

@adprasad-nvidia
Copy link
Contributor Author

@davemgreen @sjoerdmeijer Thanks for the comments, I have pushed changes addressing them.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM with an extra comment. You might have to rebase the patch too.

N->getOperand(0).getOpcode() == ISD::BSWAP &&
N->getOperand(0).getValueType() == MVT::i16 &&
(N->getValueType(0) == MVT::i32 || N->getValueType(0) == MVT::i64)) {
SDNode *BswapNode = N->getOperand(0).getNode();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDValue BSwap = N->getOperand(0);, then it can use BSwap.getOperand(0) below.

@adprasad-nvidia adprasad-nvidia changed the title [AArch64] Generate rev16 for certain uses of __builtin_bswap16 [AArch64] Lower __builtin_bswap16 to rev16 if bswap followed by any_extend Sep 6, 2024
@adprasad-nvidia adprasad-nvidia force-pushed the adprasad/rev-lsr-to-rev16 branch from 2d935a9 to bbcebb7 Compare September 6, 2024 15:18
@adprasad-nvidia adprasad-nvidia force-pushed the adprasad/rev-lsr-to-rev16 branch from bbcebb7 to fd5e47e Compare September 6, 2024 15:19
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@sjoerdmeijer sjoerdmeijer merged commit 23595d1 into llvm:main Sep 10, 2024
5 of 8 checks passed
@sjoerdmeijer
Copy link
Collaborator

Thanks for your help with this Dave!

Copy link

@adprasad-nvidia Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

VitaNuo pushed a commit to VitaNuo/llvm-project that referenced this pull request Sep 12, 2024
…xtend (llvm#105375)

GCC compiles the built-in function `__builtin_bswap16`, to the ARM
instruction rev16, which reverses the byte order of 16-bit data. On the
other Clang compiles the same built-in function to e.g.
```     
        rev     w8, w0
        lsr     w0, w8, llvm#16
```
i.e. it performs a byte reversal of a 32-bit register, (which moves the
lower half, which contains the 16-bit data, to the upper half) and then
right shifts the reversed 16-bit data back to the lower half of the
register.
We can improve Clang codegen by generating `rev16` instead of `rev` and
`lsr`, like GCC.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants