-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64] Lower __builtin_bswap16 to rev16 if bswap followed by any_extend #105375
[AArch64] Lower __builtin_bswap16 to rev16 if bswap followed by any_extend #105375
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-backend-aarch64 Author: None (adprasad-nvidia) ChangesGCC compiles the built-in function
i.e. it performs a byte reversal of a 32-bit register, (which moves the lower half, which contains the 16-bit data, to the upper half) and then right shifts the reversed 16-bit data back to the lower half of the register. Full diff: https://github.com/llvm/llvm-project/pull/105375.diff 5 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index a9324af5beb784..227ed141075582 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -2825,9 +2825,9 @@ def : InstAlias<"rev64 $Rd, $Rn", (REVXr GPR64:$Rd, GPR64:$Rn), 0>;
def : Pat<(bswap (rotr GPR32:$Rn, (i64 16))), (REV16Wr GPR32:$Rn)>;
def : Pat<(bswap (rotr GPR64:$Rn, (i64 32))), (REV32Xr GPR64:$Rn)>;
-// Match (srl (bswap x), C) -> revC if the upper bswap bits are known zero.
-def : Pat<(srl (bswap top16Zero:$Rn), (i64 16)), (REV16Wr GPR32:$Rn)>;
-def : Pat<(srl (bswap top32Zero:$Rn), (i64 32)), (REV32Xr GPR64:$Rn)>;
+// Match (srl (bswap x), C) -> revC.
+def : Pat<(srl (bswap GPR32:$Rn), (i64 16)), (REV16Wr GPR32:$Rn)>;
+def : Pat<(srl (bswap GPR64:$Rn), (i64 32)), (REV32Xr GPR64:$Rn)>;
def : Pat<(or (and (srl GPR64:$Rn, (i64 8)), (i64 0x00ff00ff00ff00ff)),
(and (shl GPR64:$Rn, (i64 8)), (i64 0xff00ff00ff00ff00))),
diff --git a/llvm/test/CodeGen/AArch64/arm64-rev.ll b/llvm/test/CodeGen/AArch64/arm64-rev.ll
index f548a0e01feee6..5973a6a0cf113f 100644
--- a/llvm/test/CodeGen/AArch64/arm64-rev.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-rev.ll
@@ -27,15 +27,13 @@ entry:
define i32 @test_rev_w_srl16(i16 %a) {
; CHECK-SD-LABEL: test_rev_w_srl16:
; CHECK-SD: // %bb.0: // %entry
-; CHECK-SD-NEXT: rev w8, w0
-; CHECK-SD-NEXT: lsr w0, w8, #16
+; CHECK-SD-NEXT: rev16 w0, w0
; CHECK-SD-NEXT: ret
;
; CHECK-GI-LABEL: test_rev_w_srl16:
; CHECK-GI: // %bb.0: // %entry
; CHECK-GI-NEXT: and w8, w0, #0xffff
-; CHECK-GI-NEXT: rev w8, w8
-; CHECK-GI-NEXT: lsr w0, w8, #16
+; CHECK-GI-NEXT: rev16 w0, w8
; CHECK-GI-NEXT: ret
entry:
%0 = zext i16 %a to i32
@@ -48,8 +46,7 @@ define i32 @test_rev_w_srl16_load(ptr %a) {
; CHECK-LABEL: test_rev_w_srl16_load:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldrh w8, [x0]
-; CHECK-NEXT: rev w8, w8
-; CHECK-NEXT: lsr w0, w8, #16
+; CHECK-NEXT: rev16 w0, w8
; CHECK-NEXT: ret
entry:
%0 = load i16, ptr %a
@@ -71,8 +68,7 @@ define i32 @test_rev_w_srl16_add(i8 %a, i8 %b) {
; CHECK-GI: // %bb.0: // %entry
; CHECK-GI-NEXT: and w8, w1, #0xff
; CHECK-GI-NEXT: add w8, w8, w0, uxtb
-; CHECK-GI-NEXT: rev w8, w8
-; CHECK-GI-NEXT: lsr w0, w8, #16
+; CHECK-GI-NEXT: rev16 w0, w8
; CHECK-GI-NEXT: ret
entry:
%0 = zext i8 %a to i32
@@ -89,15 +85,13 @@ define i64 @test_rev_x_srl32(i32 %a) {
; CHECK-SD-LABEL: test_rev_x_srl32:
; CHECK-SD: // %bb.0: // %entry
; CHECK-SD-NEXT: // kill: def $w0 killed $w0 def $x0
-; CHECK-SD-NEXT: rev x8, x0
-; CHECK-SD-NEXT: lsr x0, x8, #32
+; CHECK-SD-NEXT: rev32 x0, x0
; CHECK-SD-NEXT: ret
;
; CHECK-GI-LABEL: test_rev_x_srl32:
; CHECK-GI: // %bb.0: // %entry
; CHECK-GI-NEXT: mov w8, w0
-; CHECK-GI-NEXT: rev x8, x8
-; CHECK-GI-NEXT: lsr x0, x8, #32
+; CHECK-GI-NEXT: rev32 x0, x8
; CHECK-GI-NEXT: ret
entry:
%0 = zext i32 %a to i64
@@ -110,8 +104,7 @@ define i64 @test_rev_x_srl32_load(ptr %a) {
; CHECK-LABEL: test_rev_x_srl32_load:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr w8, [x0]
-; CHECK-NEXT: rev x8, x8
-; CHECK-NEXT: lsr x0, x8, #32
+; CHECK-NEXT: rev32 x0, x8
; CHECK-NEXT: ret
entry:
%0 = load i32, ptr %a
@@ -122,18 +115,11 @@ entry:
}
define i64 @test_rev_x_srl32_shift(i64 %a) {
-; CHECK-SD-LABEL: test_rev_x_srl32_shift:
-; CHECK-SD: // %bb.0: // %entry
-; CHECK-SD-NEXT: ubfx x8, x0, #2, #29
-; CHECK-SD-NEXT: rev32 x0, x8
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: test_rev_x_srl32_shift:
-; CHECK-GI: // %bb.0: // %entry
-; CHECK-GI-NEXT: ubfx x8, x0, #2, #29
-; CHECK-GI-NEXT: rev x8, x8
-; CHECK-GI-NEXT: lsr x0, x8, #32
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: test_rev_x_srl32_shift:
+; CHECK: // %bb.0: // %entry
+; CHECK-NEXT: ubfx x8, x0, #2, #29
+; CHECK-NEXT: rev32 x0, x8
+; CHECK-NEXT: ret
entry:
%0 = shl i64 %a, 33
%1 = lshr i64 %0, 35
@@ -472,8 +458,7 @@ define void @test_rev16_truncstore() {
; CHECK-GI-NEXT: .LBB30_1: // %cleanup
; CHECK-GI-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-GI-NEXT: ldrh w8, [x8]
-; CHECK-GI-NEXT: rev w8, w8
-; CHECK-GI-NEXT: lsr w8, w8, #16
+; CHECK-GI-NEXT: rev16 w8, w8
; CHECK-GI-NEXT: strh w8, [x8]
; CHECK-GI-NEXT: tbz wzr, #0, .LBB30_1
; CHECK-GI-NEXT: .LBB30_2: // %fail
diff --git a/llvm/test/CodeGen/AArch64/bswap.ll b/llvm/test/CodeGen/AArch64/bswap.ll
index 071613b9cc011e..2a60abdc2308f0 100644
--- a/llvm/test/CodeGen/AArch64/bswap.ll
+++ b/llvm/test/CodeGen/AArch64/bswap.ll
@@ -6,8 +6,7 @@
define i16 @bswap_i16(i16 %a){
; CHECK-LABEL: bswap_i16:
; CHECK: // %bb.0:
-; CHECK-NEXT: rev w8, w0
-; CHECK-NEXT: lsr w0, w8, #16
+; CHECK-NEXT: rev16 w0, w0
; CHECK-NEXT: ret
%3 = call i16 @llvm.bswap.i16(i16 %a)
ret i16 %3
diff --git a/llvm/test/CodeGen/AArch64/memcmp.ll b/llvm/test/CodeGen/AArch64/memcmp.ll
index 4da7c8c95a4e4f..0a6a03844128c3 100644
--- a/llvm/test/CodeGen/AArch64/memcmp.ll
+++ b/llvm/test/CodeGen/AArch64/memcmp.ll
@@ -39,9 +39,8 @@ define i32 @length2(ptr %X, ptr %Y) nounwind {
; CHECK: // %bb.0:
; CHECK-NEXT: ldrh w8, [x0]
; CHECK-NEXT: ldrh w9, [x1]
-; CHECK-NEXT: rev w8, w8
+; CHECK-NEXT: rev16 w8, w8
; CHECK-NEXT: rev w9, w9
-; CHECK-NEXT: lsr w8, w8, #16
; CHECK-NEXT: sub w0, w8, w9, lsr #16
; CHECK-NEXT: ret
%m = tail call i32 @memcmp(ptr %X, ptr %Y, i64 2) nounwind
@@ -93,9 +92,8 @@ define i1 @length2_lt(ptr %X, ptr %Y) nounwind {
; CHECK: // %bb.0:
; CHECK-NEXT: ldrh w8, [x0]
; CHECK-NEXT: ldrh w9, [x1]
-; CHECK-NEXT: rev w8, w8
+; CHECK-NEXT: rev16 w8, w8
; CHECK-NEXT: rev w9, w9
-; CHECK-NEXT: lsr w8, w8, #16
; CHECK-NEXT: sub w8, w8, w9, lsr #16
; CHECK-NEXT: lsr w0, w8, #31
; CHECK-NEXT: ret
@@ -109,9 +107,8 @@ define i1 @length2_gt(ptr %X, ptr %Y) nounwind {
; CHECK: // %bb.0:
; CHECK-NEXT: ldrh w8, [x0]
; CHECK-NEXT: ldrh w9, [x1]
-; CHECK-NEXT: rev w8, w8
+; CHECK-NEXT: rev16 w8, w8
; CHECK-NEXT: rev w9, w9
-; CHECK-NEXT: lsr w8, w8, #16
; CHECK-NEXT: sub w8, w8, w9, lsr #16
; CHECK-NEXT: cmp w8, #0
; CHECK-NEXT: cset w0, gt
@@ -536,10 +533,8 @@ define i32 @length10(ptr %X, ptr %Y) nounwind {
; CHECK-NEXT: // %bb.1: // %loadbb1
; CHECK-NEXT: ldrh w8, [x0, #8]
; CHECK-NEXT: ldrh w9, [x1, #8]
-; CHECK-NEXT: rev w8, w8
-; CHECK-NEXT: rev w9, w9
-; CHECK-NEXT: lsr w8, w8, #16
-; CHECK-NEXT: lsr w9, w9, #16
+; CHECK-NEXT: rev16 w8, w8
+; CHECK-NEXT: rev16 w9, w9
; CHECK-NEXT: cmp x8, x9
; CHECK-NEXT: b.ne .LBB32_3
; CHECK-NEXT: // %bb.2:
diff --git a/llvm/test/CodeGen/AArch64/merge-trunc-store.ll b/llvm/test/CodeGen/AArch64/merge-trunc-store.ll
index b161d746ad11d5..4fcd030db1bace 100644
--- a/llvm/test/CodeGen/AArch64/merge-trunc-store.ll
+++ b/llvm/test/CodeGen/AArch64/merge-trunc-store.ll
@@ -10,8 +10,7 @@ define void @le_i16_to_i8(i16 %x, ptr %p0) {
;
; BE-LABEL: le_i16_to_i8:
; BE: // %bb.0:
-; BE-NEXT: rev w8, w0
-; BE-NEXT: lsr w8, w8, #16
+; BE-NEXT: rev16 w8, w0
; BE-NEXT: strh w8, [x1]
; BE-NEXT: ret
%sh1 = lshr i16 %x, 8
@@ -31,8 +30,7 @@ define void @le_i16_to_i8_order(i16 %x, ptr %p0) {
;
; BE-LABEL: le_i16_to_i8_order:
; BE: // %bb.0:
-; BE-NEXT: rev w8, w0
-; BE-NEXT: lsr w8, w8, #16
+; BE-NEXT: rev16 w8, w0
; BE-NEXT: strh w8, [x1]
; BE-NEXT: ret
%sh1 = lshr i16 %x, 8
@@ -47,8 +45,7 @@ define void @le_i16_to_i8_order(i16 %x, ptr %p0) {
define void @be_i16_to_i8_offset(i16 %x, ptr %p0) {
; LE-LABEL: be_i16_to_i8_offset:
; LE: // %bb.0:
-; LE-NEXT: rev w8, w0
-; LE-NEXT: lsr w8, w8, #16
+; LE-NEXT: rev16 w8, w0
; LE-NEXT: sturh w8, [x1, #11]
; LE-NEXT: ret
;
@@ -69,8 +66,7 @@ define void @be_i16_to_i8_offset(i16 %x, ptr %p0) {
define void @be_i16_to_i8_order(i16 %x, ptr %p0) {
; LE-LABEL: be_i16_to_i8_order:
; LE: // %bb.0:
-; LE-NEXT: rev w8, w0
-; LE-NEXT: lsr w8, w8, #16
+; LE-NEXT: rev16 w8, w0
; LE-NEXT: strh w8, [x1]
; LE-NEXT: ret
;
|
@DTeachs Thanks for pointing this out. As you pointed out, |
28aa56d
to
d6b0448
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi - I think this sounds good. There might be a way to handle this where it is performed from the demanded bits of the operation, but that isn't easy to do during selection and performing it at least for the anyext case sounds good for the time being.
I had some comments inline.
N->getOperand(0).getValueType().isScalarInteger() && | ||
N->getOperand(0).getValueType().getFixedSizeInBits() == 16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could check N->getOperand(0).getValueType() == MVT::i16
. Should it check that the input type is i32 too? Otherwise it would need to check i64 or i32, and add a pattern for i64 (and a test).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion on simplifying to == MVT::i16
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By input type, I assume you mean the input type to the rev16 / output type of the old any_extend
? As far as I can tell, this is only ever an i32. If an i64 is needed than a zero_extend
is generated instead.
In the absence of writing a test, do you think it would be worthwhile to write a comment explaining this? No changes to the code would need to be made, because I am following your suggestion of using N->getValueType(0)
instead of EVT(MVT::i32)
, which means the logic in the code is independent of whether it is i64 or i32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To explain further - there are 2 cases.
Case 1: if the source code uses the result of the __builtin_bswap16
as either an i32 or an i64, then a zext i16 to i32/i64
is inserted in the IR before instruction selection. This patch's optimisation doesn't trigger because in the DAG, this zext
becomes a zero_extend
not an any_extend
.
Case 2: It's only if the result is used as an i16 that the zext
is not inserted. In this case, the any_extend
i16 to i32 seems to be inserted later, during the initial building of the DAG from IR (presumably to help with type legalisation later on). The code generator should always insert an any_extend
i16 to i32, as there's never any reason for an any_extend
i16 to i64. If there were, i.e. we need the result to be an i64, we would be in case 1, not case 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Dave's suggestion is to add a check:
bswap->getOperand(0).getValueType() == MVT::i32
where bswap = N->getOperand(0)
.
That's a very good point, it guarantees that we are only dealing with REV16 for i32 where we know the upper bits are zero. It avoids us having to worry about i64s and different patterns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. bswap
's output type is always its input type, so bswap->getOperand(0).getValueType()
is always i16, not i32 (assuming we've checked its output type is i16, which we already do). I just verified this by adding a check in the if statement that N->getOperand(0)->getOperand(0).getValueType() == MVT::i32
. This makes the test that checks rev16
is generated fail i.e. we generate the old rev
and lsr
instead of rev16
, because the check will evaluate to false.
We guarantee we are only dealing with REV16 for i32 because we insert an any_extend
before the REV16, and that any_extend
is guaranteed to be extending to i32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about anyext(bswap)
coming up for other types. These two look like they should end up with i64 anyext
and i128 anyext
after a little optimization: https://godbolt.org/z/G1Kc9K1cv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, got it. I didn't realise that we can also get an any_extend
from an optmised zero_extend
.
I can, as suggested, avoid this by adding a check that the old any_extend
output type / rev16
input type is i32, i.e. add a check that N.getValueType() == MVT::i32
.
But it might also be relatively simple to handle the i64 and i128 cases and still get the optimised rev16
codegen. We could insert two any_extend
s instead of one: one immediately before the rev16
that extends i16 to i32, and one immediately after the rev16
that extends i32 to whatever the value type of the old any_extend
was (i32, i64, i128...).
Would you be happy with the second option too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, Adding i64 sounds like a good idea. We can just add a new tablegen pattern for REV16Xr if I understand it correctly.
Usually with the way types legalize, if you handle i32 and i64 then the other types will follow suit, as they get legalized to i32/i64 anyway. We just need to make sure we don't generate a i128 AArch64ISD::REV16 that cannot be selected.
N->getOperand(0).getOpcode() == ISD::BSWAP && | ||
N->getOperand(0).getValueType().isScalarInteger() && | ||
N->getOperand(0).getValueType().getFixedSizeInBits() == 16) { | ||
SDNode *BswapNode = N->getOperand(0).getNode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we would do SDLoc DL(N);
, and use that elsewhere in both the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By this do you mean just replacing the inline SDLoc(N)
in the second DAG.getNode
with DL
, defined as you suggest?
I'm not sure how else I could use SDLoc DL(N)
and how this would replace needing to define BswapNode
. The first DAG.getNode
uses SDLoc(BswapNode)
instead of SDLoc(N)
, and I also use the definition of BswapNode
in the BswapNode->getOperand(0)
expression, not just as a location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change SDLoc(BswapNode)
with DL
in the getNode calls and define:
SDLoc DL(N);
N->getOperand(0).getValueType().getFixedSizeInBits() == 16) { | ||
SDNode *BswapNode = N->getOperand(0).getNode(); | ||
SDValue NewAnyExtend = DAG.getNode(ISD::ANY_EXTEND, SDLoc(BswapNode), | ||
EVT(MVT::i32), BswapNode->getOperand(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EVT(MVT::i32)
can just be MVT::i32
, or probably just N->getValueType(0)
.
d6b0448
to
2d935a9
Compare
@davemgreen @sjoerdmeijer Thanks for the comments, I have pushed changes addressing them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. LGTM with an extra comment. You might have to rebase the patch too.
N->getOperand(0).getOpcode() == ISD::BSWAP && | ||
N->getOperand(0).getValueType() == MVT::i16 && | ||
(N->getValueType(0) == MVT::i32 || N->getValueType(0) == MVT::i64)) { | ||
SDNode *BswapNode = N->getOperand(0).getNode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SDValue BSwap = N->getOperand(0);
, then it can use BSwap.getOperand(0)
below.
2d935a9
to
bbcebb7
Compare
bbcebb7
to
fd5e47e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
Thanks for your help with this Dave! |
@adprasad-nvidia Congratulations on having your first Pull Request (PR) merged into the LLVM Project! Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR. Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues. How to do this, and the rest of the post-merge process, is covered in detail here. If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again. If you don't get any reports, no action is required from you. Your changes are working as expected, well done! |
…xtend (llvm#105375) GCC compiles the built-in function `__builtin_bswap16`, to the ARM instruction rev16, which reverses the byte order of 16-bit data. On the other Clang compiles the same built-in function to e.g. ``` rev w8, w0 lsr w0, w8, llvm#16 ``` i.e. it performs a byte reversal of a 32-bit register, (which moves the lower half, which contains the 16-bit data, to the upper half) and then right shifts the reversed 16-bit data back to the lower half of the register. We can improve Clang codegen by generating `rev16` instead of `rev` and `lsr`, like GCC.
GCC compiles the built-in function
__builtin_bswap16
, to the ARM instruction rev16, which reverses the byte order of 16-bit data. On the other Clang compiles the same built-in function to e.g.i.e. it performs a byte reversal of a 32-bit register, (which moves the lower half, which contains the 16-bit data, to the upper half) and then right shifts the reversed 16-bit data back to the lower half of the register.
We can improve Clang codegen by generating
rev16
instead ofrev
andlsr
, like GCC.