-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[X86][NVPTX][LegalizeDAG] If i16 legal, legalize FABS/FNEG/FCOPYSIGN (f16) with Expand #106153
base: main
Are you sure you want to change the base?
[X86][NVPTX][LegalizeDAG] If i16 legal, legalize FABS/FNEG/FCOPYSIGN (f16) with Expand #106153
Conversation
@llvm/pr-subscribers-backend-nvptx @llvm/pr-subscribers-backend-x86 Author: None (v01dXYZ) ChangesFor the concerned targets (X86, NVPTX), Full diff: https://github.com/llvm/llvm-project/pull/106153.diff 6 Files Affected:
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 73791102fc04de..5b0a87d5f150f8 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -849,7 +849,8 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
AddPromotedToType(Op, MVT::bf16, MVT::f32);
}
for (const auto &Op : {ISD::FABS}) {
- setOperationAction(Op, MVT::f16, Promote);
+ // Expand instead of Promote to clear sign bit by bitcasting to i16
+ setOperationAction(Op, MVT::f16, Expand);
setOperationAction(Op, MVT::f32, Legal);
setOperationAction(Op, MVT::f64, Legal);
setOperationAction(Op, MVT::v2f16, Expand);
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 1a6be4eb5af1ef..274a03b6664736 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -599,7 +599,8 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
setOperationAction(ISD::STRICT_FP_EXTEND, MVT::f64, Legal);
auto setF16Action = [&] (MVT VT, LegalizeAction Action) {
- setOperationAction(ISD::FABS, VT, Action);
+ // Expand instead of Promote to clear sign bit by bitcasting to i16.
+ setOperationAction(ISD::FABS, VT, Expand);
setOperationAction(ISD::FNEG, VT, Action);
setOperationAction(ISD::FCOPYSIGN, VT, Expand);
setOperationAction(ISD::FREM, VT, Action);
diff --git a/llvm/test/CodeGen/NVPTX/f16-instructions.ll b/llvm/test/CodeGen/NVPTX/f16-instructions.ll
index 14e02a49f6e5e4..cfabf1d6639c2a 100644
--- a/llvm/test/CodeGen/NVPTX/f16-instructions.ll
+++ b/llvm/test/CodeGen/NVPTX/f16-instructions.ll
@@ -981,14 +981,10 @@ define half @test_fma(half %a, half %b, half %c) #0 {
}
; CHECK-LABEL: test_fabs(
-; CHECK: ld.param.b16 [[A:%rs[0-9]+]], [test_fabs_param_0];
-; CHECK-NOFTZ: cvt.f32.f16 [[AF:%f[0-9]+]], [[A]];
-; CHECK-NOFTZ: abs.f32 [[RF:%f[0-9]+]], [[AF]];
-; CHECK-F16-FTZ: cvt.ftz.f32.f16 [[AF:%f[0-9]+]], [[A]];
-; CHECK-F16-FTZ: abs.ftz.f32 [[RF:%f[0-9]+]], [[AF]];
-; CHECK: cvt.rn.f16.f32 [[R:%rs[0-9]+]], [[RF]];
-; CHECK: st.param.b16 [func_retval0+0], [[R]];
-; CHECK: ret;
+; CHECK: ld.param.b16 [[A:%rs[0-9]+]], [test_fabs_param_0];
+; CHECK: and.b16 [[RF:%rs[0-9]+]], [[A]], 32767;
+; CHECK: st.param.b16 [func_retval0+0], [[RF]];
+; CHECK: ret;
define half @test_fabs(half %a) #0 {
%r = call half @llvm.fabs.f16(half %a)
ret half %r
diff --git a/llvm/test/CodeGen/NVPTX/f16x2-instructions.ll b/llvm/test/CodeGen/NVPTX/f16x2-instructions.ll
index 464b3a754804fe..c2189a20c98c25 100644
--- a/llvm/test/CodeGen/NVPTX/f16x2-instructions.ll
+++ b/llvm/test/CodeGen/NVPTX/f16x2-instructions.ll
@@ -1183,17 +1183,13 @@ define <2 x half> @test_fma(<2 x half> %a, <2 x half> %b, <2 x half> %c) #0 {
}
; CHECK-LABEL: test_fabs(
-; CHECK: ld.param.b32 [[A:%r[0-9]+]], [test_fabs_param_0];
-; CHECK: mov.b32 {[[A0:%rs[0-9]+]], [[A1:%rs[0-9]+]]}, [[A]]
-; CHECK-DAG: cvt.f32.f16 [[AF0:%f[0-9]+]], [[A0]];
-; CHECK-DAG: cvt.f32.f16 [[AF1:%f[0-9]+]], [[A1]];
-; CHECK-DAG: abs.f32 [[RF0:%f[0-9]+]], [[AF0]];
-; CHECK-DAG: abs.f32 [[RF1:%f[0-9]+]], [[AF1]];
-; CHECK-DAG: cvt.rn.f16.f32 [[R0:%rs[0-9]+]], [[RF0]];
-; CHECK-DAG: cvt.rn.f16.f32 [[R1:%rs[0-9]+]], [[RF1]];
-; CHECK: mov.b32 [[R:%r[0-9]+]], {[[R0]], [[R1]]}
-; CHECK: st.param.b32 [func_retval0+0], [[R]];
-; CHECK: ret;
+; CHECK: ld.param.b32 [[A:%r[0-9]+]], [test_fabs_param_0];
+; CHECK: mov.b32 {[[A0:%rs[0-9]+]], [[A1:%rs[0-9]+]]}, [[A]]
+; CHECK: and.b16 [[A2:%rs[0-9]+]], [[A1]], 32767;
+; CHECK: and.b16 [[A3:%rs[0-9]+]], [[A0]], 32767;
+; CHECK: mov.b32 [[B:%r[0-9]+]], {[[A3]], [[A2]]};
+; CHECK: st.param.b32 [func_retval0+0], [[B]];
+; CHECK: ret;
define <2 x half> @test_fabs(<2 x half> %a) #0 {
%r = call <2 x half> @llvm.fabs.f16(<2 x half> %a)
ret <2 x half> %r
diff --git a/llvm/test/CodeGen/X86/fp16-libcalls.ll b/llvm/test/CodeGen/X86/fp16-libcalls.ll
index db3d031a8fe3fb..ee342f22eeff66 100644
--- a/llvm/test/CodeGen/X86/fp16-libcalls.ll
+++ b/llvm/test/CodeGen/X86/fp16-libcalls.ll
@@ -124,11 +124,7 @@ define void @test_half_fabs(half %a0, ptr %p0) nounwind {
; F16C-LABEL: test_half_fabs:
; F16C: # %bb.0:
; F16C-NEXT: vpextrw $0, %xmm0, %eax
-; F16C-NEXT: vmovd %eax, %xmm0
-; F16C-NEXT: vcvtph2ps %xmm0, %xmm0
-; F16C-NEXT: vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
-; F16C-NEXT: vcvtps2ph $4, %xmm0, %xmm0
-; F16C-NEXT: vmovd %xmm0, %eax
+; F16C-NEXT: andl $32767, %eax # imm = 0x7FFF
; F16C-NEXT: movw %ax, (%rdi)
; F16C-NEXT: retq
;
@@ -141,34 +137,17 @@ define void @test_half_fabs(half %a0, ptr %p0) nounwind {
;
; X64-LABEL: test_half_fabs:
; X64: # %bb.0:
-; X64-NEXT: pushq %rbx
-; X64-NEXT: movq %rdi, %rbx
-; X64-NEXT: callq __extendhfsf2@PLT
-; X64-NEXT: pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
-; X64-NEXT: callq __truncsfhf2@PLT
; X64-NEXT: pextrw $0, %xmm0, %eax
-; X64-NEXT: movw %ax, (%rbx)
-; X64-NEXT: popq %rbx
+; X64-NEXT: andl $32767, %eax # imm = 0x7FFF
+; X64-NEXT: movw %ax, (%rdi)
; X64-NEXT: retq
;
; X86-LABEL: test_half_fabs:
; X86: # %bb.0:
-; X86-NEXT: pushl %esi
-; X86-NEXT: subl $8, %esp
-; X86-NEXT: pinsrw $0, {{[0-9]+}}(%esp), %xmm0
-; X86-NEXT: movl {{[0-9]+}}(%esp), %esi
-; X86-NEXT: pextrw $0, %xmm0, %eax
-; X86-NEXT: movw %ax, (%esp)
-; X86-NEXT: calll __extendhfsf2
-; X86-NEXT: fstps {{[0-9]+}}(%esp)
-; X86-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
-; X86-NEXT: pand {{\.?LCPI[0-9]+_[0-9]+}}, %xmm0
-; X86-NEXT: movd %xmm0, (%esp)
-; X86-NEXT: calll __truncsfhf2
-; X86-NEXT: pextrw $0, %xmm0, %eax
-; X86-NEXT: movw %ax, (%esi)
-; X86-NEXT: addl $8, %esp
-; X86-NEXT: popl %esi
+; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
+; X86-NEXT: movzwl {{[0-9]+}}(%esp), %ecx
+; X86-NEXT: andl $32767, %ecx # imm = 0x7FFF
+; X86-NEXT: movw %cx, (%eax)
; X86-NEXT: retl
%res = call half @llvm.fabs.half(half %a0)
store half %res, ptr %p0, align 2
diff --git a/llvm/test/CodeGen/X86/half.ll b/llvm/test/CodeGen/X86/half.ll
index 9f01d07e6a6705..6838925240058e 100644
--- a/llvm/test/CodeGen/X86/half.ll
+++ b/llvm/test/CodeGen/X86/half.ll
@@ -1059,7 +1059,6 @@ define void @main.158() #0 {
; CHECK-LIBCALL: # %bb.0: # %entry
; CHECK-LIBCALL-NEXT: pushq %rax
; CHECK-LIBCALL-NEXT: xorps %xmm0, %xmm0
-; CHECK-LIBCALL-NEXT: callq __truncsfhf2@PLT
; CHECK-LIBCALL-NEXT: callq __extendhfsf2@PLT
; CHECK-LIBCALL-NEXT: movss {{.*#+}} xmm1 = [8.0E+0,0.0E+0,0.0E+0,0.0E+0]
; CHECK-LIBCALL-NEXT: ucomiss %xmm0, %xmm1
@@ -1077,10 +1076,10 @@ define void @main.158() #0 {
; BWON-F16C-LABEL: main.158:
; BWON-F16C: # %bb.0: # %entry
; BWON-F16C-NEXT: vxorps %xmm0, %xmm0, %xmm0
-; BWON-F16C-NEXT: vcvtps2ph $4, %xmm0, %xmm1
-; BWON-F16C-NEXT: vcvtph2ps %xmm1, %xmm1
-; BWON-F16C-NEXT: vmovss {{.*#+}} xmm2 = [8.0E+0,0.0E+0,0.0E+0,0.0E+0]
-; BWON-F16C-NEXT: vucomiss %xmm1, %xmm2
+; BWON-F16C-NEXT: vcvtph2ps %xmm0, %xmm0
+; BWON-F16C-NEXT: vmovss {{.*#+}} xmm1 = [8.0E+0,0.0E+0,0.0E+0,0.0E+0]
+; BWON-F16C-NEXT: vucomiss %xmm0, %xmm1
+; BWON-F16C-NEXT: vxorps %xmm0, %xmm0, %xmm0
; BWON-F16C-NEXT: jae .LBB20_2
; BWON-F16C-NEXT: # %bb.1: # %entry
; BWON-F16C-NEXT: vmovss {{.*#+}} xmm0 = [NaN,0.0E+0,0.0E+0,0.0E+0]
@@ -1093,8 +1092,7 @@ define void @main.158() #0 {
; CHECK-I686-LABEL: main.158:
; CHECK-I686: # %bb.0: # %entry
; CHECK-I686-NEXT: subl $12, %esp
-; CHECK-I686-NEXT: movl $0, (%esp)
-; CHECK-I686-NEXT: calll __truncsfhf2
+; CHECK-I686-NEXT: pxor %xmm0, %xmm0
; CHECK-I686-NEXT: pextrw $0, %xmm0, %eax
; CHECK-I686-NEXT: movw %ax, (%esp)
; CHECK-I686-NEXT: calll __extendhfsf2
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for X86 change, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
; CHECK: mov.b32 {[[A0:%rs[0-9]+]], [[A1:%rs[0-9]+]]}, [[A]] | ||
; CHECK: and.b16 [[A2:%rs[0-9]+]], [[A1]], 32767; | ||
; CHECK: and.b16 [[A3:%rs[0-9]+]], [[A0]], 32767; | ||
; CHECK: mov.b32 [[B:%r[0-9]+]], {[[A3]], [[A2]]}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this could be improved further by performing and
on the i32 directly. I suspect ptxas may do that for us.
Maybe add a TODO note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a TODO is fine, but that we should not rely on ptxas
doing it for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LegalizeType would type-expand FABS
(<2 x f16>
).
The DAG just before instruction selection:
Optimized legalized selection DAG: %bb.0 'test_fabs:'
SelectionDAG has 21 nodes:
t0: ch,glue = EntryToken
t14: v2f16,ch = load<(dereferenceable invariant load (s32) from `ptr addrspace(101) null`, addrspace 101)> t0, TargetExternalSymbol:i64'test_fabs_param_0', undef:i64
t8: ch = CopyToReg t0, Register:v2f16 %0, t14
t16: f16 = extract_vector_elt t14, Constant:i64<0>
t22: i16 = bitcast t16
t24: i16 = and t22, Constant:i16<32767>
t25: f16 = bitcast t24
t19: f16 = extract_vector_elt t14, Constant:i64<1>
t26: i16 = bitcast t19
t27: i16 = and t26, Constant:i16<32767>
t28: f16 = bitcast t27
t21: v2f16 = BUILD_VECTOR t25, t28
t11: ch = NVPTXISD::StoreRetval<(store (s32), align 1)> t8, Constant:i32<0>, t21
t12: ch = NVPTXISD::RET_GLUE t11
Maybe detect a composition of bitwise operations on each element and merge it into a composition on the bitcasted vector itself (starting from BUILD_VECTOR
).
BTW that's also the case with x86
:
define <2 x half> @test_fabs(<2 x half> %a) #0 {
%r = call <2 x half> @llvm.fabs.f16(<2 x half> %a)
ret <2 x half> %r
}
.text
.file "fabs_nvptx.ll"
.globl test_fabs # -- Begin function test_fabs
.p2align 4, 0x90
.type test_fabs,@function
test_fabs: # @test_fabs
.cfi_startproc
# %bb.0:
pextrw $0, %xmm0, %eax
psrld $16, %xmm0
pextrw $0, %xmm0, %ecx
andl $32767, %ecx # imm = 0x7FFF
pinsrw $0, %ecx, %xmm1
andl $32767, %eax # imm = 0x7FFF
pinsrw $0, %eax, %xmm0
punpcklwd %xmm1, %xmm0 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
retq
.Lfunc_end0:
.size test_fabs, .Lfunc_end0-test_fabs
.cfi_endproc
# -- End function
.section ".note.GNU-stack","",@progbits
But AArch64
seems not to do so (there is a combiner that combine the build_vector into a concat_vector ..., undef
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not get sidetracked. This is something for a separate patch or github issue, and would be best discussed there.
A comment highlighting a future optimization opportunity is all that we need in this patch.
@@ -849,7 +849,8 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM, | |||
AddPromotedToType(Op, MVT::bf16, MVT::f32); | |||
} | |||
for (const auto &Op : {ISD::FABS}) { | |||
setOperationAction(Op, MVT::f16, Promote); | |||
// Expand instead of Promote to clear sign bit by bitcasting to i16 | |||
setOperationAction(Op, MVT::f16, Expand); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would operations other than FABS benefit from that, too? E.g. fneg
or fcopysign
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a valid point.
For NVPTX
, we have for f16
(so no change required)
FNEG
Legal / ExpandFCOPYSIGN
Expand
For X86-64
, we have for f16
FNEG
PromoteFCOPYSIGN
Expand
87097a5
to
00d591c
Compare
@phoebewang I added |
// Expand instead of Promote to clear/flip/copy sign bit by bitcasting to | ||
// i16. | ||
setOperationAction(ISD::FABS, MVT::f16, Expand); | ||
setOperationAction(ISD::FNEG, MVT::f16, Expand); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X86 could use test cases for fneg/fcopysign.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
00d591c
to
ff5805c
Compare
@@ -601,7 +601,7 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM, | |||
auto setF16Action = [&] (MVT VT, LegalizeAction Action) { | |||
setOperationAction(ISD::FABS, VT, Action); | |||
setOperationAction(ISD::FNEG, VT, Action); | |||
setOperationAction(ISD::FCOPYSIGN, VT, Expand); | |||
setOperationAction(ISD::FCOPYSIGN, VT, Action); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason to change this? And why don't set above Expand here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it would create even more special cases that don't reflect the function name.
The function is called elsewhere with Expand
unless it's f16
.
That's just for consistency, I could do it here if it's better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is called not only by scalar f16
but also both vector FP16 and BF16, though we don't have tests to cover them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it is for a vector FP/BF, it is always called with Expand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, then it's fine for me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait.. should we remove it from setF16Action
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait.. should we remove it from
setF16Action
instead?
Sorry, I misunderstood. No problem then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can choose b/w:
- isolate those Opcodes into another function.
- keep them and overwrite the actions just after calling
setF16Action
if they are the right actions (which is the path I've taken, and that's what's in the current code withMVT::v16f16
for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking you meant FCOPYSIGN
is set Expand by default, then I got you mean it's all set Expand in X86ISelLowering.cpp.
Just have a double check, it does set Expand by default, see
https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/TargetLoweringBase.cpp#L775
So we can simply remove it from setF16Action
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. The less code the better.
f38c1c9
to
04e7cea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please can you rebase? I've already added additional test coverage to fp16-libcalls.ll, including copysign/fabs/fneg etc.
…(f16) with Expand For the concerned targets, `Expand` clears/flips/copies the sign bit after bitcasting to i16. This is cheaper than `Promote` that requires converting back and forth from f16 to f32 (with possibly expensive libcalls) in order to call FABS/FNEG/FCOPYSIGN (f32). Reminder: By default, FCOPYSIGN action is Expand.
04e7cea
to
d8a6a62
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - cheers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still papering over a bug - the promote action is just busted and needs to be replaced or removed
Yes, promote seems to be overkill for such operations that could be implemented as bitwise ops after move back and forth to a GPR.
The weird case would be an arch having f16 registers but only GPR registers that are strictly smaller than i16. |
The problem isn't that it's overkill, it's that it is wrong. You cannot have the canonicalization effects of the fpextend.
This is more of a ridiculous DAG design issue; you should be able to create the bitcast and have later legalization steps deal with it. Failing that, you could do a replacement bitcast by doing store to stack and reload |
Indeed, LangRef clearly mentions the fact that only the bit sign is changed with fabs/fneg/fcopysign but says nothing about that with fpext/fptrunc. |
For the concerned targets (X86, NVPTX),
Expand
clears the sign bit after bitcasting to i16. This is cheaper thanPromote
that requires converting back and forth from f16 to f32 (with possibly expensive libcalls) in order to call FABS.f32Fixes #104915.