Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU][GFX11] Do not rewrite V_FMA/FMAC_* to V_FMAAK_F16_t16 on operand legalization. #66202

Merged
merged 1 commit into from
Oct 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3318,6 +3318,12 @@ bool SIInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
if (pseudoToMCOpcode(NewOpc) == -1)
return false;

// V_FMAAK_F16_t16 takes VGPR_32_Lo128 operands, so the rewrite
// would also require restricting their register classes. For now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just handle this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it is of any practical value. For fake True16 instructions, it seems we already have enough support and generally don't want to invest more in that. For real True16 instructions that new code wouldn't be used anyway; VGPR_16_Lo128 is not even allocatable downstream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the reasoning for real True16 is right, but the conclusion we don't want to shrink is right. VGPR_16_Lo128 (or something derived from it ) would still be used downstream.
dag InOperandList = (ins VSrcT_f16_Lo128_Deferred:$src0, VGPRSrc_16_Lo128:$src1, KImmFP16:$imm);

We don't want to shrink pre-RA because it would restrict the register allocation to only use the bottom 128 vgprs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks irrelevant to me. The comment aims to explain why we bail out for fake True16 and VGPR_32_Lo128, not VGPR_16_Lo128. The instruction in the condition is supposed to be changed to V_FMAAK_F16_fake16 when we have one, and then removed along with the rest of the fake16 support.

// just bail out.
if (NewOpc == AMDGPU::V_FMAAK_F16_t16)
Sisyph marked this conversation as resolved.
Show resolved Hide resolved
return false;

const int64_t Imm = ImmOp->getImm();

// FIXME: This would be a lot easier if we could return a new instruction
Expand Down
99 changes: 95 additions & 4 deletions llvm/test/CodeGen/AMDGPU/fma.f16.ll
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -global-isel=0 -march=amdgcn -mcpu=gfx900 < %s | FileCheck %s -check-prefixes=GFX9,GFX9-SDAG
; RUN: llc -global-isel=1 -march=amdgcn -mcpu=gfx900 < %s | FileCheck %s -check-prefixes=GFX9,GFX9-GISEL
; RUN: llc -global-isel=0 -march=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10,GFX10-SDAG
; RUN: llc -global-isel=1 -march=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10,GFX10-GISEL
; RUN: llc -global-isel=0 -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX9,GFX9-SDAG
; RUN: llc -global-isel=1 -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX9,GFX9-GISEL
; RUN: llc -global-isel=0 -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX10,GFX10-SDAG
; RUN: llc -global-isel=1 -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX10,GFX10-GISEL
; RUN: llc -global-isel=0 -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX11,GFX11-SDAG
; RUN: llc -global-isel=1 -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX11,GFX11-GISEL

declare half @llvm.fma.f16(half, half, half)
declare half @llvm.maxnum.f16(half, half)
Expand All @@ -19,6 +21,12 @@ define half @test_fma(half %x, half %y, half %z) {
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: v_fma_f16 v0, v0, v1, v2
; GFX10-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-LABEL: test_fma:
; GFX11: ; %bb.0:
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-NEXT: v_fma_f16 v0, v0, v1, v2
; GFX11-NEXT: s_setpc_b64 s[30:31]
%r = call half @llvm.fma.f16(half %x, half %y, half %z)
ret half %r
}
Expand All @@ -36,6 +44,12 @@ define half @test_fmac(half %x, half %y, half %z) {
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: v_fmac_f16_e32 v0, v1, v2
; GFX10-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-LABEL: test_fmac:
; GFX11: ; %bb.0:
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-NEXT: v_fmac_f16_e32 v0, v1, v2
; GFX11-NEXT: s_setpc_b64 s[30:31]
%r = call half @llvm.fma.f16(half %y, half %z, half %x)
ret half %r
}
Expand All @@ -61,6 +75,12 @@ define half @test_fmaak(half %x, half %y, half %z) {
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: v_fmaak_f16 v0, v0, v1, 0x4200
; GFX10-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-LABEL: test_fmaak:
; GFX11: ; %bb.0:
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-NEXT: v_fmaak_f16 v0, v0, v1, 0x4200
; GFX11-NEXT: s_setpc_b64 s[30:31]
%r = call half @llvm.fma.f16(half %x, half %y, half 0xH4200)
ret half %r
}
Expand All @@ -86,6 +106,12 @@ define half @test_fmamk(half %x, half %y, half %z) {
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: v_fmamk_f16 v0, v0, 0x4200, v2
; GFX10-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-LABEL: test_fmamk:
; GFX11: ; %bb.0:
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-NEXT: v_fmamk_f16 v0, v0, 0x4200, v2
; GFX11-NEXT: s_setpc_b64 s[30:31]
%r = call half @llvm.fma.f16(half %x, half 0xH4200, half %z)
ret half %r
}
Expand Down Expand Up @@ -139,6 +165,33 @@ define i32 @test_D139469_f16(half %arg) {
; GFX10-GISEL-NEXT: s_or_b32 s4, vcc_lo, s4
; GFX10-GISEL-NEXT: v_cndmask_b32_e64 v0, 0, 1, s4
; GFX10-GISEL-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-SDAG-LABEL: test_D139469_f16:
; GFX11-SDAG: ; %bb.0: ; %bb
; GFX11-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-SDAG-NEXT: v_mov_b32_e32 v1, 0x211e
; GFX11-SDAG-NEXT: v_mul_f16_e32 v2, 0x291e, v0
; GFX11-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-SDAG-NEXT: v_fmac_f16_e32 v1, 0x291e, v0
; GFX11-SDAG-NEXT: v_min_f16_e32 v0, v2, v1
; GFX11-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-SDAG-NEXT: v_cmp_gt_f16_e32 vcc_lo, 0, v0
; GFX11-SDAG-NEXT: v_cndmask_b32_e64 v0, 0, 1, vcc_lo
; GFX11-SDAG-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-GISEL-LABEL: test_D139469_f16:
; GFX11-GISEL: ; %bb.0: ; %bb
; GFX11-GISEL-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-GISEL-NEXT: s_movk_i32 s0, 0x291e
; GFX11-GISEL-NEXT: v_mul_f16_e32 v1, 0x291e, v0
; GFX11-GISEL-NEXT: v_fmaak_f16 v0, s0, v0, 0x211e
; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX11-GISEL-NEXT: v_cmp_gt_f16_e32 vcc_lo, 0, v1
; GFX11-GISEL-NEXT: v_cmp_gt_f16_e64 s0, 0, v0
; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
; GFX11-GISEL-NEXT: s_or_b32 s0, vcc_lo, s0
; GFX11-GISEL-NEXT: v_cndmask_b32_e64 v0, 0, 1, s0
; GFX11-GISEL-NEXT: s_setpc_b64 s[30:31]
bb:
%i = fmul contract half %arg, 0xH291E
%i1 = fcmp olt half %i, 0xH0000
Expand Down Expand Up @@ -213,6 +266,44 @@ define <2 x i32> @test_D139469_v2f16(<2 x half> %arg) {
; GFX10-GISEL-NEXT: s_or_b32 s4, s6, s5
; GFX10-GISEL-NEXT: v_cndmask_b32_e64 v1, 0, 1, s4
; GFX10-GISEL-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-SDAG-LABEL: test_D139469_v2f16:
; GFX11-SDAG: ; %bb.0: ; %bb
; GFX11-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-SDAG-NEXT: s_movk_i32 s0, 0x211e
; GFX11-SDAG-NEXT: v_pk_mul_f16 v1, 0x291e, v0 op_sel_hi:[0,1]
; GFX11-SDAG-NEXT: v_pk_fma_f16 v0, 0x291e, v0, s0 op_sel_hi:[0,1,0]
; GFX11-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-SDAG-NEXT: v_pk_min_f16 v0, v1, v0
; GFX11-SDAG-NEXT: v_lshrrev_b32_e32 v1, 16, v0
; GFX11-SDAG-NEXT: v_cmp_gt_f16_e32 vcc_lo, 0, v0
; GFX11-SDAG-NEXT: v_cndmask_b32_e64 v0, 0, 1, vcc_lo
; GFX11-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_3)
; GFX11-SDAG-NEXT: v_cmp_gt_f16_e32 vcc_lo, 0, v1
; GFX11-SDAG-NEXT: v_cndmask_b32_e64 v1, 0, 1, vcc_lo
; GFX11-SDAG-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-GISEL-LABEL: test_D139469_v2f16:
; GFX11-GISEL: ; %bb.0: ; %bb
; GFX11-GISEL-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-GISEL-NEXT: s_mov_b32 s0, 0x291e291e
; GFX11-GISEL-NEXT: v_pk_mul_f16 v1, v0, 0x291e op_sel_hi:[1,0]
; GFX11-GISEL-NEXT: v_pk_fma_f16 v0, v0, s0, 0x211e op_sel_hi:[1,1,0]
; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX11-GISEL-NEXT: v_lshrrev_b32_e32 v2, 16, v1
; GFX11-GISEL-NEXT: v_lshrrev_b32_e32 v3, 16, v0
; GFX11-GISEL-NEXT: v_cmp_gt_f16_e32 vcc_lo, 0, v1
; GFX11-GISEL-NEXT: v_cmp_gt_f16_e64 s0, 0, v0
; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
; GFX11-GISEL-NEXT: v_cmp_gt_f16_e64 s1, 0, v2
; GFX11-GISEL-NEXT: v_cmp_gt_f16_e64 s2, 0, v3
; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(SALU_CYCLE_1)
; GFX11-GISEL-NEXT: s_or_b32 s0, vcc_lo, s0
; GFX11-GISEL-NEXT: v_cndmask_b32_e64 v0, 0, 1, s0
; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(SALU_CYCLE_1)
; GFX11-GISEL-NEXT: s_or_b32 s0, s1, s2
; GFX11-GISEL-NEXT: v_cndmask_b32_e64 v1, 0, 1, s0
; GFX11-GISEL-NEXT: s_setpc_b64 s[30:31]
bb:
%i = fmul contract <2 x half> %arg, <half 0xH291E, half 0xH291E>
%i1 = fcmp olt <2 x half> %i, <half 0xH0000, half 0xH0000>
Expand Down