-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPP combining is not working for addition/subtraction in certain cases [HCC, HIP, LLVM] #1233
Comments
Few more cases: #include <hc.hpp>
int main()
{
hc::array_view<int> data(1);
parallel_for_each(hc::extent<1>(1), [=](hc::index<1> i) [[hc]]
{
int __amdgcn_update_dpp(int old, int src, int dpp_ctrl, int row_mask, int bank_mask, bool bound_ctrl) [[hc]] asm("llvm.amdgcn.update.dpp.i32");
int d = data[i[0]];
d = hc::__mul24(__amdgcn_update_dpp(1, d, 1, 14, 15, false), d);
data[i[0]] = d;
});
return 0;
} v_mov_b32_dpp v2, v3 quad_perm:[1,0,0,0] row_mask:0xe bank_mask:0xf
v_mul_i32_i24_e32 v2, v2, v3 Although #include <hc.hpp>
int main()
{
hc::array_view<int> data(1);
parallel_for_each(hc::extent<1>(1), [=](hc::index<1> i) [[hc]]
{
asm("s_nop 0");
int __amdgcn_update_dpp(int old, int src, int dpp_ctrl, int row_mask, int bank_mask, bool bound_ctrl) [[hc]] asm("llvm.amdgcn.update.dpp.i32");
int d = data[0];
d = __amdgcn_update_dpp(0, d, 1, 14, 15, false) ^ d;
data[i[0]] = d;
});
return 0;
} v_mov_b32_dpp v4, v2 quad_perm:[1,0,0,0] row_mask:0xe bank_mask:0xf
v_xor_b32_e32 v2, v4, v2 |
@b-sumner for awareness. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This affects also HIP so maybe I should move this issue there. Actually it is rather LLVM issue. I put it here because I encountered this with hcc. Many issues here in hcc apply also for hip.
dump-gfx900.isa:
This is probably happening because when is run "GCN DPP Combine" pass
v_add
instruction is in formV_ADD_U32_e64
with immediate value 0 which seems DPP combine pass will not combine when row and bank masks are not full:Only later is v_add changed to V_ADD_U32_e32 and now would DPP combine work as shown bellow with dpp_combine.mir.
But this works:
When I change operation for example to xor or max:
Then is dpp combining working:
Xor and max are working because
old
argument tollvm.amdgcn.update.dpp
is identity for respective operation. When is source register out of bounds or masked by row or bank mask then__amdgcn_update_dpp
will "return" identity and xor/max operation is nop and hencev_mov_dpp
can be combined withv_xor
intov_xor_dpp
(which will behave equivalently).In case of addition identity is zero so it should also work.
Test "old_is_0" from here demonstrates it: https://github.com/llvm/llvm-project/blob/master/llvm/test/CodeGen/AMDGPU/dpp_combine.mir
Result from
/opt/rocm/hcc/bin/llc -march=amdgcn -mcpu=gfx900 -run-pass=gcn-dpp-combine
:Btw this combining is also happening on gfx803 where
v_add
modifiesvcc
if I am not wrong. But that probably does not matter ifvcc
from thisv_add
is not used.But seems it is not working when translating from LLVM IR.
Also
llvm.amdgcn.update.dpp
is not the most happy solution because when I want for example implement parallel reduction using binary operation as template argument then I need to also define identity value for each possible binary operation. Ideally it should be easier to generate_dpp
instructions without need to use identity.The text was updated successfully, but these errors were encountered: