-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMD] inThreadTranspose: Transpose between global load and local store for non-TN layouts: part 1 of 4 #5148
base: main
Are you sure you want to change the base?
Conversation
8b9bb50
to
7f24463
Compare
// traverse 2nd dimension (K-dim in GEMM case) | ||
int dim = order[1]; | ||
for (int basis = 1; basis < shape[dim]; basis <<= 1) { | ||
bases.push_back({0, basis}); | ||
} | ||
// traverse 1st dimension (N-dim in GEMM K-major B-tensor) | ||
// this is the consecutive dimension loaded from global memory | ||
dim = order[0]; | ||
for (int basis = 1; basis < shape[dim]; basis <<= 1) { | ||
bases.push_back({basis, 0}); | ||
} | ||
auto dimMinor = "dim" + std::to_string(order[0]); | ||
auto dimMajor = "dim" + std::to_string(order[1]); | ||
StringAttr kDimMinor = S(dimMinor); | ||
StringAttr kDimMajor = S(dimMajor); | ||
ret = LinearLayout( | ||
{{kRegister, bases}}, | ||
{{kDimMinor, shape[order[0]]}, {kDimMajor, shape[order[1]]}}, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to "unroll" the loop. Try to do it in a similar way as in identityND
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem here is that if I fuse the loops, the basis I will push_back
into the bases
vector will have problem. Notice that it is {0, basis}
in the first loop and {basis, 0}
in the second loop, so technically I can still fuse the loop together, but maybe keeping it as-is, instead of adding a conditional arg inside the single for-loop, will be more readable
third_party/amd/lib/TritonAMDGPUTransforms/inThreadTranspose.cpp
Outdated
Show resolved
Hide resolved
auto numMaxIters = elemsTotal / elemsPerIter; | ||
auto bitwidth = tensorTy.getElementType().getIntOrFloatBitWidth(); | ||
// Current the widest is set to ds_write_b64 | ||
auto newKOuterDim = std::min(numMaxIters, 64 / bitwidth); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we do ds_write_b128?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh I just fixed this 64bits initially and never changed it afterwards. I can experiment on this later when I have all the PRs in to see if relaxing it to 128bits is better, and change accordingly
third_party/amd/lib/TritonAMDGPUTransforms/inThreadTranspose.cpp
Outdated
Show resolved
Hide resolved
for (auto loadOp : loadOps) | ||
convertLayout(newBlockedEnc, (Operation *)loadOp); | ||
} else { | ||
LDBG("opB is K-inner and nothing to be done"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code for opA and opB are duplicated. Can you try to merge them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my inline comments.
When you address review comments, make sure to submit new commits rather than rebase. Otherwise the comments are gone.
1da0b5f
to
2875461
Compare
// in-thread, we'd want to iterate of the 2nd order, i.e. dim of 4, so that we | ||
// can pack the element of 4 into a single vector, and AMD backend LLVM compiler | ||
// will pack elements into consecutive VGPR and therefore achieve high | ||
// vectorization LDS write and also keep data in K-minor inside LDS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not say "high vectorization LDS write" because the old way, i.e. traversing the dim of 8, has higher vectorsize of ds_write.
So the ds_write can have short or long vector size, but we are guaranteed to have vectorized ds_read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about change this
"and AMD backend LLVM compiler will pack elements into consecutive VGPR and therefore achieve high vectorization LDS write and also keep data in K-minor inside LDS."
to
"and AMD backend LLVM compiler will pack elements into consecutive VGPR to write data contiguous in K dimension into LDS. In this way we guarantee vectorized ds_read, and ds_write can be vectorized to 64bit or 32bit depending on the block size and number of warps"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
third_party/amd/lib/TritonAMDGPUTransforms/inThreadTranspose.cpp
Outdated
Show resolved
Hide resolved
third_party/amd/lib/TritonAMDGPUTransforms/inThreadTranspose.cpp
Outdated
Show resolved
Hide resolved
third_party/amd/lib/TritonAMDGPUTransforms/inThreadTranspose.cpp
Outdated
Show resolved
Hide resolved
b623a9d
to
e275dcc
Compare
e275dcc
to
3ce4134
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@antiagainst please take a look
inThreadTranpose: part 1 of 4
Introduction
This PR introduces the AMD-specific inThreadTranspose feature to improve shared memory access efficiency for non-TN GEMM and 2nd dotOp in Flash Attention.
The entire feature has been broken into 4 pieces for more reliable integration and this PR is the 1st of 4.
Feature description
Currently on AMD hardware, if the dot Operand is K-major, we'd use the same vectorization for
ds_write
asglobal_load
, but won't coalesce onds_read
, resulting in poor shared memory/LDS read efficiency prior to MFMA operation.This feature, inThreadTranspose, groups multiple
global_load
together and packs vector across grain to write to LDS with vectorization, so that when the matrix is written into LDS, it's already consecutive on K dimension, and therefore vectorizedds_read
is also enabled. This is achieved byv_perm_b32
assembly instruction in AMDGCN, allowing independent register to be contiguous in VGPR space, so that we can write them together into LDS.PR description
The 1st of 4, i.e. this PR, introduces the TTGIR pass to pack multiple
global_load
together, so that in C = A @ B, where B is K-major, its blocked layout will be changed fromto
To accompany this update, the following changes are also added:
sizePerThread
from 1D to 2D, there will be 2 ways to convert it to linear layout. We want to keep the original way, i.e along the dimension of 8 in the example above, when reading from global memory, but convert along the dimension of 4, when transfering from blocked to shared.Note that this feature is tested but not activated in
compiler.py
until the 2nd PR lands in the future.New contributor declaration
I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run
pre-commit run --from-ref origin/main --to-ref HEAD
.Select one of the following.
/test
forlit
tests/unittest
for C++ tests/python/test
for end-to-end testsFILL THIS IN
.Select one of the following.
lit
tests.lit
tests I have added follow these best practices,including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)