-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmark sample for vector times matrix transposed #38
Conversation
This adds benchmarks for `vmt`, with very similar supporting structure to the existing `mmt` benchmark, but with different strategies tuned for matvec. This add three strategies: 1) Treat it like a reduction with one workgroup per row, relying on cache to get reuse of the vector. 2) Copy the vector to shared memory using all threads in the workgroup and then process N0 rows per workgroup, with WG_Y | N0 threadgroups. 3) Use a fixed number of workgroups and each workgroup strides the full problem space. This should limit the overhead of setting up the vector in shared memory, as well as improves scheduling overhead. Currently, the best configuration for each of the above three strategies are in the same performance ballpark (~20us for a 4096 * 4096x4096 matvec on an AMD 7900xtx).
I don't see a way to assign reviewers, so @antiagainst @kuhar I am posting progress here as discussed offline. |
@@ -0,0 +1,106 @@ | |||
// Copyright 2023 Google LLC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: For any substantially modified file, I believe you should put your affiliation here instead of Google. In case of files where multiple parties made significant contributions to, you can list multiple copyright lines, e.g.:
// Copyright 2020-2022 Google LLC
// Copyright 2023 Costco Inc.
Or we can switch it to "uVkCompute Authors" if that's all too annoying.
const uint threadID = gl_SubgroupInvocationID; | ||
const uvec2 gridDim = gl_NumWorkGroups.xy; | ||
|
||
const uint laneCount = gl_WorkGroupSize.x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we rely on the subgroup size being the same as the workgroup size? Or a specific multiple of the workgroup size. If yes, I think we can set it with VK_EXT_subgroup_size_control
.
int32_t wgResult = subgroupAdd(laneResult); | ||
if (subgroupElect()) { | ||
outputO.x[r] = wgResult; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it's not clear to me if we process all the elements because the outer loop has the trip count gl_WorkGroupSize.x * K0_VEC
but we reduce only within the subgroup. Don't we need an outer loop that reduces over the whole workgroup?
for (const ShaderCode &shader : kShaderCodeCases) { | ||
std::string vecmat_size = absl::StrCat(N, "x", K); | ||
std::string tiling_scheme = absl::StrCat(shader.N0, "x", shader.K0); | ||
BM_CHECK(isMultipleOf(N, shader.N0)) | ||
<< "Incompatible tiling scheme: " << tiling_scheme; | ||
BM_CHECK(isMultipleOf(K, shader.K0)) | ||
<< "Incompatible tiling scheme: " << tiling_scheme; | ||
BM_CHECK(isMultipleOf(shader.K0, 4)) | ||
<< "Incompatible tiling scheme: " << tiling_scheme; | ||
|
||
std::string workgroup_size = | ||
absl::StrCat(shader.wg_size_x, "x", shader.wg_size_y, "x1"); | ||
std::string type_info = absl::StrCat(GetName(shader.input_type), "->", | ||
GetName(shader.output_type)); | ||
std::string test_name = absl::StrCat( | ||
gpu_name, "/vmt[", vecmat_size, "]/", type_info, "/", shader.name, | ||
"/Workgroup", "_", shader.B0, "x", "[", workgroup_size, "]"); | ||
::benchmark::RegisterBenchmark(test_name.c_str(), Vmt, device, | ||
latency_measure, shader, N, K) | ||
->UseManualTime() | ||
->Unit(::benchmark::kMicrosecond); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have one main.cc
file for both variants? This is a lot of that is hard to tell apart without running under diff
. Or alternatively, maybe move common code into a header/library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks pretty nice! Just a few nits. I mainly reviewed the grid_strided_vmt_i8.glsl file and comments there applies to other simpler strategies. Thanks for adding it!
layout(binding = 1) buffer InputB { i8vec4 x[]; } inputB; | ||
layout(binding = 2) buffer Output { int32_t x[]; } outputO; | ||
|
||
layout(local_size_x = WG_X, local_size_y = WG_Y, local_size_z = 1) in; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add comment here to mention that WG_X
and WG_Y
is macros that gots their values during shader compilation.
SRC | ||
"vmt_i8.glsl" | ||
PERMUTATION | ||
"N0=[1|2|4]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to provide some background on why we have such permuation in the build system--it was to bypass limitations in driver compilers in mobile GPUs. This can and should be specialization constants in the kernel actually; but mobile GPU driver compilers used to have the issue if it's appearing as loop bounds it cannot properly unroll so tanking the perf. That might be not a problem anymore today. But we never checked. Anyway, nothing to do here; just wanted to explain a bit.
|
||
VkExtent3D dimensions1 = {uint32_t(N / 8), uint32_t(K), 1}; | ||
BM_CHECK_OK_AND_ASSIGN( | ||
auto src_image1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The images and samplers aren't used and should be removed.
|
||
layout(constant_id = 0) const uint N = 1; | ||
layout(constant_id = 1) const uint K = 1; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly it would be nice to explain what M
, N
, K
, M0
, N0
, K0
mean. And mention that the latter three are defined during shader compilation too.
|
||
const uint strideB = K_VEC; // Stride of the `inputB` matrix. | ||
|
||
// Each workgroup processes a total of N0 rows per iteration, therefore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to give an overview of the algorithm at the beginning, after explaining various macros and constant values. Saying how we decide WG_X
(subgroup size) and WG_Y
(subgroup count), and how workload are distributed (which is this comment).
void main() { | ||
const uvec2 wgID = gl_WorkGroupID.xy; | ||
const uvec2 localID = gl_LocalInvocationID.xy; | ||
const uint threadID = gl_SubgroupInvocationID; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
threadID
is just localID.x
? It would be nice to just use one source to avoid confusion.
shared i8vec4 LHS[K_VEC]; // Shared data for the LHS. | ||
|
||
void main() { | ||
const uvec2 wgID = gl_WorkGroupID.xy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only ever use the workgroup X dim. So it's better just to only assign that to wgID
.
int32_t laneResult = 0; | ||
|
||
for (uint k = 0; k < K_VEC; k += partialVec) { | ||
[[unroll]] for (uint kk = 0; kk < K0_VEC; ++kk) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, do you know what kind of ISA it generates with this unrolled scalar (i8vec4=>i32) load? Does the driver compiler potentially merge them into 128-bit loads?
} | ||
|
||
double numOperation = double(N) * double(K) * 2.; | ||
state.counters["Ops"] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of showing Ops/s, it's more meaningful to show bytes/s given this is mostly memory bound so we can compare with theoretical peak easier.
That's (4096 LHS + 4096 * 4096 RHS + 4096 * 4 OUTPUT) bytes / (20 * 10^(-6)) s ~= 0.84 TB/s? The theoritcal peak is 3.5 TB/s. So that's still quite far from it. So memory access is still not best. May need to dump ISA and see if there is anything suspicious and grab RGP traces to check. |
Closing this in favor of #40. If any of the other strategies tried here seem relevant at a later point I will open a new PR on top. |
This adds benchmarks for
vmt
, with very similar supporting structure to the existingmmt
benchmark, but with different strategies tuned for matvec. This add three strategies:cache to get reuse of the vector.
and then process N0 rows per workgroup, with WG_Y | N0 threadgroups.
full problem space. This should limit the overhead of setting up the
vector in shared memory, as well as improves scheduling overhead.
Currently, the best configuration for each of the above three strategies are in the same performance ballpark (~20us for a 4096 * 4096x4096 matvec on an AMD 7900xtx).