-
Notifications
You must be signed in to change notification settings - Fork 77
Description
Environment
- This issue has been reproduced on ROCm 6.0 and ROCm 7.0
- Kernel launch API: hipExtModuleLaunchKernel
- Total thread size not divisible by block size (tail workgroup exists)
- Add the -fno-offload-uniform-block option during compilation.
Problem Description
When launching a kernel via hipExtModuleLaunchKernel with:
totalThreads % blockSize != 0
the kernel exhibits incorrect global id (gid) computation:
- Some lower gid values are executed twice
- This only happens when a tail workgroup is present(i.e., global size is not divisible by block size)
- When the global size is divisible by block size, the problem disappears.
demo
-
Reproducing the problem
hipcc --offload-device-only kernel.hip -o kernel.co -fno-offload-uniform-block #-mcode-object-version=5
hipcc main.cpp -o main_test
./main_test -
kernel.hip
#include <hip/hip_runtime.h>
#include <stdint.h>
extern "C" __global__
void write_gid_kernel(uint64_t* out)
{
uint64_t gid = blockIdx.x * blockDim.x + threadIdx.x;
printf("gid = %lu\n", gid);
out[gid] = gid;
}
- main.cpp
#include <hip/hip_runtime.h>
#include <hip/hip_ext.h>
#include <cstdio>
#include <cstdlib>
#include <vector>
int main()
{
hipModule_t module;
hipFunction_t kernel;
hipModuleLoad(&module, "kernel.co");
hipModuleGetFunction(&kernel, module, "write_gid_kernel");
// const int block_size = 256;
// const int total_threads = 1000;
const int block_size = 512;
const int total_threads = 822528; // 822528 % 512 = 256
printf("logical threads = %d\n", total_threads);
printf("block size = %d\n", block_size);
uint64_t* d_out;
hipMalloc(&d_out, total_threads * sizeof(uint64_t));
void* args[] = {
(void*)&d_out
};
hipExtModuleLaunchKernel(
kernel,
total_threads, 1, 1,
block_size, 1, 1,
0,
nullptr,
args,
nullptr
);
hipDeviceSynchronize();
std::vector<uint64_t> h_out(total_threads, 0);
hipMemcpy(
h_out.data(), d_out,
total_threads * sizeof(uint64_t),
hipMemcpyDeviceToHost);
hipFree(d_out);
hipModuleUnload(module);
return 0;
}
Observed Behavior
When running the above demo,
-
gid in the range:[411136, 411391] is printed twice
-
gid in the final tail region: [822272, totalThreads) is not printed at all
-
822528/512=1606; 822528 % 512 = 256; 1606*256=411136; 411136+256-1=411391
-
It is suspected that, for the final remainder workgroup(256 threads), gid computation uses 256 * blockIdx + lane_id, leading to incorrect global IDs.
Expected Behavior
Each gid in [0, totalThreads) should be printed exactly once