Skip to content

[opt] Incorrect global id (gid) computation via hipExtModuleLaunchKernel for non-divisible global sizes when using -fno-offload-uniform-block #947

@0oyyo0

Description

@0oyyo0

Environment

  • This issue has been reproduced on ROCm 6.0 and ROCm 7.0
  • Kernel launch API: hipExtModuleLaunchKernel
  • Total thread size not divisible by block size (tail workgroup exists)
  • Add the -fno-offload-uniform-block option during compilation.

Problem Description
When launching a kernel via hipExtModuleLaunchKernel with:
totalThreads % blockSize != 0
the kernel exhibits incorrect global id (gid) computation:

  • Some lower gid values are executed twice
  • This only happens when a tail workgroup is present(i.e., global size is not divisible by block size)
  • When the global size is divisible by block size, the problem disappears.

demo

  • Reproducing the problem
    hipcc --offload-device-only kernel.hip -o kernel.co -fno-offload-uniform-block #-mcode-object-version=5
    hipcc main.cpp -o main_test
    ./main_test

  • kernel.hip

#include <hip/hip_runtime.h>
#include <stdint.h>

extern "C" __global__
void write_gid_kernel(uint64_t* out)
{
    uint64_t gid = blockIdx.x * blockDim.x + threadIdx.x;
    printf("gid = %lu\n", gid);
    out[gid] = gid;
}
  • main.cpp
#include <hip/hip_runtime.h>
#include <hip/hip_ext.h>
#include <cstdio>
#include <cstdlib>
#include <vector>


int main()
{
    hipModule_t module;
    hipFunction_t kernel;

    hipModuleLoad(&module, "kernel.co");
    hipModuleGetFunction(&kernel, module, "write_gid_kernel");

    // const int block_size    = 256;
    // const int total_threads = 1000;

    const int block_size    = 512;
    const int total_threads = 822528; // 822528 % 512 = 256 

    printf("logical threads = %d\n", total_threads);
    printf("block size      = %d\n", block_size);

    uint64_t* d_out;
    hipMalloc(&d_out, total_threads * sizeof(uint64_t));

    void* args[] = {
        (void*)&d_out
    };

    
    hipExtModuleLaunchKernel(
        kernel,
        total_threads, 1, 1,   
        block_size, 1, 1, 
        0,
        nullptr,
        args,
        nullptr
    );

    hipDeviceSynchronize();

    std::vector<uint64_t> h_out(total_threads, 0);
    hipMemcpy(
        h_out.data(), d_out,
        total_threads * sizeof(uint64_t),
        hipMemcpyDeviceToHost);

    hipFree(d_out);
    hipModuleUnload(module);
    return 0;
}

Observed Behavior
When running the above demo,

  • gid in the range:[411136, 411391] is printed twice

  • gid in the final tail region: [822272, totalThreads) is not printed at all

  • 822528/512=1606; 822528 % 512 = 256; 1606*256=411136; 411136+256-1=411391

  • It is suspected that, for the final remainder workgroup(256 threads), gid computation uses 256 * blockIdx + lane_id, leading to incorrect global IDs.

Expected Behavior
Each gid in [0, totalThreads) should be printed exactly once

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions