Skip to content

Conversation

tkf
Copy link
Contributor

@tkf tkf commented Feb 22, 2022

I tried if a simplistic approach can solve #1353. This patch simply lets users specify atomic ordering with Val(:monotonic) etc. passed to CUDA.atomic_*. It seems to generate the correct LLVM IR. But the orderings are not reflected to to ptx (and the run-time).

julia> function global_atomic_demo!(xs, order)
           ptr = pointer(xs, 1)
           CUDA.atomic_add!(ptr, 1, order)
           nothing
       end;

julia> @device_code_ptx @cuda global_atomic_demo!(CUDA.zeros(Int, 1), Val(:monotonic))

prints

...
// PTX CompilerJob of kernel global_atomic_demo!(CuDeviceVector{Int64, 1}, Val{:monotonic}) for sm_61

//
// Generated by LLVM NVPTX Back-End
//

.version 6.3
.target sm_61
.address_size 64

        // .globl       _Z31julia_global_atomic_demo__1619313CuDeviceArrayI5Int64Li1ELi1EE3ValI10_monotonicE // -- Begin function _Z31julia_global_atomic_demo__1619313CuDeviceArrayI5Int64Li1ELi1EE3ValI10_monotonicE
                                        // @_Z31julia_global_atomic_demo__1619313CuDeviceArrayI5Int64Li1ELi1EE3ValI10_monotonicE
.visible .entry _Z31julia_global_atomic_demo__1619313CuDeviceArrayI5Int64Li1ELi1EE3ValI10_monotonicE(
        .param .align 8 .b8 _Z31julia_global_atomic_demo__1619313CuDeviceArrayI5Int64Li1ELi1EE3ValI10_monotonicE_param_0[8],
        .param .align 8 .b8 _Z31julia_global_atomic_demo__1619313CuDeviceArrayI5Int64Li1ELi1EE3ValI10_monotonicE_param_1[32]
)
{
        .reg .b64       %rd<3>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [_Z31julia_global_atomic_demo__1619313CuDeviceArrayI5Int64Li1ELi1EE3ValI10_monotonicE_param_1];
        atom.global.add.u64     %rd2, [%rd1], 1;
        ret;
                                        // -- End function
}

and

julia> @device_code_ptx @cuda global_atomic_demo!(CUDA.zeros(Int, 1), Val(:sequentially_consistent))

prints

...
// PTX CompilerJob of kernel global_atomic_demo!(CuDeviceVector{Int64, 1}, Val{:sequentially_consistent}) for sm_61

//
// Generated by LLVM NVPTX Back-End
//

.version 6.3
.target sm_61
.address_size 64

        // .globl       _Z31julia_global_atomic_demo__1640313CuDeviceArrayI5Int64Li1ELi1EE3ValI24_sequentially_consistentE // -- Begin function _Z31julia_global_atomic_demo__1640313CuDeviceArrayI5Int64Li1ELi1EE3ValI24_sequentially_consistentE
                                        // @_Z31julia_global_atomic_demo__1640313CuDeviceArrayI5Int64Li1ELi1EE3ValI24_sequentially_consistentE
.visible .entry _Z31julia_global_atomic_demo__1640313CuDeviceArrayI5Int64Li1ELi1EE3ValI24_sequentially_consistentE(
        .param .align 8 .b8 _Z31julia_global_atomic_demo__1640313CuDeviceArrayI5Int64Li1ELi1EE3ValI24_sequentially_consistentE_param_0[8],
        .param .align 8 .b8 _Z31julia_global_atomic_demo__1640313CuDeviceArrayI5Int64Li1ELi1EE3ValI24_sequentially_consistentE_param_1[32]
)
{
        .reg .b64       %rd<3>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [_Z31julia_global_atomic_demo__1640313CuDeviceArrayI5Int64Li1ELi1EE3ValI24_sequentially_consistentE_param_1];
        atom.global.add.u64     %rd2, [%rd1], 1;
        ret;
                                        // -- End function
}

i.e., atomic_add! with both Val(:monotonic) and Val(:sequentially_consistent) are compiled down to atom.global.add.u64. Note that @device_code_llvm prints expected LLVM IR (i.e., %2 = atomicrmw add i64 addrspace(1)* %1, i64 1 monotonic for Val(:monotonic) and %2 = atomicrmw add i64 addrspace(1)* %1, i64 1 seq_cst for Val(:sequentially_consistent)). Similar program on shared memory also shows that both Val(:monotonic) and Val(:sequentially_consistent) generate the same instruction (atom.shared.add.u64).

I'm not sure how to properly generate more accurate orderings as libcu++ does. Do we need to generate more ptx-specific LLVM IR?

@maleadt
Copy link
Member

maleadt commented Feb 23, 2022

I'm not sure how to properly generate more accurate orderings as libcu++ does.

Which exact instruction (modifiers) are you expecting?

This is just LLVM not handling these orderings:

source_filename = "text"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

define void @kernel_monotonic(i32 addrspace(1)* %ptr) {
entry:
    %0 = atomicrmw add i32 addrspace(1)* %ptr, i32 1 monotonic
    ret void
}

define void @kernel_seq_cst(i32 addrspace(1)* %ptr) {
entry:
    %0 = atomicrmw add i32 addrspace(1)* %ptr, i32 1seq_cst
    ret void
}
$ llc test.ll -o -                                                                                                                                                                            
//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 64

        // .globl       kernel_monotonic        // -- Begin function kernel_monotonic
                                        // @kernel_monotonic
.visible .func kernel_monotonic(
        .param .b64 kernel_monotonic_param_0
)
{
        .reg .b32       %r<2>;
        .reg .b64       %rd<2>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [kernel_monotonic_param_0];
        atom.global.add.u32     %r1, [%rd1], 1;
        ret;
                                        // -- End function
}
        // .globl       kernel_seq_cst          // -- Begin function kernel_seq_cst
.visible .func kernel_seq_cst(
        .param .b64 kernel_seq_cst_param_0
)                                       // @kernel_seq_cst
{
        .reg .b32       %r<2>;
        .reg .b64       %rd<2>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [kernel_seq_cst_param_0];
        atom.global.add.u32     %r1, [%rd1], 1;
        ret;
                                        // -- End function

At the same time, NVVM (NVIDIA's closed-source back-end) doesn't seem to handle these either:

#include <nvvm.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define check(ans) { _check((ans), __FILE__, __LINE__); }
void _check(nvvmResult code, const char *file, int line)
{
   if (code != NVVM_SUCCESS)
   {
      fprintf(stderr,"NVVM error: %s at %s:%d\n", nvvmGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

int main() {
    FILE *f = fopen("/tmp/test.ll", "rb");
    fseek(f, 0, SEEK_END);
    long input_size = ftell(f);
    fseek(f, 0, SEEK_SET);

    char *input = malloc(input_size);
    fread(input, input_size, 1, f);
    fclose(f);

    nvvmProgram program;
    check(nvvmCreateProgram(&program));

    check(nvvmAddModuleToProgram(program, input, input_size, "main"));

    if (nvvmCompileProgram(program, 0, NULL) != NVVM_SUCCESS) {
        size_t log_size;
        check(nvvmGetProgramLogSize(program, &log_size));
        char *log = malloc(log_size);
        check(nvvmGetProgramLog(program, log));
        fprintf(stderr, "Compilation failed: %s\n", log);
        return EXIT_FAILURE;
    }

    size_t result_size;
    check(nvvmGetCompiledResultSize(program, &result_size));
    char *result = malloc(result_size);
    check(nvvmGetCompiledResult(program, result));
    fprintf(stdout, result, "%s\n", result);

    return EXIT_SUCCESS;
}
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-30794723
// Cuda compilation tools, release 11.6, V11.6.55
// Based on NVVM 7.0.1
//

.version 7.6
.target sm_52
.address_size 64

        // .globl       kernel_monotonic

.visible .func kernel_monotonic(
        .param .b64 kernel_monotonic_param_0
)
{
        .reg .b32       %r<2>;
        .reg .b64       %rd<2>;


        ld.param.u64    %rd1, [kernel_monotonic_param_0];
        atom.global.add.u32     %r1, [%rd1], 1;
        ret;

}
        // .globl       kernel_seq_cst
.visible .func kernel_seq_cst(
        .param .b64 kernel_seq_cst_param_0
)
{
        .reg .b32       %r<2>;
        .reg .b64       %rd<2>;


        ld.param.u64    %rd1, [kernel_seq_cst_param_0];
        atom.global.add.u32     %r1, [%rd1], 1;
        ret;

}

@tkf
Copy link
Contributor Author

tkf commented Feb 23, 2022

I only tried it on godbolt https://godbolt.org/z/o6areY84z but NVCC with libcu++ compiles a->fetch_add(1, cuda::std::memory_order_relaxed) to atom.add.relaxed.sys.u32. So, I was hoping that we get something similar with :monotonic. Other orderings with libcu++ are also reflected to the assembly. I tried clang++ on godbolt to see check the LLVM IR but I couldn't make it compile. Interestingly, old C APIs like atomicAdd and atomicAdd_system seem to use different instructions. atomicAdd compiles to atom.global.add.u32. So, I wonder if LLVM is not updated yet to use the new instructions? Or maybe we need to use some @llvm.nvvm.* instructions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants