Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when callling CUDA.jl in an adopted thread while blocking the main thread #2449

Closed
4 tasks done
miniskar opened this issue Jul 14, 2024 · 14 comments
Closed
4 tasks done
Labels
bug Something isn't working upstream Somebody else's problem.

Comments

@miniskar
Copy link

miniskar commented Jul 14, 2024

Sanity checks (read this first, then remove this section)

  • Make sure you're reporting a bug; for general questions, please use Discourse or
    Slack.

  • If you're dealing with a performance issue, make sure you disable scalar iteration
    (CUDA.allowscalar(false)). Only file an issue if that shows scalar iteration happening
    in CUDA.jl or Base Julia, as opposed to your own code.

  • If you're seeing an error message, follow the error message instructions, if any
    (e.g. inspect code with @device_code_warntype). If you can't solve the problem using
    that information, make sure to post it as part of the issue.

  • Always ensure you're using the latest version of CUDA.jl, and if possible, please
    check the master branch to see if your issue hasn't been resolved yet.

If your bug is still valid, please go ahead and fill out the template below.

Describe the bug

I am facing a weird problem in our application. We have a julia function calling a C function, which is creating a pthread and calling Julia CUDA kernel. I have created a small example to illustrate and reproduce the problem. Unfortunately, we are not able to make it even more simpler. This is the simplest example to reproduce the issue.

The "call_c_function_direct” Julia function calls the C function “c_function_direct”, which calls Julia CUDA kernel “ccall_saxpy() -> saxpy_kernel()”. It works without any issue.

However, when I create a pthread inside C function and call Julia CUDA kernel, it hangs the execution and no useful stack trace is available.

The "call_c_function_pthread” Julia function calls the C function “c_function_pthread”, which creates a pthread and calls Julia CUDA kernel “ccall_saxpy() -> saxpy_kernel()”. It hangs the execution when it calls @cuda saxpy_kernel”.

To control the execution of Julia CUDA kernel either through direct or pthread based, a Julia variable is added in the file “julia_cuda.jl” with “direct = true”. You can set it to false to run using pthread.

To reproduce

The Minimal Working Example (MWE) for this bug:

File: julia_code.jl

using CUDA

# Define the CUDA kernel for saxpy
function saxpy_kernel(A, B, C, alpha)
    i = threadIdx().x
    if i <= length(A)
        C[i] = alpha * A[i] + B[i]
    end
    return
end

# Make the Julia function callable from C
export ccall_saxpy
function ccall_saxpy(ctx::Ptr{Cvoid}, device::Cint, A::Ptr{Float32}, B::Ptr{Float32}, C::Ptr{Float32}, alpha::Cfloat, n::Cint)::Cvoid
    cu_ctx = unsafe_load(reinterpret(Ptr{CuContext}, ctx))
    c_println("CUDA ctx: $cu_ctx Device:$device")
    CUDA.context!(cu_ctx)
    CUDA.device!(device)
    size_dims=Tuple(Int64[n])
    A_array = unsafe_wrap(CuArray, reinterpret(CuPtr{Float32}, A), size_dims, own=false)
    B_array = unsafe_wrap(CuArray, reinterpret(CuPtr{Float32}, B), size_dims, own=false)
    C_array = unsafe_wrap(CuArray, reinterpret(CuPtr{Float32}, C), size_dims, own=false)
    c_println("A: $A_array, B:$B_array, C:$C_array Alpha:$alpha")
    c_println("Calling CUDA function")
    CUDA.@sync @cuda threads=size_dims saxpy_kernel(A_array, B_array, C_array, alpha)
    c_println("CUDA call completed")
end

# Initialize the C function
function initialize_c_function()
    func_ptr = @cfunction(ccall_saxpy, Cvoid, (Ptr{Cvoid}, Cint, Ptr{Float32}, Ptr{Float32}, Ptr{Float32}, Cfloat, Cint))
    global stored_func_ptr = func_ptr
    ccall((:initialize_c, "libcfunction"), Cvoid, (Ptr{Cvoid},), func_ptr)
end

function c_println(data::String)::Cvoid
    ccall((:c_println, "libcfunction"), Cvoid, (Ptr{Cchar},), pointer(data))
end

# Define a function to call the C function
function call_c_function_pthread(A::Vector{Float32}, B::Vector{Float32}, C::Vector{Float32}, alpha::Float32, n::Cint)
    ccall((:c_function_pthread, "./libcfunction.so"), Cvoid,
          (Ptr{Float32}, Ptr{Float32}, Ptr{Float32}, Float32, Cint),
          A, B, C, alpha, n)
end
# Define a function to call the C function
function call_c_function_direct(A::Vector{Float32}, B::Vector{Float32}, C::Vector{Float32}, alpha::Float32, n::Cint)
    ccall((:c_function_direct, "./libcfunction.so"), Cvoid,
          (Ptr{Float32}, Ptr{Float32}, Ptr{Float32}, Float32, Cint),
          A, B, C, alpha, n)
end

# Example usage
A = Float32[1.0, 2.0, 3.0]
B = Float32[4.0, 5.0, 6.0]
C = Float32[0.0, 0.0, 0.0]
alpha = 2.0f0
n = 3
#direct = true
direct = false
initialize_c_function()
# Call the C function
if direct
    call_c_function_direct(A, B, C, alpha, Int32(n))
else
    call_c_function_pthread(A, B, C, alpha, Int32(n))
end
# Print the result
println("Result: $C")

C File: c_cuda.c

#include <julia.h>
#include <pthread.h>
#include <cuda_runtime.h>
#include <cuda.h>

typedef struct {
    float *A;
    float *B;
    float *C;
    float alpha;
    int n;
} thread_data_t;

// Declare the Julia function
typedef void (*julia_saxpy_t)(void *ctx, int device, float *A, float *B, float *C, float alpha, int n);

julia_saxpy_t ccall_saxpy = NULL;
CUcontext cuContext;
CUdevice cuDevice;

void initialize_c(void *func) {
    jl_init();
    ccall_saxpy = (julia_saxpy_t) func;
    cuInit(0);
    cuDeviceGet(&cuDevice, 0);
    cuCtxCreate(&cuContext, 0, cuDevice);
    printf("C CUDA ctx: %p\n", cuContext);
}

void c_println(char *data) {
    printf("%s\n", data);
}
void call_cuda_saxpy(float *A, float *B, float *C, float alpha, int n) {
    // Allocate device memory
    float *d_A, *d_B, *d_C;
    cuCtxSetCurrent(cuContext);
    cudaMalloc((void**)&d_A, n * sizeof(float));
    cudaMalloc((void**)&d_B, n * sizeof(float));
    cudaMalloc((void**)&d_C, n * sizeof(float));

    // Copy data from host to device
    cudaMemcpy(d_A, A, n * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, n * sizeof(float), cudaMemcpyHostToDevice);

    // Call the Julia function
    ccall_saxpy((void *)&cuContext, (int)cuDevice, d_A, d_B, d_C, alpha, n);

    // Copy result from device to host
    cudaMemcpy(C, d_C, n * sizeof(float), cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
}

void *thread_function(void *arg) {
    thread_data_t *data = (thread_data_t *)arg;
    call_cuda_saxpy(data->A, data->B, data->C, data->alpha, data->n);
    return NULL;
}
void c_function_pthread(float *A, float *B, float *C, float alpha, int n) {
    pthread_t thread;
    thread_data_t data = {A, B, C, alpha, n};
    pthread_create(&thread, NULL, thread_function, &data);
    pthread_join(thread, NULL);
}
void c_function_direct(float *A, float *B, float *C, float alpha, int n) {
    call_cuda_saxpy(A, B, C, alpha, n);
}

Build command and run ``` $ gcc -g -O0 -fPIC -shared -o libcfunction.so c_cuda.c -I$(JULIA)/include/julia -L$(JULIA)/lib -ljulia -lpthread -I$(NVHPC_ROOT)/cuda/include -L$(NVHPC_ROOT)/cuda/lib64 -lcuda -lcudart $ julia julia_cuda.jl ```

Expected behavior

A clear and concise description of what you expected to happen.

Version info

Details on Julia: 1.10.4

julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD Ryzen Threadripper 3970X 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
Environment:
  JULIA_VERSION = 1.10.3
  LD_LIBRARY_PATH = /opt/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/nvshmem/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/nccl/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/24.3/math_libs/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/24.3/compilers/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/24.3/compilers/extras/qd/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/24.3/cuda/extras/CUPTI/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/24.3/cuda/lib64:

Details on CUDA:

# please post the output of:
julia> CUDA.versioninfo()
CUDA runtime 12.5, artifact installation
CUDA driver 12.4
NVIDIA driver 550.54.15

CUDA libraries:
- CUBLAS: 12.3.4
- CURAND: 10.3.6
- CUFFT: 11.2.3
- CUSOLVER: 11.6.3
- CUSPARSE: 12.5.1
- CUPTI: 23.0.0
- NVML: 12.0.0+550.54.15

Julia packages:
- CUDA: 5.4.2
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0

Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 3090 (sm_86, 22.880 GiB / 24.000 GiB available)
@miniskar miniskar added the bug Something isn't working label Jul 14, 2024
@maleadt
Copy link
Member

maleadt commented Jul 19, 2024

Unrelated to CUDA.jl.

MWE:

#include <julia.h>
#include <pthread.h>

typedef void (*julia_callback)();

void call_directly(julia_callback callback) {
    printf("Calling Julia directly\n");
    callback();
}

void *thread_function(void* callback) {
    printf("Calling Julia from thread\n");
    ((julia_callback)callback)();
    return NULL;
}
void call_on_thread(julia_callback callback) {
    printf("Creating thread\n");
    pthread_t thread;
    pthread_create(&thread, NULL, thread_function, callback);
    pthread_join(thread, NULL);
}
function callback()::Cvoid
    println(Core.stdout, "Calling the GC")
    GC.gc()
    println(Core.stdout, "GC call done")
end

callback_ptr = @cfunction(callback, Cvoid, ())
ccall((:call_directly, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
println()
gc_state = @ccall(jl_gc_safe_enter()::Int8)
ccall((:call_on_thread, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
@ccall(jl_gc_safe_leave(gc_state::Int8)::Cvoid)
println("Done")

This blocks waiting for the GC. The reason is that you're joining the thread immediately after creating it, resulting in call_on_thread blocking. Because of blocking in a random C library, it's unsafe to transition to the GC from there, which results in your callback unable to acquire the GC.

The solution is either not to have your ccall block (e.g. by not joining the thread), or by indicating that the blocking ccall is safe to transition into GC from. This requires two modifications: surrounding the ccall in jl_gc_safe_enter/jl_gc_safe_leave (x-ref JuliaLang/julia#49933), and putting everything in a function.

function main()
    callback_ptr = @cfunction(callback, Cvoid, ())
    ccall((:call_directly, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
    println()
    gc_state = @ccall(jl_gc_safe_enter()::Int8)
    ccall((:call_on_thread, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
    @ccall(jl_gc_safe_leave(gc_state::Int8)::Cvoid)
    println("Done")
end

isinteractive() || main()

@maleadt maleadt closed this as not planned Won't fix, can't repro, duplicate, stale Jul 19, 2024
@maleadt maleadt removed the bug Something isn't working label Jul 19, 2024
@miniskar
Copy link
Author

miniskar commented Aug 7, 2024

Thank you for the response. I have tried the solution. However, when I have CUDA.@sync inside call_on_thread, it hangs in execution. I tried to attach gdb process. However, it is having many threads from Julia domain and Pthread is having a single thread.

Nearly ~35 Julia threads with gdb backtrace is given below.

#0  __futex_abstimed_wait_common64 (private=0, cancel=true,
    abstime=0x0, op=393, expected=0,
    futex_word=0x7c4aef2b9ae0 <thread_status+224>)
    at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0,
    abstime=0x0, clockid=0, expected=0,
    futex_word=0x7c4aef2b9ae0 <thread_status+224>)
    at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (                                                                                           [0/1437]
    futex_word=futex_word@entry=0x7c4aef2b9ae0 <thread_status+224>,
    expected=expected@entry=0, clockid=clockid@entry=0,
    abstime=abstime@entry=0x0, private=private@entry=0)
    at ./nptl/futex-internal.c:139
#3  0x00007c4b08893a41 in __pthread_cond_wait_common (abstime=0x0,
    clockid=0, mutex=0x7c4aef2b9a90 <thread_status+144>,
    cond=0x7c4aef2b9ab8 <thread_status+184>)
    at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7c4aef2b9ab8 <thread_status+184>,
    mutex=0x7c4aef2b9a90 <thread_status+144>)
    at ./nptl/pthread_cond_wait.c:627
#5  0x00007c4aed795fdb in blas_thread_server ()
   from /home/nqx/packages/julia-1.10.4/lib/julia/libopenblas64_.so
#6  0x00007c4b08894ac3 in start_thread (arg=<optimized out>)
    at ./nptl/pthread_create.c:442
#7  0x00007c4b08926850 in clone3 ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Another thread backtrace

#0  __futex_abstimed_wait_common64 (private=128, cancel=true,
    abstime=0x0, op=265, expected=57295, futex_word=0x7c4912fde910)
    at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=128,
    abstime=0x0, clockid=0, expected=57295, futex_word=0x7c4912fde910)
    at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (
    futex_word=futex_word@entry=0x7c4912fde910, expected=57295,
    clockid=clockid@entry=0, abstime=abstime@entry=0x0,
    private=private@entry=128) at ./nptl/futex-internal.c:139
#3  0x00007c4b08896624 in __pthread_clockjoin_ex (
    threadid=136653293086272, thread_return=0x0, clockid=0,
    abstime=0x0, block=<optimized out>)
    at ./nptl/pthread_join_common.c:105
...
#34 0x00007c4b07bb51b9 in jl_compile_method_internal (world=136653665882432, mi=<optimized out>)
    at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2481
#35 jl_compile_method_internal (mi=<optimized out>, world=136653665882432)
    at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2368

Any suggestion to resolve this error.

@maleadt
Copy link
Member

maleadt commented Aug 7, 2024

Works fine here; putting CUDA.nonblocking_synchronize(context()) (triggering the threaded synchronization functionality) in callback finishes as expected. Please provide an MWE based on the one I posted above.

@miniskar
Copy link
Author

miniskar commented Aug 8, 2024

I ran the MWE as it is without any CUDA statements.

c_mwe.c:

#include <julia.h>
#include <pthread.h>

typedef void (*julia_callback)();

void call_directly(julia_callback callback) {
    printf("Calling Julia directly\n");
    callback();
}

void *thread_function(void* callback) {
    printf("Calling Julia from thread\n");
    ((julia_callback)callback)();
    return NULL;
}
void call_on_thread(julia_callback callback) {
    jl_init();
    printf("Creating thread\n");
    pthread_t thread;
    pthread_create(&thread, NULL, thread_function, callback);
    pthread_join(thread, NULL);
}

julia_cuda_mwe.jl:

export callback
function callback()::Cvoid
    println(Core.stdout, "Calling the GC")
    GC.gc()
    println(Core.stdout, "GC call done")
end

callback_ptr = @cfunction(callback, Cvoid, ())
ccall((:call_directly, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
println()
gc_state = @ccall(jl_gc_safe_enter()::Int8)
ccall((:call_on_thread, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
@ccall(jl_gc_safe_leave(gc_state::Int8)::Cvoid)
println("Done")

Here are the compilation and run commands in makefile.

build:
    gcc -g -O0 -fPIC -shared  -o wip.so c_mwe.c -I$(JULIA)/include/julia -L$(JULIA)/lib -ljulia -lpthread -I$(NVHPC_ROOT)/cuda/include -L$(NVHPC_ROOT)/cuda/lib64  -lcuda -lcudart
run:
    julia julia_cuda_mwe.jl

Output:

$ make build run
Calling Julia directly
Calling the GC
GC call done

Creating thread
Calling Julia from thread
Calling the GC

It hangs at GC.gc().

@miniskar
Copy link
Author

miniskar commented Aug 8, 2024

I have replaced the Julia code as given below to resolve the hanging issue with GC.

export callback
function callback()::Cvoid
    println(Core.stdout, "Calling the GC")
    GC.gc()
    println(Core.stdout, "GC call done")
end
function main()
    callback_ptr = @cfunction(callback, Cvoid, ())
    ccall((:call_directly, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
    println()
    gc_state = @ccall(jl_gc_safe_enter()::Int8)
    ccall((:call_on_thread, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
    @ccall(jl_gc_safe_leave(gc_state::Int8)::Cvoid)
    println("Done")
end
main()

It is working. I am still wondering why it didn't work without a function. However, CUDA is still having an issue. I will upload MWE with minimal CUDA code in my next post.

@miniskar
Copy link
Author

miniskar commented Aug 9, 2024

Here is the MWE with CUDA code to illustrate the new findings.

c_cuda.c:

#include <julia.h>
#include <pthread.h>
#include <cuda_runtime.h>
#include <cuda.h>

typedef void (*julia_callback)(void *ctx, int device, float *A, float *B, float *C, float alpha, int n);

void call_saxpy(julia_callback callback) {
    printf("Calling Julia from C thread\n");
    int n=8;
    float alpha=2.0f;
    // Allocate device memory
    float *d_A, *d_B, *d_C;
    float A[8]={1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f};
    float B[8]={2.0f, 2.0f, 2.0f, 2.0f, 2.0f, 2.0f, 2.0f, 2.0f};
    float C[8]={3.0f, 3.0f, 3.0f, 3.0f, 3.0f, 3.0f, 3.0f, 3.0f};
    CUcontext cuContext;
    CUdevice cuDevice;
    cuInit(0);
    cuDeviceGet(&cuDevice, 0);
    cuCtxCreate(&cuContext, 0, cuDevice);
    cuCtxSetCurrent(cuContext);
    cudaMalloc((void**)&d_A, n * sizeof(float));
    cudaMalloc((void**)&d_B, n * sizeof(float));
    cudaMalloc((void**)&d_C, n * sizeof(float));

    // Copy data from host to device
    cudaMemcpy(d_A, A, n * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, n * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_C, C, n * sizeof(float), cudaMemcpyHostToDevice);

    // Call the Julia function
    callback((void *)&cuContext, (int)cuDevice, d_A, d_B, d_C, alpha, n);
    // Copy result from device to host
    cudaMemcpy(C, d_C, n * sizeof(float), cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
}

void call_directly(julia_callback callback) {
    printf("Calling Julia directly\n");
    call_saxpy(callback);
}

void *thread_function(void* callback) {
    call_saxpy((julia_callback)callback);
    return NULL;
}
void call_on_thread(julia_callback callback) {
    jl_init();
    printf("Creating thread\n");
    pthread_t thread;
    pthread_create(&thread, NULL, thread_function, callback);
    pthread_join(thread, NULL);
}

julia_cuda.jl:

using CUDA

# Define the CUDA kernel for saxpy
function saxpy_kernel(A, B, C, alpha)
    i = threadIdx().x
    #i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    if i <= length(A)
        C[i] = alpha * A[i] + B[i]
    end
    return nothing
end

export callback
function callback(ctx::Ptr{Cvoid}, device::Cint, A::Ptr{Float32}, B::Ptr{Float32}, C::Ptr{Float32}, alpha::Cfloat, n::Cint)::Cvoid
    GC.gc()
    # Limit BLAS to a single thread
    cu_ctx = unsafe_load(reinterpret(Ptr{CuContext}, ctx))
    CUDA.context!(cu_ctx)
    CUDA.device!(device)
    size_dims=Tuple(Int64[n])
    nthreads =Tuple(Int64[n])
    A_array = unsafe_wrap(CuArray, reinterpret(CuPtr{Float32}, A), size_dims, own=false)
    B_array = unsafe_wrap(CuArray, reinterpret(CuPtr{Float32}, B), size_dims, own=false)
    C_array = unsafe_wrap(CuArray, reinterpret(CuPtr{Float32}, C), size_dims, own=false)
    println(Core.stdout, "CUDA.ctx:$cu_ctx Device:$device Before A: $A_array, B:$B_array, C:$C_array Alpha:$alpha")
    #CUDA.nonblocking_synchronize(CUDA.context())
    CUDA.@sync @cuda threads=nthreads saxpy_kernel(A_array, B_array, C_array, alpha)
    #CUDA.synchronize()
    println(Core.stdout, "After A: $A_array, B:$B_array, C:$C_array Alpha:$alpha")
    println(Core.stdout, "GC call done")
end


function main()
    callback_ptr = @cfunction(callback, Cvoid, (Ptr{Cvoid}, Cint, Ptr{Float32}, Ptr{Float32}, Ptr{Float32}, Cfloat, Cint))
    disable = true
    if !disable
        ccall((:call_directly, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
    end
    println()
    gc_state = @ccall(jl_gc_safe_enter()::Int8)
    ccall((:call_on_thread, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
    @ccall(jl_gc_safe_leave(gc_state::Int8)::Cvoid)
    println("Done")
end

main()

In the above code, it works good if I set disable = false in the Julia main() function.
It seems like, the CUDA saxpy_kernel has to be called directly at least once before we call the CUDA kernel through pthread. Otherwise, when it is called from pthread, it seems like it is not recognizing the saxpy_kernel. Any solution to get it work without calling the call_directly ?

Here is the makefile to build and run.
Makefile:

build:
    gcc -g -O0 -fPIC -shared  -o wip.so c_cuda.c -I$(JULIA)/include/julia -L$(JULIA)/lib -ljulia -lpthread -I$(NVHPC_ROOT)/cuda/include -L$(NVHPC_ROOT)/cuda/lib64  -lcuda -lcudart
run:
    julia julia_cuda.jl

@vchuravy
Copy link
Member

vchuravy commented Aug 9, 2024

Otherwise, when it is called from pthread, it seems like it is not recognizing the saxpy_kernel.

What do you mean by that? Please always include relevant output to explain your issue.

@miniskar
Copy link
Author

miniskar commented Aug 9, 2024

Here is the output for two scenarios.

  1. Scenario 1: with the main function where disabled is set to false
function main()
    callback_ptr = @cfunction(callback, Cvoid, (Ptr{Cvoid}, Cint, Ptr{Float32}, Ptr{Float32}, Ptr{Float32}, Cfloat, Cint))
    disable = false
    if !disable
        ccall((:call_directly, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
    end
    println()
    gc_state = @ccall(jl_gc_safe_enter()::Int8)
    ccall((:call_on_thread, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
    @ccall(jl_gc_safe_leave(gc_state::Int8)::Cvoid)
    println("Done")
end

main()

Output:

Calling Julia directly
Calling Julia from C thread
CUDA.ctx:CuContext(0x0000000003986450) Device:0 Before A: Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], B:Float32[2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0], C:Float32[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0] Alpha:2.0
After A: Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], B:Float32[2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0], C:Float32[4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0] Alpha:2.0
GC call done

Creating thread
Calling Julia from C thread
CUDA.ctx:CuContext(0x0000768328000e00) Device:0 Before A: Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], B:Float32[2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0], C:Float32[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0] Alpha:2.0
After A: Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], B:Float32[2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0], C:Float32[4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0] Alpha:2.0
GC call done
Done
  1. Scenario 2: with the main function where disabled is set to true
function main()
    callback_ptr = @cfunction(callback, Cvoid, (Ptr{Cvoid}, Cint, Ptr{Float32}, Ptr{Float32}, Ptr{Float32}, Cfloat, Cint))
    disable = true
    if !disable
        ccall((:call_directly, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
    end
    println()
    gc_state = @ccall(jl_gc_safe_enter()::Int8)
    ccall((:call_on_thread, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
    @ccall(jl_gc_safe_leave(gc_state::Int8)::Cvoid)
    println("Done")
end

main()

Output:

Creating thread
Calling Julia from C thread
CUDA.ctx:CuContext(0x00007b6154000e00) Device:0 Before A: Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], B:Float32[2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0], C:Float32[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0] Alpha:2.0
.... # It Hangs here

@miniskar
Copy link
Author

I would like to reopen this issue as it fails for Scenario-2.

@miniskar
Copy link
Author

When I debug the code further, it is getting hanged during execution in the GPUCompiler when it calls "Core.Compiler.typeinf_type".
https://github.com/JuliaGPU/GPUCompiler.jl/blob/master/src/validation.jl#L17

function typeinf_type(mi::MethodInstance; interp::CC.AbstractInterpreter)
    ty = Core.Compiler.typeinf_type(interp, mi.def, mi.specTypes, mi.sparam_vals)
    return something(ty, Any)
end

This issue exists even when I use AMDGPU instead of CUDA.

I have reported this issue in GPUCompiler repository now.
JuliaGPU/GPUCompiler.jl#615

Please collaborate with us to fix this issue.

@maleadt
Copy link
Member

maleadt commented Aug 19, 2024

Reduced to:

#include <julia.h>
#include <pthread.h>

typedef void (*julia_callback)();

void call_directly(julia_callback callback) {
    printf("Calling Julia directly\n");
    callback();
}

void *thread_function(void* callback) {
    printf("Calling Julia from thread\n");
    ((julia_callback)callback)();
    return NULL;
}
void call_on_thread(julia_callback callback) {
    printf("Creating thread\n");
    pthread_t thread;
    pthread_create(&thread, NULL, thread_function, callback);
    pthread_join(thread, NULL);
}
function callback()::Cvoid
    println("Running a command")
    run(`echo 42`)
    return
end

function main()
    callback_ptr = @cfunction(callback, Cvoid, ())
    gc_state = @ccall(jl_gc_safe_enter()::Int8)
    ccall((:call_on_thread, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
    @ccall(jl_gc_safe_leave(gc_state::Int8)::Cvoid)
    println("Done")
end

main()

I'm pretty sure this is not guaranteed to work. Julia code can work on an foreign thread, as you're doing here, but you're also concurrently blocking the main thread of execution by calling pthread_join on it. The gc_safe handling at least unblocks the GC with respect to that, but other locks may have been taken too, such as in the scheduler. Maybe @vchuravy or @vtjnash can confirm this.

If you really want this to work, I'd advise filing an issue on the Julia main repository. In the mean time, I would try only calling pthread_join when you know your thread is done, e.g., by setting a flag at the end. Or, if you don't care about portability, switch to pthread_tryjoin_np (but crucially, calling this function in a loop from Julia, not from C).

Hope this helps you resolve the issue! In any case, there isn't much we can do from the CUDA.jl side about this...

@maleadt maleadt reopened this Aug 19, 2024
@maleadt maleadt closed this as completed Aug 19, 2024
@maleadt maleadt reopened this Aug 19, 2024
@maleadt maleadt changed the title Julia -> C function (Create thead) -> Julia CUDA kernel issue Deadlock when callling CUDA.jl in an adopted thread while blocking the main thread Aug 19, 2024
@maleadt maleadt added bug Something isn't working upstream Somebody else's problem. labels Aug 19, 2024
@vchuravy
Copy link
Member

Ah, I missed the GC lock interaction when I read this issue initially. When we enter C we shouldn't hold any locks (except the ones the user holds). So marking the thread as "GCSafe" ought to be enough.

@maleadt
Copy link
Member

maleadt commented Aug 19, 2024

When we enter C we shouldn't hold any locks (except the ones the user holds). So marking the thread as "GCSafe" ought to be enough.

In that case, let's file this on the Julia repo.

@maleadt
Copy link
Member

maleadt commented Aug 19, 2024

Filed upstream: JuliaLang/julia#55525

I think we can close this then, as there's nothing actionable on the CUDA.jl side.

And again, as a potential workaround, don't have the main thread block. For example, using the non-portable API:

#define _GNU_SOURCE
#include <julia.h>
#include <pthread.h>

typedef void (*julia_callback)();

void call_directly(julia_callback callback) {
    printf("Calling Julia directly\n");
    callback();
}

void *thread_function(void* callback) {
    printf("Calling Julia from thread\n");
    ((julia_callback)callback)();
    return NULL;
}
pthread_t thread;
void call_on_thread(julia_callback callback) {
    printf("Creating thread\n");
    pthread_create(&thread, NULL, thread_function, callback);
}
int wait_for_thread() {
    return pthread_tryjoin_np(thread, NULL);
}
function callback()::Cvoid
    println("Running a command")
    run(`echo 42`)
    return
end

function main()
    callback_ptr = @cfunction(callback, Cvoid, ())
    ccall((:call_on_thread, "./wip.so"), Cvoid, (Ptr{Cvoid},), callback_ptr)
    ret = -1
    while ret != 0
        ret = ccall((:wait_for_thread, "./wip.so"), Cint, ())
        yield()
    end
    println("Done")
end

main()

@maleadt maleadt closed this as completed Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working upstream Somebody else's problem.
Projects
None yet
Development

No branches or pull requests

3 participants