Combining Cuba.jl with GPU #36

CharlesRSmith44 · 2021-05-06T20:02:05Z

Hello,

I'm trying to utilize gpu computation using Cuda.jl to speed up calculating integrals. Is it possible to do so? If so, how? Here is my example code:

using Cuba, Distributions
using BenchmarkTools, Test, CUDA
using FLoops, FoldsCUDA
using SpecialFunctions

# user input
M= 10  #number of independent uniform random variables
atol=1e-6
rtol=1e-3
nvec=1000000
maxevals=100000000

# setting up integration function
function int_thread_col_gpu_precompile_test(x, f)
    w = CuArray(x)
   @floop CUDAEx() for i in 1:size(x,2)
  #@floop for i in 1:size(x,2)
      val = reduce(*,@view(w[:,i]))
        f[i] = val
    end
    return f
end

cuhre(int_thread_col_gpu_precompile_test, M, 1, atol=atol, rtol=rtol)
suave(int_thread_col_gpu_precompile_test, M, 1, atol=atol, rtol=rtol, nvec = nvec, maxevals = maxevals)

If I use the CUDAEx() call, the code errors. If I don't the code works fine, but isn't exploiting the GPU effectively.

If I include the CUDAEx() call, the error message is

GPU compilation of kernel transduce_kernel!(Nothing, Transducers.Reduction{Transducers.Map{typeof(first)}, Transducers.BottomRF{Transducers.AdHocRF{var"#__##oninit_function#434#22", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}}}, typeof(identity), var"#__##combine_function#436#24"}}}, Transducers.InitOf{Transducers.DefaultInitOf}, Int64, Base.OneTo{Int64}, UnitRange{Int64}) failed
KernelError: passing and using non-bitstype argument

Argument 3 to your kernel function is of type Transducers.Reduction{Transducers.Map{typeof(first)}, Transducers.BottomRF{Transducers.AdHocRF{var"#__##oninit_function#434#22", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}}}, typeof(identity), var"#__##combine_function#436#24"}}}, which is not isbits:
  .inner is of type Transducers.BottomRF{Transducers.AdHocRF{var"#__##oninit_function#434#22", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}}}, typeof(identity), var"#__##combine_function#436#24"}} which is not isbits.
    .inner is of type Transducers.AdHocRF{var"#__##oninit_function#434#22", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}}}, typeof(identity), var"#__##combine_function#436#24"} which is not isbits.
      .next is of type InitialValues.AdjoinIdentity{var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}}} which is not isbits.
        .op is of type var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}} which is not isbits.
          .f is of type Matrix{Float64} which is not isbits.

The text was updated successfully, but these errors were encountered:

CharlesRSmith44 · 2021-05-13T14:06:43Z

Alternatively, I can bind arrays as CuArrays after drawing the random variables, but the time cost of doing so seems prohibitive. Instead, it seems much faster to avoid using the GPU and use the CPU exclusively. Is there anyway to decrease the time fo the gpu option? I.e. could I draw random variables into the GPU initially rather than having to transfer them over? I think the transfering from the gpu to the cpu is what makes the code so slow for the gpu option.

For example:

import Pkg
Pkg.activate("joint_timing")
Pkg.instantiate()
using Cuba, Distributions
using BenchmarkTools, Test, CUDA
using FLoops, FoldsCUDA
using SpecialFunctions

@test Threads.nthreads()>1

# User Inputs
M= 5 # number of independent uniform random variables
atol=1e-6
rtol=1e-3
nvec=1000000
maxevals=100000000

# Initializing Functions
function int_cpu(x, f)
   f[1] = pdf(Product(Beta.(1.0,2.0*ones(M))),x)
end

function int_cpu2(x, f)
   f[1] = vec(prod(x'.^(1.0-1.0) .* (1.0 .- x').^(2.0-1.0)./(gamma(1.0)*gamma(2.0)/gamma(3.0)),dims=2))[1]
end

function beta_pdf_gpu(x, a, b)
     prod(x.^(a-1.0f0) .* (1.0f0 .- x).^(b-1.0f0)./(gamma(a)*gamma(b)/gamma(a+b)),dims=1)
end

function int_gpu(x, f)
   f[1] = vec(beta_pdf_gpu(CuArray(x),1.0f0,2.0f0))[1]
end

display(@benchmark cuhre($int_gpu, $M, 1, atol=$atol, rtol=$rtol)) # 70 ms for M = 5, 11.7 s for M = 15)
display(@benchmark cuhre($int_cpu, $M, 1, atol=$atol, rtol=$rtol)) # (2.0 ms for M = 5, 650ms for M=15)
display(@benchmark cuhre($int_cpu2, $M, 1, atol=$atol, rtol=$rtol)) # (500 mus for M = 5, 100ms for M = 15, 38s for M = 25)

giordano · 2021-05-14T18:06:19Z

I'm not really sure what you want to do here. Cuba.jl calls into a C shared library which runs on the CPU, so you're bound to have to constantly moving data between GPU and CPU, there is really no way around it, and whether this is worth or not depends on your specific problem, but in vast majority of (or all?) cases I'd assume it isn't. I'm going to close this ticket because I don't think there is any action to take here.

It may be interesting to see whether HCubature.jl works out-of-the-box with CuArrays: that's a pure-Julia package which implements the same algorithm as the cuhre function here

CharlesRSmith44 · 2021-05-14T18:45:39Z

Thank you for the quick response! I will look into that.

giordano closed this as completed May 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining Cuba.jl with GPU #36

Combining Cuba.jl with GPU #36

CharlesRSmith44 commented May 6, 2021 •

edited by giordano

Loading

CharlesRSmith44 commented May 13, 2021 •

edited by giordano

Loading

giordano commented May 14, 2021 •

edited

Loading

CharlesRSmith44 commented May 14, 2021

Combining Cuba.jl with GPU #36

Combining Cuba.jl with GPU #36

Comments

CharlesRSmith44 commented May 6, 2021 • edited by giordano Loading

CharlesRSmith44 commented May 13, 2021 • edited by giordano Loading

giordano commented May 14, 2021 • edited Loading

CharlesRSmith44 commented May 14, 2021

CharlesRSmith44 commented May 6, 2021 •

edited by giordano

Loading

CharlesRSmith44 commented May 13, 2021 •

edited by giordano

Loading

giordano commented May 14, 2021 •

edited

Loading