Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining Cuba.jl with GPU #36

Closed
CharlesRSmith44 opened this issue May 6, 2021 · 3 comments
Closed

Combining Cuba.jl with GPU #36

CharlesRSmith44 opened this issue May 6, 2021 · 3 comments

Comments

@CharlesRSmith44
Copy link

CharlesRSmith44 commented May 6, 2021

Hello,

I'm trying to utilize gpu computation using Cuda.jl to speed up calculating integrals. Is it possible to do so? If so, how? Here is my example code:

using Cuba, Distributions
using BenchmarkTools, Test, CUDA
using FLoops, FoldsCUDA
using SpecialFunctions

# user input
M= 10  #number of independent uniform random variables
atol=1e-6
rtol=1e-3
nvec=1000000
maxevals=100000000

# setting up integration function
function int_thread_col_gpu_precompile_test(x, f)
    w = CuArray(x)
   @floop CUDAEx() for i in 1:size(x,2)
  #@floop for i in 1:size(x,2)
      val = reduce(*,@view(w[:,i]))
        f[i] = val
    end
    return f
end

cuhre(int_thread_col_gpu_precompile_test, M, 1, atol=atol, rtol=rtol)
suave(int_thread_col_gpu_precompile_test, M, 1, atol=atol, rtol=rtol, nvec = nvec, maxevals = maxevals)

If I use the CUDAEx() call, the code errors. If I don't the code works fine, but isn't exploiting the GPU effectively.

If I include the CUDAEx() call, the error message is

GPU compilation of kernel transduce_kernel!(Nothing, Transducers.Reduction{Transducers.Map{typeof(first)}, Transducers.BottomRF{Transducers.AdHocRF{var"#__##oninit_function#434#22", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}}}, typeof(identity), var"#__##combine_function#436#24"}}}, Transducers.InitOf{Transducers.DefaultInitOf}, Int64, Base.OneTo{Int64}, UnitRange{Int64}) failed
KernelError: passing and using non-bitstype argument

Argument 3 to your kernel function is of type Transducers.Reduction{Transducers.Map{typeof(first)}, Transducers.BottomRF{Transducers.AdHocRF{var"#__##oninit_function#434#22", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}}}, typeof(identity), var"#__##combine_function#436#24"}}}, which is not isbits:
  .inner is of type Transducers.BottomRF{Transducers.AdHocRF{var"#__##oninit_function#434#22", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}}}, typeof(identity), var"#__##combine_function#436#24"}} which is not isbits.
    .inner is of type Transducers.AdHocRF{var"#__##oninit_function#434#22", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}}}, typeof(identity), var"#__##combine_function#436#24"} which is not isbits.
      .next is of type InitialValues.AdjoinIdentity{var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}}} which is not isbits.
        .op is of type var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}} which is not isbits.
          .f is of type Matrix{Float64} which is not isbits.
@CharlesRSmith44
Copy link
Author

CharlesRSmith44 commented May 13, 2021

Alternatively, I can bind arrays as CuArrays after drawing the random variables, but the time cost of doing so seems prohibitive. Instead, it seems much faster to avoid using the GPU and use the CPU exclusively. Is there anyway to decrease the time fo the gpu option? I.e. could I draw random variables into the GPU initially rather than having to transfer them over? I think the transfering from the gpu to the cpu is what makes the code so slow for the gpu option.

For example:

import Pkg
Pkg.activate("joint_timing")
Pkg.instantiate()
using Cuba, Distributions
using BenchmarkTools, Test, CUDA
using FLoops, FoldsCUDA
using SpecialFunctions

@test Threads.nthreads()>1

# User Inputs
M= 5 # number of independent uniform random variables
atol=1e-6
rtol=1e-3
nvec=1000000
maxevals=100000000

# Initializing Functions
function int_cpu(x, f)
   f[1] = pdf(Product(Beta.(1.0,2.0*ones(M))),x)
end

function int_cpu2(x, f)
   f[1] = vec(prod(x'.^(1.0-1.0) .* (1.0 .- x').^(2.0-1.0)./(gamma(1.0)*gamma(2.0)/gamma(3.0)),dims=2))[1]
end

function beta_pdf_gpu(x, a, b)
     prod(x.^(a-1.0f0) .* (1.0f0 .- x).^(b-1.0f0)./(gamma(a)*gamma(b)/gamma(a+b)),dims=1)
end

function int_gpu(x, f)
   f[1] = vec(beta_pdf_gpu(CuArray(x),1.0f0,2.0f0))[1]
end

display(@benchmark cuhre($int_gpu, $M, 1, atol=$atol, rtol=$rtol)) # 70 ms for M = 5, 11.7 s for M = 15)
display(@benchmark cuhre($int_cpu, $M, 1, atol=$atol, rtol=$rtol)) # (2.0 ms for M = 5, 650ms for M=15)
display(@benchmark cuhre($int_cpu2, $M, 1, atol=$atol, rtol=$rtol)) # (500 mus for M = 5, 100ms for M = 15, 38s for M = 25)

@giordano
Copy link
Owner

giordano commented May 14, 2021

I'm not really sure what you want to do here. Cuba.jl calls into a C shared library which runs on the CPU, so you're bound to have to constantly moving data between GPU and CPU, there is really no way around it, and whether this is worth or not depends on your specific problem, but in vast majority of (or all?) cases I'd assume it isn't. I'm going to close this ticket because I don't think there is any action to take here.

It may be interesting to see whether HCubature.jl works out-of-the-box with CuArrays: that's a pure-Julia package which implements the same algorithm as the cuhre function here

@CharlesRSmith44
Copy link
Author

Thank you for the quick response! I will look into that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants