Need more accurate threads per block #58

PhilipFackler · 2024-03-21T18:10:19Z

I hit a CUDA error about "too many resources" and discovered it's because my kernel required a lot of registers. I found the following answer helpful, but it uses the deprecated CUDAnative package. The maxthreads on cufunction takes the number of registers needed by the kernel into account. Based on that example, here's what I came up with for JACC.parallel_for for single dimension:

function JACC.parallel_for(N::I, f::F, x...) where {I<:Integer,F<:Function}
  parallel_args = (f, x...)
  parallel_kargs = cudaconvert.(parallel_args)
  parallel_tt = Tuple{Core.Typeof.(parallel_kargs)...}
  parallel_kernel = cufunction(_parallel_for_cuda, parallel_tt)
  maxPossibleThreads = CUDA.maxthreads(parallel_kernel)
  threads = min(N, maxPossibleThreads)
  blocks = ceil(Int, N / threads)
  parallel_kernel(parallel_kargs...; threads=threads, blocks=blocks)
end

This works, although it probably needs more exploration.

PhilipFackler mentioned this issue Nov 1, 2024

WIP: Better blocks/threads calculations for CUDA backend #136

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need more accurate threads per block #58

Need more accurate threads per block #58

PhilipFackler commented Mar 21, 2024 •

edited

Loading

Need more accurate threads per block #58

Need more accurate threads per block #58

Comments

PhilipFackler commented Mar 21, 2024 • edited Loading

PhilipFackler commented Mar 21, 2024 •

edited

Loading