ERROR: CUDA error: too many resources requested for launch (code #701, ERROR_LAUNCH_OUT_OF_RESOURCES) #211

mohamed82008 · 2018-11-22T12:36:24Z

The following script reproduces an error I am not sure how to solve. I could not reduce it to a MWE but the trigger line is the last line in the script test/resources_error.jl which is not long. The actual problematic kernel is here. Interestingly, making the kernel function a no-op by removing all its lines solves the problem!

run(`git clone https://github.com/mohamed82008/TopOpt.jl TopOpt`)
cd(()->run(`git checkout resources_error`), "TopOpt")
using Pkg
Pkg.activate("./TopOpt")
Pkg.instantiate()

include("TopOpt/test/resources_error.jl")

The error I am getting is:

ERROR: CUDA error: too many resources requested for launch (code #701, ERROR_LAUNCH_OUT_OF_RESOURCES)
Stacktrace:
 [1] macro expansion at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\base.jl:147 [inlined]
 [2] macro expansion at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\execution.jl:90 [inlined]
 [3] macro expansion at .\gcutils.jl:87 [inlined]
 [4] macro expansion at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\execution.jl:88 [inlined]
 [5] _launch at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\execution.jl:68 [inlined]
 [6] launch at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\execution.jl:60 [inlined]
 [7] macro expansion at .\gcutils.jl:87 [inlined]
 [8] macro expansion at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\execution.jl:171 [inlined]
 [9] #_cudacall#22(::Int64, ::Int64, ::Int64, ::CUDAdrv.CuStream, ::typeof(CUDAdrv._cudacall), ::CUDAdrv.CuFunction, ::Type{Tuple{CUDAnative.CuDeviceArray{StaticArrays.SArray{Tuple{8},Float64,1,8},1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Symmetric{Float64,StaticArrays.SArray{Tuple{8,8},Float64,2,64}},1,CUDAnative.AS.Global},Float64,TopOpt.PowerPenalty{Float64},Int64}}, ::Tuple{CUDAnative.CuDeviceArray{StaticArrays.SArray{Tuple{8},Float64,1,8},1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Symmetric{Float64,StaticArrays.SArray{Tuple{8,8},Float64,2,64}},1,CUDAnative.AS.Global},Float64,TopOpt.PowerPenalty{Float64},Int64}) at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\execution.jl:154
 [10] (::getfield(CUDAdrv, Symbol("#kw##_cudacall")))(::NamedTuple{(:blocks, :threads),Tuple{Int64,Int64}}, ::typeof(CUDAdrv._cudacall), ::CUDAdrv.CuFunction, ::Type, ::Tuple{CUDAnative.CuDeviceArray{StaticArrays.SArray{Tuple{8},Float64,1,8},1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Symmetric{Float64,StaticArrays.SArray{Tuple{8,8},Float64,2,64}},1,CUDAnative.AS.Global},Float64,TopOpt.PowerPenalty{Float64},Int64}) at .\none:0
 [11] #cudacall#21 at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\execution.jl:146 [inlined]
 [12] #cudacall at .\none:0 [inlined]
 [13] macro expansion at C:\Users\user\.julia\packages\CUDAnative\nqSUm\src\execution.jl:313 [inlined]
 [14] #call#93(::Base.Iterators.Pairs{Symbol,Int64,Tuple{Symbol,Symbol},NamedTuple{(:blocks, :threads),Tuple{Int64,Int64}}}, ::CUDAnative.Kernel{TopOpt.kernel1,Tuple{CUDAnative.CuDeviceArray{StaticArrays.SArray{Tuple{8},Float64,1,8},1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Symmetric{Float64,StaticArrays.SArray{Tuple{8,8},Float64,2,64}},1,CUDAnative.AS.Global},Float64,TopOpt.PowerPenalty{Float64},Int64}}, ::CUDAnative.CuDeviceArray{StaticArrays.SArray{Tuple{8},Float64,1,8},1,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Int64,1,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Int64,2,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Symmetric{Float64,StaticArrays.SArray{Tuple{8,8},Float64,2,64}},1,CUDAnative.AS.Global}, ::Float64, ::TopOpt.PowerPenalty{Float64}, ::Int64) at C:\Users\user\.julia\packages\CUDAnative\nqSUm\src\execution.jl:290
 [15] (::getfield(CUDAnative, Symbol("#kw#Kernel")))(::NamedTuple{(:blocks, :threads),Tuple{Int64,Int64}}, ::CUDAnative.Kernel{TopOpt.kernel1,Tuple{CUDAnative.CuDeviceArray{StaticArrays.SArray{Tuple{8},Float64,1,8},1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Symmetric{Float64,StaticArrays.SArray{Tuple{8,8},Float64,2,64}},1,CUDAnative.AS.Global},Float64,TopOpt.PowerPenalty{Float64},Int64}}, ::CUDAnative.CuDeviceArray{StaticArrays.SArray{Tuple{8},Float64,1,8},1,CUDAnative.AS.Global}, ::Vararg{Any,N} where N) at .\none:0
 [16] mul!(::CuArray{Float64,1}, ::TopOpt.MatrixFreeOperator{Float64,2,ElementFEAInfo{2,Float64,Symmetric{Float64,StaticArrays.SArray{Tuple{8,8},Float64,2,64}},StaticArrays.SArray{Tuple{8},Float64,1,8},CuArray{Symmetric{Float64,StaticArrays.SArray{Tuple{8,8},Float64,2,64}},1},CuArray{StaticArrays.SArray{Tuple{8},Float64,1,8},1},CuArray{Float64,1},JuAFEM.RefCube,JuAFEM.CellScalarValues{2,Float64,JuAFEM.RefCube},2,JuAFEM.FaceScalarValues{2,Float64,JuAFEM.RefCube},TopOptProblems.Metadata{CuArray{Tuple{Int64,Int64},1},CuArray{Int64,1},CuArray{Int64,2}},CuArray{Bool,1},CuArray{Int64,1}},PointLoadCantilever{2,Float64,4,4,Array{Int64,1},TopOptProblems.Metadata{Array{Tuple{Int64,Int64},1},Array{Int64,1},Array{Int64,2}}},CuArray{Float64,1},TopOpt.PowerPenalty{Float64}}, ::CuArray{Float64,1}) at C:\Users\user\.julia\dev\TopOpt\src\fea_solvers\matrix_free_operator.jl:70
 [17] top-level scope at none:0

My versioninfo() is:

Julia Version 1.0.1
Commit 0d713926f8 (2018-09-29 19:05 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

This is quite important to me, so any workaround will do for now. Thanks a lot!

The text was updated successfully, but these errors were encountered:

maleadt · 2018-11-22T12:56:38Z

Ah, I figured this was mul! in CuArrays, but you implemented that method yourself? This isn't really a bug then, but probably an issue in your code. Please use Discourse for such issues.

Anyhow, you're probably launching too many threads, exhausting the resources of your GPU. I see you're checking CUDAdrv.MAX_THREADS_PER_BLOCK, which is good, but not enough: if your kernel uses many registers, it also limits the amount of threads you can use. You should use the lower-level CUDAnative APIs then, which allow introspecting on the max amount of threads you're allowed to launch given a kernel. For example:

CuArrays.jl/src/mapreduce.jl

Lines 64 to 87 in bac3ce4

    
           parallel_args = (f, op, R, A, CIS, Rlength, Slength) 
        
           GC.@preserve parallel_args begin 
        
               parallel_kargs = cudaconvert.(parallel_args) 
        
               parallel_tt = Tuple{Core.Typeof.(parallel_kargs)...} 
        
               parallel_kernel = cufunction(mapreducedim_kernel_parallel, parallel_tt) 
        
               # we are limited in how many threads we can launch... 
        
               ## by the kernel 
        
               kernel_threads = CUDAnative.maxthreads(parallel_kernel) 
        
               ## by the device 
        
               dev = CUDAdrv.device() 
        
               block_threads = (x=attribute(dev, CUDAdrv.MAX_BLOCK_DIM_X), 
        
                                y=attribute(dev, CUDAdrv.MAX_BLOCK_DIM_Y), 
        
                                total=attribute(dev, CUDAdrv.MAX_THREADS_PER_BLOCK)) 
        
               # figure out a legal launch configuration 
        
               y_thr = min(nextpow(2, Rlength ÷ 512 + 1), 512, block_threads.y, kernel_threads) 
        
               x_thr = min(512 ÷ y_thr, Slength, block_threads.x, 
        
                           ceil(Int, block_threads.total/y_thr), 
        
                           ceil(Int, kernel_threads/y_thr)) 
        
               if x_thr >= 8 
        
                   blk, thr = (Rlength - 1) ÷ y_thr + 1, (x_thr, y_thr, 1) 
        
                   parallel_kernel(parallel_kargs...; threads=thr, blocks=blk)

(or just maxthreads if you don't care about 2D launch). Also see https://github.com/JuliaGPU/CUDAnative.jl/blob/b0bdecb551f942acdc82309a954fb42c9230b44d/src/execution.jl#L163-L169

Another tip, launch Julia with JULIA_DEBUG="CUDAnative" to see some details about the amount of registers your kernel is using, shouldn't exceed device limits when multiplied with the block size.

mohamed82008 · 2018-11-22T13:00:39Z

I will try these out tomorrow. Thanks for the tips. Sorry for blasting you with many issues today!

maleadt · 2018-11-22T13:05:24Z

No problem; you're in luck I have some spare time today 🙂

maleadt transferred this issue from JuliaGPU/CUDAnative.jl Nov 22, 2018

maleadt added the bug label Nov 22, 2018

maleadt closed this as completed Nov 22, 2018

PhilipFackler mentioned this issue Mar 27, 2024

Need more accurate threads per block JuliaORNL/JACC.jl#58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR: CUDA error: too many resources requested for launch (code #701, ERROR_LAUNCH_OUT_OF_RESOURCES) #211

ERROR: CUDA error: too many resources requested for launch (code #701, ERROR_LAUNCH_OUT_OF_RESOURCES) #211

mohamed82008 commented Nov 22, 2018

maleadt commented Nov 22, 2018

mohamed82008 commented Nov 22, 2018

maleadt commented Nov 22, 2018 •

edited

Loading

ERROR: CUDA error: too many resources requested for launch (code #701, ERROR_LAUNCH_OUT_OF_RESOURCES) #211

ERROR: CUDA error: too many resources requested for launch (code #701, ERROR_LAUNCH_OUT_OF_RESOURCES) #211

Comments

mohamed82008 commented Nov 22, 2018

maleadt commented Nov 22, 2018

mohamed82008 commented Nov 22, 2018

maleadt commented Nov 22, 2018 • edited Loading

maleadt commented Nov 22, 2018 •

edited

Loading