Skip to content
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.

ERROR: CUDA error: too many resources requested for launch (code #701, ERROR_LAUNCH_OUT_OF_RESOURCES) #211

Closed
mohamed82008 opened this issue Nov 22, 2018 · 3 comments
Labels

Comments

@mohamed82008
Copy link
Contributor

The following script reproduces an error I am not sure how to solve. I could not reduce it to a MWE but the trigger line is the last line in the script test/resources_error.jl which is not long. The actual problematic kernel is here. Interestingly, making the kernel function a no-op by removing all its lines solves the problem!

run(`git clone https://github.com/mohamed82008/TopOpt.jl TopOpt`)
cd(()->run(`git checkout resources_error`), "TopOpt")
using Pkg
Pkg.activate("./TopOpt")
Pkg.instantiate()

include("TopOpt/test/resources_error.jl")

The error I am getting is:

ERROR: CUDA error: too many resources requested for launch (code #701, ERROR_LAUNCH_OUT_OF_RESOURCES)
Stacktrace:
 [1] macro expansion at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\base.jl:147 [inlined]
 [2] macro expansion at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\execution.jl:90 [inlined]
 [3] macro expansion at .\gcutils.jl:87 [inlined]
 [4] macro expansion at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\execution.jl:88 [inlined]
 [5] _launch at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\execution.jl:68 [inlined]
 [6] launch at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\execution.jl:60 [inlined]
 [7] macro expansion at .\gcutils.jl:87 [inlined]
 [8] macro expansion at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\execution.jl:171 [inlined]
 [9] #_cudacall#22(::Int64, ::Int64, ::Int64, ::CUDAdrv.CuStream, ::typeof(CUDAdrv._cudacall), ::CUDAdrv.CuFunction, ::Type{Tuple{CUDAnative.CuDeviceArray{StaticArrays.SArray{Tuple{8},Float64,1,8},1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Symmetric{Float64,StaticArrays.SArray{Tuple{8,8},Float64,2,64}},1,CUDAnative.AS.Global},Float64,TopOpt.PowerPenalty{Float64},Int64}}, ::Tuple{CUDAnative.CuDeviceArray{StaticArrays.SArray{Tuple{8},Float64,1,8},1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Symmetric{Float64,StaticArrays.SArray{Tuple{8,8},Float64,2,64}},1,CUDAnative.AS.Global},Float64,TopOpt.PowerPenalty{Float64},Int64}) at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\execution.jl:154
 [10] (::getfield(CUDAdrv, Symbol("#kw##_cudacall")))(::NamedTuple{(:blocks, :threads),Tuple{Int64,Int64}}, ::typeof(CUDAdrv._cudacall), ::CUDAdrv.CuFunction, ::Type, ::Tuple{CUDAnative.CuDeviceArray{StaticArrays.SArray{Tuple{8},Float64,1,8},1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Symmetric{Float64,StaticArrays.SArray{Tuple{8,8},Float64,2,64}},1,CUDAnative.AS.Global},Float64,TopOpt.PowerPenalty{Float64},Int64}) at .\none:0
 [11] #cudacall#21 at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\execution.jl:146 [inlined]
 [12] #cudacall at .\none:0 [inlined]
 [13] macro expansion at C:\Users\user\.julia\packages\CUDAnative\nqSUm\src\execution.jl:313 [inlined]
 [14] #call#93(::Base.Iterators.Pairs{Symbol,Int64,Tuple{Symbol,Symbol},NamedTuple{(:blocks, :threads),Tuple{Int64,Int64}}}, ::CUDAnative.Kernel{TopOpt.kernel1,Tuple{CUDAnative.CuDeviceArray{StaticArrays.SArray{Tuple{8},Float64,1,8},1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Symmetric{Float64,StaticArrays.SArray{Tuple{8,8},Float64,2,64}},1,CUDAnative.AS.Global},Float64,TopOpt.PowerPenalty{Float64},Int64}}, ::CUDAnative.CuDeviceArray{StaticArrays.SArray{Tuple{8},Float64,1,8},1,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Int64,1,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Int64,2,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Symmetric{Float64,StaticArrays.SArray{Tuple{8,8},Float64,2,64}},1,CUDAnative.AS.Global}, ::Float64, ::TopOpt.PowerPenalty{Float64}, ::Int64) at C:\Users\user\.julia\packages\CUDAnative\nqSUm\src\execution.jl:290
 [15] (::getfield(CUDAnative, Symbol("#kw#Kernel")))(::NamedTuple{(:blocks, :threads),Tuple{Int64,Int64}}, ::CUDAnative.Kernel{TopOpt.kernel1,Tuple{CUDAnative.CuDeviceArray{StaticArrays.SArray{Tuple{8},Float64,1,8},1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Int64,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Symmetric{Float64,StaticArrays.SArray{Tuple{8,8},Float64,2,64}},1,CUDAnative.AS.Global},Float64,TopOpt.PowerPenalty{Float64},Int64}}, ::CUDAnative.CuDeviceArray{StaticArrays.SArray{Tuple{8},Float64,1,8},1,CUDAnative.AS.Global}, ::Vararg{Any,N} where N) at .\none:0
 [16] mul!(::CuArray{Float64,1}, ::TopOpt.MatrixFreeOperator{Float64,2,ElementFEAInfo{2,Float64,Symmetric{Float64,StaticArrays.SArray{Tuple{8,8},Float64,2,64}},StaticArrays.SArray{Tuple{8},Float64,1,8},CuArray{Symmetric{Float64,StaticArrays.SArray{Tuple{8,8},Float64,2,64}},1},CuArray{StaticArrays.SArray{Tuple{8},Float64,1,8},1},CuArray{Float64,1},JuAFEM.RefCube,JuAFEM.CellScalarValues{2,Float64,JuAFEM.RefCube},2,JuAFEM.FaceScalarValues{2,Float64,JuAFEM.RefCube},TopOptProblems.Metadata{CuArray{Tuple{Int64,Int64},1},CuArray{Int64,1},CuArray{Int64,2}},CuArray{Bool,1},CuArray{Int64,1}},PointLoadCantilever{2,Float64,4,4,Array{Int64,1},TopOptProblems.Metadata{Array{Tuple{Int64,Int64},1},Array{Int64,1},Array{Int64,2}}},CuArray{Float64,1},TopOpt.PowerPenalty{Float64}}, ::CuArray{Float64,1}) at C:\Users\user\.julia\dev\TopOpt\src\fea_solvers\matrix_free_operator.jl:70
 [17] top-level scope at none:0

My versioninfo() is:

Julia Version 1.0.1
Commit 0d713926f8 (2018-09-29 19:05 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

This is quite important to me, so any workaround will do for now. Thanks a lot!

@maleadt maleadt transferred this issue from JuliaGPU/CUDAnative.jl Nov 22, 2018
@maleadt maleadt added the bug label Nov 22, 2018
@maleadt
Copy link
Member

maleadt commented Nov 22, 2018

Ah, I figured this was mul! in CuArrays, but you implemented that method yourself? This isn't really a bug then, but probably an issue in your code. Please use Discourse for such issues.

Anyhow, you're probably launching too many threads, exhausting the resources of your GPU. I see you're checking CUDAdrv.MAX_THREADS_PER_BLOCK, which is good, but not enough: if your kernel uses many registers, it also limits the amount of threads you can use. You should use the lower-level CUDAnative APIs then, which allow introspecting on the max amount of threads you're allowed to launch given a kernel. For example:

parallel_args = (f, op, R, A, CIS, Rlength, Slength)
GC.@preserve parallel_args begin
parallel_kargs = cudaconvert.(parallel_args)
parallel_tt = Tuple{Core.Typeof.(parallel_kargs)...}
parallel_kernel = cufunction(mapreducedim_kernel_parallel, parallel_tt)
# we are limited in how many threads we can launch...
## by the kernel
kernel_threads = CUDAnative.maxthreads(parallel_kernel)
## by the device
dev = CUDAdrv.device()
block_threads = (x=attribute(dev, CUDAdrv.MAX_BLOCK_DIM_X),
y=attribute(dev, CUDAdrv.MAX_BLOCK_DIM_Y),
total=attribute(dev, CUDAdrv.MAX_THREADS_PER_BLOCK))
# figure out a legal launch configuration
y_thr = min(nextpow(2, Rlength ÷ 512 + 1), 512, block_threads.y, kernel_threads)
x_thr = min(512 ÷ y_thr, Slength, block_threads.x,
ceil(Int, block_threads.total/y_thr),
ceil(Int, kernel_threads/y_thr))
if x_thr >= 8
blk, thr = (Rlength - 1) ÷ y_thr + 1, (x_thr, y_thr, 1)
parallel_kernel(parallel_kargs...; threads=thr, blocks=blk)
(or just maxthreads if you don't care about 2D launch). Also see https://github.com/JuliaGPU/CUDAnative.jl/blob/b0bdecb551f942acdc82309a954fb42c9230b44d/src/execution.jl#L163-L169

Another tip, launch Julia with JULIA_DEBUG="CUDAnative" to see some details about the amount of registers your kernel is using, shouldn't exceed device limits when multiplied with the block size.

@maleadt maleadt closed this as completed Nov 22, 2018
@mohamed82008
Copy link
Contributor Author

I will try these out tomorrow. Thanks for the tips. Sorry for blasting you with many issues today!

@maleadt
Copy link
Member

maleadt commented Nov 22, 2018

No problem; you're in luck I have some spare time today 🙂

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants