-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore CUDA graph API #65
Comments
There is also
Which includes the default stream IIUC. (In any case we might want to switch to per thread default stream) |
BLAS has cublasSetStream so might require some work across the package though. |
The legacy default stream seems different from the regular default one? |
This will require the (inevitable) work of putting streams everywhere: using CUDAnative, CUDAdrv, CuArrays
import CUDAdrv: @apicall, CuStream_t
stream = CuStream()
@apicall(:cuStreamBeginCapture, (CuStream_t,), stream)
A = cu(rand(2,2)) # implicitly uploads on the default stream
B = cu(rand(2,2))
... which I wasn't planning on attempting in the near future. |
The reason being that I haven't put enough thought in how the API should look like, and how it would be compatible with CUDA:
The question is also where this functionality should go. And how it should interact with foreign libraries. It also requires figuring out which operations to make asynchronous, because you can only synchronously cuMemcpyHtoD on the default stream AFAIK. Maybe we should just make everything asynchronous. |
Some more exploration: using CUDAnative, CUDAdrv, CuArrays
import CUDAdrv: @apicall, CuStream_t, isvalid
# graph
const CuGraph_t = Ptr{Cvoid}
mutable struct CuGraph
handle::CuGraph_t
ctx::CuContext
function CuGraph(f::Function, stream::CuStream)
handle_ref = Ref{CuGraph_t}()
@apicall(:cuStreamBeginCapture, (CuStream_t,), stream)
f()
@apicall(:cuStreamEndCapture, (CuStream_t, Ptr{CuGraph_t}), stream, handle_ref)
ctx = CuCurrentContext()
obj = new(handle_ref[], ctx)
finalizer(unsafe_destroy!, obj)
return obj
end
end
function unsafe_destroy!(x::CuGraph)
if isvalid(x.ctx)
@apicall(:cuGraphDestroy, (CuGraph_t,), x)
end
end
Base.unsafe_convert(::Type{CuGraph_t}, x::CuGraph) = x.handle
# graph node
const CuGraphNode_t = Ptr{Cvoid}
# graph execution
const CuGraphExec_t = Ptr{Cvoid}
function instantiate(graph::CuGraph)
exec_ref = Ref{CuGraphExec_t}()
error_node = Ref{CuGraphNode_t}()
buflen = 256
buf = Vector{Cchar}(undef, buflen)
@apicall(:cuGraphInstantiate,
(Ptr{CuGraphExec_t}, CuGraph_t, Ptr{CuGraphNode_t}, Ptr{Cchar}, Csize_t),
exec_ref, graph, error_node, buf, buflen)
return exec_ref[]
end
function launch(exec::CuGraphExec_t, stream::CuStream=CuDefaultStream())
@apicall(:cuGraphLaunch, (CuGraphExec_t, CuStream_t), exec, stream)
end
launch(graph::CuGraph, stream::CuStream=CuDefaultStream()) =
launch(instantiate(graph), stream)
# demo
stream = CuStream()
graph = CuGraph(stream) do
dims=(3,4)
a = rand(Float32, dims)
#d_a = cu(a)
buf_a = Mem.alloc(a)
Mem.upload!(buf_a, a, stream; async=true)
d_a = CuArray{Float32,2}(buf_a, dims)
b = rand(Float32, dims)
#d_b = cu(b)
buf_b = Mem.alloc(b)
Mem.upload!(buf_b, b, stream; async=true)
d_b = CuArray{Float32,2}(buf_b, dims)
c = rand(Float32, dims)
#d_b = cu(b)
buf_c = Mem.alloc(c)
Mem.upload!(buf_c, c, stream; async=true)
d_c = CuArray{Float32,2}(buf_c, dims)
#d_out = similar(d_out)
buf_out = Mem.alloc(b)
d_out = CuArray{Float32,2}(buf_out, dims)
# d_out .= d_a .+ d_b
function vadd(a, b, c)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
c[i] = a[i] + b[i]
return
end
@cuda threads=prod(dims) stream=stream vadd(d_out, d_a, d_b)
# d_out .= d_out .+ d_c
@cuda threads=prod(dims) stream=stream vadd(d_out, d_out, d_c)
end
launch(graph)
I had thought it would merge kernels, but it doesn't. It just avoids multiple launches. For this to be efficient we'd have to cache graphs which seems hard to do in an automatic fashion. I could imagine the Maybe I'm overthinking this and it should just be explicit, but then it might not be worth it. It would be useful to have some workloads that would benefit from this, in order to estimate that (cc @MikeInnes). |
Doing this automatically seems like one of the big wins we can get in Julia. I'm imagining seeing this as a compiler feature rather than an API as such; as part of our optimisation passes we'll look for multiple kernel launches in sequence and fuse them. There are obviously a lot of details to be worked out there, but as long as we can build a graph using only type information we should be fine; it's not so far off from fusing a broadcast tree. AFAIK this feature is pretty squarely aimed at DL, as it's increasingly difficult to stress a V100 with only matmuls and broadcasts. But I agree that it's easy to check the numbers here and we should be looking for a use case first. |
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH
https://devblogs.nvidia.com/cuda-10-features-revealed/
I also came across http://www.cudahandbook.com/2018/09/cuda-graphs-roi-and-api-adoption/ today but didn't give it a proper read yet
The text was updated successfully, but these errors were encountered: