Explore CUDA graph API #65

maleadt · 2018-11-29T16:15:16Z

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH
https://devblogs.nvidia.com/cuda-10-features-revealed/

I also came across http://www.cudahandbook.com/2018/09/cuda-graphs-roi-and-api-adoption/ today but didn't give it a proper read yet

vchuravy · 2018-11-29T16:23:58Z

There is also cudaStreamBeginCapture to turn a Stream into a Graph.

Capture may not be initiated if stream is cudaStreamLegacy.

Which includes the default stream IIUC. (In any case we might want to switch to per thread default stream)

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g1811d555e88205c2f60d61535294c4fe

maleadt · 2018-11-29T16:24:53Z

BLAS has cublasSetStream so might require some work across the package though.

maleadt · 2018-11-29T16:33:28Z

Driver API: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1gea22d4496b1c8d02d0607bb05743532f

The legacy default stream seems different from the regular default one?
https://docs.nvidia.com/cuda/cuda-driver-api/stream-sync-behavior.html#stream-sync-behavior
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html#group__CUDA__TYPES_1ga53e8210837f039dd6434a3a4c3324aa

maleadt · 2018-11-29T16:44:59Z

This will require the (inevitable) work of putting streams everywhere:

using CUDAnative, CUDAdrv, CuArrays
import CUDAdrv: @apicall, CuStream_t

stream = CuStream()
@apicall(:cuStreamBeginCapture, (CuStream_t,), stream)

A = cu(rand(2,2)) # implicitly uploads on the default stream
B = cu(rand(2,2))

ERROR: LoadError: CUDA error: operation would make the legacy stream depend on a capturing blocking stream (code #906, ERROR_STREAM_CAPTURE_IMPLICIT)
Stacktrace:
 [1] #upload!#10(::Bool, ::Function, ::CUDAdrv.Mem.Buffer, ::Ptr{Float32}, ::Int64, ::CuStream) at /home/tbesard/Julia/CUDAdrv/src/memory.jl:235

... which I wasn't planning on attempting in the near future.

maleadt · 2018-11-29T16:53:52Z

... which I wasn't planning on attempting in the near future.

The reason being that I haven't put enough thought in how the API should look like, and how it would be compatible with CUDA:

like contexts, a default global one and use do blocks to switch stream
or rather put the stream in the array and thread it through everywhere

The question is also where this functionality should go. And how it should interact with foreign libraries.

It also requires figuring out which operations to make asynchronous, because you can only synchronously cuMemcpyHtoD on the default stream AFAIK. Maybe we should just make everything asynchronous.

maleadt · 2018-11-29T16:57:37Z

https://github.com/NVIDIA/cuda-samples/blob/master/Samples/simpleCudaGraphs/simpleCudaGraphs.cu

maleadt · 2018-11-29T17:38:13Z

Some more exploration:

using CUDAnative, CUDAdrv, CuArrays
import CUDAdrv: @apicall, CuStream_t, isvalid

# graph
const CuGraph_t = Ptr{Cvoid}
mutable struct CuGraph
    handle::CuGraph_t
    ctx::CuContext

    function CuGraph(f::Function, stream::CuStream)
        handle_ref = Ref{CuGraph_t}()

        @apicall(:cuStreamBeginCapture, (CuStream_t,), stream)
        f()
        @apicall(:cuStreamEndCapture, (CuStream_t, Ptr{CuGraph_t}), stream, handle_ref)

        ctx = CuCurrentContext()
        obj = new(handle_ref[], ctx)
        finalizer(unsafe_destroy!, obj)
        return obj
    end 
end
function unsafe_destroy!(x::CuGraph)
    if isvalid(x.ctx)
        @apicall(:cuGraphDestroy, (CuGraph_t,), x)
    end
end
Base.unsafe_convert(::Type{CuGraph_t}, x::CuGraph) = x.handle

# graph node
const CuGraphNode_t = Ptr{Cvoid}

# graph execution
const CuGraphExec_t = Ptr{Cvoid}
function instantiate(graph::CuGraph)
    exec_ref = Ref{CuGraphExec_t}()
    error_node = Ref{CuGraphNode_t}()
    buflen = 256
    buf = Vector{Cchar}(undef, buflen)
    @apicall(:cuGraphInstantiate,
             (Ptr{CuGraphExec_t}, CuGraph_t, Ptr{CuGraphNode_t}, Ptr{Cchar}, Csize_t),
             exec_ref, graph, error_node, buf, buflen)
    return exec_ref[]
end
function launch(exec::CuGraphExec_t, stream::CuStream=CuDefaultStream())
    @apicall(:cuGraphLaunch, (CuGraphExec_t, CuStream_t), exec, stream)
end
launch(graph::CuGraph, stream::CuStream=CuDefaultStream()) =
    launch(instantiate(graph), stream)

# demo
stream = CuStream()
graph = CuGraph(stream) do
    dims=(3,4)

    a = rand(Float32, dims)
    #d_a = cu(a)
    buf_a = Mem.alloc(a)
    Mem.upload!(buf_a, a, stream; async=true)
    d_a = CuArray{Float32,2}(buf_a, dims)

    b = rand(Float32, dims)
    #d_b = cu(b)
    buf_b = Mem.alloc(b)
    Mem.upload!(buf_b, b, stream; async=true)
    d_b = CuArray{Float32,2}(buf_b, dims)

    c = rand(Float32, dims)
    #d_b = cu(b)
    buf_c = Mem.alloc(c)
    Mem.upload!(buf_c, c, stream; async=true)
    d_c = CuArray{Float32,2}(buf_c, dims)

    #d_out = similar(d_out)
    buf_out = Mem.alloc(b)
    d_out = CuArray{Float32,2}(buf_out, dims)

    # d_out .= d_a .+ d_b
    function vadd(a, b, c)
        i = (blockIdx().x-1) * blockDim().x + threadIdx().x
        c[i] = a[i] + b[i]
        return
    end
    @cuda threads=prod(dims) stream=stream vadd(d_out, d_a, d_b)

    # d_out .= d_out .+ d_c
    @cuda threads=prod(dims) stream=stream vadd(d_out, d_out, d_c)
end
launch(graph)

==8594== NVPROF is profiling process 8594, command: julia wip.jl
==8594== Profiling application: julia wip.jl
==8594== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   72.22%  5.8240us         2  2.9120us  1.8880us  3.9360us  ptxcall_vadd_1
                   27.78%  2.2400us         3     746ns     608ns     960ns  [CUDA memcpy HtoD]
      API calls:   70.19%  101.91ms         1  101.91ms  101.91ms  101.91ms  cuCtxCreate
                   28.57%  41.479ms         1  41.479ms  41.479ms  41.479ms  cuCtxDestroy
                    0.56%  807.98us         1  807.98us  807.98us  807.98us  cuModuleLoadDataEx
                    0.37%  534.67us         1  534.67us  534.67us  534.67us  cuGraphInstantiate
                    0.14%  208.24us         4  52.059us  4.7680us  180.50us  cuMemAlloc
                    0.07%  95.018us         1  95.018us  95.018us  95.018us  cuModuleUnload
                    0.03%  41.206us         1  41.206us  41.206us  41.206us  cuStreamCreate
                    0.02%  30.955us         1  30.955us  30.955us  30.955us  cuGraphLaunch
                    0.01%  17.924us         1  17.924us  17.924us  17.924us  cuStreamDestroy
                    0.01%  14.153us         3  4.7170us  1.6700us  10.220us  cuMemcpyHtoDAsync
                    0.01%  9.8060us        11     891ns     337ns  2.4880us  cuCtxGetCurrent
                    0.00%  6.3230us         1  6.3230us  6.3230us  6.3230us  cuDeviceGetPCIBusId
                    0.00%  6.2120us         1  6.2120us  6.2120us  6.2120us  cuGraphDestroy
                    0.00%  6.0810us         5  1.2160us     420ns  2.2970us  cuDeviceGetAttribute
                    0.00%  4.2560us         2  2.1280us  1.5410us  2.7150us  cuDeviceGet
                    0.00%  4.2250us         1  4.2250us  4.2250us  4.2250us  cuStreamBeginCapture
                    0.00%  3.4370us         2  1.7180us     826ns  2.6110us  cuDeviceGetCount
                    0.00%  2.0160us         1  2.0160us  2.0160us  2.0160us  cuDriverGetVersion
                    0.00%  1.3710us         1  1.3710us  1.3710us  1.3710us  cuStreamEndCapture
                    0.00%     840ns         1     840ns     840ns     840ns  cuModuleGetFunction
                    0.00%     657ns         1     657ns     657ns     657ns  cuCtxGetDevice

I had thought it would merge kernels, but it doesn't. It just avoids multiple launches.

For this to be efficient we'd have to cache graphs which seems hard to do in an automatic fashion. I could imagine the CuGraph constructor doing something dispatch-y, in relation to the graph construction body and its arguments, but that seems iffy since graph construction might depend on information that isn't in the type, such as array sizes, or worse actual values.

Maybe I'm overthinking this and it should just be explicit, but then it might not be worth it. It would be useful to have some workloads that would benefit from this, in order to estimate that (cc @MikeInnes).

MikeInnes · 2018-11-30T10:08:20Z

Doing this automatically seems like one of the big wins we can get in Julia. I'm imagining seeing this as a compiler feature rather than an API as such; as part of our optimisation passes we'll look for multiple kernel launches in sequence and fuse them. There are obviously a lot of details to be worked out there, but as long as we can build a graph using only type information we should be fine; it's not so far off from fusing a broadcast tree.

AFAIK this feature is pretty squarely aimed at DL, as it's increasingly difficult to stress a V100 with only matmuls and broadcasts. But I agree that it's easy to check the numbers here and we should be looking for a use case first.

maleadt transferred this issue from JuliaGPU/CUDAnative.jl May 27, 2020

maleadt added cuda kernels Stuff about writing CUDA kernels. speculative Not sure about this one yet. labels May 27, 2020

maleadt mentioned this issue Apr 30, 2021

Add wrappers for the CUDA graph API. #877

Merged

maleadt closed this as completed in #877 May 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore CUDA graph API #65

Explore CUDA graph API #65

maleadt commented Nov 29, 2018

vchuravy commented Nov 29, 2018

maleadt commented Nov 29, 2018

maleadt commented Nov 29, 2018

maleadt commented Nov 29, 2018

maleadt commented Nov 29, 2018 •

edited

Loading

maleadt commented Nov 29, 2018

maleadt commented Nov 29, 2018

MikeInnes commented Nov 30, 2018

Explore CUDA graph API #65

Explore CUDA graph API #65

Comments

maleadt commented Nov 29, 2018

vchuravy commented Nov 29, 2018

maleadt commented Nov 29, 2018

maleadt commented Nov 29, 2018

maleadt commented Nov 29, 2018

maleadt commented Nov 29, 2018 • edited Loading

maleadt commented Nov 29, 2018

maleadt commented Nov 29, 2018

MikeInnes commented Nov 30, 2018

maleadt commented Nov 29, 2018 •

edited

Loading