Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore CUDA graph API #65

Closed
maleadt opened this issue Nov 29, 2018 · 8 comments · Fixed by #877
Closed

Explore CUDA graph API #65

maleadt opened this issue Nov 29, 2018 · 8 comments · Fixed by #877
Labels
cuda kernels Stuff about writing CUDA kernels. speculative Not sure about this one yet.

Comments

@maleadt
Copy link
Member

maleadt commented Nov 29, 2018

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH
https://devblogs.nvidia.com/cuda-10-features-revealed/

I also came across http://www.cudahandbook.com/2018/09/cuda-graphs-roi-and-api-adoption/ today but didn't give it a proper read yet

@vchuravy
Copy link
Member

There is also cudaStreamBeginCapture to turn a Stream into a Graph.

Capture may not be initiated if stream is cudaStreamLegacy.

Which includes the default stream IIUC. (In any case we might want to switch to per thread default stream)

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g1811d555e88205c2f60d61535294c4fe

@maleadt
Copy link
Member Author

maleadt commented Nov 29, 2018

BLAS has cublasSetStream so might require some work across the package though.

@maleadt
Copy link
Member Author

maleadt commented Nov 29, 2018

This will require the (inevitable) work of putting streams everywhere:

using CUDAnative, CUDAdrv, CuArrays
import CUDAdrv: @apicall, CuStream_t

stream = CuStream()
@apicall(:cuStreamBeginCapture, (CuStream_t,), stream)

A = cu(rand(2,2)) # implicitly uploads on the default stream
B = cu(rand(2,2))
ERROR: LoadError: CUDA error: operation would make the legacy stream depend on a capturing blocking stream (code #906, ERROR_STREAM_CAPTURE_IMPLICIT)
Stacktrace:
 [1] #upload!#10(::Bool, ::Function, ::CUDAdrv.Mem.Buffer, ::Ptr{Float32}, ::Int64, ::CuStream) at /home/tbesard/Julia/CUDAdrv/src/memory.jl:235

... which I wasn't planning on attempting in the near future.

@maleadt
Copy link
Member Author

maleadt commented Nov 29, 2018

... which I wasn't planning on attempting in the near future.

The reason being that I haven't put enough thought in how the API should look like, and how it would be compatible with CUDA:

  1. like contexts, a default global one and use do blocks to switch stream
  2. or rather put the stream in the array and thread it through everywhere

The question is also where this functionality should go. And how it should interact with foreign libraries.

It also requires figuring out which operations to make asynchronous, because you can only synchronously cuMemcpyHtoD on the default stream AFAIK. Maybe we should just make everything asynchronous.

@maleadt
Copy link
Member Author

maleadt commented Nov 29, 2018

Some more exploration:

using CUDAnative, CUDAdrv, CuArrays
import CUDAdrv: @apicall, CuStream_t, isvalid

# graph
const CuGraph_t = Ptr{Cvoid}
mutable struct CuGraph
    handle::CuGraph_t
    ctx::CuContext

    function CuGraph(f::Function, stream::CuStream)
        handle_ref = Ref{CuGraph_t}()

        @apicall(:cuStreamBeginCapture, (CuStream_t,), stream)
        f()
        @apicall(:cuStreamEndCapture, (CuStream_t, Ptr{CuGraph_t}), stream, handle_ref)

        ctx = CuCurrentContext()
        obj = new(handle_ref[], ctx)
        finalizer(unsafe_destroy!, obj)
        return obj
    end 
end
function unsafe_destroy!(x::CuGraph)
    if isvalid(x.ctx)
        @apicall(:cuGraphDestroy, (CuGraph_t,), x)
    end
end
Base.unsafe_convert(::Type{CuGraph_t}, x::CuGraph) = x.handle

# graph node
const CuGraphNode_t = Ptr{Cvoid}

# graph execution
const CuGraphExec_t = Ptr{Cvoid}
function instantiate(graph::CuGraph)
    exec_ref = Ref{CuGraphExec_t}()
    error_node = Ref{CuGraphNode_t}()
    buflen = 256
    buf = Vector{Cchar}(undef, buflen)
    @apicall(:cuGraphInstantiate,
             (Ptr{CuGraphExec_t}, CuGraph_t, Ptr{CuGraphNode_t}, Ptr{Cchar}, Csize_t),
             exec_ref, graph, error_node, buf, buflen)
    return exec_ref[]
end
function launch(exec::CuGraphExec_t, stream::CuStream=CuDefaultStream())
    @apicall(:cuGraphLaunch, (CuGraphExec_t, CuStream_t), exec, stream)
end
launch(graph::CuGraph, stream::CuStream=CuDefaultStream()) =
    launch(instantiate(graph), stream)

# demo
stream = CuStream()
graph = CuGraph(stream) do
    dims=(3,4)

    a = rand(Float32, dims)
    #d_a = cu(a)
    buf_a = Mem.alloc(a)
    Mem.upload!(buf_a, a, stream; async=true)
    d_a = CuArray{Float32,2}(buf_a, dims)

    b = rand(Float32, dims)
    #d_b = cu(b)
    buf_b = Mem.alloc(b)
    Mem.upload!(buf_b, b, stream; async=true)
    d_b = CuArray{Float32,2}(buf_b, dims)

    c = rand(Float32, dims)
    #d_b = cu(b)
    buf_c = Mem.alloc(c)
    Mem.upload!(buf_c, c, stream; async=true)
    d_c = CuArray{Float32,2}(buf_c, dims)

    #d_out = similar(d_out)
    buf_out = Mem.alloc(b)
    d_out = CuArray{Float32,2}(buf_out, dims)

    # d_out .= d_a .+ d_b
    function vadd(a, b, c)
        i = (blockIdx().x-1) * blockDim().x + threadIdx().x
        c[i] = a[i] + b[i]
        return
    end
    @cuda threads=prod(dims) stream=stream vadd(d_out, d_a, d_b)

    # d_out .= d_out .+ d_c
    @cuda threads=prod(dims) stream=stream vadd(d_out, d_out, d_c)
end
launch(graph)
==8594== NVPROF is profiling process 8594, command: julia wip.jl
==8594== Profiling application: julia wip.jl
==8594== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   72.22%  5.8240us         2  2.9120us  1.8880us  3.9360us  ptxcall_vadd_1
                   27.78%  2.2400us         3     746ns     608ns     960ns  [CUDA memcpy HtoD]
      API calls:   70.19%  101.91ms         1  101.91ms  101.91ms  101.91ms  cuCtxCreate
                   28.57%  41.479ms         1  41.479ms  41.479ms  41.479ms  cuCtxDestroy
                    0.56%  807.98us         1  807.98us  807.98us  807.98us  cuModuleLoadDataEx
                    0.37%  534.67us         1  534.67us  534.67us  534.67us  cuGraphInstantiate
                    0.14%  208.24us         4  52.059us  4.7680us  180.50us  cuMemAlloc
                    0.07%  95.018us         1  95.018us  95.018us  95.018us  cuModuleUnload
                    0.03%  41.206us         1  41.206us  41.206us  41.206us  cuStreamCreate
                    0.02%  30.955us         1  30.955us  30.955us  30.955us  cuGraphLaunch
                    0.01%  17.924us         1  17.924us  17.924us  17.924us  cuStreamDestroy
                    0.01%  14.153us         3  4.7170us  1.6700us  10.220us  cuMemcpyHtoDAsync
                    0.01%  9.8060us        11     891ns     337ns  2.4880us  cuCtxGetCurrent
                    0.00%  6.3230us         1  6.3230us  6.3230us  6.3230us  cuDeviceGetPCIBusId
                    0.00%  6.2120us         1  6.2120us  6.2120us  6.2120us  cuGraphDestroy
                    0.00%  6.0810us         5  1.2160us     420ns  2.2970us  cuDeviceGetAttribute
                    0.00%  4.2560us         2  2.1280us  1.5410us  2.7150us  cuDeviceGet
                    0.00%  4.2250us         1  4.2250us  4.2250us  4.2250us  cuStreamBeginCapture
                    0.00%  3.4370us         2  1.7180us     826ns  2.6110us  cuDeviceGetCount
                    0.00%  2.0160us         1  2.0160us  2.0160us  2.0160us  cuDriverGetVersion
                    0.00%  1.3710us         1  1.3710us  1.3710us  1.3710us  cuStreamEndCapture
                    0.00%     840ns         1     840ns     840ns     840ns  cuModuleGetFunction
                    0.00%     657ns         1     657ns     657ns     657ns  cuCtxGetDevice

I had thought it would merge kernels, but it doesn't. It just avoids multiple launches.

For this to be efficient we'd have to cache graphs which seems hard to do in an automatic fashion. I could imagine the CuGraph constructor doing something dispatch-y, in relation to the graph construction body and its arguments, but that seems iffy since graph construction might depend on information that isn't in the type, such as array sizes, or worse actual values.

Maybe I'm overthinking this and it should just be explicit, but then it might not be worth it. It would be useful to have some workloads that would benefit from this, in order to estimate that (cc @MikeInnes).

@MikeInnes
Copy link
Contributor

Doing this automatically seems like one of the big wins we can get in Julia. I'm imagining seeing this as a compiler feature rather than an API as such; as part of our optimisation passes we'll look for multiple kernel launches in sequence and fuse them. There are obviously a lot of details to be worked out there, but as long as we can build a graph using only type information we should be fine; it's not so far off from fusing a broadcast tree.

AFAIK this feature is pretty squarely aimed at DL, as it's increasingly difficult to stress a V100 with only matmuls and broadcasts. But I agree that it's easy to check the numbers here and we should be looking for a use case first.

@maleadt maleadt transferred this issue from JuliaGPU/CUDAnative.jl May 27, 2020
@maleadt maleadt added cuda kernels Stuff about writing CUDA kernels. speculative Not sure about this one yet. labels May 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda kernels Stuff about writing CUDA kernels. speculative Not sure about this one yet.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants