-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stream synchronization is slow when waiting on the event from CUDA #1910
Comments
Are you doing work on other tasks that would explain why the scheduler doesn't immediately continue with the synchronization task? Could you extend our |
@maleadt I only have one task in my benchmark, and the number of threads doesn't change much (I run julia with Command to run the profiler:
Here's the extended profile with NVTX info (purple bars are number of spins in busy-wait, green bar is querying stream after waiting for the timer): Full profile: copy_julia_long_t2.zip Apparently, slow part is the wating for an event. At first, I thought that the thread spawn is the issue, but replacing The code for `nonblocking_synchronize` with NVTX ranges/events@inline function nonblocking_synchronize(stream::CuStream)
# fast path
isdone(stream) && return
NVTX.@range "busy-wait" begin
# minimize latency of short operations by busy-waiting,
# initially without even yielding to other tasks
spins = 0
while spins < 256
if spins < 32
ccall(:jl_cpu_pause, Cvoid, ())
# Temporary solution before we have gc transition support in codegen.
ccall(:jl_gc_safepoint, Cvoid, ())
else
NVTX.@mark "yield" payload=spins
yield()
end
isdone(stream) && return
spins += 1
end
end
NVTX.@range "wait for an event" begin
# minimize CPU usage of long-running kernels by waiting for an event signalled by CUDA
event = Base.Event()
launch(; stream) do
notify(event)
end
NVTX.@mark "launched CUDA function"
# if an error occurs, the callback may never fire, so use a timer to detect such cases
dev = device()
NVTX.@mark "switched to current device"
timer = Timer(0; interval=1)
NVTX.@range "spawn threads and sync" begin
Base.@sync begin
Threads.@spawn begin
NVTX.@range "wait for timer and check" begin
try
device!(dev)
while true
try
Base.wait(timer)
catch err
err isa EOFError && break
rethrow()
end
if unsafe_cuStreamQuery(stream) != ERROR_NOT_READY
break
end
NVTX.@mark "checked stream"
end
finally
notify(event)
end
end
end
Threads.@spawn begin
NVTX.@range "wait for event and close timer" begin
Base.wait(event)
close(timer)
end
end
end
end
end
return
end |
Possibly related to this issue we ran into a while ago, where the nonblocking_synchronize() timeout was stalling due to I/O on the main thread. Our hacky workaround at the time was to use a local CUDA.jl branch with nonblocking_synchronize() disabled. I haven't revisited this in a while, so not sure if it was ever resolved, but your issue may be related |
@jpdoane Thanks for the link! The discussion was an interesting read, and it seems like indeed the issue here is the same. I think the solution proposed by @maleadt in the discussion here, i.e., having only a busy-loop with |
As synchronize may be called by non-toplevel code (e.g. array copy functions), maybe it's better to introduce a preference instead? |
I'm experiencing this issue, too: @vchuravy suggested that this can be made faster in 1.10? |
@vchuravy Can you elaborate on your suggestion? IIUC, wait for on a Condition, and have a |
This addresses JuliaGPU#1910 by adding the boolean environment variable `JULIA_CUDA_NONBLOCKING_SYNCHRONIZE` to control if nonblocking synchronizes are used or not.
This addresses JuliaGPU#1910 by adding the boolean environment variable `JULIA_CUDA_NONBLOCKING_SYNCHRONIZE` to control if nonblocking synchronizes are used or not.
Should be fixed by #2025. Synchronization is still slower than CUDA C, but it's much faster than the previously (from 150us or so down to 5us, while CUDA C is 0.5us). In case that's still too much, there's a preference to disable nonblocking synchronization. |
Describe the bug
Synchronizing streams in CUDA.jl can sometimes be slower by factor of ~200 compared to CUDA C, when waiting on the event signaled by CUDA, as implemented in the
nonblocking_synchronize
:CUDA.jl/lib/cudadrv/stream.jl
Lines 159 to 188 in 883db71
Screenshots from Nsight Systems
The CUDA C version for reference:
The Julia version: Note the gaps between consecutive kernel runs when waiting for an event.
To reproduce
I use the following code to profile CUDA.jl version:
And the following code for a reference C implementation:
Manifest.toml
Expected behavior
Synchronizing streams in CUDA.jl should be comparable to CUDA C in performance.
Version info
Details on Julia:
Details on CUDA:
Additional context
My use case is a physics simulation running in a multi-GPU and multi-node environment. I use
KernelAbstractions.jl
for backend-agnostic kernels, andMPI.jl
for communication. For scalability, the computations on GPU and MPI communications need to overlap. I use the following pattern for hiding the communication behind computations:I rely on tasks to run device-to-host copy and MPI communication, and I need synchronization after calling
update_A!
to avoid data race with kernels in nested tasks which depend on the results ofupdate_A!
. I noticed that the stream synchronization becomes a bottleneck for our typical kernel runtimes. On top of that, the waiting times are non-uniform, spiking randomly to up to several milliseconds, which is orders of magnitude slower than just callingcudaStreamSynchronize
, and is longer that our typical kernel running times. With many MPI processes, the chances that even one of the processes would give a spike are high, leading to all MPI processes having to wait almost every loop iteration.The text was updated successfully, but these errors were encountered: