sync_threads() appears to not be sync'ing threads #61

nrxszvo · 2019-01-31T02:42:25Z

My setup:
GTX 1080 Ti
CUDA v9, driver v396.26
CUDAnative v0.9.1
Julia v1.0.2

I have been experimenting with @maetshju's implementation of CTC loss from this Flux Pull Request. I noticed that when using multiple threads to execute one of the CUDA kernels, the results differ from single-thread. After extensive debugging, I discovered that adding @cuprintf calls immediately after two of the sync_threads() calls causes the multi-threaded code to produce the same results as the single-threaded code. It appears that the sync_threads() calls are not actually sync'ing the threads in this particular instance (and @cuprintf seems to have the side-effect of forcing thread synchronization).

Unfortunately, I haven't been able to reproduce this issue in a simple code example, so I have attached my version of the entire CTC implementation with a test case that reproduces the problem:
ctc.jl.txt

To reproduce:

julia> include("ctc.jl")
single-thread intermediate printouts:
thread 1 accum[9] = -inf
thread 1 accum[7] = -inf
thread 1 accum[9] = -inf
thread 1 accum[8] = -3.782628
thread 1 accum[9] = -6.222818
thread 1 grad[7]: 0.665241   accum[7]: -inf
thread 1 grad[8]: -0.829811   accum[8]: -3.782628
thread 1 grad[9]: 0.164570   accum[9]: -6.222818

multi-thread intermediate printouts:
thread 2 grad[8]: 0.090031   accum[8]: -inf
thread 3 grad[9]: 0.244728   accum[9]: -inf
thread 1 accum[9] = -inf
thread 1 accum[7] = -inf
thread 1 accum[9] = -inf
thread 1 accum[8] = -3.782628
thread 1 accum[9] = -6.222818
thread 1 grad[7]: 0.665241   accum[7]: -inf

max grad diff: 9.20e-01
        single-thread grad: -0.82981
        multi-thread grad: 0.09003

The lines beginning with "thread" are printouts of intermediate results from the computeBetasAndGrads kernel. In the multi-thread case, threads 2 and 3 should not be printing out values for grad and accum (line 168 in ctc.jl) before thread 1 has finished initializing accum (line 147). The maximum difference between the single- and multi-thread gradients is about 0.9.

Then modify ctc.jl by uncommenting the two @cuprintf calls on lines 152 and 240 ("sync 1" and "sync 2" printouts) and re-run:

julia> include("ctc.jl")
single-thread intermediate printouts:
thread 1 accum[9] = -inf
thread 1 accum[7] = -inf
thread 1 accum[9] = -inf
thread 1 accum[8] = -3.782628
thread 1 accum[9] = -6.222818
sync 1 thread: 1
thread 1 grad[7]: 0.665241   accum[7]: -inf
thread 1 grad[8]: -0.829811   accum[8]: -3.782628
thread 1 grad[9]: 0.164570   accum[9]: -6.222818
sync 2 thread: 1
sync 2 thread: 1

multi-thread intermediate printouts:
thread 1 accum[9] = -inf
thread 1 accum[7] = -inf
thread 1 accum[9] = -inf
thread 1 accum[8] = -3.782628
thread 1 accum[9] = -6.222818
sync 1 thread: 1
sync 1 thread: 2
sync 1 thread: 3
thread 1 grad[7]: 0.665241   accum[7]: -inf
thread 2 grad[8]: -0.829811   accum[8]: -3.782628
thread 3 grad[9]: 0.164570   accum[9]: -6.222818
sync 2 thread: 1
sync 2 thread: 2
sync 2 thread: 3
sync 2 thread: 1
sync 2 thread: 2
sync 2 thread: 3

max grad diff: 0.00e+00
        single-thread grad: -0.31767
        multi-thread grad: -0.31767

Now thread 1 finishes initializing accum before the other threads begin calculating values for grad. The single- and multi-thread gradients are now identical.

So, although I suspect the actual problem is something else, this example does seem to indicate that the sync_threads() calls on lines 151 and 239 are simply not working.

Is a ccall within a CUDA kernel guaranteed to be synchronous with respect to the kernel thread?
Do you have any suggestions for how to debug this further?

Thanks!

The text was updated successfully, but these errors were encountered:

maleadt · 2019-01-31T06:54:08Z

That sounds an awful like JuliaGPU/CUDAnative.jl#4 ... Is there shared memory involved?

Try and run under cuda-memcheck --tool=synccheck to rule out other synchronization-related errors.

If that doesn't help, I'm afraid the only way forward is to minimize the kernel, decomposing any abstractions to their lowest-level llvmcalls along the way. It's a pretty painful process (do use Revise to make it somewhat easier), but the LLVM IR needs to be reasonably simple in order to try and test alternative IR rewrite passes to deal with the divergent control flow that breaks ptxas.

maleadt · 2019-01-31T06:59:31Z

I see you're using a fairly old version of CUDAnative though; did you encounter any warnings during compilation a la unreachable control flow with ... predecessors?
Please try the last version of CUDAnative, if possible. The CFG rewrite passes have changed quite a bit.

nrxszvo · 2019-01-31T08:45:39Z

Oh wow, I didn't realize I was so far behind versions, thanks!

Well, updating to CUDAnative and CUDAdrv 1.0.1 seems to have changed the behavior but not fixed the problem; now one of the intermediate @cuprintfs seems to determine whether threads are properly sync'ed or not.

synccheck prints out one of these barrier errors for each thread:

========= Barrier error detected. Divergent thread(s) in warp
=========     at 0x000046b8 in /home/mhorg/.julia/packages/CUDAnative/Mdd3w/src/device/cuda_intrinsics/synchronization.jl:12:ptxcall_computeBetasAndGradKernel_3
=========     by thread (2,0,0) in block (0,0,0)
=========     Device Frame:/home/mhorg/.julia/packages/CUDAnative/Mdd3w/src/device/cuda_intrinsics/synchronization.jl:12:ptxcall_computeBetasAndGradKernel_3 (ptxcall_computeBetasAndGradKernel_3 : 0x46c0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x2cd) [0x24c3ad]
=========     Host Frame:[0x7f25181a598a]
=========     Host Frame:[0x7f25181a5bb0]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x48176]
=========     Host Frame:[0x7f25181a53f9]
=========     Host Frame:[0x7f25181a548a]
=========     Host Frame:[0x7f25181a50d6]
=========     Host Frame:[0x7f25181a51e7]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x48176]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_f__apply + 0x246) [0x564f6]
=========     Host Frame:[0x7f25181a4856]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x48176]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_f__apply + 0x246) [0x564f6]
=========     Host Frame:[0x7f256415bbf9]
=========     Host Frame:[0x7f256415bf9f]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x48176]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 [0x1b1740]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 [0x1b1469]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 [0x1b1dec]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 [0x1b256f]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 [0x5e5ec]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 [0x1b303d]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 [0x7dd9c]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 [0x5266e]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_load + 0x53) [0x7f0e3]
=========     Host Frame:/home/mhorg/julia-1.0.2/lib/julia/sys.so [0xc1eee6]
=========     Host Frame:/home/mhorg/julia-1.0.2/lib/julia/sys.so [0x6aa89d]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x48176]
=========     Host Frame:/home/mhorg/julia-1.0.2/lib/julia/sys.so [0xc27edb]
=========     Host Frame:/home/mhorg/julia-1.0.2/lib/julia/sys.so [0x6aa59d]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x48176]
=========     Host Frame:julia [0x1ae9]
=========     Host Frame:julia [0x1514]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
=========     Host Frame:julia [0x15b5]

maleadt · 2019-01-31T09:18:20Z

OK. that might or might not be a sign of JuliaGPU/CUDAnative.jl#4. I haven't looked at your code in detail, but you know that you shouldn't be syncing threads in divergent branches, right?

nrxszvo · 2019-01-31T09:35:53Z

I don't believe any of the syncs are within divergent branches, however, both of the failing sync_threads calls occur immediately after divergent branches (if tid == 1 blocks)

maleadt · 2019-01-31T09:44:51Z

I'll have to look at it closer then. I'd appreciate if you could minimize the test case though, eg. getting rid of Flux. I mentioned Revise before, one way I typically tackle this is to define two kernels, put that global code in a main function and call that function after doing a Revise.includet("bug.jl"). Thanks.

maetshju · 2019-02-04T19:01:32Z

This is a strange bug because I don't recall it happening when I performed the initial testing of the loss function, but perhaps I chalked it up to precision differences between the GPU version and the solutions I hand worked.

In any case, I've put a version of ctc without Flux on Gist, and I've also moved the testing code into a function called main. I also took a look to see if I could help create a simpler test case, but was running into strange errors when indexing arrays in the kernel that I haven't yet figured out. I can keep trying though.

And I'm happy to help with understanding the kernels if anyone needs it!

maetshju · 2019-02-05T00:07:04Z

After paring down a lot of the code, I may have pinpointed the source of the error. It involves a div function call. Here's a Gist of the pared down version.

On line 43, explicitly casting T to Float32 resolves the synchronization issue for me. Similarly, using a constant value of 3 prevents the synchronization error from occurring. (3 is the value that T resolves to in this particular use case). So, the line would read while idx <= div(length(grad), Float32(T)).

Similarly, the same edit of the test case from the original post to have the explicit cast (in line 297, I believe) prevents the synchronization issue from cropping up. (Using the constant value of 3 works as well). I'm not sure as to why this seems to work, though.

Does this fix resolve the issue on your end, @nrxszvo?

nrxszvo · 2019-02-05T02:05:21Z

@maetshju I can confirm that the cast to Float32 fixes the issue for me. I can also confirm that after removing all @cuprintfs from the original example and adding the cast, the issue is fixed. Nice detective work! And very bizarre behavior!

maleadt · 2023-08-17T05:44:46Z

Going to assume this was a case of #1746, which is fixed now.

maleadt transferred this issue from JuliaGPU/CUDAnative.jl May 27, 2020

maleadt added bug Something isn't working cuda kernels Stuff about writing CUDA kernels. labels May 27, 2020

maleadt closed this as completed Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync_threads() appears to not be sync'ing threads #61

sync_threads() appears to not be sync'ing threads #61

nrxszvo commented Jan 31, 2019

maleadt commented Jan 31, 2019

maleadt commented Jan 31, 2019

nrxszvo commented Jan 31, 2019

maleadt commented Jan 31, 2019

nrxszvo commented Jan 31, 2019

maleadt commented Jan 31, 2019

maetshju commented Feb 4, 2019

maetshju commented Feb 5, 2019

nrxszvo commented Feb 5, 2019

maleadt commented Aug 17, 2023

sync_threads() appears to not be sync'ing threads #61

sync_threads() appears to not be sync'ing threads #61

Comments

nrxszvo commented Jan 31, 2019

maleadt commented Jan 31, 2019

maleadt commented Jan 31, 2019

nrxszvo commented Jan 31, 2019

maleadt commented Jan 31, 2019

nrxszvo commented Jan 31, 2019

maleadt commented Jan 31, 2019

maetshju commented Feb 4, 2019

maetshju commented Feb 5, 2019

nrxszvo commented Feb 5, 2019

maleadt commented Aug 17, 2023