Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync_threads() appears to not be sync'ing threads #61

Closed
nrxszvo opened this issue Jan 31, 2019 · 10 comments
Closed

sync_threads() appears to not be sync'ing threads #61

nrxszvo opened this issue Jan 31, 2019 · 10 comments
Labels
bug Something isn't working cuda kernels Stuff about writing CUDA kernels.

Comments

@nrxszvo
Copy link

nrxszvo commented Jan 31, 2019

My setup:
GTX 1080 Ti
CUDA v9, driver v396.26
CUDAnative v0.9.1
Julia v1.0.2

I have been experimenting with @maetshju's implementation of CTC loss from this Flux Pull Request. I noticed that when using multiple threads to execute one of the CUDA kernels, the results differ from single-thread. After extensive debugging, I discovered that adding @cuprintf calls immediately after two of the sync_threads() calls causes the multi-threaded code to produce the same results as the single-threaded code. It appears that the sync_threads() calls are not actually sync'ing the threads in this particular instance (and @cuprintf seems to have the side-effect of forcing thread synchronization).

Unfortunately, I haven't been able to reproduce this issue in a simple code example, so I have attached my version of the entire CTC implementation with a test case that reproduces the problem:
ctc.jl.txt

To reproduce:

julia> include("ctc.jl")
single-thread intermediate printouts:
thread 1 accum[9] = -inf
thread 1 accum[7] = -inf
thread 1 accum[9] = -inf
thread 1 accum[8] = -3.782628
thread 1 accum[9] = -6.222818
thread 1 grad[7]: 0.665241   accum[7]: -inf
thread 1 grad[8]: -0.829811   accum[8]: -3.782628
thread 1 grad[9]: 0.164570   accum[9]: -6.222818

multi-thread intermediate printouts:
thread 2 grad[8]: 0.090031   accum[8]: -inf
thread 3 grad[9]: 0.244728   accum[9]: -inf
thread 1 accum[9] = -inf
thread 1 accum[7] = -inf
thread 1 accum[9] = -inf
thread 1 accum[8] = -3.782628
thread 1 accum[9] = -6.222818
thread 1 grad[7]: 0.665241   accum[7]: -inf

max grad diff: 9.20e-01
        single-thread grad: -0.82981
        multi-thread grad: 0.09003

The lines beginning with "thread" are printouts of intermediate results from the computeBetasAndGrads kernel. In the multi-thread case, threads 2 and 3 should not be printing out values for grad and accum (line 168 in ctc.jl) before thread 1 has finished initializing accum (line 147). The maximum difference between the single- and multi-thread gradients is about 0.9.

Then modify ctc.jl by uncommenting the two @cuprintf calls on lines 152 and 240 ("sync 1" and "sync 2" printouts) and re-run:

julia> include("ctc.jl")
single-thread intermediate printouts:
thread 1 accum[9] = -inf
thread 1 accum[7] = -inf
thread 1 accum[9] = -inf
thread 1 accum[8] = -3.782628
thread 1 accum[9] = -6.222818
sync 1 thread: 1
thread 1 grad[7]: 0.665241   accum[7]: -inf
thread 1 grad[8]: -0.829811   accum[8]: -3.782628
thread 1 grad[9]: 0.164570   accum[9]: -6.222818
sync 2 thread: 1
sync 2 thread: 1

multi-thread intermediate printouts:
thread 1 accum[9] = -inf
thread 1 accum[7] = -inf
thread 1 accum[9] = -inf
thread 1 accum[8] = -3.782628
thread 1 accum[9] = -6.222818
sync 1 thread: 1
sync 1 thread: 2
sync 1 thread: 3
thread 1 grad[7]: 0.665241   accum[7]: -inf
thread 2 grad[8]: -0.829811   accum[8]: -3.782628
thread 3 grad[9]: 0.164570   accum[9]: -6.222818
sync 2 thread: 1
sync 2 thread: 2
sync 2 thread: 3
sync 2 thread: 1
sync 2 thread: 2
sync 2 thread: 3

max grad diff: 0.00e+00
        single-thread grad: -0.31767
        multi-thread grad: -0.31767

Now thread 1 finishes initializing accum before the other threads begin calculating values for grad. The single- and multi-thread gradients are now identical.

So, although I suspect the actual problem is something else, this example does seem to indicate that the sync_threads() calls on lines 151 and 239 are simply not working.

Is a ccall within a CUDA kernel guaranteed to be synchronous with respect to the kernel thread?
Do you have any suggestions for how to debug this further?

Thanks!

@maleadt
Copy link
Member

maleadt commented Jan 31, 2019

That sounds an awful like JuliaGPU/CUDAnative.jl#4 ... Is there shared memory involved?

Try and run under cuda-memcheck --tool=synccheck to rule out other synchronization-related errors.

If that doesn't help, I'm afraid the only way forward is to minimize the kernel, decomposing any abstractions to their lowest-level llvmcalls along the way. It's a pretty painful process (do use Revise to make it somewhat easier), but the LLVM IR needs to be reasonably simple in order to try and test alternative IR rewrite passes to deal with the divergent control flow that breaks ptxas.

@maleadt
Copy link
Member

maleadt commented Jan 31, 2019

I see you're using a fairly old version of CUDAnative though; did you encounter any warnings during compilation a la unreachable control flow with ... predecessors?
Please try the last version of CUDAnative, if possible. The CFG rewrite passes have changed quite a bit.

@nrxszvo
Copy link
Author

nrxszvo commented Jan 31, 2019

Oh wow, I didn't realize I was so far behind versions, thanks!

Well, updating to CUDAnative and CUDAdrv 1.0.1 seems to have changed the behavior but not fixed the problem; now one of the intermediate @cuprintfs seems to determine whether threads are properly sync'ed or not.

synccheck prints out one of these barrier errors for each thread:

========= Barrier error detected. Divergent thread(s) in warp
=========     at 0x000046b8 in /home/mhorg/.julia/packages/CUDAnative/Mdd3w/src/device/cuda_intrinsics/synchronization.jl:12:ptxcall_computeBetasAndGradKernel_3
=========     by thread (2,0,0) in block (0,0,0)
=========     Device Frame:/home/mhorg/.julia/packages/CUDAnative/Mdd3w/src/device/cuda_intrinsics/synchronization.jl:12:ptxcall_computeBetasAndGradKernel_3 (ptxcall_computeBetasAndGradKernel_3 : 0x46c0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x2cd) [0x24c3ad]
=========     Host Frame:[0x7f25181a598a]
=========     Host Frame:[0x7f25181a5bb0]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x48176]
=========     Host Frame:[0x7f25181a53f9]
=========     Host Frame:[0x7f25181a548a]
=========     Host Frame:[0x7f25181a50d6]
=========     Host Frame:[0x7f25181a51e7]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x48176]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_f__apply + 0x246) [0x564f6]
=========     Host Frame:[0x7f25181a4856]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x48176]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_f__apply + 0x246) [0x564f6]
=========     Host Frame:[0x7f256415bbf9]
=========     Host Frame:[0x7f256415bf9f]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x48176]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 [0x1b1740]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 [0x1b1469]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 [0x1b1dec]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 [0x1b256f]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 [0x5e5ec]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 [0x1b303d]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 [0x7dd9c]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 [0x5266e]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_load + 0x53) [0x7f0e3]
=========     Host Frame:/home/mhorg/julia-1.0.2/lib/julia/sys.so [0xc1eee6]
=========     Host Frame:/home/mhorg/julia-1.0.2/lib/julia/sys.so [0x6aa89d]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x48176]
=========     Host Frame:/home/mhorg/julia-1.0.2/lib/julia/sys.so [0xc27edb]
=========     Host Frame:/home/mhorg/julia-1.0.2/lib/julia/sys.so [0x6aa59d]
=========     Host Frame:/home/mhorg/julia-1.0.2/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x48176]
=========     Host Frame:julia [0x1ae9]
=========     Host Frame:julia [0x1514]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
=========     Host Frame:julia [0x15b5]

@maleadt
Copy link
Member

maleadt commented Jan 31, 2019

OK. that might or might not be a sign of JuliaGPU/CUDAnative.jl#4. I haven't looked at your code in detail, but you know that you shouldn't be syncing threads in divergent branches, right?

@nrxszvo
Copy link
Author

nrxszvo commented Jan 31, 2019

I don't believe any of the syncs are within divergent branches, however, both of the failing sync_threads calls occur immediately after divergent branches (if tid == 1 blocks)

@maleadt
Copy link
Member

maleadt commented Jan 31, 2019

I'll have to look at it closer then. I'd appreciate if you could minimize the test case though, eg. getting rid of Flux. I mentioned Revise before, one way I typically tackle this is to define two kernels, put that global code in a main function and call that function after doing a Revise.includet("bug.jl"). Thanks.

@maetshju
Copy link

maetshju commented Feb 4, 2019

This is a strange bug because I don't recall it happening when I performed the initial testing of the loss function, but perhaps I chalked it up to precision differences between the GPU version and the solutions I hand worked.

In any case, I've put a version of ctc without Flux on Gist, and I've also moved the testing code into a function called main. I also took a look to see if I could help create a simpler test case, but was running into strange errors when indexing arrays in the kernel that I haven't yet figured out. I can keep trying though.

And I'm happy to help with understanding the kernels if anyone needs it!

@maetshju
Copy link

maetshju commented Feb 5, 2019

After paring down a lot of the code, I may have pinpointed the source of the error. It involves a div function call. Here's a Gist of the pared down version.

On line 43, explicitly casting T to Float32 resolves the synchronization issue for me. Similarly, using a constant value of 3 prevents the synchronization error from occurring. (3 is the value that T resolves to in this particular use case). So, the line would read while idx <= div(length(grad), Float32(T)).

Similarly, the same edit of the test case from the original post to have the explicit cast (in line 297, I believe) prevents the synchronization issue from cropping up. (Using the constant value of 3 works as well). I'm not sure as to why this seems to work, though.

Does this fix resolve the issue on your end, @nrxszvo?

@nrxszvo
Copy link
Author

nrxszvo commented Feb 5, 2019

@maetshju I can confirm that the cast to Float32 fixes the issue for me. I can also confirm that after removing all @cuprintfs from the original example and adding the cast, the issue is fixed. Nice detective work! And very bizarre behavior!

@maleadt maleadt transferred this issue from JuliaGPU/CUDAnative.jl May 27, 2020
@maleadt maleadt added bug Something isn't working cuda kernels Stuff about writing CUDA kernels. labels May 27, 2020
@maleadt
Copy link
Member

maleadt commented Aug 17, 2023

Going to assume this was a case of #1746, which is fixed now.

@maleadt maleadt closed this as completed Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda kernels Stuff about writing CUDA kernels.
Projects
None yet
Development

No branches or pull requests

3 participants