-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync_threads() appears to not be sync'ing threads #61
Comments
That sounds an awful like JuliaGPU/CUDAnative.jl#4 ... Is there shared memory involved? Try and run under If that doesn't help, I'm afraid the only way forward is to minimize the kernel, decomposing any abstractions to their lowest-level |
I see you're using a fairly old version of CUDAnative though; did you encounter any warnings during compilation a la |
Oh wow, I didn't realize I was so far behind versions, thanks! Well, updating to CUDAnative and CUDAdrv 1.0.1 seems to have changed the behavior but not fixed the problem; now one of the intermediate synccheck prints out one of these barrier errors for each thread:
|
OK. that might or might not be a sign of JuliaGPU/CUDAnative.jl#4. I haven't looked at your code in detail, but you know that you shouldn't be syncing threads in divergent branches, right? |
I don't believe any of the syncs are within divergent branches, however, both of the failing sync_threads calls occur immediately after divergent branches ( |
I'll have to look at it closer then. I'd appreciate if you could minimize the test case though, eg. getting rid of Flux. I mentioned Revise before, one way I typically tackle this is to define two kernels, put that global code in a |
This is a strange bug because I don't recall it happening when I performed the initial testing of the loss function, but perhaps I chalked it up to precision differences between the GPU version and the solutions I hand worked. In any case, I've put a version of ctc without Flux on Gist, and I've also moved the testing code into a function called And I'm happy to help with understanding the kernels if anyone needs it! |
After paring down a lot of the code, I may have pinpointed the source of the error. It involves a On line 43, explicitly casting Similarly, the same edit of the test case from the original post to have the explicit cast (in line 297, I believe) prevents the synchronization issue from cropping up. (Using the constant value of Does this fix resolve the issue on your end, @nrxszvo? |
@maetshju I can confirm that the cast to Float32 fixes the issue for me. I can also confirm that after removing all |
Going to assume this was a case of #1746, which is fixed now. |
My setup:
GTX 1080 Ti
CUDA v9, driver v396.26
CUDAnative v0.9.1
Julia v1.0.2
I have been experimenting with @maetshju's implementation of CTC loss from this Flux Pull Request. I noticed that when using multiple threads to execute one of the CUDA kernels, the results differ from single-thread. After extensive debugging, I discovered that adding
@cuprintf
calls immediately after two of thesync_threads()
calls causes the multi-threaded code to produce the same results as the single-threaded code. It appears that thesync_threads()
calls are not actually sync'ing the threads in this particular instance (and@cuprintf
seems to have the side-effect of forcing thread synchronization).Unfortunately, I haven't been able to reproduce this issue in a simple code example, so I have attached my version of the entire CTC implementation with a test case that reproduces the problem:
ctc.jl.txt
To reproduce:
The lines beginning with "thread" are printouts of intermediate results from the computeBetasAndGrads kernel. In the multi-thread case, threads 2 and 3 should not be printing out values for
grad
andaccum
(line 168 in ctc.jl) before thread 1 has finished initializingaccum
(line 147). The maximum difference between the single- and multi-thread gradients is about 0.9.Then modify
ctc.jl
by uncommenting the two@cuprintf
calls on lines 152 and 240 ("sync 1" and "sync 2" printouts) and re-run:Now thread 1 finishes initializing
accum
before the other threads begin calculating values forgrad
. The single- and multi-thread gradients are now identical.So, although I suspect the actual problem is something else, this example does seem to indicate that the
sync_threads()
calls on lines 151 and 239 are simply not working.Is a
ccall
within a CUDA kernel guaranteed to be synchronous with respect to the kernel thread?Do you have any suggestions for how to debug this further?
Thanks!
The text was updated successfully, but these errors were encountered: