-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed-up rand: Tausworthe RNG with shared random state. #788
Conversation
Using this in my binomial kernel improves its speed quite a bit (30-40%), see this comment, moving them closer to the speeds I'm getting in Julia 1.5.4. There is no way to use these new RNGs in Julia 1.5 to cross-check, is there? |
No. I expected greater improvements though, as you can see in my results, but you're probably not generating many random numbers (or the overhead is in other operations you're performing). |
BTRS generates 2-3 uniforms on average, and yes, the other operations are not negligible. That's why I found that up to count=17 it's still better to use the naive algorithm, which needs count number of uniform RVs. But 30-40% improvement is not bad at all. |
CI failure is JuliaLang/julia#40252. |
a771850
to
75ba3ac
Compare
Should be good to go, but will require Julia 1.6.1, so let's only merge when at least the required change has landed on the |
Did some more optimization, and added a host object with specialized
For reference:
|
6499593
to
8f675fe
Compare
Well, that's an interesting (probably unrelated) segfault:
|
Hmm, I just realized the current design is probably not deterministic, and depends on how warps are scheduled: The state is 32-bytes stored in shared memory, and updated cooperatively by threads in a warp. So depending on how warps execute, you might get a different state to work with. I don't think we want that. EDIT: actually, now with the additional syncs that isn't true anymore, and we just don't generate unique numbers anymore...
|
OK, fixed that last one. I'm not sure if what I'm doing is acceptable though (from a RNG quality/robustness point of view):
The output looks OK, as judged by Maybe @rfourquet could chime in? The core logic is here: Lines 167 to 207 in 17580e7
|
Interestingly, this RNG can be faster than Base's MersenneTwister, even including time to allocate and copy temporary buffers!
|
What is the duplicate test you are using? |
|
Doesn't pass SmallCrush, so that doesn't look good:
|
A quick calculation shows that for 2^20 samples of |
🤷 I just compared against Base's RNG, which produces similar results:
But that doesn't matter, as the crush failures are worrisome. |
Codecov Report
@@ Coverage Diff @@
## master #788 +/- ##
==========================================
+ Coverage 75.08% 78.00% +2.91%
==========================================
Files 120 120
Lines 7266 7380 +114
==========================================
+ Hits 5456 5757 +301
+ Misses 1810 1623 -187
Continue to review full report at Codecov.
|
GPUArrays' RNG doesn't pass SmallCrush either, so let's merge this. Still, let's not enable the CUDA RNG before somebody who knows about random numbers has chimed in. So for now host-side rand is still using GPUArrays, although we do now have a device-side rand that's based on the (flawed) CUDA generator. I've opened an issue to track improvements: #803 |
I think these are too many collisions by two-three orders of magnitude, see e.g. this article which confirms the estimate I made above. So I should probably file an issue for Base as well. |
Actually the number of duplicates is as expected, as confirmed in issue #40355 here. There are actually just |
Continuation of #772, inspired by https://forums.developer.nvidia.com/t/random-numbers-inside-the-kernel/14222/10. This uses a Combined Tausworthe generator that reads global state from shared memory. Seeds are fed from the CPU RNG when the CuModule is loaded, and are stored in constant memory (alongside with other RNG params).
Performance is great:
Copying from #772:
So a 15x speed-up, getting close to CURAND, and no restrictions wrt. launch sizes anymore.
Still TODO: figure out whether we want to mix in the block ID, as identical threads will currently yield the same random numbers across blocks:
EDIT: fixed that; even though I don't really know what I'm doing, the results look superficially OK:
cc @simsurace
cc @S-D-R: this will now benefit a lot from #552, in case you'd want to incorporate that in your work.