Speed-up rand: Tausworthe RNG with shared random state. #788

maleadt · 2021-03-25T15:53:31Z

Continuation of #772, inspired by https://forums.developer.nvidia.com/t/random-numbers-inside-the-kernel/14222/10. This uses a Combined Tausworthe generator that reads global state from shared memory. Seeds are fed from the CPU RNG when the CuModule is loaded, and are stored in constant memory (alongside with other RNG params).

Performance is great:

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024))
BenchmarkTools.Trial: 
  memory estimate:  176 bytes
  allocs estimate:  5
  --------------
  minimum time:     5.408 μs (0.00% GC)
  median time:      5.717 μs (0.00% GC)
  mean time:        5.721 μs (0.00% GC)
  maximum time:     11.573 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     6

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024, 1024))
BenchmarkTools.Trial: 
  memory estimate:  176 bytes
  allocs estimate:  5
  --------------
  minimum time:     25.503 μs (0.00% GC)
  median time:      26.029 μs (0.00% GC)
  mean time:        26.179 μs (0.00% GC)
  maximum time:     45.937 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024, 1024, 1024))
BenchmarkTools.Trial: 
  memory estimate:  176 bytes
  allocs estimate:  5
  --------------
  minimum time:     24.563 ms (0.00% GC)
  median time:      24.968 ms (0.00% GC)
  mean time:        25.033 ms (0.00% GC)
  maximum time:     25.364 ms (0.00% GC)
  --------------
  samples:          194
  evals/sample:     1

Copying from #772:

CURAND:

julia> @benchmark CUDA.@sync(rand!(a)) setup=(a=CuArray{Float32}(undef, 1024))
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     4.539 μs (0.00% GC)
  median time:      4.694 μs (0.00% GC)
  mean time:        4.700 μs (0.00% GC)
  maximum time:     9.496 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark CUDA.@sync(rand!(a)) setup=(a=CuArray{Float32}(undef, 1024, 1024))
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     15.074 μs (0.00% GC)
  median time:      15.798 μs (0.00% GC)
  mean time:        16.095 μs (0.00% GC)
  maximum time:     35.990 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark CUDA.@sync(rand!(a)) setup=(a=CuArray{Float32}(undef, 1024, 1024, 1024))
BenchmarkTools.Trial: 
  memory estimate:  192 bytes
  allocs estimate:  6
  --------------
  minimum time:     19.067 ms (0.00% GC)
  median time:      104.840 ms (0.33% GC)
  mean time:        99.918 ms (0.33% GC)
  maximum time:     105.639 ms (0.32% GC)
  --------------
  samples:          51
  evals/sample:     1

CUDA.jl:

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024))
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  7
  --------------
  minimum time:     8.752 μs (0.00% GC)
  median time:      8.967 μs (0.00% GC)
  mean time:        8.990 μs (0.00% GC)
  maximum time:     31.119 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024, 1024))
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  7
  --------------
  minimum time:     433.062 μs (0.00% GC)
  median time:      452.577 μs (0.00% GC)
  mean time:        452.522 μs (0.00% GC)
  maximum time:     568.405 μs (0.00% GC)
  --------------
  samples:          9469
  evals/sample:     1

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024, 1024, 1024))
ERROR: Out of GPU memory trying to allocate 4.000 GiB

julia> sum(sizeof, map(kernel->kernel.random_state, values(CUDA.cufunction_cache[device()]))) |> Base.format_bytes
"4.004 GiB"

So a 15x speed-up, getting close to CURAND, and no restrictions wrt. launch sizes anymore.

Still TODO: figure out whether we want to mix in the block ID, as identical threads will currently yield the same random numbers across blocks:

function kernel()
    @cuprintln("thread $(threadIdx().x): $(rand())")
    return
end

@cuda threads=2 blocks=2 kernel()

thread 1: 0.605515
thread 2: 0.707695
thread 1: 0.605515
thread 2: 0.707695

EDIT: fixed that; even though I don't really know what I'm doing, the results look superficially OK:

julia> quantile(vec(Array(a)), [0.0, 0.25, 0.5, 0.75, 1.0])
5-element Vector{Float64}:
 2.384185791015625e-7
 0.24934574961662292
 0.49834924936294556
 0.7480988800525665
 0.9999868869781494

cc @simsurace
cc @S-D-R: this will now benefit a lot from #552, in case you'd want to incorporate that in your work.

simsurace · 2021-03-26T15:10:23Z

Using this in my binomial kernel improves its speed quite a bit (30-40%), see this comment, moving them closer to the speeds I'm getting in Julia 1.5.4. There is no way to use these new RNGs in Julia 1.5 to cross-check, is there?

maleadt · 2021-03-26T16:08:55Z

There is no way to use these new RNGs in Julia 1.5 to cross-check, is there?

No.

I expected greater improvements though, as you can see in my results, but you're probably not generating many random numbers (or the overhead is in other operations you're performing).

simsurace · 2021-03-26T16:35:57Z

I expected greater improvements though, as you can see in my results, but you're probably not generating many random numbers (or the overhead is in other operations you're performing).

BTRS generates 2-3 uniforms on average, and yes, the other operations are not negligible. That's why I found that up to count=17 it's still better to use the naive algorithm, which needs count number of uniform RVs. But 30-40% improvement is not bad at all.

maleadt · 2021-03-29T12:58:03Z

CI failure is JuliaLang/julia#40252.

maleadt · 2021-03-30T13:01:17Z

Should be good to go, but will require Julia 1.6.1, so let's only merge when at least the required change has landed on the release-1.6 branch (JuliaLang/julia#39160).

maleadt · 2021-03-31T15:14:22Z

Did some more optimization, and added a host object with specialized rand! kernel primarily for reproducibility (with seeds), but which also improves performance:

julia> A = CuArray{Float32}(undef, 1024, 1024);

julia> @benchmark CUDA.@sync broadcast!(()->rand(Float32), A)
BenchmarkTools.Trial: 
  memory estimate:  1.08 KiB
  allocs estimate:  63
  --------------
  minimum time:     28.167 μs (0.00% GC)
  median time:      28.909 μs (0.00% GC)
  mean time:        28.998 μs (0.00% GC)
  maximum time:     57.838 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> rng = CUDA.RNG()
CUDA.RNG(UInt32[0x06266d22, 0x432c1fa9, 0x3501a365, 0x61a123c6, 0x88157d8c, 0x4e353607, 0x4259a90d, 0x4becba27, 0xeadb7dc9, 0x8d844e08  …  0x970f1b5f, 0xe7bb1ddb, 0x57430774, 0xd2647e1f, 0x9f30ff12, 0x1dc7a05b, 0x4ab8d5bb, 0x06041e3e, 0x88190e4a, 0x852676c1])

julia> @benchmark CUDA.@sync rand!($rng, $A)
BenchmarkTools.Trial: 
  memory estimate:  1.98 KiB
  allocs estimate:  67
  --------------
  minimum time:     22.624 μs (0.00% GC)
  median time:      23.305 μs (0.00% GC)
  mean time:        23.322 μs (0.00% GC)
  maximum time:     63.698 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

For reference:

julia> @benchmark CUDA.@sync rand!($(CUDA.gpuarrays_rng()), $A)
BenchmarkTools.Trial: 
  memory estimate:  3.77 KiB
  allocs estimate:  234
  --------------
  minimum time:     77.461 μs (0.00% GC)
  median time:      79.996 μs (0.00% GC)
  mean time:        80.277 μs (0.27% GC)
  maximum time:     2.237 ms (95.91% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark CUDA.@sync rand!($(CUDA.curand_rng()), $A)
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  8
  --------------
  minimum time:     14.898 μs (0.00% GC)
  median time:      15.620 μs (0.00% GC)
  mean time:        15.777 μs (0.00% GC)
  maximum time:     40.332 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

src/device/random.jl

maleadt · 2021-04-01T10:27:07Z

Well, that's an interesting (probably unrelated) segfault:

      From worker 6:    signal (11): Segmentation fault
      From worker 6:    in expression starting at /var/lib/buildkite-agent/builds/p6000-gpuci3-julia-csail-mit-edu/julialang/cuda-dot-jl/test/codegen.jl:119
      From worker 6:    fl_isstring at /buildworker/worker/package_linux64/build/src/flisp/cvalues.c:213
      From worker 6:    scm_to_julia_ at /buildworker/worker/package_linux64/build/src/ast.c:520
      From worker 6:    scm_to_julia_ at /buildworker/worker/package_linux64/build/src/ast.c:626
      From worker 6:    scm_to_julia_ at /buildworker/worker/package_linux64/build/src/ast.c:626
      From worker 6:    scm_to_julia_ at /buildworker/worker/package_linux64/build/src/ast.c:626
      From worker 6:    scm_to_julia_ at /buildworker/worker/package_linux64/build/src/ast.c:626
      From worker 6:    scm_to_julia_ at /buildworker/worker/package_linux64/build/src/ast.c:626
      From worker 6:    scm_to_julia at /buildworker/worker/package_linux64/build/src/ast.c:466
      From worker 6:    jl_expand_with_loc_warn at /buildworker/worker/package_linux64/build/src/ast.c:1171
      From worker 6:    jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:662
      From worker 6:    jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:825
      From worker 6:    jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:929
      From worker 6:    eval at ./boot.jl:360 [inlined]
      From worker 6:    include_string at ./loading.jl:1094
      From worker 6:    _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
      From worker 6:    jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
      From worker 6:    _include at ./loading.jl:1148
      From worker 6:    include at ./client.jl:444 [inlined]
      From worker 6:    #9 at /var/lib/buildkite-agent/builds/p6000-gpuci3-julia-csail-mit-edu/julialang/cuda-dot-jl/test/runtests.jl:79 [inlined]
      From worker 6:    macro expansion at /var/lib/buildkite-agent/builds/p6000-gpuci3-julia-csail-mit-edu/julialang/cuda-dot-jl/test/setup.jl:57 [inlined]
      From worker 6:    macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1151 [inlined]
      From worker 6:    macro expansion at /var/lib/buildkite-agent/builds/p6000-gpuci3-julia-csail-mit-edu/julialang/cuda-dot-jl/test/setup.jl:57 [inlined]
      From worker 6:    macro expansion at /var/lib/buildkite-agent/builds/p6000-gpuci3-julia-csail-mit-edu/julialang/cuda-dot-jl/src/utilities.jl:28 [inlined]
      From worker 6:    macro expansion at /var/lib/buildkite-agent/builds/p6000-gpuci3-julia-csail-mit-edu/julialang/cuda-dot-jl/src/pool.jl:565 [inlined]
      From worker 6:    top-level scope at /var/lib/buildkite-agent/builds/p6000-gpuci3-julia-csail-mit-edu/julialang/cuda-dot-jl/test/setup.jl:56
      From worker 6:    jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:871
      From worker 6:    jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:929
      From worker 6:    eval at ./boot.jl:360 [inlined]
      From worker 6:    runtests at /var/lib/buildkite-agent/builds/p6000-gpuci3-julia-csail-mit-edu/julialang/cuda-dot-jl/test/setup.jl:68
      From worker 6:    _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
      From worker 6:    jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
      From worker 6:    jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
      From worker 6:    do_apply at /buildworker/worker/package_linux64/build/src/builtins.c:670
      From worker 6:    #106 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:278
      From worker 6:    run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:63
      From worker 6:    macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:278 [inlined]
      From worker 6:    #105 at ./task.jl:406
      From worker 6:    unknown function (ip: 0x7fadf16f570c)
      From worker 6:    _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
      From worker 6:    jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
      From worker 6:    jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
      From worker 6:    start_task at /buildworker/worker/package_linux64/build/src/task.c:839
      From worker 6:    unknown function (ip: (nil))
      From worker 6:    Allocations: 71560588 (Pool: 71528025; Big: 32563); GC: 83
Worker 6 terminated.
codegen                              (6) |         failed at 2021-04-01T09:46:18.927

maleadt · 2021-04-01T11:40:08Z

Hmm, I just realized the current design is probably not deterministic, and depends on how warps are scheduled: The state is 32-bytes stored in shared memory, and updated cooperatively by threads in a warp. So depending on how warps execute, you might get a different state to work with. I don't think we want that.

EDIT: actually, now with the additional syncs that isn't true anymore, and we just don't generate unique numbers anymore...

thread 33: 0.595539
thread 34: 0.309210
thread 35: 0.302712
...
thread 1: 0.595539
thread 2: 0.309210
thread 3: 0.302712

maleadt · 2021-04-01T14:12:27Z

OK, fixed that last one. I'm not sure if what I'm doing is acceptable though (from a RNG quality/robustness point of view):

32 bytes of seed, set during compilation of the kernel
32 bytes of random state, per block, initialized when the first random number is generated (derived from the 32 bytes of seed, mixing in the block identifier using xorshift)
generation, per warp (groups of 32 threads in a block): read the block state, mix in the warp identifier (again using xorshift), and use that + 3 bytes of data from other threads to generate output

The output looks OK, as judged by quantile, or by looking for duplicates (rand(Float32, 1024, 1024) generates about 5% duplicates, which is similar to Base's RNG. With Float64 it's 0.01%). I don't suppose this would pass BigCrush though, and it's annoying to set-up and use (the Julia package assumes a scalar rand(), for one).

Maybe @rfourquet could chime in? The core logic is here:

CUDA.jl/src/device/random.jl

Lines 167 to 207 in 17580e7

    
           function Random.rand(rng::SharedTauswortheGenerator, ::Type{UInt32}) 
        
               @inline pow2_mod1(x, y) = (x-1)&(y-1) + 1 
        
               threadId = UInt32(threadIdx().x + (threadIdx().y - 1) * blockDim().x + 
        
                                                 (threadIdx().z - 1) * blockDim().x * blockDim().y) 
        
               warpId = (threadId-UInt32(1)) >> 5 + UInt32(1)  # fld1 
        
               i = pow2_mod1(threadId, 32) 
        
               j = pow2_mod1(threadId, 4) 
        
               @inbounds begin 
        
                   # get state 
        
                   z = rng.state[i] 
        
                   if z == 0 
        
                       z = initial_state(rng.seed) 
        
                   end 
        
                   # mix-in the warp id to ensure unique values across blocks. 
        
                   # we have max 1024 threads per block, so can safely shift by 16 bits. 
        
                   # XXX: see comment in `initial_state` 
        
                   z = xorshift(z ⊻ (warpId << 16)) 
        
                   sync_threads() 
        
                   # advance & update state 
        
                   S1, S2, S3, M = TausShift1()[j], TausShift2()[j], TausShift3()[j], TausOffset()[j] 
        
                   state = TausStep(z, S1, S2, S3, M) 
        
                   if warpId == 1 
        
                       rng.state[i] = state 
        
                   end 
        
                   sync_threads() 
        
                   # generate 
        
                   # TODO: use shuffle to get the state from threads in this warp, because now we're 
        
                   #       re-using 3 states (that don't have the warp ID mixed in) across the block. 
        
                   #       that's tricky though, because it requires threads to be available. 
        
                   state ⊻ rng.state[pow2_mod1(threadId+1, 32)] ⊻ 
        
                           rng.state[pow2_mod1(threadId+2, 32)] ⊻ 
        
                           rng.state[pow2_mod1(threadId+3, 32)] 
        
               end 
        
           end

maleadt · 2021-04-01T15:31:41Z

Interestingly, this RNG can be faster than Base's MersenneTwister, even including time to allocate and copy temporary buffers!

julia> rng = CUDA.RNG();

julia> A = CuArray{Float64}(undef, 1024, 1024);

julia> @benchmark CUDA.@sync rand!($rng, $A)
BenchmarkTools.Trial: 
  memory estimate:  1.92 KiB
  allocs estimate:  63
  --------------
  minimum time:     28.651 μs (0.00% GC)
  median time:      29.998 μs (0.00% GC)
  mean time:        30.347 μs (0.00% GC)
  maximum time:     65.613 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> B = rand(Float32, 1024, 1024);

julia> @benchmark CUDA.@sync rand!($rng, $B)
BenchmarkTools.Trial: 
  memory estimate:  896 bytes
  allocs estimate:  45
  --------------
  minimum time:     477.436 μs (0.00% GC)
  median time:      510.928 μs (0.00% GC)
  mean time:        576.550 μs (0.13% GC)
  maximum time:     9.654 ms (35.86% GC)
  --------------
  samples:          8662
  evals/sample:     1

julia> @benchmark CUDA.@sync rand!($B)
BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  4
  --------------
  minimum time:     635.996 μs (0.00% GC)
  median time:      661.923 μs (0.00% GC)
  mean time:        799.737 μs (0.00% GC)
  maximum time:     11.801 ms (0.00% GC)
  --------------
  samples:          6248
  evals/sample:     1

~~Quite some variability though.~~ (that was me recompiling for every invocation 🤦)

simsurace · 2021-04-01T15:36:27Z

The output looks OK, as judged by quantile, or by looking for duplicates (rand(Float32, 1024, 1024) generates about 5% duplicates, which is similar to Base's RNG. With Float64 it's 0.01%). I don't suppose this would pass BigCrush though, and it's annoying to set-up and use (the Julia package assumes a scalar rand(), for one).

What is the duplicate test you are using?

maleadt · 2021-04-01T15:58:30Z

What is the duplicate test you are using?

function nonunique(x::AbstractArray{T}) where T
    uniqueset = Set{T}()
    duplicatedset = Set{T}()
    for i in x
        if(i in uniqueset)
            push!(duplicatedset, i)
        else
            push!(uniqueset, i)
        end
    end
    collect(duplicatedset)
end

@show 100 * length(nonunique(vec(A))) / length(A)

maleadt · 2021-04-01T16:41:21Z

Doesn't pass SmallCrush, so that doesn't look good:

julia> using CUDA, RNGTest

julia> rng = RNGTest.wrap(CUDA.RNG(), UInt32);

julia> RNGTest.smallcrushTestU01(rng)

========= Summary results of SmallCrush =========

 Version:          TestU01 1.2.3
 Generator:        
 Number of statistics:  15
 Total CPU time:   00:00:06.60
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
  1  BirthdaySpacings                 eps  
  2  Collision                        eps  
  3  Gap                              eps  
  4  SimpPoker                        eps  
  5  CouponCollector                  eps  
  6  MaxOft                          7.2e-5
  6  MaxOft AD                      1 - 1.6e-12
  7  WeightDistrib                    eps  
  8  MatrixRank                       eps  
  9  HammingIndep                     eps  
 10  RandomWalk1 H                  2.6e-10
 10  RandomWalk1 M                  5.6e-11
 ----------------------------------------------
 All other tests were passed

simsurace · 2021-04-01T18:40:25Z

function nonunique(x::AbstractArray{T}) where T
    uniqueset = Set{T}()
    duplicatedset = Set{T}()
    for i in x
        if(i in uniqueset)
            push!(duplicatedset, i)
        else
            push!(uniqueset, i)
        end
    end
    collect(duplicatedset)
end

@show 100 * length(nonunique(vec(A))) / length(A)

A quick calculation shows that for 2^20 samples of Float32s there should be an estimated 2^32*(1-(1-1/2^32)^(2^20)) = 1048448 unique values, which means that just 128 should be duplicates. This is much less than 5%, and becomes even less if you don't count duplicates of duplicates (which nonunique above doesn't do). Or am I making an error here?

maleadt · 2021-04-02T05:43:42Z

🤷 I just compared against Base's RNG, which produces similar results:

julia> A = rand(Float32, 1024, 1024);

julia> Adups = nonunique(A);

julia> @show 100 * length(nonunique(vec(A))) / length(A)
(100 * length(nonunique(vec(A)))) / length(A) = 5.735969543457031
5.735969543457031

But that doesn't matter, as the crush failures are worrisome.

codecov · 2021-04-02T09:46:48Z

Codecov Report

Merging #788 (51b07cd) into master (4721f60) will increase coverage by 2.91%.
The diff coverage is 12.24%.

❗ Current head 51b07cd differs from pull request most recent head a3792e6. Consider uploading reports for the commit a3792e6 to get more accurate results

@@            Coverage Diff             @@
##           master     #788      +/-   ##
==========================================
+ Coverage   75.08%   78.00%   +2.91%     
==========================================
  Files         120      120              
  Lines        7266     7380     +114     
==========================================
+ Hits         5456     5757     +301     
+ Misses       1810     1623     -187

Impacted Files	Coverage Δ
src/CUDA.jl	`100.00% <ø> (ø)`
src/random.jl	`34.66% <0.00%> (-46.59%)`	⬇️
src/compiler/execution.jl	`91.24% <100.00%> (-0.13%)`	⬇️
examples/vadd.jl	`25.00% <0.00%> (-75.00%)`	⬇️
examples/peakflops.jl	`68.57% <0.00%> (-31.43%)`	⬇️
examples/pairwise.jl	`58.20% <0.00%> (-19.80%)`	⬇️
examples/hello_world.jl	`16.66% <0.00%> (-8.34%)`	⬇️
src/sorting.jl	`23.17% <0.00%> (-5.62%)`	⬇️
lib/cupti/error.jl	`29.41% <0.00%> (-1.84%)`	⬇️
lib/cublas/wrappers.jl	`90.50% <0.00%> (+0.12%)`	⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4721f60...a3792e6. Read the comment docs.

maleadt · 2021-04-02T10:23:29Z

GPUArrays' RNG doesn't pass SmallCrush either, so let's merge this. Still, let's not enable the CUDA RNG before somebody who knows about random numbers has chimed in. So for now host-side rand is still using GPUArrays, although we do now have a device-side rand that's based on the (flawed) CUDA generator. I've opened an issue to track improvements: #803

[skip tests]

simsurace · 2021-04-02T17:23:05Z

shrug I just compared against Base's RNG, which produces similar results:
julia> A = rand(Float32, 1024, 1024);

julia> Adups = nonunique(A);

julia> @show 100 * length(nonunique(vec(A))) / length(A)
(100 * length(nonunique(vec(A)))) / length(A) = 5.735969543457031
5.735969543457031
But that doesn't matter, as the crush failures are worrisome.

I think these are too many collisions by two-three orders of magnitude, see e.g. this article which confirms the estimate I made above. So I should probably file an issue for Base as well.

simsurace · 2021-04-05T16:08:41Z

Actually the number of duplicates is as expected, as confirmed in issue #40355 here. There are actually just 2^23, not 2^32, elements to choose from, and so the formula above gives an answer which is in line with the number of duplicates seen both in Base.rand and also this RNG here.

maleadt added cuda kernels Stuff about writing CUDA kernels. performance How fast can we go? labels Mar 25, 2021

maleadt force-pushed the tb/speedup_rand branch from a771850 to 75ba3ac Compare March 30, 2021 12:53

maleadt force-pushed the tb/speedup_rand branch from 6499593 to 8f675fe Compare March 31, 2021 15:55

vchuravy reviewed Mar 31, 2021

View reviewed changes

src/device/random.jl Outdated Show resolved Hide resolved

maleadt force-pushed the tb/speedup_rand branch from 8f675fe to 2e640a5 Compare April 1, 2021 09:44

maleadt force-pushed the tb/speedup_rand branch from 2e640a5 to 6fa45f2 Compare April 1, 2021 11:38

maleadt force-pushed the tb/speedup_rand branch from 6fa45f2 to 17580e7 Compare April 1, 2021 14:00

This was referenced Apr 1, 2021

Try new rand(Float32) in kernel JuliaGPU/BinomialGPU.jl#7

Closed

Use Tausworthe RNG JuliaGPU/BinomialGPU.jl#11

Closed

maleadt force-pushed the tb/speedup_rand branch from a2308d8 to 51b07cd Compare April 2, 2021 08:33

maleadt changed the base branch from master to tb/improvements April 2, 2021 08:34

Base automatically changed from tb/improvements to master April 2, 2021 10:01

Test on Julia 1.6.1.

a7a28b2

maleadt added 3 commits April 2, 2021 12:13

Use a faster device-side rand: Tausworthe RNG with shared state.

03f74c6

Host-side RNG interface.

23325d2

Add in-place rand for CPU arrays.

4b55e73

maleadt force-pushed the tb/speedup_rand branch from 51b07cd to 4b55e73 Compare April 2, 2021 10:13

Add a performance benchmark for rand() in a kernel.

a3792e6

[skip tests]

maleadt merged commit 03b4c39 into master Apr 2, 2021

maleadt deleted the tb/speedup_rand branch April 2, 2021 11:43

This was referenced Apr 2, 2021

Extend new RNG to Complex numbers & normal distributions #726

Closed

Set random seed is extremely slow #685

Closed

maleadt mentioned this pull request Apr 30, 2021

Duplicate RNG state across block to avoid need for synchronization #879

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed-up rand: Tausworthe RNG with shared random state. #788

Speed-up rand: Tausworthe RNG with shared random state. #788

maleadt commented Mar 25, 2021 •

edited

Loading

simsurace commented Mar 26, 2021 •

edited

Loading

maleadt commented Mar 26, 2021

simsurace commented Mar 26, 2021

maleadt commented Mar 29, 2021

maleadt commented Mar 30, 2021

maleadt commented Mar 31, 2021 •

edited

Loading

maleadt commented Apr 1, 2021

maleadt commented Apr 1, 2021 •

edited

Loading

maleadt commented Apr 1, 2021

maleadt commented Apr 1, 2021 •

edited

Loading

simsurace commented Apr 1, 2021

maleadt commented Apr 1, 2021 •

edited by simsurace

Loading

maleadt commented Apr 1, 2021

simsurace commented Apr 1, 2021

maleadt commented Apr 2, 2021

codecov bot commented Apr 2, 2021 •

edited

Loading

maleadt commented Apr 2, 2021

simsurace commented Apr 2, 2021

simsurace commented Apr 5, 2021

Speed-up rand: Tausworthe RNG with shared random state. #788

Speed-up rand: Tausworthe RNG with shared random state. #788

Conversation

maleadt commented Mar 25, 2021 • edited Loading

simsurace commented Mar 26, 2021 • edited Loading

maleadt commented Mar 26, 2021

simsurace commented Mar 26, 2021

maleadt commented Mar 29, 2021

maleadt commented Mar 30, 2021

maleadt commented Mar 31, 2021 • edited Loading

maleadt commented Apr 1, 2021

maleadt commented Apr 1, 2021 • edited Loading

maleadt commented Apr 1, 2021

maleadt commented Apr 1, 2021 • edited Loading

simsurace commented Apr 1, 2021

maleadt commented Apr 1, 2021 • edited by simsurace Loading

maleadt commented Apr 1, 2021

simsurace commented Apr 1, 2021

maleadt commented Apr 2, 2021

codecov bot commented Apr 2, 2021 • edited Loading

Codecov Report

maleadt commented Apr 2, 2021

simsurace commented Apr 2, 2021

simsurace commented Apr 5, 2021

maleadt commented Mar 25, 2021 •

edited

Loading

simsurace commented Mar 26, 2021 •

edited

Loading

maleadt commented Mar 31, 2021 •

edited

Loading

maleadt commented Apr 1, 2021 •

edited

Loading

maleadt commented Apr 1, 2021 •

edited

Loading

maleadt commented Apr 1, 2021 •

edited by simsurace

Loading

codecov bot commented Apr 2, 2021 •

edited

Loading