Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up rand: Tausworthe RNG with shared random state. #788

Merged
merged 5 commits into from
Apr 2, 2021

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Mar 25, 2021

Continuation of #772, inspired by https://forums.developer.nvidia.com/t/random-numbers-inside-the-kernel/14222/10. This uses a Combined Tausworthe generator that reads global state from shared memory. Seeds are fed from the CPU RNG when the CuModule is loaded, and are stored in constant memory (alongside with other RNG params).

Performance is great:

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024))
BenchmarkTools.Trial: 
  memory estimate:  176 bytes
  allocs estimate:  5
  --------------
  minimum time:     5.408 μs (0.00% GC)
  median time:      5.717 μs (0.00% GC)
  mean time:        5.721 μs (0.00% GC)
  maximum time:     11.573 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     6

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024, 1024))
BenchmarkTools.Trial: 
  memory estimate:  176 bytes
  allocs estimate:  5
  --------------
  minimum time:     25.503 μs (0.00% GC)
  median time:      26.029 μs (0.00% GC)
  mean time:        26.179 μs (0.00% GC)
  maximum time:     45.937 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024, 1024, 1024))
BenchmarkTools.Trial: 
  memory estimate:  176 bytes
  allocs estimate:  5
  --------------
  minimum time:     24.563 ms (0.00% GC)
  median time:      24.968 ms (0.00% GC)
  mean time:        25.033 ms (0.00% GC)
  maximum time:     25.364 ms (0.00% GC)
  --------------
  samples:          194
  evals/sample:     1

Copying from #772:

CURAND:

julia> @benchmark CUDA.@sync(rand!(a)) setup=(a=CuArray{Float32}(undef, 1024))
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     4.539 μs (0.00% GC)
  median time:      4.694 μs (0.00% GC)
  mean time:        4.700 μs (0.00% GC)
  maximum time:     9.496 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark CUDA.@sync(rand!(a)) setup=(a=CuArray{Float32}(undef, 1024, 1024))
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     15.074 μs (0.00% GC)
  median time:      15.798 μs (0.00% GC)
  mean time:        16.095 μs (0.00% GC)
  maximum time:     35.990 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark CUDA.@sync(rand!(a)) setup=(a=CuArray{Float32}(undef, 1024, 1024, 1024))
BenchmarkTools.Trial: 
  memory estimate:  192 bytes
  allocs estimate:  6
  --------------
  minimum time:     19.067 ms (0.00% GC)
  median time:      104.840 ms (0.33% GC)
  mean time:        99.918 ms (0.33% GC)
  maximum time:     105.639 ms (0.32% GC)
  --------------
  samples:          51
  evals/sample:     1

CUDA.jl:

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024))
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  7
  --------------
  minimum time:     8.752 μs (0.00% GC)
  median time:      8.967 μs (0.00% GC)
  mean time:        8.990 μs (0.00% GC)
  maximum time:     31.119 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024, 1024))
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  7
  --------------
  minimum time:     433.062 μs (0.00% GC)
  median time:      452.577 μs (0.00% GC)
  mean time:        452.522 μs (0.00% GC)
  maximum time:     568.405 μs (0.00% GC)
  --------------
  samples:          9469
  evals/sample:     1

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024, 1024, 1024))
ERROR: Out of GPU memory trying to allocate 4.000 GiB

julia> sum(sizeof, map(kernel->kernel.random_state, values(CUDA.cufunction_cache[device()]))) |> Base.format_bytes
"4.004 GiB"

So a 15x speed-up, getting close to CURAND, and no restrictions wrt. launch sizes anymore.

Still TODO: figure out whether we want to mix in the block ID, as identical threads will currently yield the same random numbers across blocks:

function kernel()
    @cuprintln("thread $(threadIdx().x): $(rand())")
    return
end

@cuda threads=2 blocks=2 kernel()
thread 1: 0.605515
thread 2: 0.707695
thread 1: 0.605515
thread 2: 0.707695

EDIT: fixed that; even though I don't really know what I'm doing, the results look superficially OK:

julia> quantile(vec(Array(a)), [0.0, 0.25, 0.5, 0.75, 1.0])
5-element Vector{Float64}:
 2.384185791015625e-7
 0.24934574961662292
 0.49834924936294556
 0.7480988800525665
 0.9999868869781494

cc @simsurace
cc @S-D-R: this will now benefit a lot from #552, in case you'd want to incorporate that in your work.

@maleadt maleadt added cuda kernels Stuff about writing CUDA kernels. performance How fast can we go? labels Mar 25, 2021
@simsurace
Copy link

simsurace commented Mar 26, 2021

Using this in my binomial kernel improves its speed quite a bit (30-40%), see this comment, moving them closer to the speeds I'm getting in Julia 1.5.4. There is no way to use these new RNGs in Julia 1.5 to cross-check, is there?

@maleadt
Copy link
Member Author

maleadt commented Mar 26, 2021

There is no way to use these new RNGs in Julia 1.5 to cross-check, is there?

No.

I expected greater improvements though, as you can see in my results, but you're probably not generating many random numbers (or the overhead is in other operations you're performing).

@simsurace
Copy link

I expected greater improvements though, as you can see in my results, but you're probably not generating many random numbers (or the overhead is in other operations you're performing).

BTRS generates 2-3 uniforms on average, and yes, the other operations are not negligible. That's why I found that up to count=17 it's still better to use the naive algorithm, which needs count number of uniform RVs. But 30-40% improvement is not bad at all.

@maleadt
Copy link
Member Author

maleadt commented Mar 29, 2021

CI failure is JuliaLang/julia#40252.

@maleadt
Copy link
Member Author

maleadt commented Mar 30, 2021

Should be good to go, but will require Julia 1.6.1, so let's only merge when at least the required change has landed on the release-1.6 branch (JuliaLang/julia#39160).

@maleadt
Copy link
Member Author

maleadt commented Mar 31, 2021

Did some more optimization, and added a host object with specialized rand! kernel primarily for reproducibility (with seeds), but which also improves performance:

julia> A = CuArray{Float32}(undef, 1024, 1024);

julia> @benchmark CUDA.@sync broadcast!(()->rand(Float32), A)
BenchmarkTools.Trial: 
  memory estimate:  1.08 KiB
  allocs estimate:  63
  --------------
  minimum time:     28.167 μs (0.00% GC)
  median time:      28.909 μs (0.00% GC)
  mean time:        28.998 μs (0.00% GC)
  maximum time:     57.838 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> rng = CUDA.RNG()
CUDA.RNG(UInt32[0x06266d22, 0x432c1fa9, 0x3501a365, 0x61a123c6, 0x88157d8c, 0x4e353607, 0x4259a90d, 0x4becba27, 0xeadb7dc9, 0x8d844e08  …  0x970f1b5f, 0xe7bb1ddb, 0x57430774, 0xd2647e1f, 0x9f30ff12, 0x1dc7a05b, 0x4ab8d5bb, 0x06041e3e, 0x88190e4a, 0x852676c1])

julia> @benchmark CUDA.@sync rand!($rng, $A)
BenchmarkTools.Trial: 
  memory estimate:  1.98 KiB
  allocs estimate:  67
  --------------
  minimum time:     22.624 μs (0.00% GC)
  median time:      23.305 μs (0.00% GC)
  mean time:        23.322 μs (0.00% GC)
  maximum time:     63.698 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

For reference:

julia> @benchmark CUDA.@sync rand!($(CUDA.gpuarrays_rng()), $A)
BenchmarkTools.Trial: 
  memory estimate:  3.77 KiB
  allocs estimate:  234
  --------------
  minimum time:     77.461 μs (0.00% GC)
  median time:      79.996 μs (0.00% GC)
  mean time:        80.277 μs (0.27% GC)
  maximum time:     2.237 ms (95.91% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark CUDA.@sync rand!($(CUDA.curand_rng()), $A)
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  8
  --------------
  minimum time:     14.898 μs (0.00% GC)
  median time:      15.620 μs (0.00% GC)
  mean time:        15.777 μs (0.00% GC)
  maximum time:     40.332 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

src/device/random.jl Outdated Show resolved Hide resolved
@maleadt
Copy link
Member Author

maleadt commented Apr 1, 2021

Well, that's an interesting (probably unrelated) segfault:

      From worker 6:    signal (11): Segmentation fault
      From worker 6:    in expression starting at /var/lib/buildkite-agent/builds/p6000-gpuci3-julia-csail-mit-edu/julialang/cuda-dot-jl/test/codegen.jl:119
      From worker 6:    fl_isstring at /buildworker/worker/package_linux64/build/src/flisp/cvalues.c:213
      From worker 6:    scm_to_julia_ at /buildworker/worker/package_linux64/build/src/ast.c:520
      From worker 6:    scm_to_julia_ at /buildworker/worker/package_linux64/build/src/ast.c:626
      From worker 6:    scm_to_julia_ at /buildworker/worker/package_linux64/build/src/ast.c:626
      From worker 6:    scm_to_julia_ at /buildworker/worker/package_linux64/build/src/ast.c:626
      From worker 6:    scm_to_julia_ at /buildworker/worker/package_linux64/build/src/ast.c:626
      From worker 6:    scm_to_julia_ at /buildworker/worker/package_linux64/build/src/ast.c:626
      From worker 6:    scm_to_julia at /buildworker/worker/package_linux64/build/src/ast.c:466
      From worker 6:    jl_expand_with_loc_warn at /buildworker/worker/package_linux64/build/src/ast.c:1171
      From worker 6:    jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:662
      From worker 6:    jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:825
      From worker 6:    jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:929
      From worker 6:    eval at ./boot.jl:360 [inlined]
      From worker 6:    include_string at ./loading.jl:1094
      From worker 6:    _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
      From worker 6:    jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
      From worker 6:    _include at ./loading.jl:1148
      From worker 6:    include at ./client.jl:444 [inlined]
      From worker 6:    #9 at /var/lib/buildkite-agent/builds/p6000-gpuci3-julia-csail-mit-edu/julialang/cuda-dot-jl/test/runtests.jl:79 [inlined]
      From worker 6:    macro expansion at /var/lib/buildkite-agent/builds/p6000-gpuci3-julia-csail-mit-edu/julialang/cuda-dot-jl/test/setup.jl:57 [inlined]
      From worker 6:    macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1151 [inlined]
      From worker 6:    macro expansion at /var/lib/buildkite-agent/builds/p6000-gpuci3-julia-csail-mit-edu/julialang/cuda-dot-jl/test/setup.jl:57 [inlined]
      From worker 6:    macro expansion at /var/lib/buildkite-agent/builds/p6000-gpuci3-julia-csail-mit-edu/julialang/cuda-dot-jl/src/utilities.jl:28 [inlined]
      From worker 6:    macro expansion at /var/lib/buildkite-agent/builds/p6000-gpuci3-julia-csail-mit-edu/julialang/cuda-dot-jl/src/pool.jl:565 [inlined]
      From worker 6:    top-level scope at /var/lib/buildkite-agent/builds/p6000-gpuci3-julia-csail-mit-edu/julialang/cuda-dot-jl/test/setup.jl:56
      From worker 6:    jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:871
      From worker 6:    jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:929
      From worker 6:    eval at ./boot.jl:360 [inlined]
      From worker 6:    runtests at /var/lib/buildkite-agent/builds/p6000-gpuci3-julia-csail-mit-edu/julialang/cuda-dot-jl/test/setup.jl:68
      From worker 6:    _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
      From worker 6:    jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
      From worker 6:    jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
      From worker 6:    do_apply at /buildworker/worker/package_linux64/build/src/builtins.c:670
      From worker 6:    #106 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:278
      From worker 6:    run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:63
      From worker 6:    macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:278 [inlined]
      From worker 6:    #105 at ./task.jl:406
      From worker 6:    unknown function (ip: 0x7fadf16f570c)
      From worker 6:    _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
      From worker 6:    jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
      From worker 6:    jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
      From worker 6:    start_task at /buildworker/worker/package_linux64/build/src/task.c:839
      From worker 6:    unknown function (ip: (nil))
      From worker 6:    Allocations: 71560588 (Pool: 71528025; Big: 32563); GC: 83
Worker 6 terminated.
codegen                              (6) |         failed at 2021-04-01T09:46:18.927

@maleadt
Copy link
Member Author

maleadt commented Apr 1, 2021

Hmm, I just realized the current design is probably not deterministic, and depends on how warps are scheduled: The state is 32-bytes stored in shared memory, and updated cooperatively by threads in a warp. So depending on how warps execute, you might get a different state to work with. I don't think we want that.

EDIT: actually, now with the additional syncs that isn't true anymore, and we just don't generate unique numbers anymore...

thread 33: 0.595539
thread 34: 0.309210
thread 35: 0.302712
...
thread 1: 0.595539
thread 2: 0.309210
thread 3: 0.302712

@maleadt
Copy link
Member Author

maleadt commented Apr 1, 2021

OK, fixed that last one. I'm not sure if what I'm doing is acceptable though (from a RNG quality/robustness point of view):

  • 32 bytes of seed, set during compilation of the kernel
  • 32 bytes of random state, per block, initialized when the first random number is generated (derived from the 32 bytes of seed, mixing in the block identifier using xorshift)
  • generation, per warp (groups of 32 threads in a block): read the block state, mix in the warp identifier (again using xorshift), and use that + 3 bytes of data from other threads to generate output

The output looks OK, as judged by quantile, or by looking for duplicates (rand(Float32, 1024, 1024) generates about 5% duplicates, which is similar to Base's RNG. With Float64 it's 0.01%). I don't suppose this would pass BigCrush though, and it's annoying to set-up and use (the Julia package assumes a scalar rand(), for one).

Maybe @rfourquet could chime in? The core logic is here:

function Random.rand(rng::SharedTauswortheGenerator, ::Type{UInt32})
@inline pow2_mod1(x, y) = (x-1)&(y-1) + 1
threadId = UInt32(threadIdx().x + (threadIdx().y - 1) * blockDim().x +
(threadIdx().z - 1) * blockDim().x * blockDim().y)
warpId = (threadId-UInt32(1)) >> 5 + UInt32(1) # fld1
i = pow2_mod1(threadId, 32)
j = pow2_mod1(threadId, 4)
@inbounds begin
# get state
z = rng.state[i]
if z == 0
z = initial_state(rng.seed)
end
# mix-in the warp id to ensure unique values across blocks.
# we have max 1024 threads per block, so can safely shift by 16 bits.
# XXX: see comment in `initial_state`
z = xorshift(z (warpId << 16))
sync_threads()
# advance & update state
S1, S2, S3, M = TausShift1()[j], TausShift2()[j], TausShift3()[j], TausOffset()[j]
state = TausStep(z, S1, S2, S3, M)
if warpId == 1
rng.state[i] = state
end
sync_threads()
# generate
# TODO: use shuffle to get the state from threads in this warp, because now we're
# re-using 3 states (that don't have the warp ID mixed in) across the block.
# that's tricky though, because it requires threads to be available.
state rng.state[pow2_mod1(threadId+1, 32)]
rng.state[pow2_mod1(threadId+2, 32)]
rng.state[pow2_mod1(threadId+3, 32)]
end
end

@maleadt
Copy link
Member Author

maleadt commented Apr 1, 2021

Interestingly, this RNG can be faster than Base's MersenneTwister, even including time to allocate and copy temporary buffers!

julia> rng = CUDA.RNG();

julia> A = CuArray{Float64}(undef, 1024, 1024);

julia> @benchmark CUDA.@sync rand!($rng, $A)
BenchmarkTools.Trial: 
  memory estimate:  1.92 KiB
  allocs estimate:  63
  --------------
  minimum time:     28.651 μs (0.00% GC)
  median time:      29.998 μs (0.00% GC)
  mean time:        30.347 μs (0.00% GC)
  maximum time:     65.613 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> B = rand(Float32, 1024, 1024);

julia> @benchmark CUDA.@sync rand!($rng, $B)
BenchmarkTools.Trial: 
  memory estimate:  896 bytes
  allocs estimate:  45
  --------------
  minimum time:     477.436 μs (0.00% GC)
  median time:      510.928 μs (0.00% GC)
  mean time:        576.550 μs (0.13% GC)
  maximum time:     9.654 ms (35.86% GC)
  --------------
  samples:          8662
  evals/sample:     1

julia> @benchmark CUDA.@sync rand!($B)
BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  4
  --------------
  minimum time:     635.996 μs (0.00% GC)
  median time:      661.923 μs (0.00% GC)
  mean time:        799.737 μs (0.00% GC)
  maximum time:     11.801 ms (0.00% GC)
  --------------
  samples:          6248
  evals/sample:     1

Quite some variability though. (that was me recompiling for every invocation 🤦)

@simsurace
Copy link

The output looks OK, as judged by quantile, or by looking for duplicates (rand(Float32, 1024, 1024) generates about 5% duplicates, which is similar to Base's RNG. With Float64 it's 0.01%). I don't suppose this would pass BigCrush though, and it's annoying to set-up and use (the Julia package assumes a scalar rand(), for one).

What is the duplicate test you are using?

@maleadt
Copy link
Member Author

maleadt commented Apr 1, 2021

What is the duplicate test you are using?

function nonunique(x::AbstractArray{T}) where T
    uniqueset = Set{T}()
    duplicatedset = Set{T}()
    for i in x
        if(i in uniqueset)
            push!(duplicatedset, i)
        else
            push!(uniqueset, i)
        end
    end
    collect(duplicatedset)
end

@show 100 * length(nonunique(vec(A))) / length(A)

@maleadt
Copy link
Member Author

maleadt commented Apr 1, 2021

Doesn't pass SmallCrush, so that doesn't look good:

julia> using CUDA, RNGTest

julia> rng = RNGTest.wrap(CUDA.RNG(), UInt32);

julia> RNGTest.smallcrushTestU01(rng)

========= Summary results of SmallCrush =========

 Version:          TestU01 1.2.3
 Generator:        
 Number of statistics:  15
 Total CPU time:   00:00:06.60
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
  1  BirthdaySpacings                 eps  
  2  Collision                        eps  
  3  Gap                              eps  
  4  SimpPoker                        eps  
  5  CouponCollector                  eps  
  6  MaxOft                          7.2e-5
  6  MaxOft AD                      1 - 1.6e-12
  7  WeightDistrib                    eps  
  8  MatrixRank                       eps  
  9  HammingIndep                     eps  
 10  RandomWalk1 H                  2.6e-10
 10  RandomWalk1 M                  5.6e-11
 ----------------------------------------------
 All other tests were passed

@simsurace
Copy link

function nonunique(x::AbstractArray{T}) where T
    uniqueset = Set{T}()
    duplicatedset = Set{T}()
    for i in x
        if(i in uniqueset)
            push!(duplicatedset, i)
        else
            push!(uniqueset, i)
        end
    end
    collect(duplicatedset)
end

@show 100 * length(nonunique(vec(A))) / length(A)

A quick calculation shows that for 2^20 samples of Float32s there should be an estimated 2^32*(1-(1-1/2^32)^(2^20)) = 1048448 unique values, which means that just 128 should be duplicates. This is much less than 5%, and becomes even less if you don't count duplicates of duplicates (which nonunique above doesn't do). Or am I making an error here?

@maleadt
Copy link
Member Author

maleadt commented Apr 2, 2021

🤷 I just compared against Base's RNG, which produces similar results:

julia> A = rand(Float32, 1024, 1024);

julia> Adups = nonunique(A);

julia> @show 100 * length(nonunique(vec(A))) / length(A)
(100 * length(nonunique(vec(A)))) / length(A) = 5.735969543457031
5.735969543457031

But that doesn't matter, as the crush failures are worrisome.

@maleadt maleadt changed the base branch from master to tb/improvements April 2, 2021 08:34
@codecov
Copy link

codecov bot commented Apr 2, 2021

Codecov Report

Merging #788 (51b07cd) into master (4721f60) will increase coverage by 2.91%.
The diff coverage is 12.24%.

❗ Current head 51b07cd differs from pull request most recent head a3792e6. Consider uploading reports for the commit a3792e6 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #788      +/-   ##
==========================================
+ Coverage   75.08%   78.00%   +2.91%     
==========================================
  Files         120      120              
  Lines        7266     7380     +114     
==========================================
+ Hits         5456     5757     +301     
+ Misses       1810     1623     -187     
Impacted Files Coverage Δ
src/CUDA.jl 100.00% <ø> (ø)
src/random.jl 34.66% <0.00%> (-46.59%) ⬇️
src/compiler/execution.jl 91.24% <100.00%> (-0.13%) ⬇️
examples/vadd.jl 25.00% <0.00%> (-75.00%) ⬇️
examples/peakflops.jl 68.57% <0.00%> (-31.43%) ⬇️
examples/pairwise.jl 58.20% <0.00%> (-19.80%) ⬇️
examples/hello_world.jl 16.66% <0.00%> (-8.34%) ⬇️
src/sorting.jl 23.17% <0.00%> (-5.62%) ⬇️
lib/cupti/error.jl 29.41% <0.00%> (-1.84%) ⬇️
lib/cublas/wrappers.jl 90.50% <0.00%> (+0.12%) ⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4721f60...a3792e6. Read the comment docs.

Base automatically changed from tb/improvements to master April 2, 2021 10:01
@maleadt
Copy link
Member Author

maleadt commented Apr 2, 2021

GPUArrays' RNG doesn't pass SmallCrush either, so let's merge this. Still, let's not enable the CUDA RNG before somebody who knows about random numbers has chimed in. So for now host-side rand is still using GPUArrays, although we do now have a device-side rand that's based on the (flawed) CUDA generator. I've opened an issue to track improvements: #803

@maleadt maleadt merged commit 03b4c39 into master Apr 2, 2021
@maleadt maleadt deleted the tb/speedup_rand branch April 2, 2021 11:43
@simsurace
Copy link

shrug I just compared against Base's RNG, which produces similar results:

julia> A = rand(Float32, 1024, 1024);

julia> Adups = nonunique(A);

julia> @show 100 * length(nonunique(vec(A))) / length(A)
(100 * length(nonunique(vec(A)))) / length(A) = 5.735969543457031
5.735969543457031

But that doesn't matter, as the crush failures are worrisome.

I think these are too many collisions by two-three orders of magnitude, see e.g. this article which confirms the estimate I made above. So I should probably file an issue for Base as well.

@simsurace
Copy link

Actually the number of duplicates is as expected, as confirmed in issue #40355 here. There are actually just 2^23, not 2^32, elements to choose from, and so the formula above gives an answer which is in line with the number of duplicates seen both in Base.rand and also this RNG here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda kernels Stuff about writing CUDA kernels. performance How fast can we go?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants