Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant performance regression when precompiled #50749

Closed
timholy opened this issue Aug 1, 2023 · 18 comments · Fixed by #50766
Closed

Significant performance regression when precompiled #50749

timholy opened this issue Aug 1, 2023 · 18 comments · Fixed by #50766
Labels
compiler:precompilation Precompilation of modules performance Must go faster regression Regression in behavior compared to a previous version

Comments

@timholy
Copy link
Member

timholy commented Aug 1, 2023

Using the teh/pc branch of my fork of VectorizedRNG seems to show worse performance with a precompile workload than without. The teh/pc branch has that workload commented-out, and I get (on an old CPU):

julia> using VectorizedRNG, Random

julia> drng = Random.default_rng(); lrng = local_rng();

julia> x64 = Vector{Float64}(undef, 255);

julia> using BenchmarkTools

julia> @bprofile rand!($lrng, $x64) samples=10000 evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  134.000 ns  706.949 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     139.000 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   229.106 ns ±   7.303 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
 Memory estimate: 0 bytes, allocs estimate: 0.

The key thing is 139ns median run time. (I deleted the plot of the histogram since that doesn't transfer well to github.) Whereas when I uncomment the precompile block and use the same commands, I get

julia> @bprofile rand!($lrng, $x64) samples=10000 evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  1.594 μs  967.359 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.615 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.547 μs ±  24.092 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
 Memory estimate: 0 bytes, allocs estimate: 0.

i.e., seemingly more than an order of magnitude slower.

I say seemingly because there's something quite interesting about the profile. Here's a snippet:

Fast version:

 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 13 /home/tim/src/julia-branch/src/interpreter.c:774; jl_interpret_toplevel_thunk
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  13 /home/tim/src/julia-branch/src/interpreter.c:543; eval_body
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   13 /home/tim/src/julia-branch/src/interpreter.c:488; eval_body
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    13 /home/tim/src/julia-branch/src/interpreter.c:222; eval_value
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     13 /home/tim/src/julia-branch/src/interpreter.c:125; do_call
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 13 /home/tim/src/julia-branch/src/julia.h:1969; jl_apply
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  13 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   13 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    13 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     13 [unknown stackframe]
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 13 @BenchmarkTools/src/execution.jl:117; run(b::BenchmarkTools.Benchmark)
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  13 @BenchmarkTools/src/execution.jl:117; run
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   13 @BenchmarkTools/src/execution.jl:117; run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Fl...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    13 @BenchmarkTools/src/execution.jl:34; run_result
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     13 @BenchmarkTools/src/execution.jl:34; #run_result#45
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 13 @Base/essentials.jl:884; invokelatest
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  13 @Base/essentials.jl:887; #invokelatest#2
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   13 /home/tim/src/julia-branch/src/builtins.c:812; jl_f__call_latest
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    13 /home/tim/src/julia-branch/src/julia.h:1969; jl_apply
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     13 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 13 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  13 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   13 [unknown stackframe]
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    13 @BenchmarkTools/src/execution.jl:92; _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters)
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     13 @BenchmarkTools/src/execution.jl:105; _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::Stri...

whereas the slow version is

 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 28 /home/tim/src/julia-branch/src/interpreter.c:774; jl_interpret_toplevel_thunk
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28 /home/tim/src/julia-branch/src/interpreter.c:543; eval_body
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28 /home/tim/src/julia-branch/src/interpreter.c:488; eval_body
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28 /home/tim/src/julia-branch/src/interpreter.c:222; eval_value
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28 /home/tim/src/julia-branch/src/interpreter.c:125; do_call
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 28 /home/tim/src/julia-branch/src/julia.h:1969; jl_apply
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28 [unknown stackframe]
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 28 @BenchmarkTools/src/execution.jl:117; run(b::BenchmarkTools.Benchmark)
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28 @BenchmarkTools/src/execution.jl:117; run
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28 @BenchmarkTools/src/execution.jl:117; run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Fl...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28 @BenchmarkTools/src/execution.jl:34; run_result
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28 @BenchmarkTools/src/execution.jl:34; #run_result#45
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 28 @Base/essentials.jl:884; invokelatest
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28 @Base/essentials.jl:887; #invokelatest#2
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28 /home/tim/src/julia-branch/src/builtins.c:812; jl_f__call_latest
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28 /home/tim/src/julia-branch/src/julia.h:1969; jl_apply
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 28 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28 [unknown stackframe]
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28 @BenchmarkTools/src/execution.jl:92; _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters)
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28 @BenchmarkTools/src/execution.jl:105; _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::Stri...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 25 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  25 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   25 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    18 [unknown stackframe]
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     16 @BenchmarkTools/src/execution.jl:495; var"##sample#236"(::Tuple{VectorizedRNG.Xoshiro{2}, Vector{Float64}}, __params::Benc...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 16 @BenchmarkTools/src/execution.jl:489; var"##core#235"(lrng#233::VectorizedRNG.Xoshiro{2}, x64#234::Vector{Float64})
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  16 @VectorizedRNG/src/api.jl:366; rand!(rng::VectorizedRNG.Xoshiro{2}, x::Vector{Float64}, α::Static.StaticInt{0}, β::...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   14 @VectorizedRNG/src/api.jl:366; rand!(rng::VectorizedRNG.Xoshiro{2}, x::Vector{Float64}, α::Static.StaticInt{0}, β:...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    12 @VectorizedRNG/src/api.jl:314; samplevector!
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     8  @VectorizedRNG/src/api.jl:216; random_uniform
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 8  @VectorizedRNG/src/api.jl:34; random_uniform
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  7  @VectorizedRNG/src/masks.jl:66; floatbitmask
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   6  @VectorizedRNG/src/masks.jl:34; setbits
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    6  @VectorizedRNG/src/masks.jl:33; setbits

There is a puzzle here: if the two differ by an order of magnitude, why is the number of samples approximately the same? (At top-level, the slow one has 30 samples and the fast one 28.) Perhaps samples=10000 evals=1 is insufficient to ensure that they are running the same workload, but it's not entirely clear.

CC @chriselrod

@timholy timholy added the compiler:precompilation Precompilation of modules label Aug 1, 2023
@vchuravy
Copy link
Member

vchuravy commented Aug 1, 2023

Some slow down is expected, an order of magnitude is not. I recommend a native profiler like hotspot/perf/vtunes

@chriselrod
Copy link
Contributor

chriselrod commented Aug 1, 2023

With precompilation, I get (removing @pstats with less iters for compilation):

julia> using LinuxPerf

julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> Base.donotdelete( f(args...)), Base.OneTo(N))
foreachf! (generic function with 1 method)

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       foreachf!(rand!, 10_000_000, lrng, x64)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               2.98e+10   60.0%  #  3.9 cycles per ns
┌ instructions             4.07e+10   60.0%  #  1.4 insns per cycle
│ branch-instructions      1.39e+10   60.0%  # 34.3% of insns
└ branch-misses            9.95e+06   60.0%  #  0.1% of branch insns
┌ task-clock               7.72e+09  100.0%  #  7.7 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    2.16e+05   20.0%  #  0.0% of dcache loads
│ L1-dcache-loads          1.56e+10   20.0%
└ L1-icache-load-misses    3.95e+05   20.0%
┌ dTLB-load-misses         1.50e+01   20.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.54e+10   20.0%
┌ iTLB-load-misses         1.66e+03   40.0%  #  7.5% of iTLB loads
└ iTLB-loads               2.20e+04   40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Without precompilation:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       foreachf!(rand!, 10_000_000, lrng, x64)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.87e+09   59.9%  #  3.9 cycles per ns
┌ instructions             5.00e+09   60.0%  #  2.7 insns per cycle
│ branch-instructions      2.40e+08   60.0%  #  4.8% of insns
└ branch-misses            1.99e+03   60.0%  #  0.0% of branch insns
┌ task-clock               4.84e+08  100.0%  # 483.9 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    9.64e+03   20.0%  #  0.0% of dcache loads
│ L1-dcache-loads          2.51e+08   20.0%
└ L1-icache-load-misses    6.30e+03   20.0%
┌ dTLB-load-misses         0.00e+00   19.9%  #  0.0% of dTLB loads
└ dTLB-loads               2.50e+08   19.9%
┌ iTLB-load-misses         5.01e+00   39.9%  #  0.4% of iTLB loads
└ iTLB-loads               1.36e+03   39.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

So, with precompilation, we have 2.78 times more branch instructions than we have total instructions without precompilation.

I get about a 16x performance difference (7.7s vs 0.484 s for the 10 million repetitions).

Also, FWIW, the default_rng is also >6x slower than VectorizedRNG, despite also being vectorized and using mostly the same algorithm.

Copy-pastable script, if your CPU supports the same performance counters as mine:

julia> using VectorizedRNG, Random, LinuxPerf
Precompiling VectorizedRNG
  1 dependency successfully precompiled in 2 seconds. 30 already precompiled.

julia> x64 = Vector{Float64}(undef, 255);

julia> drng = Random.default_rng(); lrng = local_rng();

julia> @benchmark rand!($lrng, $x64)
BenchmarkTools.Trial: 10000 samples with 986 evaluations.
 Range (min  max):  52.124 ns  88.696 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     52.493 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   52.680 ns ±  1.211 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▆▃█▇                  ▁▁▁▁                               ▂
  ▃▁██████▅▃▁▅▅▃▃▃▄▄▃▅▅▆▇██████████▇▇▇▇█▇▅▅▅▅▅▄▄▅▁▃▄▅▄▆▆▅▄▅▆▅ █
  52.1 ns      Histogram: log(frequency) by time      55.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> Base.donotdelete( f(args...)), Base.OneTo(N))
foreachf! (generic function with 1 method)

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       foreachf!(rand!, 1000, lrng, x64)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               2.19e+07   57.4%  #  3.5 cycles per ns
┌ instructions             2.11e+07   68.3%  #  1.0 insns per cycle
│ branch-instructions      4.21e+06   68.3%  # 20.0% of insns
└ branch-misses            2.16e+05   68.3%  #  5.1% of branch insns
┌ task-clock               6.26e+06  100.0%  #  6.3 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              7.50e+01  100.0%
┌ L1-dcache-load-misses    2.21e+05   15.7%  #  4.5% of dcache loads
│ L1-dcache-loads          4.91e+06   15.7%
└ L1-icache-load-misses    1.41e+06   15.7%
┌ dTLB-load-misses         8.51e+03   16.0%  #  0.1% of dTLB loads
└ dTLB-loads               6.72e+06   16.0%
┌ iTLB-load-misses         8.24e+03   32.0%  # 22.7% of iTLB loads
└ iTLB-loads               3.63e+04   32.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       foreachf!(rand!, 10_000_000, lrng, x64)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.87e+09   59.9%  #  3.9 cycles per ns
┌ instructions             5.00e+09   60.0%  #  2.7 insns per cycle
│ branch-instructions      2.40e+08   60.0%  #  4.8% of insns
└ branch-misses            1.99e+03   60.0%  #  0.0% of branch insns
┌ task-clock               4.84e+08  100.0%  # 483.9 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    9.64e+03   20.0%  #  0.0% of dcache loads
│ L1-dcache-loads          2.51e+08   20.0%
└ L1-icache-load-misses    6.30e+03   20.0%
┌ dTLB-load-misses         0.00e+00   19.9%  #  0.0% of dTLB loads
└ dTLB-loads               2.50e+08   19.9%
┌ iTLB-load-misses         5.01e+00   39.9%  #  0.4% of iTLB loads
└ iTLB-loads               1.36e+03   39.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

@timholy
Copy link
Member Author

timholy commented Aug 1, 2023

I can verify this with

function runmany!(rng, buf, n)
    for i = 1:n
        Base.donotdelete(rand!(rng, buf))
    end
    return buf
end

which is a check that specialization heuristics with foreach are not playing a role. I get about a 13x regression with precompilation.

@timholy timholy changed the title (Maybe) worse performance when precompiled Significant performance regression when precompiled Aug 1, 2023
@timholy
Copy link
Member Author

timholy commented Aug 1, 2023

Using Cthulhu I've checked the LLVM of samplevector!:

 Function Signature: samplevector!(typeof(VectorizedRNG.random_uniform), VectorizedRNG.Xoshiro{2}, Array{Float64, 1}, Static.StaticInt{0}, Static.StaticInt{0}, Static.StaticInt{1}, typeof(Base.identity))
;  @ /home/tim/.julia/dev/VectorizedRNG/src/api.jl:293 within `samplevector!`
define swiftcc nonnull {}* @"julia_samplevector!_766"({}*** nonnull swiftself %pgcstack, [1 x i64]* nocapture noundef nonnull readonly align 8 dereferenceable(8) %"rng::Xoshiro", {}* noundef nonnull align 16 dereferenceable(40) %"x::Array") #0 {

and unless I've screwed up they seem the same. Might they differ in their native-code optimizations?

EDIT: moreover, their native code also looks nearly identical.

@pchintalapudi
Copy link
Member

Try with --image-codegen or @code_native dump_module=false, anything else will give you the JIT-ted version of native code rather than image version.

@timholy
Copy link
Member Author

timholy commented Aug 1, 2023

Thanks for the tip. Here's what I tried (both of them I think):

tim@flash:~/.julia/dev/VectorizedRNG$ ~/src/julia-branch/julia --project --image-codegen
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.0-DEV.203 (2023-07-31)
 _/ |\__'_|_|_|\__'_|  |  Commit ec8df3da35* (1 day old master)
|__/                   |

(VectorizedRNG) pkg> add Static
   Resolving package versions...
    Updating `~/.julia/dev/VectorizedRNG/Project.toml`
  [aedffcd0] + Static v0.8.8
  No Changes to `~/.julia/dev/VectorizedRNG/Manifest.toml`
Precompiling project...
  1 dependency successfully precompiled in 3 seconds. 34 already precompiled.

julia> include("script.jl")
  1.457765 seconds (982.10 k allocations: 65.922 MiB, 14.16% gc time, 100.00% compilation time)
  0.117277 seconds

julia> using Static

julia> code_native(VectorizedRNG.samplevector!, Tuple{typeof(VectorizedRNG.random_uniform), VectorizedRNG.Xoshiro{2}, Vector{Float64}, Static.StaticInt{0}, Static.StaticInt{0}, Static.StaticInt{1}, typeof(identity)}; dump_module=false)
...

and then I copy/pasted the output in the terminal to a file. I did this with and without precompilation, then ran diff --color native_fast.log native_slow.log. Here's a screenshot to see the color:
image

I assume that's inconsequential? Am I doing something wrong?

@DilumAluthge DilumAluthge added performance Must go faster regression Regression in behavior compared to a previous version labels Aug 1, 2023
@pchintalapudi
Copy link
Member

pchintalapudi commented Aug 1, 2023

You'll want code_native(...; dump_module=true|false) without --image-codegen with precompilation; true will give you what the JIT would have compiled, false will give you what's actually in the image. If it's not in an image, then just doing code_native with julia --imaging-mode will give the image version of native code, while doing it without --imaging-mode will give you the JIT version (dump_module doesn't matter here, since we're compiling the code outright regardless).

Also the second approach has the advantage that code_llvm will give the imaging_mode output as well, and I personally find LLVM IR more readable than assembly.

@timholy
Copy link
Member Author

timholy commented Aug 2, 2023

Here is the script:

tim@flash:~/.julia/dev/VectorizedRNG$ cat script.jl
using VectorizedRNG, Random

drng = Random.default_rng(); lrng = local_rng();
x64 = Vector{Float64}(undef, 255);

function runmany!(rng, buf, n)
    for i = 1:n
        Base.donotdelete(rand!(rng, buf))
    end
    return buf
end

@time runmany!(lrng, x64, 1)
@time runmany!(lrng, x64, 10^6)
nothing

Here is a session from the (slow) version with precompilation:

tim@flash:~/.julia/dev/VectorizedRNG$ ~/src/julia-branch/julia --project
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.0-DEV.203 (2023-07-31)
 _/ |\__'_|_|_|\__'_|  |  Commit ec8df3da35* (1 day old master)
|__/                   |

julia> include("script.jl")
Precompiling VectorizedRNG
  1 dependency successfully precompiled in 3 seconds. 30 already precompiled.
  0.005778 seconds (2.63 k allocations: 184.414 KiB, 99.47% compilation time)
  1.580240 seconds

julia> using Static

julia> open("/tmp/tim/slowpc/slow_native.log", "w") do io
           code_native(io, VectorizedRNG.samplevector!, Tuple{typeof(VectorizedRNG.random_uniform), VectorizedRNG.Xoshiro{2}, Vector{Float64}, Static.StaticInt{0}, Static.StaticInt{0}, Static.StaticInt{1}, typeof(identity)}; dump_module=false)
       end

julia> open("/tmp/tim/slowpc/slow_llvm.log", "w") do io
           code_llvm(io, VectorizedRNG.samplevector!, Tuple{typeof(VectorizedRNG.random_uniform), VectorizedRNG.Xoshiro{2}, Vector{Float64}, Static.StaticInt{0}, Static.StaticInt{0}, Static.StaticInt{1}, typeof(identity)}; dump_module=false)
       end

The changes for the fast version (which comments out the @compile_workload block) should be obvious.

And here are the complete diffs:
image

image

@timholy
Copy link
Member Author

timholy commented Aug 2, 2023

Oh interesting, if I profile the runmany! call, the slow version shows this (I trimmed off the boring part):

  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  1590 @VectorizedRNG/script.jl:8; runmany!(rng::VectorizedRNG.Xoshiro{2}, buf::Vector{Float64}, n::Int64)
 1╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   1589 @VectorizedRNG/src/api.jl:366; rand!
 1╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    1322 @VectorizedRNG/src/api.jl:366; rand!(rng::VectorizedRNG.Xoshiro{2}, x::Vector{Float64}, α::Static.StaticInt...
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     811  @VectorizedRNG/src/api.jl:314; samplevector!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 443  @VectorizedRNG/src/api.jl:215; random_uniform
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  68   @VectorizedRNG/src/xoshiro.jl:588; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   68   @VectorizationBase/src/vecunroll/fmap.jl:194; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    25   @VectorizationBase/src/llvm_intrin/conversion.jl:357; Vec
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     25   @VectorizationBase/src/llvm_intrin/conversion.jl:188; vconvert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 25   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:122; _vbroadcast
23╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  25   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:95; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    43   @VectorizationBase/src/vecunroll/fmap.jl:16; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     29   @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:888; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 29   @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:856; funnel_shift_right
29╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  29   @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:856; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     14   @VectorizationBase/src/vecunroll/fmap.jl:9; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 14   @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:888; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  14   @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:856; funnel_shift_right
14╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   14   ...torizationBase/src/llvm_intrin/intrin_funcs.jl:856; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  351  @VectorizedRNG/src/xoshiro.jl:589; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   54   @VectorizedRNG/src/xoshiro.jl:495; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    27   @VectorizationBase/src/base_defs.jl:98; <<
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     27   @VectorizationBase/src/promotion.jl:127; promote_div
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 27   @VectorizationBase/src/promotion.jl:139; promote_div
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  27   @VectorizationBase/src/base_defs.jl:209; rem
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   27   @VectorizationBase/src/llvm_intrin/conversion.jl:451; vrem
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    27   @VectorizationBase/src/base_defs.jl:199; convert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     27   @VectorizationBase/src/llvm_intrin/conversion.jl:232; vconvert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 27   @VectorizationBase/src/llvm_intrin/conversion.jl:188; vconvert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  27   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:122; _vbroadcast
20╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   27   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:95; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    27   @VectorizationBase/src/base_defs.jl:99; <<
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     27   @VectorizationBase/src/vecunroll/fmap.jl:159; vshl
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 27   @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  18   @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   18   @VectorizationBase/src/llvm_intrin/binary_ops.jl:25; vshl
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    18   @VectorizationBase/src/llvm_intrin/binary_ops.jl:31; vshl_fast
18╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     18   @VectorizationBase/src/llvm_intrin/binary_ops.jl:31; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   54   @VectorizedRNG/src/xoshiro.jl:497; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    54   @VectorizationBase/src/base_defs.jl:91; xor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     54   @VectorizationBase/src/vecunroll/fmap.jl:111; vxor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 54   @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  26   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vxor
26╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   26   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28   @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vxor
28╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   37   @VectorizedRNG/src/xoshiro.jl:498; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    37   @VectorizationBase/src/base_defs.jl:91; xor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     37   @VectorizationBase/src/vecunroll/fmap.jl:111; vxor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 37   @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  19   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vxor
18╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   19   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  18   @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   18   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vxor
11╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    18   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   50   @VectorizedRNG/src/xoshiro.jl:499; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    50   @VectorizationBase/src/base_defs.jl:91; xor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     50   @VectorizationBase/src/vecunroll/fmap.jl:111; vxor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 50   @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  31   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vxor
31╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   31   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  19   @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   19   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vxor
 9╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    19   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
10╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     10   ...ia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2683
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   74   @VectorizedRNG/src/xoshiro.jl:500; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    74   @VectorizationBase/src/base_defs.jl:91; xor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     74   @VectorizationBase/src/vecunroll/fmap.jl:111; vxor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 74   @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  37   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vxor
37╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   37   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  37   @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   37   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vxor
37╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    37   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   56   @VectorizedRNG/src/xoshiro.jl:501; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    56   @VectorizationBase/src/vecunroll/fmap.jl:194; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     56   @VectorizationBase/src/vecunroll/fmap.jl:16; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 24   @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:888; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  24   @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:856; funnel_shift_right
24╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   24   ...torizationBase/src/llvm_intrin/intrin_funcs.jl:856; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 32   @VectorizationBase/src/vecunroll/fmap.jl:9; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  32   @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:888; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   32   ...torizationBase/src/llvm_intrin/intrin_funcs.jl:856; funnel_shift_right
32╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    32   ...torizationBase/src/llvm_intrin/intrin_funcs.jl:856; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 368  @VectorizedRNG/src/api.jl:216; random_uniform
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  368  @VectorizedRNG/src/api.jl:34; random_uniform
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   253  @VectorizedRNG/src/masks.jl:66; floatbitmask
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    153  @VectorizedRNG/src/masks.jl:34; setbits
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     153  @VectorizedRNG/src/masks.jl:33; setbits
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 55   @VectorizationBase/src/base_defs.jl:99; &
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  55   @VectorizationBase/src/vecunroll/fmap.jl:111; vand
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   55   @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    51   @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     51   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vand
24╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 51   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
27╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  27   ...a/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2691
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 33   @VectorizationBase/src/base_defs.jl:98; |
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  33   @Base/promotion.jl:393; promote
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   33   @Base/promotion.jl:370; _promote
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    33   @VectorizationBase/src/base_defs.jl:199; convert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     33   @VectorizationBase/src/llvm_intrin/conversion.jl:232; vconvert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 33   @VectorizationBase/src/llvm_intrin/conversion.jl:188; vconvert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  33   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:122; _vbroadcast
29╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   33   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:95; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 52   @VectorizationBase/src/base_defs.jl:99; |
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  52   @VectorizationBase/src/vecunroll/fmap.jl:111; vor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   52   @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vor
20╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    24   @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     24   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vor
10╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 24   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
14╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  14   ...a/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2694
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    100  @VectorizationBase/src/base_defs.jl:201; reinterpret
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     100  @VectorizationBase/src/vecunroll/fmap.jl:82; vreinterpret
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 100  @VectorizationBase/src/vecunroll/fmap.jl:18; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  30   @VectorizationBase/src/llvm_intrin/conversion.jl:424; vreinterpret
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   30   @VectorizationBase/src/llvm_intrin/conversion.jl:424; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    30   @VectorizationBase/src/llvm_intrin/conversion.jl:435; vreinterpret
30╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     30   @VectorizationBase/src/llvm_intrin/conversion.jl:18; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  70   @VectorizationBase/src/vecunroll/fmap.jl:10; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   70   @VectorizationBase/src/llvm_intrin/conversion.jl:424; vreinterpret
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    70   @VectorizationBase/src/llvm_intrin/conversion.jl:424; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     70   @VectorizationBase/src/llvm_intrin/conversion.jl:435; vreinterpret
70╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 70   @VectorizationBase/src/llvm_intrin/conversion.jl:18; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   99   @VectorizationBase/src/base_defs.jl:99; -
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    99   @VectorizationBase/src/vecunroll/fmap.jl:111; vsub
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     99   @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 79   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; vsub
21╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  79   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; macro expansion
58╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   58   ...lia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2698
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 20   @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  20   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; vsub
 6╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   20   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; macro expansion
14╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    14   ...lia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2699
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     265  @VectorizedRNG/src/api.jl:320; samplevector!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 77   @Base/promotion.jl:423; *
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  19   @VectorizationBase/src/base_defs.jl:94; *
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   19   @Base/promotion.jl:393; promote
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    19   @Base/promotion.jl:370; _promote
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     19   @VectorizationBase/src/base_defs.jl:199; convert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 19   @VectorizationBase/src/llvm_intrin/conversion.jl:184; vconvert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  19   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:151; vbroadcast
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   19   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:122; _vbroadcast
10╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    19   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:95; macro expansion
 9╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     9    ...ia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2700
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  58   @VectorizationBase/src/base_defs.jl:95; *
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   58   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; vmul
 5╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    58   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; macro expansion
53╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     53   ...ulia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2701
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 93   @Base/promotion.jl:422; +
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  76   @VectorizationBase/src/base_defs.jl:95; +
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   76   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; vadd
29╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    76   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; macro expansion
47╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     47   ...ulia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2703
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 95   @VectorizationBase/src/llvm_intrin/memory_addr.jl:2094; vstore!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  27   ...ationBase/src/strided_pointers/stridedpointers.jl:198; _vstore!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   27   ...ationBase/src/strided_pointers/stridedpointers.jl:45; linear_index
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    27   ...onBase/src/strided_pointers/cartesian_indexing.jl:5; tdot
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     27   ...nBase/src/strided_pointers/cartesian_indexing.jl:9; tdot
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 27   @VectorizationBase/src/lazymul.jl:61; lazymul
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  27   @VectorizationBase/src/static.jl:53; vmul_nsw
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   27   @VectorizationBase/src/llvm_intrin/binary_ops.jl:49; vmul_nsw
27╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    27   @VectorizationBase/src/llvm_intrin/binary_ops.jl:49; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  68   ...ationBase/src/strided_pointers/stridedpointers.jl:199; _vstore!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   68   @VectorizationBase/src/llvm_intrin/memory_addr.jl:1482; __vstore!
19╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    68   @VectorizationBase/src/llvm_intrin/memory_addr.jl:1482; macro expansion
49╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     49   ...ulia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2705
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     149  @VectorizedRNG/src/api.jl:321; samplevector!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 54   @Base/promotion.jl:422; +
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  43   @VectorizationBase/src/base_defs.jl:95; +
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   43   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; vadd
 5╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    43   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; macro expansion
38╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     38   ...ulia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2709
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 73   @VectorizationBase/src/llvm_intrin/memory_addr.jl:2094; vstore!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  61   ...ationBase/src/strided_pointers/stridedpointers.jl:199; _vstore!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   61   @VectorizationBase/src/llvm_intrin/memory_addr.jl:1482; __vstore!
28╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    61   @VectorizationBase/src/llvm_intrin/memory_addr.jl:1482; macro expansion
33╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     33   ...ulia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2711
 1╎1    /workspace/srcdir/gcc-12.1.0/libstdc++-v3/libsupc++/del_op.cc:48; operator delete
Total snapshots: 1592. Utilization: 100% across all threads and tasks. Use the `groupby` kwarg to break down by thread and/or task.

One bizarre feature: note the things that get listed with a file/line in the .so file. 🤔

For comparison, here's the profile of the fast one:

  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  108 @VectorizedRNG/script.jl:8; runmany!(rng::VectorizedRNG.Xoshiro{2}, buf::Vector{Float64}, n::Int64)
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   108 @VectorizedRNG/src/api.jl:366; rand!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    107 @VectorizedRNG/src/api.jl:366; rand!(rng::VectorizedRNG.Xoshiro{2}, x::Vector{Float64}, α::Static.StaticInt{...
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     90  @VectorizedRNG/src/api.jl:314; samplevector!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 69  @VectorizedRNG/src/api.jl:215; random_uniform
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  17  @VectorizedRNG/src/xoshiro.jl:588; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   17  @VectorizationBase/src/vecunroll/fmap.jl:194; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    17  @VectorizationBase/src/vecunroll/fmap.jl:16; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     11  @VectorizationBase/src/vecunroll/fmap.jl:9; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 11  @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:888; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  11  @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:856; funnel_shift_right
11╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   11  @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:856; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  45  @VectorizedRNG/src/xoshiro.jl:589; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   19  @VectorizedRNG/src/xoshiro.jl:501; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    19  @VectorizationBase/src/vecunroll/fmap.jl:194; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     19  @VectorizationBase/src/vecunroll/fmap.jl:16; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 12  @VectorizationBase/src/vecunroll/fmap.jl:9; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  12  @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:888; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   12  @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:856; funnel_shift_right
12╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    12  ...torizationBase/src/llvm_intrin/intrin_funcs.jl:856; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 21  @VectorizedRNG/src/api.jl:216; random_uniform
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  21  @VectorizedRNG/src/api.jl:34; random_uniform
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   17  @VectorizationBase/src/base_defs.jl:99; -
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    17  @VectorizationBase/src/vecunroll/fmap.jl:111; vsub
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     17  @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 16  @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  16  @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; vsub
15╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   16  @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; macro expansion
Total snapshots: 110. Utilization: 100% across all threads and tasks. Use the `groupby` kwarg to break down by thread and/or task.

@gbaraldi
Copy link
Member

gbaraldi commented Aug 2, 2023

Those look exactly the same 🤔

@timholy
Copy link
Member Author

timholy commented Aug 2, 2023

The code, but not the profile. Which makes me puzzled.

@chriselrod
Copy link
Contributor

I imagine code_llvm and code_native are still lying.

@gbaraldi
Copy link
Member

gbaraldi commented Aug 2, 2023

I'll see if perf can see through it

@pchintalapudi
Copy link
Member

The other way to get the real true LLVM that we've optimized is to run with JULIA_LLVM_ARGS="--print-before=AfterOptimization", which will give a whole lot more output but is everything the JIT compiled (BeforeOptimization will give the unoptimized code, obviously). The same thing will also work during precompilation, but you'll just get a giant module with every function in the image as well as the JIT-ted modules created by the precompile worker. Pkg might swallow that output (it'll arrive on stderr), so there may be some other changes needed to actually capture it for the image.

@gbaraldi
Copy link
Member

gbaraldi commented Aug 2, 2023

JIT
image

Image
image

There's lots of non inlined calls and weird things

@timholy
Copy link
Member Author

timholy commented Aug 2, 2023

Notice that's one of the things attributed to the .so file in the Julia profile (e.g., ...ulia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2711).

@gbaraldi
Copy link
Member

gbaraldi commented Aug 2, 2023

I'm trying to look at the llvm generated but for the sysimg module it just makes up garbage. Is there a way for us to emit a package_image to a .bc /.ll file?
Maybe it's multiple threads printing 🤔

@gbaraldi
Copy link
Member

gbaraldi commented Aug 2, 2023

So after talking a bit with @pchintalapudi we realized that what is happening is that, during precompilation the llvmcall modules that are marked always_inline can get split from the main function body, leading to those calls becoming non inlined. Which causes the following issue.
To prove this I ran the precompilation with export JULIA_IMAGE_THREADS=1 which then fixes the issue.
The solution is to be more precise with how we split the modules so we don't separate these functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:precompilation Precompilation of modules performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants