Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass arguments to _avx_! in registers. #261

Merged
merged 3 commits into from
May 13, 2021
Merged

Conversation

chriselrod
Copy link
Member

@chriselrod chriselrod commented May 12, 2021

On current master, they are passed in the stack.
This PR tries to pass them in registers instead.

Of course, this only matters when _avx_! isn't inlined.

@codecov
Copy link

codecov bot commented May 12, 2021

Codecov Report

Merging #261 (2486aaa) into master (3ef16fb) will decrease coverage by 0.04%.
The diff coverage is 86.86%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #261      +/-   ##
==========================================
- Coverage   90.25%   90.20%   -0.05%     
==========================================
  Files          36       36              
  Lines        7572     7650      +78     
==========================================
+ Hits         6834     6901      +67     
- Misses        738      749      +11     
Impacted Files Coverage Δ
src/LoopVectorization.jl 100.00% <ø> (ø)
src/parse/memory_ops_common.jl 80.60% <0.00%> (ø)
src/codegen/lowering.jl 89.67% <66.66%> (-0.31%) ⬇️
src/codegen/lower_threads.jl 59.94% <69.23%> (+0.05%) ⬆️
src/condense_loopset.jl 92.65% <85.00%> (-0.90%) ⬇️
src/modeling/determinestrategy.jl 97.18% <90.90%> (-0.02%) ⬇️
src/modeling/graphs.jl 88.82% <95.83%> (+0.23%) ⬆️
src/codegen/loopstartstopmanager.jl 89.11% <100.00%> (+0.02%) ⬆️
src/codegen/lower_compute.jl 93.84% <100.00%> (ø)
src/reconstruct_loopset.jl 93.41% <100.00%> (+0.01%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ef16fb...2486aaa. Read the comment docs.

@chriselrod
Copy link
Member Author

The bump in VectorizationBase requirement fixes #214:

julia> @btime fun_avxt($D,$QT,$μ,$σ,$m,$i);
  1.378 μs (0 allocations: 0 bytes)

julia> @btime fun_avx($D,$QT,$μ,$σ,$m,$i);
  7.565 μs (0 allocations: 0 bytes)

julia> @btime fun_simd($D,$QT,$μ,$σ,$m,$i);
  9.948 μs (0 allocations: 0 bytes)

julia> # redefine args

julia> eltype(D)
Float32

julia> @btime fun_avxt($D,$QT,$μ,$σ,$m,$i);
  905.000 ns (0 allocations: 0 bytes)

julia> @btime fun_avx($D,$QT,$μ,$σ,$m,$i);
  1.848 μs (0 allocations: 0 bytes)

julia> @btime fun_simd($D,$QT,$μ,$σ,$m,$i);
  1.873 μs (0 allocations: 0 bytes)

This allows using approximate inverses in fun_simd.

Should also fix the issue of passing in types causing type instabilities, and it should theoretically lower the function call overhead, and a few benchmarks seem to suggest that helps threading in cases where _avx_! isn't inlined (as then we do have smaller calls).

using LoopVectorization, Octavian, LinearAlgebra, StaticArrays, BenchmarkTools

@inline function AmulB!(C,A,B)
    @avx for n in indices((B,C),2), m in indices((A,C),1)
        Cmn = zero(eltype(C))
        for k in indices((A,B),(2,1))
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
    C
end

M = K = N = 8; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
Astatic = SMatrix{8,8}(A); Bstatic = SMatrix{8,8}(B); Amutable = MMatrix(Astatic); Bmutable = MMatrix(Bstatic); Cmutable = similar(Amutable);
@time(AmulB!(C0,A,B)); C0  C1 # compile time (a bunch of things were compiled already when I ran this)
@benchmark AmulB!($C0,$A,$B) # dynamic 
@benchmark matmul_serial!($C0,$A,$B) # dynamically sized Octavian, serial
@benchmark matmul!($C0,$A,$B) # dynamically sized Octavian, multithreaded

Astatic = SMatrix{8,8}(A); Bstatic = SMatrix{8,8}(B);
Amutable = MMatrix(Astatic); Bmutable = MMatrix(Bstatic); Cmutable = similar(Amutable);

@benchmark mul!($Cmutable, $Amutable, $Bmutable)
@benchmark $(Ref(Astatic))[] * $(Ref(Bstatic))[]

@benchmark AmulB!($Cmutable, $Amutable, $Bmutable)
@benchmark matmul!($Cmutable, $Amutable, $Bmutable)

Results:

julia> @time(AmulB!(C0,A,B)); C0  C1 # I redefined `AmulB!`, but that particular `_avx_!` was already compiled
  0.009071 seconds (37.50 k allocations: 1.917 MiB, 99.85% compilation time)
true

julia> @benchmark AmulB!($C0,$A,$B) # dynamic
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     23.575 ns (0.00% GC)
  median time:      24.313 ns (0.00% GC)
  mean time:        24.355 ns (0.00% GC)
  maximum time:     59.959 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> @benchmark matmul_serial!($C0,$A,$B) # dynamically sized Octavian, serial
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.155 ns (0.00% GC)
  median time:      27.295 ns (0.00% GC)
  mean time:        27.356 ns (0.00% GC)
  maximum time:     76.102 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

julia> @benchmark matmul!($C0,$A,$B) # dynamically sized Octavian, multithreaded
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.087 ns (0.00% GC)
  median time:      27.390 ns (0.00% GC)
  mean time:        27.420 ns (0.00% GC)
  maximum time:     65.763 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

julia> Astatic = SMatrix{8,8}(A); Bstatic = SMatrix{8,8}(B);

julia> Amutable = MMatrix(Astatic); Bmutable = MMatrix(Bstatic); Cmutable = similar(Amutable);

julia> @benchmark mul!($Cmutable, $Amutable, $Bmutable)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     62.418 ns (0.00% GC)
  median time:      62.452 ns (0.00% GC)
  mean time:        62.533 ns (0.00% GC)
  maximum time:     93.819 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     983

julia> @benchmark $(Ref(Astatic))[] * $(Ref(Bstatic))[]
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     37.092 ns (0.00% GC)
  median time:      37.280 ns (0.00% GC)
  mean time:        37.296 ns (0.00% GC)
  maximum time:     68.081 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992

julia> @benchmark AmulB!($Cmutable, $Amutable, $Bmutable)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.097 ns (0.00% GC)
  median time:      12.146 ns (0.00% GC)
  mean time:        12.168 ns (0.00% GC)
  maximum time:     38.972 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark matmul!($Cmutable, $Amutable, $Bmutable)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.097 ns (0.00% GC)
  median time:      12.148 ns (0.00% GC)
  mean time:        12.168 ns (0.00% GC)
  maximum time:     38.316 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

So the dynamically sized version already yields a good improvement over StaticArrays here.

@chriselrod chriselrod merged commit 93d2be2 into master May 13, 2021
@chriselrod chriselrod deleted the avxargsinregisters branch June 1, 2021 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant