-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass arguments to _avx_! in registers. #261
Conversation
Codecov Report
@@ Coverage Diff @@
## master #261 +/- ##
==========================================
- Coverage 90.25% 90.20% -0.05%
==========================================
Files 36 36
Lines 7572 7650 +78
==========================================
+ Hits 6834 6901 +67
- Misses 738 749 +11
Continue to review full report at Codecov.
|
The bump in VectorizationBase requirement fixes #214: julia> @btime fun_avxt($D,$QT,$μ,$σ,$m,$i);
1.378 μs (0 allocations: 0 bytes)
julia> @btime fun_avx($D,$QT,$μ,$σ,$m,$i);
7.565 μs (0 allocations: 0 bytes)
julia> @btime fun_simd($D,$QT,$μ,$σ,$m,$i);
9.948 μs (0 allocations: 0 bytes)
julia> # redefine args
julia> eltype(D)
Float32
julia> @btime fun_avxt($D,$QT,$μ,$σ,$m,$i);
905.000 ns (0 allocations: 0 bytes)
julia> @btime fun_avx($D,$QT,$μ,$σ,$m,$i);
1.848 μs (0 allocations: 0 bytes)
julia> @btime fun_simd($D,$QT,$μ,$σ,$m,$i);
1.873 μs (0 allocations: 0 bytes) This allows using approximate inverses in Should also fix the issue of passing in types causing type instabilities, and it should theoretically lower the function call overhead, and a few benchmarks seem to suggest that helps threading in cases where using LoopVectorization, Octavian, LinearAlgebra, StaticArrays, BenchmarkTools
@inline function AmulB!(C,A,B)
@avx for n in indices((B,C),2), m in indices((A,C),1)
Cmn = zero(eltype(C))
for k in indices((A,B),(2,1))
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
C
end
M = K = N = 8; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
Astatic = SMatrix{8,8}(A); Bstatic = SMatrix{8,8}(B); Amutable = MMatrix(Astatic); Bmutable = MMatrix(Bstatic); Cmutable = similar(Amutable);
@time(AmulB!(C0,A,B)); C0 ≈ C1 # compile time (a bunch of things were compiled already when I ran this)
@benchmark AmulB!($C0,$A,$B) # dynamic
@benchmark matmul_serial!($C0,$A,$B) # dynamically sized Octavian, serial
@benchmark matmul!($C0,$A,$B) # dynamically sized Octavian, multithreaded
Astatic = SMatrix{8,8}(A); Bstatic = SMatrix{8,8}(B);
Amutable = MMatrix(Astatic); Bmutable = MMatrix(Bstatic); Cmutable = similar(Amutable);
@benchmark mul!($Cmutable, $Amutable, $Bmutable)
@benchmark $(Ref(Astatic))[] * $(Ref(Bstatic))[]
@benchmark AmulB!($Cmutable, $Amutable, $Bmutable)
@benchmark matmul!($Cmutable, $Amutable, $Bmutable) Results: julia> @time(AmulB!(C0,A,B)); C0 ≈ C1 # I redefined `AmulB!`, but that particular `_avx_!` was already compiled
0.009071 seconds (37.50 k allocations: 1.917 MiB, 99.85% compilation time)
true
julia> @benchmark AmulB!($C0,$A,$B) # dynamic
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 23.575 ns (0.00% GC)
median time: 24.313 ns (0.00% GC)
mean time: 24.355 ns (0.00% GC)
maximum time: 59.959 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 996
julia> @benchmark matmul_serial!($C0,$A,$B) # dynamically sized Octavian, serial
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 27.155 ns (0.00% GC)
median time: 27.295 ns (0.00% GC)
mean time: 27.356 ns (0.00% GC)
maximum time: 76.102 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 995
julia> @benchmark matmul!($C0,$A,$B) # dynamically sized Octavian, multithreaded
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 27.087 ns (0.00% GC)
median time: 27.390 ns (0.00% GC)
mean time: 27.420 ns (0.00% GC)
maximum time: 65.763 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 995
julia> Astatic = SMatrix{8,8}(A); Bstatic = SMatrix{8,8}(B);
julia> Amutable = MMatrix(Astatic); Bmutable = MMatrix(Bstatic); Cmutable = similar(Amutable);
julia> @benchmark mul!($Cmutable, $Amutable, $Bmutable)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 62.418 ns (0.00% GC)
median time: 62.452 ns (0.00% GC)
mean time: 62.533 ns (0.00% GC)
maximum time: 93.819 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 983
julia> @benchmark $(Ref(Astatic))[] * $(Ref(Bstatic))[]
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 37.092 ns (0.00% GC)
median time: 37.280 ns (0.00% GC)
mean time: 37.296 ns (0.00% GC)
maximum time: 68.081 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 992
julia> @benchmark AmulB!($Cmutable, $Amutable, $Bmutable)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 12.097 ns (0.00% GC)
median time: 12.146 ns (0.00% GC)
mean time: 12.168 ns (0.00% GC)
maximum time: 38.972 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
julia> @benchmark matmul!($Cmutable, $Amutable, $Bmutable)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 12.097 ns (0.00% GC)
median time: 12.148 ns (0.00% GC)
mean time: 12.168 ns (0.00% GC)
maximum time: 38.316 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999 So the dynamically sized version already yields a good improvement over |
On current master, they are passed in the stack.
This PR tries to pass them in registers instead.
Of course, this only matters when
_avx_!
isn't inlined.