-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not seeing the performance gains of SIMD #20
Comments
There are several reasons why you don't see any speedup. The three most relevant here are:
To check for these:
As a general remark, simply adding two arrays of numbers will these days always be limited by memory bandwidth, and SIMD vectorization is not very important. |
Thanks! Following your advice, I did the following tests: using SIMD
using BenchmarkTools
x1 = rand(Float64, 64)
x2 = rand(Float64, 64)
y = similar(x1)
function add!(y, x1,x2)
@inbounds for i=1:length(x1)
y[i] = x1[i] + x2[i]
end
end
function vadd!{N,T}(y::Vector{T}, xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}}=Vec{8,T})
@inbounds for i in 1:N:length(xs)
xv = vload(Vec{N,T}, xs, i)
yv = vload(Vec{N,T}, ys, i)
xv += yv
vstore(xv, y, i)
end
end
function euclid!(y, x1,x2)
@inbounds for i=1:length(x1)
y[i] = sqrt(x1[1] * x1[1]) + (x2[i] * x2[i])
end
end
function veuclid!{N,T}(y::Vector{T}, xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}}=Vec{8,T})
@inbounds for i in 1:N:length(xs)
xv = vload(Vec{N,T}, xs, i)
yv = vload(Vec{N,T}, ys, i)
xv = sqrt(xv*xv + yv*yv)
vstore(xv, y, i)
end
end and got these results: julia> @benchmark euclid!(y,x1,x2)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 271.115 ns (0.00% GC)
median time: 279.992 ns (0.00% GC)
mean time: 289.619 ns (0.00% GC)
maximum time: 662.458 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 530
julia> @benchmark veuclid!(y,x1,x2)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 187.463 ns (0.00% GC)
median time: 187.996 ns (0.00% GC)
mean time: 188.370 ns (0.00% GC)
maximum time: 463.867 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 803
julia> @benchmark add!(y,x1,x2)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 30.363 ns (0.00% GC)
median time: 30.791 ns (0.00% GC)
mean time: 31.074 ns (0.00% GC)
maximum time: 103.919 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark vadd!(y,x1,x2)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 54.311 ns (0.00% GC)
median time: 55.167 ns (0.00% GC)
mean time: 55.503 ns (0.00% GC)
maximum time: 125.729 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000 where I can now see that increasing the arithmetic intensity results in better SIMD performance. julia> @code_native vadd!(y,x1,x2)
.text
Filename: runtests.jl
pushq %rbp
movq %rsp, %rbp
Source line: 49
subq $32, %rsp
movabsq $"vadd!", %rax
movl $2226597040, %r9d # imm = 0x84B72CB0
callq *%rax
addq $32, %rsp
popq %rbp
retq |
What you are seeing here in There is another reason why your loop might be slow. If your architecture's preferred vector size is 4, as it seems to be ( |
Ah, you're right. I removed the argument and hardcoded N and the code_native now shows something more sensible: ...
Source line: 1254
vmovupd 32(%rdi,%rbx), %ymm0
vmovupd (%rdi,%rbx), %ymm1
Source line: 745
movq (%r15), %rbx
Source line: 573
vaddpd (%rdi,%rbx), %ymm1, %ymm1
vaddpd 32(%rdi,%rbx), %ymm0, %ymm0
Source line: 745
movq (%rsi), %rbx
Source line: 1344
vmovupd %ymm0, 32(%rdi,%rbx)
vmovupd %ymm1, (%rdi,%rbx)
... I just tried it with N=4, but it just slowed it down some more. |
The code above looks very reasonable. If the |
I'm still seeing the SIMD version to be significantly slower. Can you take a look at the complete diassemblies to see if you can spot any atrocities? Thanks! julia> @code_native add!(y,x1,x2)
.text
Filename: runtests.jl
pushq %rbp
movq %rsp, %rbp
Source line: 46
pushq %r15
pushq %r14
pushq %rsi
pushq %rdi
pushq %rbx
movq 8(%rdx), %rax
testq %rax, %rax
jle L254
Source line: 47
movq (%rdx), %r9
movq (%r8), %r10
movq (%rcx), %rcx
movl $1, %esi
testq %rax, %rax
Source line: 46
je L197
leaq 1(%rax), %r8
movq %rax, %rdx
andq $-8, %rdx
leaq 1(%rdx), %r11
movl $1, %esi
cmpq $1, %r11
je L192
leaq -8(%rcx,%rax,8), %rsi
leaq -8(%r9,%rax,8), %rdi
leaq -8(%r10,%rax,8), %rbx
cmpq %rdi, %rcx
setbe %r14b
cmpq %rsi, %r9
setbe %r15b
cmpq %rbx, %rcx
setbe %bl
cmpq %rsi, %r10
setbe %dil
movl $1, %esi
testb %r15b, %r14b
jne L192
andb %dil, %bl
jne L192
leaq 32(%r9), %rsi
leaq 32(%r10), %rdi
leaq 32(%rcx), %rbx
nop
Source line: 47
L144:
vmovupd -32(%rsi), %ymm0
vmovupd (%rsi), %ymm1
vaddpd -32(%rdi), %ymm0, %ymm0
vaddpd (%rdi), %ymm1, %ymm1
vmovupd %ymm0, -32(%rbx)
vmovupd %ymm1, (%rbx)
Source line: 46
addq $64, %rsi
addq $64, %rdi
addq $64, %rbx
addq $-8, %rdx
jne L144
movq %r11, %rsi
L192:
cmpq %rsi, %r8
je L254
L197:
addq $1, %rax
subq %rsi, %rax
leaq -8(%rcx,%rsi,8), %rcx
leaq -8(%r10,%rsi,8), %rdx
leaq -8(%r9,%rsi,8), %rbx
nopl (%rax,%rax)
Source line: 47
L224:
vmovsd ZN4llvm16DAGTypeLegalizer17SplitVecOp_VSETCCEPNS_6SDNodeE(%rbx), %xmm0 # xmm0 = mem[0],zero
vaddsd (%rdx), %xmm0, %xmm0
vmovsd %xmm0, (%rcx)
Source line: 46
addq $8, %rcx
addq $8, %rdx
addq $8, %rbx
addq $-1, %rax
jne L224
Source line: 47
L254:
popq %rbx
popq %rdi
popq %rsi
popq %r14
popq %r15
popq %rbp
vzeroupper
retq
nopw (%rax,%rax) julia> @code_native vadd!(y,x1,x2)
.text
Filename: runtests.jl
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r12
pushq %rsi
pushq %rdi
pushq %rbx
subq $64, %rsp
movq %r8, %r15
movq %rdx, %r12
movq %rcx, %rsi
movabsq $jl_get_ptls_states, %rax
callq *%rax
movq %rax, %r14
movq $0, -56(%rbp)
movq $2, -72(%rbp)
movq (%r14), %rax
movq %rax, -64(%rbp)
leaq -72(%rbp), %rax
movq %rax, (%r14)
Source line: 51
movq 8(%r12), %r8
Source line: 18
movabsq $steprange_last, %rax
movl $1, %ecx
movl $8, %edx
callq *%rax
Source line: 51
testq %rax, %rax
setle %cl
setg %r8b
cmpq $-7, %rax
setne %dl
andb %r8b, %dl
movzbl %dl, %edx
cmpl $1, %edx
jne L229
movl $1, %edx
xorl %edi, %edi
nopw %cs:(%rax,%rax)
Source line: 1355
L144:
testq %rdx, %rdx
jle L254
movq 8(%rsi), %rbx
addq $-7, %rbx
cmpq %rbx, %rdx
jg L254
Source line: 745
movq (%r12), %rbx
Source line: 1254
vmovupd 32(%rdi,%rbx), %ymm0
vmovupd (%rdi,%rbx), %ymm1
Source line: 745
movq (%r15), %rbx
Source line: 573
vaddpd (%rdi,%rbx), %ymm1, %ymm1
vaddpd 32(%rdi,%rbx), %ymm0, %ymm0
Source line: 745
movq (%rsi), %rbx
Source line: 1344
vmovupd %ymm0, 32(%rdi,%rbx)
vmovupd %ymm1, (%rdi,%rbx)
Source line: 51
cmpq %rdx, %rax
leaq 8(%rdx), %rdx
sete %bl
orb %cl, %bl
movzbl %bl, %ebx
addq $64, %rdi
cmpl $1, %ebx
jne L144
Source line: 55
L229:
movq -64(%rbp), %rax
movq %rax, (%r14)
addq $64, %rsp
popq %rbx
popq %rdi
popq %rsi
popq %r12
popq %r14
popq %r15
popq %rbp
vzeroupper
retq
Source line: 1355
L254:
movabsq $jl_gc_pool_alloc, %rax
movl $1512, %edx # imm = 0x5E8
movl $32, %r8d
movq %r14, %rcx
vzeroupper
callq *%rax
movl $2149694384, %ecx # imm = 0x8021BBB0
movq %rcx, -8(%rax)
movq %rax, -56(%rbp)
vxorps %xmm0, %xmm0, %xmm0
vmovups %xmm0, (%rax)
movabsq $jl_throw, %rdx
movq %rax, %rcx
callq *%rdx
ud2
nopw %cs:(%rax,%rax) Looks like maybe SIMD is already applied in |
You could try to work around this by having a loop go from If that is indeed the case, then I would submit a bug report to Julia itself, in particular for the In more detail: In the |
Could SVML be used with |
Yes, it could. A brief look at Intel's There are a few open source alternatives, including one by the esteemed Agner Fog http://www.agner.org/optimize/#vectorclass. One problem is that these libraries are usually distributed as C++ header-only libraries. Using them from Julia as external C library requires some boilerplate code to be written, and reduces performance because it introduces a function call overhead since the function cannot be inlined. What I am looking for is an LLVM based version of such a library that can then be inlined by LLVM into code written in any language. |
@eschnett , I think Numba utilizes SVML without the need for Intel Compiler. Have a look at - numba/numba#2264. |
@RalphAS , Did you verify their performance numbers? I must say I do it on Windows and not using their testing but just a simple loop of |
A few things. Firstly, regarding
creating stepranges in Julia is slow (JuliaLang/julia#26770) so this will be a bottle neck. Secondly, loop unrolling. To be competitive with Base you probably need to write something like: function vadd!(y::Vector{T}, xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}}) where {N, T}
@assert length(ys) == length(xs) == length(y)
@assert length(xs) % (N*4) == 0
i = 1
@inbounds while i <= length(y)
xv1 = vload(Vec{N,T}, xs, i)
xv2 = vload(Vec{N,T}, xs, i+4)
xv3 = vload(Vec{N,T}, xs, i+8)
xv4 = vload(Vec{N,T}, xs, i+12)
yv1 = vload(Vec{N,T}, ys, i)
yv2 = vload(Vec{N,T}, ys, i+4)
yv3 = vload(Vec{N,T}, ys, i+8)
yv4 = vload(Vec{N,T}, ys, i+12)
xv1 += yv1
xv2 += yv2
xv3 += yv3
xv4 += yv4
vstore(xv1, y, i)
vstore(xv2, y, i+4)
vstore(xv3, y, i+8)
vstore(xv4, y, i+12)
i += 16
end
end Thirdly, it seems that the pointer lookup for the arrays are not hoisted out of the loop so doing that manually and the whole thing looks like: function vadd!(y::Vector{T}, xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}}) where {N, T}
# @assert length(ys) == length(xs) == length(y)
# @assert length(xs) % (N*4) == 0
ptr_offset = 0
i = 0
GC.@preserve y xs ys begin
pxs = pointer(xs)
pys = pointer(ys)
py = pointer(y)
while i <= length(y)
pxs_offset = pxs + ptr_offset
pys_offset = pys + ptr_offset
py_offset = py + ptr_offset
xv1 = vload(Vec{N,T}, pxs_offset)
xv2 = vload(Vec{N,T}, pxs_offset + 4*sizeof(T))
xv3 = vload(Vec{N,T}, pxs_offset + 8*sizeof(T))
xv4 = vload(Vec{N,T}, pxs_offset + 12*sizeof(T))
yv1 = vload(Vec{N,T}, pys_offset)
yv2 = vload(Vec{N,T}, pys_offset + 4*sizeof(T))
yv3 = vload(Vec{N,T}, pys_offset + 8*sizeof(T))
yv4 = vload(Vec{N,T}, pys_offset + 12*sizeof(T))
xv1 += yv1
xv2 += yv2
xv3 += yv3
xv4 += yv4
vstore(xv1, py_offset)
vstore(xv2, py_offset + 4*sizeof(T))
vstore(xv3, py_offset + 8*sizeof(T))
vstore(xv4, py_offset + 12*sizeof(T))
ptr_offset += 4*4*sizeof(T)
i += 16
end
end
end Now we can compare the body of the loop for the scalar Base case:
and the SIMD written one:
where the Base case has an extra jump instruction because it needs to handle the case where the arrays are not a multiple of unroll_factor * simd_width. |
There isn't really any problem with this package though. It is just the Base Julia (and LLVM) is really good at optimizations so beating it in easy code like this will be very hard. The fact that the pointer lookup doesn't get hoisted is quite annoying though. |
Hello, I'm not sure if I'm using SIMD.jl correctly. I don't seem to be seeing any performance gains using SIMD.jl. Here is my test code:
and this is the result:
I rebuilt the system image using the build_sysimg.jl script, and also started julia with -O3.
Also, if I do
@code_native vadd!(y2,x1,x2,Vec{8,Float64})
I get:Why don't I see any speedups? Any help would be appreciated! Thanks!
Ritchie
The text was updated successfully, but these errors were encountered: