Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not seeing the performance gains of SIMD #20

Closed
rcnlee opened this issue Apr 28, 2017 · 14 comments · Fixed by #63
Closed

Not seeing the performance gains of SIMD #20

rcnlee opened this issue Apr 28, 2017 · 14 comments · Fixed by #63

Comments

@rcnlee
Copy link

rcnlee commented Apr 28, 2017

Hello, I'm not sure if I'm using SIMD.jl correctly. I don't seem to be seeing any performance gains using SIMD.jl. Here is my test code:

using SIMD
using BenchmarkTools

function vadd!{N,T}(y::Vector{T}, xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}})
    @assert length(ys) == length(xs) == length(y)
    @assert length(xs) % N == 0
    @inbounds for i in 1:N:length(xs)
        xv = vload(Vec{N,T}, xs, i)
        yv = vload(Vec{N,T}, ys, i)
        xv += yv
        vstore(xv, y, i)
    end
end
function add!(y, x1,x2)
    @inbounds for i=1:length(x1)
        y[i] = x1[i] + x2[i]
    end
end
x1 = rand(Float64, 64)
x2 = rand(Float64, 64)
y1 = similar(x1)
y2 = similar(x1)

and this is the result:

julia> @benchmark vadd!(y1, x1,x2,Vec{8,Float64})
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     71.707 ns (0.00% GC)
  median time:      72.572 ns (0.00% GC)
  mean time:        72.879 ns (0.00% GC)
  maximum time:     148.167 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     990

julia> @benchmark add!(y2, x1,x2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     29.935 ns (0.00% GC)
  median time:      30.363 ns (0.00% GC)
  mean time:        30.384 ns (0.00% GC)
  maximum time:     83.820 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

I rebuilt the system image using the build_sysimg.jl script, and also started julia with -O3.
Also, if I do @code_native vadd!(y2,x1,x2,Vec{8,Float64}) I get:

...
Source line: 573
        vaddpd  (%rdi,%rbx), %ymm1, %ymm1
        vaddpd  32(%rdi,%rbx), %ymm0, %ymm0
...

Why don't I see any speedups? Any help would be appreciated! Thanks!

Ritchie

@eschnett
Copy link
Owner

There are several reasons why you don't see any speedup. The three most relevant here are:

  1. Your benchmark test case is too small, and timing is influenced by overhead, e.g. due to the additional @assert statements
  2. Your test case is limited by something else, such as e.g. L1 cache bandwidth, and not by CPU floating point performance (this is actually likely here)
  3. The compiler's optimizer is smart enough to optimize the second loop, and in doing so, applies other optimizations as well that it doesn't apply to the manually vectorized loop

To check for these:

  1. Remove the additional statements
  2. Choose a different test case that has many more floating point operations per memory access, or use a tool such as perf (https://perf.wiki.kernel.org/index.php/Main_Page) to analyze the code
  3. Look at the disassembled output of both routines

As a general remark, simply adding two arrays of numbers will these days always be limited by memory bandwidth, and SIMD vectorization is not very important.

@rcnlee
Copy link
Author

rcnlee commented Apr 28, 2017

Thanks! Following your advice, I did the following tests:

using SIMD
using BenchmarkTools

x1 = rand(Float64, 64)
x2 = rand(Float64, 64)
y = similar(x1)

function add!(y, x1,x2)
    @inbounds for i=1:length(x1)
        y[i] = x1[i] + x2[i] 
    end
end
function vadd!{N,T}(y::Vector{T}, xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}}=Vec{8,T})
    @inbounds for i in 1:N:length(xs)
        xv = vload(Vec{N,T}, xs, i)
        yv = vload(Vec{N,T}, ys, i)
        xv += yv 
        vstore(xv, y, i)
    end
end
function euclid!(y, x1,x2)
    @inbounds for i=1:length(x1)
        y[i] = sqrt(x1[1] * x1[1]) + (x2[i] * x2[i])
    end
end
function veuclid!{N,T}(y::Vector{T}, xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}}=Vec{8,T})
    @inbounds for i in 1:N:length(xs)
        xv = vload(Vec{N,T}, xs, i)
        yv = vload(Vec{N,T}, ys, i)
        xv = sqrt(xv*xv + yv*yv)
        vstore(xv, y, i)
    end
end

and got these results:

julia> @benchmark euclid!(y,x1,x2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     271.115 ns (0.00% GC)
  median time:      279.992 ns (0.00% GC)
  mean time:        289.619 ns (0.00% GC)
  maximum time:     662.458 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     530


julia> @benchmark veuclid!(y,x1,x2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     187.463 ns (0.00% GC)
  median time:      187.996 ns (0.00% GC)
  mean time:        188.370 ns (0.00% GC)
  maximum time:     463.867 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     803

julia> @benchmark add!(y,x1,x2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     30.363 ns (0.00% GC)
  median time:      30.791 ns (0.00% GC)
  mean time:        31.074 ns (0.00% GC)
  maximum time:     103.919 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark vadd!(y,x1,x2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     54.311 ns (0.00% GC)
  median time:      55.167 ns (0.00% GC)
  mean time:        55.503 ns (0.00% GC)
  maximum time:     125.729 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

where I can now see that increasing the arithmetic intensity results in better SIMD performance.
I'm guessing there must be some added overhead in using SIMD. Maybe in the loads? Because even with removing the checks, the SIMD version is still much slower. The @code_native of vadd! is much simpler than that of add!. Suspiciously though I don't see any vaddpd commands.

julia> @code_native vadd!(y,x1,x2)
        .text
Filename: runtests.jl
        pushq   %rbp
        movq    %rsp, %rbp
Source line: 49
        subq    $32, %rsp
        movabsq $"vadd!", %rax
        movl    $2226597040, %r9d       # imm = 0x84B72CB0
        callq   *%rax
        addq    $32, %rsp
        popq    %rbp
        retq

@eschnett
Copy link
Owner

What you are seeing here in vadd! is part of the dynamic dispatch on the type of the arguments. Maybe this is due to the default argument for the type, which requires constructing or referencing a type object? What happens if you omit this argument and/or hard-code N=8?

There is another reason why your loop might be slow. If your architecture's preferred vector size is 4, as it seems to be (%ymm), then sometimes LLVM does a bad job handling vectors of length 8. This would then be the fault of LLVM's optimizer, and the work-around would be to use a vector length of 4.

@rcnlee
Copy link
Author

rcnlee commented Apr 28, 2017

Ah, you're right. I removed the argument and hardcoded N and the code_native now shows something more sensible:

...
Source line: 1254
        vmovupd 32(%rdi,%rbx), %ymm0
        vmovupd (%rdi,%rbx), %ymm1
Source line: 745
        movq    (%r15), %rbx
Source line: 573
        vaddpd  (%rdi,%rbx), %ymm1, %ymm1
        vaddpd  32(%rdi,%rbx), %ymm0, %ymm0
Source line: 745
        movq    (%rsi), %rbx
Source line: 1344
        vmovupd %ymm0, 32(%rdi,%rbx)
        vmovupd %ymm1, (%rdi,%rbx)
...

I just tried it with N=4, but it just slowed it down some more.

@eschnett
Copy link
Owner

The code above looks very reasonable. If the ... parts don't hide an atrocity, then this should perform well. Without comparison to the disassembled non-manually-vectorized code I can't tell more.

@rcnlee
Copy link
Author

rcnlee commented Apr 30, 2017

I'm still seeing the SIMD version to be significantly slower. Can you take a look at the complete diassemblies to see if you can spot any atrocities?

Thanks!

julia> @code_native add!(y,x1,x2)
        .text
Filename: runtests.jl
        pushq   %rbp
        movq    %rsp, %rbp
Source line: 46
        pushq   %r15
        pushq   %r14
        pushq   %rsi
        pushq   %rdi
        pushq   %rbx
        movq    8(%rdx), %rax
        testq   %rax, %rax
        jle     L254
Source line: 47
        movq    (%rdx), %r9
        movq    (%r8), %r10
        movq    (%rcx), %rcx
        movl    $1, %esi
        testq   %rax, %rax
Source line: 46
        je      L197
        leaq    1(%rax), %r8
        movq    %rax, %rdx
        andq    $-8, %rdx
        leaq    1(%rdx), %r11
        movl    $1, %esi
        cmpq    $1, %r11
        je      L192
        leaq    -8(%rcx,%rax,8), %rsi
        leaq    -8(%r9,%rax,8), %rdi
        leaq    -8(%r10,%rax,8), %rbx
        cmpq    %rdi, %rcx
        setbe   %r14b
        cmpq    %rsi, %r9
        setbe   %r15b
        cmpq    %rbx, %rcx
        setbe   %bl
        cmpq    %rsi, %r10
        setbe   %dil
        movl    $1, %esi
        testb   %r15b, %r14b
        jne     L192
        andb    %dil, %bl
        jne     L192
        leaq    32(%r9), %rsi
        leaq    32(%r10), %rdi
        leaq    32(%rcx), %rbx
        nop
Source line: 47
L144:
        vmovupd -32(%rsi), %ymm0
        vmovupd (%rsi), %ymm1
        vaddpd  -32(%rdi), %ymm0, %ymm0
        vaddpd  (%rdi), %ymm1, %ymm1
        vmovupd %ymm0, -32(%rbx)
        vmovupd %ymm1, (%rbx)
Source line: 46
        addq    $64, %rsi
        addq    $64, %rdi
        addq    $64, %rbx
        addq    $-8, %rdx
        jne     L144
        movq    %r11, %rsi
L192:
        cmpq    %rsi, %r8
        je      L254
L197:
        addq    $1, %rax
        subq    %rsi, %rax
        leaq    -8(%rcx,%rsi,8), %rcx
        leaq    -8(%r10,%rsi,8), %rdx
        leaq    -8(%r9,%rsi,8), %rbx
        nopl    (%rax,%rax)
Source line: 47
L224:
        vmovsd  ZN4llvm16DAGTypeLegalizer17SplitVecOp_VSETCCEPNS_6SDNodeE(%rbx), %xmm0 # xmm0 = mem[0],zero
        vaddsd  (%rdx), %xmm0, %xmm0
        vmovsd  %xmm0, (%rcx)
Source line: 46
        addq    $8, %rcx
        addq    $8, %rdx
        addq    $8, %rbx
        addq    $-1, %rax
        jne     L224
Source line: 47
L254:
        popq    %rbx
        popq    %rdi
        popq    %rsi
        popq    %r14
        popq    %r15
        popq    %rbp
        vzeroupper
        retq
        nopw    (%rax,%rax)
julia> @code_native vadd!(y,x1,x2)
        .text
Filename: runtests.jl
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r12
        pushq   %rsi
        pushq   %rdi
        pushq   %rbx
        subq    $64, %rsp
        movq    %r8, %r15
        movq    %rdx, %r12
        movq    %rcx, %rsi
        movabsq $jl_get_ptls_states, %rax
        callq   *%rax
        movq    %rax, %r14
        movq    $0, -56(%rbp)
        movq    $2, -72(%rbp)
        movq    (%r14), %rax
        movq    %rax, -64(%rbp)
        leaq    -72(%rbp), %rax
        movq    %rax, (%r14)
Source line: 51
        movq    8(%r12), %r8
Source line: 18
        movabsq $steprange_last, %rax
        movl    $1, %ecx
        movl    $8, %edx
        callq   *%rax
Source line: 51
        testq   %rax, %rax
        setle   %cl
        setg    %r8b
        cmpq    $-7, %rax
        setne   %dl
        andb    %r8b, %dl
        movzbl  %dl, %edx
        cmpl    $1, %edx
        jne     L229
        movl    $1, %edx
        xorl    %edi, %edi
        nopw    %cs:(%rax,%rax)
Source line: 1355
L144:
        testq   %rdx, %rdx
        jle     L254
        movq    8(%rsi), %rbx
        addq    $-7, %rbx
        cmpq    %rbx, %rdx
        jg      L254
Source line: 745
        movq    (%r12), %rbx
Source line: 1254
        vmovupd 32(%rdi,%rbx), %ymm0
        vmovupd (%rdi,%rbx), %ymm1
Source line: 745
        movq    (%r15), %rbx
Source line: 573
        vaddpd  (%rdi,%rbx), %ymm1, %ymm1
        vaddpd  32(%rdi,%rbx), %ymm0, %ymm0
Source line: 745
        movq    (%rsi), %rbx
Source line: 1344
        vmovupd %ymm0, 32(%rdi,%rbx)
        vmovupd %ymm1, (%rdi,%rbx)
Source line: 51
        cmpq    %rdx, %rax
        leaq    8(%rdx), %rdx
        sete    %bl
        orb     %cl, %bl
        movzbl  %bl, %ebx
        addq    $64, %rdi
        cmpl    $1, %ebx
        jne     L144
Source line: 55
L229:
        movq    -64(%rbp), %rax
        movq    %rax, (%r14)
        addq    $64, %rsp
        popq    %rbx
        popq    %rdi
        popq    %rsi
        popq    %r12
        popq    %r14
        popq    %r15
        popq    %rbp
        vzeroupper
        retq
Source line: 1355
L254:
        movabsq $jl_gc_pool_alloc, %rax
        movl    $1512, %edx             # imm = 0x5E8
        movl    $32, %r8d
        movq    %r14, %rcx
        vzeroupper
        callq   *%rax
        movl    $2149694384, %ecx       # imm = 0x8021BBB0
        movq    %rcx, -8(%rax)
        movq    %rax, -56(%rbp)
        vxorps  %xmm0, %xmm0, %xmm0
        vmovups %xmm0, (%rax)
        movabsq $jl_throw, %rdx
        movq    %rax, %rcx
        callq   *%rdx
        ud2
        nopw    %cs:(%rax,%rax)

Looks like maybe SIMD is already applied in add!

@eschnett
Copy link
Owner

eschnett commented May 1, 2017

  1. Yes, it seems that LLVM's optimizer is these days clever enough to both vectorize and unroll the loop, so that you should expect identical performance.
  2. Something in Julia's looping construct with a step size of N is off and leads to bad code. This also seems to infect the array accesses later on that are also worse.

You could try to work around this by having a loop go from 0 to length(xs)/N-1, and then multiply by N and add 1 inside the loop. I hope that this leads to simpler code. If LLVM's optimizer latches on, it will do an amazing job with loop indices and pointer arithmetic.

If that is indeed the case, then I would submit a bug report to Julia itself, in particular for the 1:N:length(xs) construct that apparently isn't handled well in this case.

In more detail: In the vadd! routine, then inner loop starts at L144 and goes until the statement jne L144. You see that there are two additional branch statements in the beginning of the loop (jle and jg) that shouldn't be there. In the end, there are sete, orb, and cmpl instructions that are also more complicated than necessary; they calculate a "logical or" that should have been optimized away.

@RoyiAvital
Copy link

Could SVML be used with SIMD.jl easily for the Math operations?

@eschnett
Copy link
Owner

Yes, it could. A brief look at Intel's SVML library tells me that one needs their compiler and thus a respective licence to do so. I thus assume that only a minority of Julia user will have access to this library.

There are a few open source alternatives, including one by the esteemed Agner Fog http://www.agner.org/optimize/#vectorclass.

One problem is that these libraries are usually distributed as C++ header-only libraries. Using them from Julia as external C library requires some boilerplate code to be written, and reduces performance because it introduces a function call overhead since the function cannot be inlined.

What I am looking for is an LLVM based version of such a library that can then be inlined by LLVM into code written in any language.

@RoyiAvital
Copy link

@eschnett , I think Numba utilizes SVML without the need for Intel Compiler.

Have a look at - numba/numba#2264.

@RalphAS
Copy link

RalphAS commented Sep 22, 2018

@eschnett SLEEF now produces LLVM bitcode (when compiled with Clang), and has a permissive license.

@RoyiAvital
Copy link

RoyiAvital commented Sep 22, 2018

@RalphAS , Did you verify their performance numbers?
When I compare SLEEF to SVML (Intel Compiler 18 / Intel Compiler 19) I get different results than theirs.

I must say I do it on Windows and not using their testing but just a simple loop of sin / cos / exp / etc.. on large array (~60e6 Elements).
When I time it I get SVML is much faster.
Since Intel is now much more cooperative in letting Open Source Projects use its libraries why not have both?

@KristofferC
Copy link
Collaborator

KristofferC commented Feb 23, 2020

A few things. Firstly, regarding

1:N:length(xs)

creating stepranges in Julia is slow (JuliaLang/julia#26770) so this will be a bottle neck.

Secondly, loop unrolling. To be competitive with Base you probably need to write something like:

function vadd!(y::Vector{T}, xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}}) where {N, T}
    @assert length(ys) == length(xs) == length(y)
    @assert length(xs) % (N*4) == 0
    i = 1
    @inbounds while i <= length(y)
        xv1 = vload(Vec{N,T}, xs, i)
        xv2 = vload(Vec{N,T}, xs, i+4)
        xv3 = vload(Vec{N,T}, xs, i+8)
        xv4 = vload(Vec{N,T}, xs, i+12)

        yv1 = vload(Vec{N,T}, ys, i)
        yv2 = vload(Vec{N,T}, ys, i+4)
        yv3 = vload(Vec{N,T}, ys, i+8)
        yv4 = vload(Vec{N,T}, ys, i+12)

        xv1 += yv1
        xv2 += yv2
        xv3 += yv3
        xv4 += yv4

        vstore(xv1, y, i)
        vstore(xv2, y, i+4)
        vstore(xv3, y, i+8)
        vstore(xv4, y, i+12)

        i += 16
    end
end

Thirdly, it seems that the pointer lookup for the arrays are not hoisted out of the loop so doing that manually and the whole thing looks like:

function vadd!(y::Vector{T}, xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}}) where {N, T}
    #  @assert length(ys) == length(xs) == length(y)
    #  @assert length(xs) % (N*4) == 0
    ptr_offset = 0
    i = 0
    GC.@preserve y xs ys begin
        pxs = pointer(xs)
        pys = pointer(ys)
        py  = pointer(y)
        while i <= length(y)
            pxs_offset = pxs + ptr_offset
            pys_offset = pys + ptr_offset
            py_offset = py + ptr_offset

            xv1 = vload(Vec{N,T}, pxs_offset)
            xv2 = vload(Vec{N,T}, pxs_offset + 4*sizeof(T))
            xv3 = vload(Vec{N,T}, pxs_offset + 8*sizeof(T))
            xv4 = vload(Vec{N,T}, pxs_offset + 12*sizeof(T))

            yv1 = vload(Vec{N,T}, pys_offset)
            yv2 = vload(Vec{N,T}, pys_offset + 4*sizeof(T))
            yv3 = vload(Vec{N,T}, pys_offset + 8*sizeof(T))
            yv4 = vload(Vec{N,T}, pys_offset + 12*sizeof(T))

            xv1 += yv1
            xv2 += yv2
            xv3 += yv3
            xv4 += yv4

            vstore(xv1, py_offset)
            vstore(xv2, py_offset + 4*sizeof(T))
            vstore(xv3, py_offset + 8*sizeof(T))
            vstore(xv4, py_offset + 12*sizeof(T))

            ptr_offset += 4*4*sizeof(T)
            i += 16
        end
    end
end

Now we can compare the body of the loop for the scalar Base case:

        vmovupd (%rcx,%rax,8), %ymm0
        vmovupd 32(%rcx,%rax,8), %ymm1
        vmovupd 64(%rcx,%rax,8), %ymm2
        vmovupd 96(%rcx,%rax,8), %ymm3
        vaddpd  (%rdx,%rax,8), %ymm0, %ymm0
        vaddpd  32(%rdx,%rax,8), %ymm1, %ymm1
        vaddpd  64(%rdx,%rax,8), %ymm2, %ymm2
        vaddpd  96(%rdx,%rax,8), %ymm3, %ymm3
        vmovupd %ymm0, (%rsi,%rax,8)
        vmovupd %ymm1, 32(%rsi,%rax,8)
        vmovupd %ymm2, 64(%rsi,%rax,8)
        vmovupd %ymm3, 96(%rsi,%rax,8)
        addq    $16, %rax
        cmpq    %rax, %r8
        jne     L144
        cmpq    %r8, %r9
        je      L264

and the SIMD written one:

        vmovupd (%rcx,%rdi,8), %ymm0
        vmovupd 32(%rcx,%rdi,8), %ymm1
        vmovupd 64(%rcx,%rdi,8), %ymm2
        vmovupd 96(%rcx,%rdi,8), %ymm3
        vaddpd  (%rdx,%rdi,8), %ymm0, %ymm0
        vaddpd  32(%rdx,%rdi,8), %ymm1, %ymm1
        vaddpd  64(%rdx,%rdi,8), %ymm2, %ymm2
        vaddpd  96(%rdx,%rdi,8), %ymm3, %ymm3
        vmovupd %ymm0, (%rsi,%rdi,8)
        vmovupd %ymm1, 32(%rsi,%rdi,8)
        vmovupd %ymm2, 64(%rsi,%rdi,8)
        vmovupd %ymm3, 96(%rsi,%rdi,8)
        addq    $16, %rdi
        cmpq    8(%rax), %rdi
        jle     L48

where the Base case has an extra jump instruction because it needs to handle the case where the arrays are not a multiple of unroll_factor * simd_width.

@KristofferC
Copy link
Collaborator

There isn't really any problem with this package though. It is just the Base Julia (and LLVM) is really good at optimizations so beating it in easy code like this will be very hard. The fact that the pointer lookup doesn't get hoisted is quite annoying though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants