Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstraction penalty: Wrapper types lead to much less efficient code #13104

Closed
eschnett opened this issue Sep 13, 2015 · 10 comments
Closed

Abstraction penalty: Wrapper types lead to much less efficient code #13104

eschnett opened this issue Sep 13, 2015 · 10 comments
Labels
compiler:codegen Generation of LLVM IR and native code performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks

Comments

@eschnett
Copy link
Contributor

(This is Julia release-0.4 with LLVM 3.6.2; LLVM 3.3 is similar but slightly worse.)

I find that introducing a trivial wrapper type around Float64 leads to much less efficient code. This is an example:

immutable F elt::Float64 end
import Base.+
+(a::F, b::F) = F(a.elt+b.elt)

I would expect this type F to be as efficient as Float64. Unfortunately it isn't, as visible in this test:

function vadd!(rs,xs,ys)
    @inbounds @simd for i in eachindex(rs)
        rs[i] = xs[i] + ys[i]
    end
end

Julia produces the following code for this example:

julia> code_native(vadd!, (Vector{F}, Vector{F}, Vector{F}))
[...]
Source line: 3
L48:    movq    (%rdx), %rax
    vmovsd  (%rax,%rdi,8), %xmm0
    movq    (%rsi), %rax
    vaddsd  (%rax,%rdi,8), %xmm0, %xmm0
    movq    (%r8), %rax
    vmovsd  %xmm0, (%rax,%rdi,8)
Source line: 74
    incq    %rdi
Source line: 75
    cmpq    %rdi, %rcx
    jne L48
[...]

When I generated code for Vector{Float64} instead, then the double indirections (reloading %rax at every loop iteration) are not present, and the loop is vectorized. I wonder what causes this, and how it can be addressed.

The presence of the double indirections indicates that this may be an aliasing problem: Maybe LLVM cannot determine that the three arrays and the content of the new type F don't alias, and thus cannot hoist the respective loads out of the loop?

@simonster
Copy link
Member

With julia -O (which enables the LLVM basic AA pass) the array pointer loads appear to be hoisted, although the loop is still not vectorized (on LLVM 3.3).

@tkelman tkelman added the performance Must go faster label Sep 13, 2015
@mlubin
Copy link
Member

mlubin commented Sep 13, 2015

CC @jrevels

@eschnett
Copy link
Contributor Author

I confirm that julia -O optimizes the loads and stores, and the loop carries not obvious overhead. This is the case both for LLVM 3.3 and LLVM 3.6.2.

As you say, the loop is not vectorized. I also notice that LLVM chooses questionable addressing modes, as there are 4 integer arithmetic instructions with the 3 floating point instructions, but I cannot tell whether this is actually slower on my CPU.

@jrevels
Copy link
Member

jrevels commented Sep 14, 2015

This is something that caught us off guard when testing ForwardDiff.jl's wrapper number types as well.

@eschnett
Copy link
Contributor Author

LLVM 3.7 doesn't help.

@jrevels ... how did you address it?

@jrevels
Copy link
Member

jrevels commented Sep 14, 2015

@eschnett I mainly didn't address it, unfortunately - at least not the problems you're pointing out with additional loads. In our case, the performance hit due to some inlining-related problems we were facing dwarfed the additional loads, so while I took note of it, it wasn't a focus at the time.

@eschnett eschnett added the compiler:codegen Generation of LLVM IR and native code label Sep 14, 2015
@eschnett
Copy link
Contributor Author

@jrevels Okay. Well, the native code generated by julia -O looks harmless enough, meaning that LLVM should be able to vectorize the bitcode that produced it. So either the annotations added by @simd somehow get lost, or all that's missing is running another LLVM pass (or running them in a different order).

@simonster
Copy link
Member

The LLVM IR constructs an aggregate (%32 = insertvalue %F undef, double %31, 0) which may be what spooks the vectorizer. LLVM 3.7 seems to be capable of removing the aggregate, but the vectorizer seems to be broken in general (#13106).

@yuyichao
Copy link
Contributor

yuyichao commented May 6, 2016

Is this still a problem? (Somehow I don't see vectors in code_llvm but I do see simd instructions in code_native.....)

code_native I get:

Source line: 3
L112:
        vmovupd -96(%rdi), %ymm0
        vmovupd -64(%rdi), %ymm1
        vmovupd -32(%rdi), %ymm2
        vmovupd (%rdi), %ymm3
Source line: 1
        vaddpd  -96(%rax), %ymm0, %ymm0
        vaddpd  -64(%rax), %ymm1, %ymm1
        vaddpd  -32(%rax), %ymm2, %ymm2
        vaddpd  (%rax), %ymm3, %ymm3
        vmovupd %ymm0, -96(%rcx)
        vmovupd %ymm1, -64(%rcx)
        vmovupd %ymm2, -32(%rcx)
        vmovupd %ymm3, (%rcx)
Source line: 74
        addq    $128, %rdi
        addq    $128, %rax
        addq    $128, %rcx
        addq    $-16, %r11
        jne     L112

@vtjnash
Copy link
Member

vtjnash commented May 6, 2016

lgtm also

@vtjnash vtjnash closed this as completed May 6, 2016
@tkelman tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label May 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks
Projects
None yet
Development

No branches or pull requests

7 participants