-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abstraction penalty: Wrapper types lead to much less efficient code #13104
Comments
With |
CC @jrevels |
I confirm that As you say, the loop is not vectorized. I also notice that LLVM chooses questionable addressing modes, as there are 4 integer arithmetic instructions with the 3 floating point instructions, but I cannot tell whether this is actually slower on my CPU. |
This is something that caught us off guard when testing ForwardDiff.jl's wrapper number types as well. |
LLVM 3.7 doesn't help. @jrevels ... how did you address it? |
@eschnett I mainly didn't address it, unfortunately - at least not the problems you're pointing out with additional loads. In our case, the performance hit due to some inlining-related problems we were facing dwarfed the additional loads, so while I took note of it, it wasn't a focus at the time. |
@jrevels Okay. Well, the native code generated by |
The LLVM IR constructs an aggregate ( |
Is this still a problem? (Somehow I don't see vectors in code_llvm but I do see simd instructions in code_native.....) code_native I get: Source line: 3
L112:
vmovupd -96(%rdi), %ymm0
vmovupd -64(%rdi), %ymm1
vmovupd -32(%rdi), %ymm2
vmovupd (%rdi), %ymm3
Source line: 1
vaddpd -96(%rax), %ymm0, %ymm0
vaddpd -64(%rax), %ymm1, %ymm1
vaddpd -32(%rax), %ymm2, %ymm2
vaddpd (%rax), %ymm3, %ymm3
vmovupd %ymm0, -96(%rcx)
vmovupd %ymm1, -64(%rcx)
vmovupd %ymm2, -32(%rcx)
vmovupd %ymm3, (%rcx)
Source line: 74
addq $128, %rdi
addq $128, %rax
addq $128, %rcx
addq $-16, %r11
jne L112 |
lgtm also |
(This is Julia release-0.4 with LLVM 3.6.2; LLVM 3.3 is similar but slightly worse.)
I find that introducing a trivial wrapper type around
Float64
leads to much less efficient code. This is an example:I would expect this type
F
to be as efficient asFloat64
. Unfortunately it isn't, as visible in this test:Julia produces the following code for this example:
When I generated code for
Vector{Float64}
instead, then the double indirections (reloading%rax
at every loop iteration) are not present, and the loop is vectorized. I wonder what causes this, and how it can be addressed.The presence of the double indirections indicates that this may be an aliasing problem: Maybe LLVM cannot determine that the three arrays and the content of the new type
F
don't alias, and thus cannot hoist the respective loads out of the loop?The text was updated successfully, but these errors were encountered: