-
Notifications
You must be signed in to change notification settings - Fork 7
Fma #17
Comments
The background is here: Basically There are some cases where if you have a fast What is your |
julia> @code_native fma(1.,2.,3.)
.text
Filename: floatfuncs.jl
pushq %rbp
movq %rsp, %rbp
movabsq $fma, %rax
Source line: 281
popq %rbp
jmpq *%rax
nopw %cs:(%rax,%rax)
julia> @code_native muladd(1.,2.,3.)
.text
Filename: float.jl
pushq %rbp
movq %rsp, %rbp
Source line: 249
mulsd %xmm1, %xmm0
addsd %xmm2, %xmm0
popq %rbp
retq
nop
julia> versioninfo()
Julia Version 0.5.0-rc3+0
Commit e6f843b (2016-08-22 23:43 UTC)
Platform Info:
System: NT (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
|
Yes, so it looks like it is using a software fma, in which case |
Though you do have a Haswell processor, so if you recompile the sysimg (run |
Exactly, hence my surprise when I saw that it wasn't defaulting to the hardware implementation. I tried to run buil_sysimg.jl to no avail. Will post an issue. |
Yeah I think that it requires a linker be installed, which typically isn't available on Windows. |
Fixed, I just had to manually delete the old dll before runing build_sysimg and now muladd defaults correctly to the hardware configuration JuliaLang/julia#18377 |
FYI, this really has a considerable impact on benchmarks.... |
It looks like using two-prod-fma is still faster in both cases (software or hardware) immutable Double{T<:Float64}
hi::T
lo::T
end
@inline trunclo(x::Float64) = reinterpret(Float64, reinterpret(UInt64, x) & 0xffff_ffff_f800_0000) # clear lower 27 bits (leave upper 26 bits)
@inline trunclo(x::Float32) = reinterpret(Float32, reinterpret(UInt32, x) & 0xffff_f000) # clear lowest 12 bits (leave upper 12 bits)
@inline function splitprec(x::Float64)
hx = trunclo(x)
hx, x-hx
end
# two-prod-fma
#HARDWARE
function ddmul1{T<:Float64}(x::T, y::T)
z = x*y
Double(z, Base.fma_llvm(x, y, -z))
end
#SOFTWARE
function ddmul2{T<:Float64}(x::T, y::T)
z = x*y
Double(z, Base.fma_libm(x, y, -z))
end
# two-prod x*y
function ddmul3{T<:Float64}(x::T, y::T)
hx, lx = splitprec(x)
hy, ly = splitprec(y)
z = x*y
Double(z, ((hx*hy-z) + lx*hy + hx*ly) + lx*ly)
end You can look at julia> using BenchmarkTools
julia> x = zip(randn(1000000),randn(1000000))
julia> @benchmark for xi in x ddmul1(xi...) end
BenchmarkTools.Trial:
samples: 22
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 137.32 mb
allocs estimate: 5999490
minimum time: 203.72 ms (9.85% GC)
median time: 216.42 ms (9.90% GC)
mean time: 228.58 ms (9.62% GC)
maximum time: 335.36 ms (7.75% GC)
julia> @benchmark for xi in x ddmul2(xi...) end
BenchmarkTools.Trial:
samples: 16
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 137.32 mb
allocs estimate: 5999490
minimum time: 287.62 ms (6.49% GC)
median time: 297.99 ms (7.16% GC)
mean time: 319.22 ms (6.91% GC)
maximum time: 461.30 ms (6.47% GC)
julia> @benchmark for xi in x ddmul3(xi...) end
BenchmarkTools.Trial:
samples: 1
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 579.83 mb
allocs estimate: 27999490
minimum time: 8.93 s (1.07% GC)
median time: 8.93 s (1.07% GC)
mean time: 8.93 s (1.07% GC)
maximum time: 8.93 s (1.07% GC)
|
Yeah, that's one reason I really would like the |
That is interesting, I never really benchmarked it (in fact I don't think |
We can provide a function to rebuild the system image optionally rather than be part of installation to start with. |
Yeah I can't get my head around why ddmul3 is any slower than ddmul2 since ddmul2 is under the hood using the openlibm software implementation of Dekker's tricks... |
You would probably have to look at the assembly output to see the difference. Does it make a difference if you start julia with |
@ViralBShah is that really optimal. If/when we have a Julia libm shouldn't the default be to always build the system image due to the considerable perf gains. On a relatively new CPU it only took a couple of minutes to build on my laptop. It might also be very confusing to newbies. |
@simonbyrne That's it! -O3 nails it down to be roughly the same speed |
On a related note. It would fantastic if we could also force precompilation of all the math function in Libm in Julia under -O3, even if the user has run Julia under the default optimization flag. I'm not sure if this is possible, otherwise I could open an issue to explore/discuss this possibility in more depth. |
I meant that we should start with having it optional in the 0.6 release cycle and make it robust before we decide to make it default as part of the installation. It is a pretty big thing to pull off and will need significant testing. @tkelman has thought about these issues in a lot more detail. |
Agreed. |
Actually it turns out that if it's in sysimg (and not subsequently inlined) it should be compiled at -O3 in the release, so this won't be a problem once it's in Base. But yes, it would be nice to have a way to control function or module level optimisation behaviour |
Perfect. The only dependency then is to remove the openlibm fma implementation. Would defining a pure Julia fma as fma_julia(x,y,z) = Sleef.ddadd2(Sleef.ddmul(x,y),z) be sufficient? The code for this in openlibm looks quite messy |
Not quite. There is the issue whereby |
All those checks explain why the software fma is so much slower than muladd |
Shouldn't it be fma and not muladd in the following line?
https://github.com/JuliaMath/Libm.jl/blob/master/src/utils.jl#L79
On my pc muladd is just as fast as fma, but this test fails.
The text was updated successfully, but these errors were encountered: