Fma #17

musm · 2016-09-06T00:56:13Z

Shouldn't it be fma and not muladd in the following line?
https://github.com/JuliaMath/Libm.jl/blob/master/src/utils.jl#L79
On my pc muladd is just as fast as fma, but this test fails.

simonbyrne · 2016-09-06T10:12:13Z

The background is here:
JuliaLang/julia#9855

Basically fma is guaranteed to have no intermediate rounding (but may be slow if not implemented in hardware). muladd can do either fma or a*b+c, but should do whichever is fastest.

There are some cases where if you have a fast fma you want to use it, but if you don't you need to fall back on a different approach to maintain accuracy, e.g. here.

What is your @code_native for muladd and fma? You may also be hitting this issue.

musm · 2016-09-06T14:33:15Z

julia> @code_native fma(1.,2.,3.)
        .text
Filename: floatfuncs.jl
        pushq   %rbp
        movq    %rsp, %rbp
        movabsq $fma, %rax
Source line: 281
        popq    %rbp
        jmpq    *%rax
        nopw    %cs:(%rax,%rax)

julia> @code_native muladd(1.,2.,3.)
        .text
Filename: float.jl
        pushq   %rbp
        movq    %rsp, %rbp
Source line: 249
        mulsd   %xmm1, %xmm0
        addsd   %xmm2, %xmm0
        popq    %rbp
        retq
        nop

julia> versioninfo()
Julia Version 0.5.0-rc3+0
Commit e6f843b (2016-08-22 23:43 UTC)
Platform Info:
  System: NT (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

simonbyrne · 2016-09-06T14:57:00Z

Yes, so it looks like it is using a software fma, in which case muladd is just giving you a standard multiply and add instead of a fused operation.

simonbyrne · 2016-09-06T14:59:11Z

Though you do have a Haswell processor, so if you recompile the sysimg (run contrib/build_sysimg.jl, though I'm not sure if it works on Windows) you should be able to use the hardware fma.

musm · 2016-09-06T15:17:36Z

Exactly, hence my surprise when I saw that it wasn't defaulting to the hardware implementation.

I tried to run buil_sysimg.jl to no avail. Will post an issue.

simonbyrne · 2016-09-06T15:25:06Z

Yeah I think that it requires a linker be installed, which typically isn't available on Windows.

musm · 2016-09-06T16:17:00Z

Fixed, I just had to manually delete the old dll before runing build_sysimg and now muladd defaults correctly to the hardware configuration JuliaLang/julia#18377

musm · 2016-09-06T16:20:37Z

FYI, this really has a considerable impact on benchmarks....

musm · 2016-09-06T16:35:16Z

It looks like using two-prod-fma is still faster in both cases (software or hardware)
compared to the manual two-prod below, which is very interesting since the libm fma seems to be employing the same trick as in the Dekker

immutable Double{T<:Float64}
    hi::T
    lo::T
end
@inline trunclo(x::Float64) = reinterpret(Float64, reinterpret(UInt64, x) & 0xffff_ffff_f800_0000) # clear lower 27 bits (leave upper 26 bits)
@inline trunclo(x::Float32) = reinterpret(Float32, reinterpret(UInt32, x) & 0xffff_f000) # clear lowest 12 bits (leave upper 12 bits)
@inline function splitprec(x::Float64)
    hx = trunclo(x)
    hx, x-hx
end

# two-prod-fma
#HARDWARE
function ddmul1{T<:Float64}(x::T, y::T)
     z = x*y
     Double(z, Base.fma_llvm(x, y, -z))
end

#SOFTWARE
function ddmul2{T<:Float64}(x::T, y::T)
     z = x*y
     Double(z, Base.fma_libm(x, y, -z))
end

# two-prod x*y
function ddmul3{T<:Float64}(x::T, y::T)
    hx, lx = splitprec(x)
    hy, ly = splitprec(y)
    z = x*y
    Double(z, ((hx*hy-z) + lx*hy + hx*ly) + lx*ly)
end

You can look at code_native or bench

julia> using BenchmarkTools
julia> x = zip(randn(1000000),randn(1000000))
julia> @benchmark  for xi in x ddmul1(xi...) end            
BenchmarkTools.Trial:                                       
  samples:          22                                      
  evals/sample:     1                                       
  time tolerance:   5.00%                                   
  memory tolerance: 1.00%                                   
  memory estimate:  137.32 mb                               
  allocs estimate:  5999490                                 
  minimum time:     203.72 ms (9.85% GC)                    
  median time:      216.42 ms (9.90% GC)                    
  mean time:        228.58 ms (9.62% GC)                    
  maximum time:     335.36 ms (7.75% GC)                    

julia> @benchmark  for xi in x ddmul2(xi...) end            
BenchmarkTools.Trial:                                       
  samples:          16                                      
  evals/sample:     1                                       
  time tolerance:   5.00%                                   
  memory tolerance: 1.00%                                   
  memory estimate:  137.32 mb                               
  allocs estimate:  5999490                                 
  minimum time:     287.62 ms (6.49% GC)                    
  median time:      297.99 ms (7.16% GC)                    
  mean time:        319.22 ms (6.91% GC)                    
  maximum time:     461.30 ms (6.47% GC)                    

julia> @benchmark  for xi in x ddmul3(xi...) end
BenchmarkTools.Trial:
  samples:          1
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  579.83 mb
  allocs estimate:  27999490
  minimum time:     8.93 s (1.07% GC)
  median time:      8.93 s (1.07% GC)
  mean time:        8.93 s (1.07% GC)
  maximum time:     8.93 s (1.07% GC)

simonbyrne · 2016-09-06T17:01:42Z

FYI, this really has a considerable impact on benchmarks....

Yeah, that's one reason I really would like the build_sysimg to be part of installation, as it lets us do things that are basically impossible in C unless you recompile all your libraries for your machine.

simonbyrne · 2016-09-06T17:04:53Z

That is interesting, I never really benchmarked it (in fact I don't think fma was available as a Julia function when I wrote DoubleDouble.jl)

ViralBShah · 2016-09-06T17:05:48Z

We can provide a function to rebuild the system image optionally rather than be part of installation to start with.

musm · 2016-09-06T17:06:36Z

Yeah I can't get my head around why ddmul3 is any slower than ddmul2 since ddmul2 is under the hood using the openlibm software implementation of Dekker's tricks...

simonbyrne · 2016-09-06T17:09:45Z

You would probably have to look at the assembly output to see the difference. Does it make a difference if you start julia with -O3?

musm · 2016-09-06T17:10:07Z

@ViralBShah is that really optimal. If/when we have a Julia libm shouldn't the default be to always build the system image due to the considerable perf gains. On a relatively new CPU it only took a couple of minutes to build on my laptop. It might also be very confusing to newbies.

musm · 2016-09-06T17:12:04Z

@simonbyrne That's it! -O3 nails it down to be roughly the same speed

musm · 2016-09-06T17:14:28Z

On a related note. It would fantastic if we could also force precompilation of all the math function in Libm in Julia under -O3, even if the user has run Julia under the default optimization flag. I'm not sure if this is possible, otherwise I could open an issue to explore/discuss this possibility in more depth.

ViralBShah · 2016-09-06T17:14:37Z

I meant that we should start with having it optional in the 0.6 release cycle and make it robust before we decide to make it default as part of the installation. It is a pretty big thing to pull off and will need significant testing. @tkelman has thought about these issues in a lot more detail.

simonbyrne · 2016-09-06T17:16:58Z

On a related note. It would fantastic if we could also force precompilation of all the math function in Libm in Julia under -O3, I'm not sure if that's possible.

Agreed.

simonbyrne · 2016-09-06T17:33:04Z

Actually it turns out that if it's in sysimg (and not subsequently inlined) it should be compiled at -O3 in the release, so this won't be a problem once it's in Base. But yes, it would be nice to have a way to control function or module level optimisation behaviour

musm · 2016-09-06T21:44:22Z

Perfect. The only dependency then is to remove the openlibm fma implementation.

Would defining a pure Julia fma as

fma_julia(x,y,z) = Sleef.ddadd2(Sleef.ddmul(x,y),z)

be sufficient?

The code for this in openlibm looks quite messy
https://github.com/JuliaLang/openlibm/blob/master/src/s_fma.c

simonbyrne · 2016-09-07T09:12:43Z

Not quite. There is the issue whereby ddmul can lose the last bit, and also would need to handle subnormals and intermediate overflow correctly.

musm · 2016-09-07T15:46:10Z

All those checks explain why the software fma is so much slower than muladd

musm closed this as completed Sep 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fma #17

Fma #17

musm commented Sep 6, 2016 •

edited

Loading

simonbyrne commented Sep 6, 2016

musm commented Sep 6, 2016

simonbyrne commented Sep 6, 2016

simonbyrne commented Sep 6, 2016

musm commented Sep 6, 2016

simonbyrne commented Sep 6, 2016

musm commented Sep 6, 2016

musm commented Sep 6, 2016 •

edited

Loading

musm commented Sep 6, 2016 •

edited

Loading

simonbyrne commented Sep 6, 2016

simonbyrne commented Sep 6, 2016

ViralBShah commented Sep 6, 2016

musm commented Sep 6, 2016

simonbyrne commented Sep 6, 2016

musm commented Sep 6, 2016

musm commented Sep 6, 2016

musm commented Sep 6, 2016 •

edited

Loading

ViralBShah commented Sep 6, 2016

simonbyrne commented Sep 6, 2016

simonbyrne commented Sep 6, 2016 •

edited

Loading

musm commented Sep 6, 2016 •

edited

Loading

simonbyrne commented Sep 7, 2016

musm commented Sep 7, 2016 •

edited

Loading

Fma #17

Fma #17

Comments

musm commented Sep 6, 2016 • edited Loading

simonbyrne commented Sep 6, 2016

musm commented Sep 6, 2016

simonbyrne commented Sep 6, 2016

simonbyrne commented Sep 6, 2016

musm commented Sep 6, 2016

simonbyrne commented Sep 6, 2016

musm commented Sep 6, 2016

musm commented Sep 6, 2016 • edited Loading

musm commented Sep 6, 2016 • edited Loading

simonbyrne commented Sep 6, 2016

simonbyrne commented Sep 6, 2016

ViralBShah commented Sep 6, 2016

musm commented Sep 6, 2016

simonbyrne commented Sep 6, 2016

musm commented Sep 6, 2016

musm commented Sep 6, 2016

musm commented Sep 6, 2016 • edited Loading

ViralBShah commented Sep 6, 2016

simonbyrne commented Sep 6, 2016

simonbyrne commented Sep 6, 2016 • edited Loading

musm commented Sep 6, 2016 • edited Loading

simonbyrne commented Sep 7, 2016

musm commented Sep 7, 2016 • edited Loading

musm commented Sep 6, 2016 •

edited

Loading

musm commented Sep 6, 2016 •

edited

Loading

musm commented Sep 6, 2016 •

edited

Loading

musm commented Sep 6, 2016 •

edited

Loading

simonbyrne commented Sep 6, 2016 •

edited

Loading

musm commented Sep 6, 2016 •

edited

Loading

musm commented Sep 7, 2016 •

edited

Loading