Skip to content
This repository has been archived by the owner on Mar 11, 2020. It is now read-only.

Fma #17

Closed
musm opened this issue Sep 6, 2016 · 23 comments
Closed

Fma #17

musm opened this issue Sep 6, 2016 · 23 comments

Comments

@musm
Copy link
Collaborator

musm commented Sep 6, 2016

Shouldn't it be fma and not muladd in the following line?
https://github.com/JuliaMath/Libm.jl/blob/master/src/utils.jl#L79
On my pc muladd is just as fast as fma, but this test fails.

@simonbyrne
Copy link
Member

The background is here:
JuliaLang/julia#9855

Basically fma is guaranteed to have no intermediate rounding (but may be slow if not implemented in hardware). muladd can do either fma or a*b+c, but should do whichever is fastest.

There are some cases where if you have a fast fma you want to use it, but if you don't you need to fall back on a different approach to maintain accuracy, e.g. here.

What is your @code_native for muladd and fma? You may also be hitting this issue.

@musm
Copy link
Collaborator Author

musm commented Sep 6, 2016

julia> @code_native fma(1.,2.,3.)
        .text
Filename: floatfuncs.jl
        pushq   %rbp
        movq    %rsp, %rbp
        movabsq $fma, %rax
Source line: 281
        popq    %rbp
        jmpq    *%rax
        nopw    %cs:(%rax,%rax)

julia> @code_native muladd(1.,2.,3.)
        .text
Filename: float.jl
        pushq   %rbp
        movq    %rsp, %rbp
Source line: 249
        mulsd   %xmm1, %xmm0
        addsd   %xmm2, %xmm0
        popq    %rbp
        retq
        nop

julia> versioninfo()
Julia Version 0.5.0-rc3+0
Commit e6f843b (2016-08-22 23:43 UTC)
Platform Info:
  System: NT (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

@simonbyrne
Copy link
Member

Yes, so it looks like it is using a software fma, in which case muladd is just giving you a standard multiply and add instead of a fused operation.

@simonbyrne
Copy link
Member

Though you do have a Haswell processor, so if you recompile the sysimg (run contrib/build_sysimg.jl, though I'm not sure if it works on Windows) you should be able to use the hardware fma.

@musm
Copy link
Collaborator Author

musm commented Sep 6, 2016

Exactly, hence my surprise when I saw that it wasn't defaulting to the hardware implementation.

I tried to run buil_sysimg.jl to no avail. Will post an issue.

@simonbyrne
Copy link
Member

Yeah I think that it requires a linker be installed, which typically isn't available on Windows.

@musm
Copy link
Collaborator Author

musm commented Sep 6, 2016

Fixed, I just had to manually delete the old dll before runing build_sysimg and now muladd defaults correctly to the hardware configuration JuliaLang/julia#18377

@musm
Copy link
Collaborator Author

musm commented Sep 6, 2016

FYI, this really has a considerable impact on benchmarks....

@musm
Copy link
Collaborator Author

musm commented Sep 6, 2016

It looks like using two-prod-fma is still faster in both cases (software or hardware)
compared to the manual two-prod below, which is very interesting since the libm fma seems to be employing the same trick as in the Dekker

immutable Double{T<:Float64}
    hi::T
    lo::T
end
@inline trunclo(x::Float64) = reinterpret(Float64, reinterpret(UInt64, x) & 0xffff_ffff_f800_0000) # clear lower 27 bits (leave upper 26 bits)
@inline trunclo(x::Float32) = reinterpret(Float32, reinterpret(UInt32, x) & 0xffff_f000) # clear lowest 12 bits (leave upper 12 bits)
@inline function splitprec(x::Float64)
    hx = trunclo(x)
    hx, x-hx
end

# two-prod-fma
#HARDWARE
function ddmul1{T<:Float64}(x::T, y::T)
     z = x*y
     Double(z, Base.fma_llvm(x, y, -z))
end

#SOFTWARE
function ddmul2{T<:Float64}(x::T, y::T)
     z = x*y
     Double(z, Base.fma_libm(x, y, -z))
end

# two-prod x*y
function ddmul3{T<:Float64}(x::T, y::T)
    hx, lx = splitprec(x)
    hy, ly = splitprec(y)
    z = x*y
    Double(z, ((hx*hy-z) + lx*hy + hx*ly) + lx*ly)
end

You can look at code_native or bench

julia> using BenchmarkTools
julia> x = zip(randn(1000000),randn(1000000))
julia> @benchmark  for xi in x ddmul1(xi...) end            
BenchmarkTools.Trial:                                       
  samples:          22                                      
  evals/sample:     1                                       
  time tolerance:   5.00%                                   
  memory tolerance: 1.00%                                   
  memory estimate:  137.32 mb                               
  allocs estimate:  5999490                                 
  minimum time:     203.72 ms (9.85% GC)                    
  median time:      216.42 ms (9.90% GC)                    
  mean time:        228.58 ms (9.62% GC)                    
  maximum time:     335.36 ms (7.75% GC)                    

julia> @benchmark  for xi in x ddmul2(xi...) end            
BenchmarkTools.Trial:                                       
  samples:          16                                      
  evals/sample:     1                                       
  time tolerance:   5.00%                                   
  memory tolerance: 1.00%                                   
  memory estimate:  137.32 mb                               
  allocs estimate:  5999490                                 
  minimum time:     287.62 ms (6.49% GC)                    
  median time:      297.99 ms (7.16% GC)                    
  mean time:        319.22 ms (6.91% GC)                    
  maximum time:     461.30 ms (6.47% GC)                    

julia> @benchmark  for xi in x ddmul3(xi...) end
BenchmarkTools.Trial:
  samples:          1
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  579.83 mb
  allocs estimate:  27999490
  minimum time:     8.93 s (1.07% GC)
  median time:      8.93 s (1.07% GC)
  mean time:        8.93 s (1.07% GC)
  maximum time:     8.93 s (1.07% GC)                                         

@simonbyrne
Copy link
Member

FYI, this really has a considerable impact on benchmarks....

Yeah, that's one reason I really would like the build_sysimg to be part of installation, as it lets us do things that are basically impossible in C unless you recompile all your libraries for your machine.

@simonbyrne
Copy link
Member

That is interesting, I never really benchmarked it (in fact I don't think fma was available as a Julia function when I wrote DoubleDouble.jl)

@ViralBShah
Copy link
Member

We can provide a function to rebuild the system image optionally rather than be part of installation to start with.

@musm
Copy link
Collaborator Author

musm commented Sep 6, 2016

Yeah I can't get my head around why ddmul3 is any slower than ddmul2 since ddmul2 is under the hood using the openlibm software implementation of Dekker's tricks...

@simonbyrne
Copy link
Member

You would probably have to look at the assembly output to see the difference. Does it make a difference if you start julia with -O3?

@musm
Copy link
Collaborator Author

musm commented Sep 6, 2016

@ViralBShah is that really optimal. If/when we have a Julia libm shouldn't the default be to always build the system image due to the considerable perf gains. On a relatively new CPU it only took a couple of minutes to build on my laptop. It might also be very confusing to newbies.

@musm
Copy link
Collaborator Author

musm commented Sep 6, 2016

@simonbyrne That's it! -O3 nails it down to be roughly the same speed

@musm
Copy link
Collaborator Author

musm commented Sep 6, 2016

On a related note. It would fantastic if we could also force precompilation of all the math function in Libm in Julia under -O3, even if the user has run Julia under the default optimization flag. I'm not sure if this is possible, otherwise I could open an issue to explore/discuss this possibility in more depth.

@ViralBShah
Copy link
Member

I meant that we should start with having it optional in the 0.6 release cycle and make it robust before we decide to make it default as part of the installation. It is a pretty big thing to pull off and will need significant testing. @tkelman has thought about these issues in a lot more detail.

@simonbyrne
Copy link
Member

On a related note. It would fantastic if we could also force precompilation of all the math function in Libm in Julia under -O3, I'm not sure if that's possible.

Agreed.

@simonbyrne
Copy link
Member

simonbyrne commented Sep 6, 2016

Actually it turns out that if it's in sysimg (and not subsequently inlined) it should be compiled at -O3 in the release, so this won't be a problem once it's in Base. But yes, it would be nice to have a way to control function or module level optimisation behaviour

@musm
Copy link
Collaborator Author

musm commented Sep 6, 2016

Perfect. The only dependency then is to remove the openlibm fma implementation.

Would defining a pure Julia fma as

fma_julia(x,y,z) = Sleef.ddadd2(Sleef.ddmul(x,y),z) 

be sufficient?

The code for this in openlibm looks quite messy
https://github.com/JuliaLang/openlibm/blob/master/src/s_fma.c

@simonbyrne
Copy link
Member

Not quite. There is the issue whereby ddmul can lose the last bit, and also would need to handle subnormals and intermediate overflow correctly.

@musm
Copy link
Collaborator Author

musm commented Sep 7, 2016

All those checks explain why the software fma is so much slower than muladd

@musm musm closed this as completed Sep 20, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants