Implement fma #8112

eschnett · 2014-08-24T18:30:40Z

I've implemented fma(x,y,z) and mad(x,y,z), as a follow-up to #6330. I'd be happy to receive comments.

Keno · 2014-08-24T18:37:38Z

src/intrinsics.cpp

+        Type *tz = z->getType();
+        Type *ts[3] = { tx, ty, tz };
+        return builder.CreateCall3
+          (jl_Module->getOrInsertFunction(tx==T_float64 ? "fma" : "fmaf",


Probably better to use the llvm intrinsic here.

StefanKarpinski · 2014-08-24T20:07:13Z

I'd prefer to call mad muladd – the former is inscrutable and means "mean absolute deviation" in statistics while the latter is fairly clear but still pretty short.

eschnett · 2014-08-24T23:29:39Z

I have no preference as to the name; muladd seems better.

simonbyrne · 2014-08-25T10:47:30Z

Fantastic! I made some general comments on #6330, but I think even if we do go down that direction, we will probably still want to expose these functions.

I don't think the fallback for fma should be fma{T<:Number}(x::T, y::T, z::T) = x*y+z. This is fine for integers (as we do modulo arithmetic), but not necessarily true for arbitrary numeric types.
For BigFloats you can call the fma function from MPFR.

ArchRobison · 2014-10-03T14:49:43Z

muladd_float appears to be defined, but not used anywhere. Also, should it expand to fma on platforms where fma is generally faster than separate + and *? (That's "generally" because there are corner cases, at least on Intel platforms where separate operations, are faster because of pipelining effects.)

Thanks for doing this. I can cross it off my to-do list :-)

eschnett · 2014-10-03T14:55:00Z

I don't see muladd_float; where is it defined?

Yes, muladd should expand to the fastest way a*b+c can be evaluated. In the future, this should be lowered to the respective LLVM intrinsic for types where it exists, which would automatically choose the appropriate instruction.

ArchRobison · 2014-10-03T15:24:08Z

src/intrinsics.cpp

@@ -6,6 +6,7 @@ namespace JL_I {
        neg_int, add_int, sub_int, mul_int,
        sdiv_int, udiv_int, srem_int, urem_int, smod_int,
        neg_float, add_float, sub_float, mul_float, div_float, rem_float,
+        fma_float, muladd_float,


muladd_float seems to be defined as an intrinisic, but the intrinsic appears to be unused by the rest of the patch.

ArchRobison · 2014-10-03T15:34:52Z

I see now that LLVM 3.4.2 has both the mandatory fusion llvm.fma.* and optional fusion llvm.fmuladd intrinsics. (I had missed the latter.)

I marked the first occurrence of muladd_float in a line note.

StefanKarpinski · 2014-10-03T15:53:52Z

I don't much like the idea of opening the door to simple sequential arithmetic code giving different answers on different machines based on their performance characteristics. I'm not convinced that the performance advantage of fma over separate mul and add is ever that great.

That leaves accuracy as a motivation for fma – and indeed, there are cases where you really need this. But in that case, it is incorrect to do the mul and add separately. So I guess what I'm saying is that I think we should only expose the mandatory fma operation.

ArchRobison · 2014-10-03T16:24:06Z

If only the mandatory fma is exposed at first, then we could play around with its effect on benchmarks. I tried to build the branch, but ran into some unrelated problem:

$ julia
symbol could not be found jl_sysimg_gvars (-1): /localdisk/adrobiso/fma/julia/usr/bin/../lib/julia/sys.so: undefined symbol: jl_sysimg_gvars
symbol could not be found jl_globalUnique (-1): /localdisk/adrobiso/fma/julia/usr/bin/../lib/julia/sys.so: undefined symbol: jl_globalUnique
Segmentation fault (core dumped)

simonster · 2014-10-03T16:34:33Z

@StefanKarpinski I can't comment on the performance advantage of muladd, but we already have a lot of functions that may give slightly different results depending on machine performance characteristics. This is true of many functions that use @simd (now including sum) as well as some matrix operations (which can also give different results depending on whether OpenBLAS threads are enabled). For the @simd cases we might be able to tell the loop vectorizer to always interleave n iterations of the loop, but I don't think we can do anything about OpenBLAS. Given the present situation, I don't think a muladd function that is explicitly documented to give different answers on different machines would be overly surprising.

StefanKarpinski · 2014-10-03T16:37:08Z

It's true that we can't always avoid this, but explicitly introducing things that are unpredictable feels odd.

eschnett · 2014-10-03T16:47:37Z

The case for muladd is very much the same as @simd: Improved performance, by accepting possible slight changes to the results. (Not necessarily less accurate -- muladd would produce more accurate results if it uses fma.) So are you suggesting to remove @simd?

Alternatively, we can introduce @muladd that would automatically find cases where it can be applied by rewriting expressions. Internally, it would probably generate function calls to a new function unsafe_muladd.

simonbyrne · 2014-10-03T16:48:42Z

Would it also be possible to have a way of determining whether or not hardware fma is available? C has the FP_FAST_FMA macro: could we export this as a boolean constant?

eschnett · 2014-10-03T16:53:19Z

We could. However, it is more elegant to let the programmer specify what the algorithm needs ("accurate", "fast"), and then let Julia choose what is right for a platform. Whether an actual fma instruction will be used for a particular construct depends on the code generator, and may not be known until machine code is generated.

For example, the optimizer may determine that the multiplication can be hoisted out of a loop, leaving only the addition inside the loop. If you check for such a feature, and then use an algorithm with an explicit fma, this optimization is not possible any more.

simonbyrne · 2014-10-03T17:00:12Z

@eschnett It depends on what level you're programming at. For things like double-double arithmetic, you need the accuracy, but you don't want to fallback to using software fma.

simonbyrne · 2014-10-22T10:59:10Z

I've had a quick play around with this (thanks to @xianyi for access to the openblas Haswell machine). Two things I've noticed:

I've been unable to get muladd to ever actually use an fma, despite it being empirically faster.
For double-double multiplication, you often end up with sequences such as:

hi = x*y
lo = fma(x,y,-hi)

The resulting code_native output indicates that on the second line it does an explicit negation, followed by a vfmadd213sd (multiply-add) instruction, rather than using a single vfmsubXXXsd (multiply-subtract) instruction.

ArchRobison · 2014-10-22T14:04:45Z

@simonbyrne: Are you using LLVM 3.3 or 3.5? My impression is that LLVM 3.5 has better support for Haswell instructions, though I have not played with fma.

eschnett · 2014-10-22T14:07:11Z

These look like LLVM issues to me. Can anyone confirm that the muladd implementation is correct, and "should" work? Or do we need to add some kind of annotation to the generated code to tell LLVM that muladd can be optimized to fma if fma is faster?

simonbyrne · 2014-10-22T14:39:44Z

I just updated to 3.5, but that didn't help with either issue.

ArchRobison · 2014-10-22T20:01:41Z

I found the problem with muladd. Our LLVM version symbols are a bit misleading. #if LLVM33 really means "if LLVM 3.3 or higher". I changed the affected section to:

    HANDLE(muladd_float,3)
#ifdef LLVM34
    {
      assert(y->getType() == x->getType());
      assert(z->getType() == y->getType());
      return builder.CreateCall3
        (Intrinsic::getDeclaration(jl_Module, Intrinsic::fmuladd,
                                   ArrayRef<Type*>(x->getType())),
         FP(x), FP(y), FP(z));
    }
#else
      return builder.CreateFAdd(builder.CreateFMul(FP(x), FP(y)), FP(z));
#endif

and muladd got me a vfmadd213ss on a Haswell box.

ArchRobison · 2014-10-22T20:03:31Z

src/intrinsics.cpp

+         FP(x), FP(y), FP(z));
+    }
+    HANDLE(muladd_float,3)
+#ifdef LLVM33


Per note, this needs to be #idef LLVM34 and the then/else parts swapped.

eschnett · 2014-10-24T13:08:20Z

Thanks for the pointers.

simonbyrne · 2014-11-01T09:51:38Z

I just tried this branch again: both issues I raised above now seem fixed.

simonbyrne · 2015-01-07T09:11:17Z

I would be in favour of keeping muladd: it is distinct from the full @fastmath, in that it typically doesn't involve a loss of accuracy. One example use case where you probably don't want the full @fastmath would be for use inside the @horner macro.

eschnett · 2015-01-07T16:25:11Z

One difference between muladd and @fastmath is that @fastmath tags all math operations that it transforms, and those can then be optimized. For example, in

@fastmath y = a*b+x
@fastmath z = c*d+y

there is no guarantee that a*b+x is evaluated before c*d+y. Instead, all * and + operators are tagged, so that this code is equivalent to @fastmath a*b+c*d+x. The fact that there are two separate @fastmath declarations is irrelevant.

I assume that there are cases where one may want a different behaviour. In my mind, muladd(a,b,x) is allowed to evaluate a*b+x in the most efficient manner (in particular, in a more exact manner), but further re-associations and optimizations are not permitted.

Since the cases of @fastmath, fma, and muladd are somewhat different, and in the hope that separating things helps with the code review and getting-this-thing-pulled-into-master, I've split out muladd. @simonbyrne, if you like muladd, could you open a separate issue? I will then revive my muladd code and create a pull request. After @fastmath and fma have been handled.

eschnett · 2015-01-07T18:06:00Z

The Travis failure seems unrelated.

simonbyrne · 2015-01-19T10:45:29Z

Are there any objections to merging this?

ViralBShah · 2015-01-19T11:08:16Z

This looks like it should be ready to merge.

eschnett · 2015-01-19T14:18:14Z

I'll do the final rebase and handle the conflicts.

eschnett · 2015-01-19T14:22:15Z

Rebased. Note that this only implements fma; since there was some objection to muladd I took it out. I'll open another pull request for muladd.

`fma(x,y,z)` calculates `x*y+z` without rounding the intermediate result `x*y`.

jakebolewski · 2015-01-19T17:26:38Z

@eschnett can you squash this again? It's best to not have test failures in the commit history to help out git-bisect.

simonbyrne · 2015-01-19T17:33:15Z

What is the best way to export TargetLowering::isFMAFasterThanFMulAndFAdd (from #8112 (comment))?

eschnett · 2015-01-19T19:02:28Z

@simonbyrne We can always add a new intrinsic for this.

eschnett · 2015-01-19T19:17:00Z

Squashed.

Implement fma

jakebolewski · 2015-01-19T19:56:17Z

Thanks again @eschnett!

ViralBShah · 2015-01-20T05:06:21Z

👍

Keno reviewed Aug 24, 2014
View reviewed changes

ArchRobison reviewed Oct 3, 2014
View reviewed changes

jiahao force-pushed the master branch 3 times, most recently from 6c7c7e3 to 1a4c02f Compare October 11, 2014 22:06

ArchRobison reviewed Oct 22, 2014
View reviewed changes

jiahao force-pushed the master branch from cdde4df to 7fdc860 Compare October 28, 2014 04:20

simonbyrne mentioned this pull request Nov 1, 2014

A faster log function #8869

Closed

MikeInnes force-pushed the master branch from 5c60996 to b1c3df3 Compare November 14, 2014 17:07

ViralBShah changed the title ~~Implement fma and mad~~ Implement fma and muladd Jan 7, 2015

eschnett force-pushed the fma branch 2 times, most recently from abad2b6 to cf27a93 Compare January 13, 2015 21:47

eschnett changed the title ~~Implement fma and muladd~~ Implement fma Jan 19, 2015

eschnett force-pushed the fma branch from cf27a93 to 1857e5a Compare January 19, 2015 14:21

eschnett added 3 commits January 19, 2015 09:37

Implement fma

a6c0250

`fma(x,y,z)` calculates `x*y+z` without rounding the intermediate result `x*y`.

Simplify fma(float16) implementation

44f2386

Add fma test cases for BigInt, Rational, BigFloat

8fc46fb

eschnett force-pushed the fma branch from 1857e5a to e598afd Compare January 19, 2015 15:00

eschnett mentioned this pull request Jan 19, 2015

Implement muladd #9840

Merged

Add fma test for float16

2353834

eschnett force-pushed the fma branch from 041366d to 2353834 Compare January 19, 2015 18:37

jakebolewski added a commit that referenced this pull request Jan 19, 2015

Merge pull request #8112 from eschnett/fma

06e2137

Implement fma

jakebolewski merged commit 06e2137 into JuliaLang:master Jan 19, 2015

eschnett deleted the fma branch January 19, 2015 20:16

simonbyrne mentioned this pull request Jan 20, 2015

indicator for fast FMA #9855

Closed

stevengj mentioned this pull request Apr 28, 2015

added fma and muladd JuliaLang/Compat.jl#75

Closed

Implement fma #8112

Implement fma #8112

Conversation

eschnett commented Aug 24, 2014

Keno Aug 24, 2014

Choose a reason for hiding this comment

StefanKarpinski commented Aug 24, 2014

eschnett commented Aug 24, 2014

simonbyrne commented Aug 25, 2014

ArchRobison commented Oct 3, 2014

eschnett commented Oct 3, 2014

ArchRobison Oct 3, 2014

Choose a reason for hiding this comment

ArchRobison commented Oct 3, 2014

StefanKarpinski commented Oct 3, 2014

ArchRobison commented Oct 3, 2014

simonster commented Oct 3, 2014

StefanKarpinski commented Oct 3, 2014

eschnett commented Oct 3, 2014

simonbyrne commented Oct 3, 2014

eschnett commented Oct 3, 2014

simonbyrne commented Oct 3, 2014

simonbyrne commented Oct 22, 2014

ArchRobison commented Oct 22, 2014

eschnett commented Oct 22, 2014

simonbyrne commented Oct 22, 2014

ArchRobison commented Oct 22, 2014

ArchRobison Oct 22, 2014

Choose a reason for hiding this comment

eschnett commented Oct 24, 2014

simonbyrne commented Nov 1, 2014

simonbyrne commented Jan 7, 2015

eschnett commented Jan 7, 2015

eschnett commented Jan 7, 2015

simonbyrne commented Jan 19, 2015

ViralBShah commented Jan 19, 2015

eschnett commented Jan 19, 2015

eschnett commented Jan 19, 2015

jakebolewski commented Jan 19, 2015

simonbyrne commented Jan 19, 2015

eschnett commented Jan 19, 2015

eschnett commented Jan 19, 2015

jakebolewski commented Jan 19, 2015

ViralBShah commented Jan 20, 2015