Implement `muladd` #9840

eschnett · 2015-01-19T15:43:30Z

(This branch is based on the fma branch which is likely to be merged soon #8112 . Since the implementations of fma and muladd are similar, there would otherwise be many conflicts.)

Implement muladd(x,y,z), a fast way to calculate x*y+z.

This is very different from fma(x,y,z), which also calculates x*y+z. fma is about accuracy; it guarantees that the intermediate result x*y is not rounded. This may be very slow on some platforms. muladd is guaranteed to be fast, and will use architecture-specific instructions if available. If fma happens to be the fastest way for this operation, the muladd will be equivalent to fma.

…ladd`

simonbyrne · 2015-01-20T10:48:20Z

Fantastic, thanks.

jakebolewski · 2015-01-20T16:04:10Z

src/intrinsics.cpp

@@ -986,6 +986,20 @@ static Value *emit_intrinsic(intrinsic f, jl_value_t **args, size_t nargs,
                                   ArrayRef<Type*>(x->getType())),
         FP(x), FP(y), FP(z));
    }
+    HANDLE(muladd_float,3)
+#ifdef LLVM34


The API changed for only v3.4? Will this work on v3.3 and versions >= v3.5?

LLVM34 doesn't do what you probably think it does -- it means "LLVM 3.4 or later".

Ah yes, sorry for the noise.

Implement `muladd`

simonbyrne · 2015-01-22T12:07:14Z

Thanks again, it's good to have this.

stevengj · 2015-01-22T15:09:01Z

What about access to multiply-subtract and the other three FMA variants as discussed in #6330? See, for example, @evalpoly as discussed in #9881, which needs fnma.

eschnett · 2015-01-22T15:19:38Z

These other variants are not exposed by LLVM as intrinsics. I assume that LLVM's optimizer will generate these automatically when it encounters nearby fneg instructions, in particular when they are marked as fast. muladd does not mark instructions as fast, but the @fastmath macro does.

stevengj · 2015-01-22T15:20:36Z

Then why have muladd at all, as opposed to just using @fastmath?

eschnett · 2015-01-22T15:33:57Z

@fastmath is much more aggressive than muladd. I would guess that using muladd where possible is harmless in almost all cases, and it helps getting the full floating point performance of a CPU. The latter may not actually be possible when you write x*y+z.

On the other hand, @fastmath can lead to noticeable differences. @fastmath e.g. also assumes that there are no Infs and NaNs. @fastmath leads to incorrect results if its assumptions are not met, whereas muladd may just provide a tiny bit more accuracy than you specified.

A more fine-grained set of options would be possible. For example, LLVM distinguishes between "assume there are no nans", "assume there are no infs", "don't care about the sign of zero", and "allow reciprocals instead of division". Another flag could be "flush subnormal numbers to zero". But such a large set of options would be quite confusing.

simonbyrne · 2015-01-22T15:39:41Z

Last time I checked, LLVM was able to transform things like

fma(a,b,-c)

into the relevant fused multiply-subtract instruction.

simonbyrne · 2015-01-22T15:55:13Z

Okay, on a Haswell machine running LLVM 3.5:

julia> foo(x,y,z) = muladd(x,y,-z)
foo (generic function with 1 method)

julia> @code_native foo(1.0,2.0,3.0)
    .text
Filename: none
Source line: 0
    push    rbp
    mov rbp, rsp
Source line: 1
    vfmsub213sd xmm0, xmm1, xmm2
    pop rbp
    ret

and

julia> foo(x,y,z) = muladd(-x,y,z)
foo (generic function with 1 method)

julia> @code_native foo(1.0,2.0,3.0)
    .text
Filename: none
Source line: 0
    push    rbp
    mov rbp, rsp
Source line: 1
    vfnmadd213sd    xmm0, xmm1, xmm2
    pop rbp
    ret

So it does look like it can handle the transformations correctly.

stevengj · 2015-01-22T16:43:04Z

@simonbyrne, good to know that muladd + unary-minus is enough.

simonbyrne · 2015-01-22T16:50:19Z

There is one downside however: on a non-fma machine (using LLVM 3.3)

julia> foo(x,y,z) = muladd(-x,y,z)
foo (generic function with 1 method)

julia> @code_native foo(1.0,2.0,3.0)
    .section    __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 1
    push    RBP
    mov RBP, RSP
    movabs  RAX, 4484106720
Source line: 1
    vxorpd  XMM0, XMM0, XMMWORD PTR [RAX]
    vmulsd  XMM0, XMM0, XMM1
    vaddsd  XMM0, XMM0, XMM2
    pop RBP
    ret

The code takes an explicit negation, rather than use a subtraction (the other case is fine).

stevengj · 2015-01-22T17:26:36Z

Yikes, that's unfortunate.

simonbyrne · 2015-01-22T17:59:50Z

Ah, it does appear to be fixed with LLVM 3.5, so I guess that's not too big of a problem.

stevengj · 2015-01-22T19:16:34Z

Is there some way to put in a regression test for this in case LLVM changes its mind again in the future?

eschnett · 2015-01-22T19:35:26Z

That's difficult from Julia. We don't really care about what instructions LLVM generates, we care that it executes as fast as possible. Instruction selection (and register selection and instruction scheduling) isn't typically something that we worry about in Julia. I think this should be an LLVM test case instead, checking that the LLVM muladd intrinsic generates the right code.

I checked, and there doesn't seem to be such a check at the moment. This test ./CodeGen/X86/extended-fma-contraction.ll comes close, and could probably be adapted:

; RUN: llc -march=x86 -mcpu=bdver2 -mattr=-fma -mtriple=x86_64-apple-darwin < %s | FileCheck %s
; RUN: llc -march=x86 -mcpu=bdver2 -mattr=-fma,-fma4 -mtriple=x86_64-apple-darwin < %s | FileCheck %s --check-prefix=CHECK-NOFMA

; CHECK-LABEL: fmafunc
define <3 x float> @fmafunc(<3 x float> %a, <3 x float> %b, <3 x float> %c) {

; CHECK-NOT: vmulps
; CHECK-NOT: vaddps
; CHECK: vfmaddps
; CHECK-NOT: vmulps
; CHECK-NOT: vaddps

; CHECK-NOFMA-NOT: calll
; CHECK-NOFMA: vmulps
; CHECK-NOFMA: vaddps
; CHECK-NOFMA-NOT: calll

  %ret = tail call <3 x float> @llvm.fmuladd.v3f32(<3 x float> %a, <3 x float> %b, <3 x float> %c)
  ret <3 x float> %ret
}

declare <3 x float> @llvm.fmuladd.v3f32(<3 x float>, <3 x float>, <3 x float>) nounwind readnone

StefanKarpinski · 2015-01-22T20:19:11Z

Currently Julia's own IR is represented as data but as soon as we get to LLVM IR we just print a bunch of text. A great project would be to expose LLVM IR as data in Julia – that would make testing for this kind of thing possible and possibly even straightforward.

tkelman · 2015-01-22T20:59:00Z

I've thought the same thing - #8275 (comment)

Having Julia data structures for LLVM IR (and assembly too?) could be pretty useful. If there's not an existing up-for-grabs issue, should we open one?

StefanKarpinski · 2015-01-22T21:05:13Z

It might be better to stop at LLVM IR. As long as we only expose LLVM in Julia, it's hard to write inherently non-portable code, which is kind of nice. I guess if we wanted to reify machine code, each platform could have a different platform-specific representation. Please do open an issue!

eschnett · 2015-01-22T22:43:54Z

Oooh, LLVM passes written in Julia!

Implement muladd

e87e3fe

eschnett force-pushed the muladd branch from d85bee0 to e87e3fe Compare January 19, 2015 20:18

Update documentation to describe the difference between fma and `mu…

768ffb5

…ladd`

ivarne mentioned this pull request Jan 20, 2015

indicator for fast FMA #9855

Closed

jakebolewski reviewed Jan 20, 2015
View reviewed changes

simonbyrne added a commit that referenced this pull request Jan 22, 2015

Merge pull request #9840 from eschnett/muladd

e726b6c

Implement `muladd`

simonbyrne merged commit e726b6c into JuliaLang:master Jan 22, 2015

simonbyrne mentioned this pull request Jan 22, 2015

use muladd in horner macro #9881

Merged

eschnett deleted the muladd branch January 22, 2015 14:29

tkelman mentioned this pull request Jan 22, 2015

Data structures for LLVM IR #9885

Closed

stevengj mentioned this pull request Apr 28, 2015

added fma and muladd JuliaLang/Compat.jl#75

Closed

ChrisRackauckas mentioned this pull request Sep 23, 2016

@fastmath does not apply muladd #18654

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `muladd` #9840

Implement `muladd` #9840

eschnett commented Jan 19, 2015

simonbyrne commented Jan 20, 2015

jakebolewski Jan 20, 2015

eschnett Jan 20, 2015

jakebolewski Jan 20, 2015

simonbyrne commented Jan 22, 2015

stevengj commented Jan 22, 2015

eschnett commented Jan 22, 2015

stevengj commented Jan 22, 2015

eschnett commented Jan 22, 2015

simonbyrne commented Jan 22, 2015

simonbyrne commented Jan 22, 2015

stevengj commented Jan 22, 2015

simonbyrne commented Jan 22, 2015

stevengj commented Jan 22, 2015

simonbyrne commented Jan 22, 2015

stevengj commented Jan 22, 2015

eschnett commented Jan 22, 2015

StefanKarpinski commented Jan 22, 2015

tkelman commented Jan 22, 2015

StefanKarpinski commented Jan 22, 2015

eschnett commented Jan 22, 2015

Implement muladd #9840

Implement muladd #9840

Conversation

eschnett commented Jan 19, 2015

simonbyrne commented Jan 20, 2015

jakebolewski Jan 20, 2015

Choose a reason for hiding this comment

eschnett Jan 20, 2015

Choose a reason for hiding this comment

jakebolewski Jan 20, 2015

Choose a reason for hiding this comment

simonbyrne commented Jan 22, 2015

stevengj commented Jan 22, 2015

eschnett commented Jan 22, 2015

stevengj commented Jan 22, 2015

eschnett commented Jan 22, 2015

simonbyrne commented Jan 22, 2015

simonbyrne commented Jan 22, 2015

stevengj commented Jan 22, 2015

simonbyrne commented Jan 22, 2015

stevengj commented Jan 22, 2015

simonbyrne commented Jan 22, 2015

stevengj commented Jan 22, 2015

eschnett commented Jan 22, 2015

StefanKarpinski commented Jan 22, 2015

tkelman commented Jan 22, 2015

StefanKarpinski commented Jan 22, 2015

eschnett commented Jan 22, 2015

Implement `muladd` #9840

Implement `muladd` #9840