Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement fma #8112

Merged
merged 4 commits into from
Jan 19, 2015
Merged

Implement fma #8112

merged 4 commits into from
Jan 19, 2015

Conversation

eschnett
Copy link
Contributor

I've implemented fma(x,y,z) and mad(x,y,z), as a follow-up to #6330. I'd be happy to receive comments.

Type *tz = z->getType();
Type *ts[3] = { tx, ty, tz };
return builder.CreateCall3
(jl_Module->getOrInsertFunction(tx==T_float64 ? "fma" : "fmaf",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to use the llvm intrinsic here.

@StefanKarpinski
Copy link
Member

I'd prefer to call mad muladd – the former is inscrutable and means "mean absolute deviation" in statistics while the latter is fairly clear but still pretty short.

@eschnett
Copy link
Contributor Author

I have no preference as to the name; muladd seems better.

@simonbyrne
Copy link
Contributor

Fantastic! I made some general comments on #6330, but I think even if we do go down that direction, we will probably still want to expose these functions.

  • I don't think the fallback for fma should be fma{T<:Number}(x::T, y::T, z::T) = x*y+z. This is fine for integers (as we do modulo arithmetic), but not necessarily true for arbitrary numeric types.
  • For BigFloats you can call the fma function from MPFR.

@ArchRobison
Copy link
Contributor

muladd_float appears to be defined, but not used anywhere. Also, should it expand to fma on platforms where fma is generally faster than separate + and *? (That's "generally" because there are corner cases, at least on Intel platforms where separate operations, are faster because of pipelining effects.)

Thanks for doing this. I can cross it off my to-do list :-)

@eschnett
Copy link
Contributor Author

eschnett commented Oct 3, 2014

I don't see muladd_float; where is it defined?

Yes, muladd should expand to the fastest way a*b+c can be evaluated. In the future, this should be lowered to the respective LLVM intrinsic for types where it exists, which would automatically choose the appropriate instruction.

@@ -6,6 +6,7 @@ namespace JL_I {
neg_int, add_int, sub_int, mul_int,
sdiv_int, udiv_int, srem_int, urem_int, smod_int,
neg_float, add_float, sub_float, mul_float, div_float, rem_float,
fma_float, muladd_float,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

muladd_float seems to be defined as an intrinisic, but the intrinsic appears to be unused by the rest of the patch.

@ArchRobison
Copy link
Contributor

I see now that LLVM 3.4.2 has both the mandatory fusion llvm.fma.* and optional fusion llvm.fmuladd intrinsics. (I had missed the latter.)

I marked the first occurrence of muladd_float in a line note.

@StefanKarpinski
Copy link
Member

I don't much like the idea of opening the door to simple sequential arithmetic code giving different answers on different machines based on their performance characteristics. I'm not convinced that the performance advantage of fma over separate mul and add is ever that great.

That leaves accuracy as a motivation for fma – and indeed, there are cases where you really need this. But in that case, it is incorrect to do the mul and add separately. So I guess what I'm saying is that I think we should only expose the mandatory fma operation.

@ArchRobison
Copy link
Contributor

If only the mandatory fma is exposed at first, then we could play around with its effect on benchmarks. I tried to build the branch, but ran into some unrelated problem:

$ julia
symbol could not be found jl_sysimg_gvars (-1): /localdisk/adrobiso/fma/julia/usr/bin/../lib/julia/sys.so: undefined symbol: jl_sysimg_gvars
symbol could not be found jl_globalUnique (-1): /localdisk/adrobiso/fma/julia/usr/bin/../lib/julia/sys.so: undefined symbol: jl_globalUnique
Segmentation fault (core dumped)

@simonster
Copy link
Member

@StefanKarpinski I can't comment on the performance advantage of muladd, but we already have a lot of functions that may give slightly different results depending on machine performance characteristics. This is true of many functions that use @simd (now including sum) as well as some matrix operations (which can also give different results depending on whether OpenBLAS threads are enabled). For the @simd cases we might be able to tell the loop vectorizer to always interleave n iterations of the loop, but I don't think we can do anything about OpenBLAS. Given the present situation, I don't think a muladd function that is explicitly documented to give different answers on different machines would be overly surprising.

@StefanKarpinski
Copy link
Member

It's true that we can't always avoid this, but explicitly introducing things that are unpredictable feels odd.

@eschnett
Copy link
Contributor Author

eschnett commented Oct 3, 2014

The case for muladd is very much the same as @simd: Improved performance, by accepting possible slight changes to the results. (Not necessarily less accurate -- muladd would produce more accurate results if it uses fma.) So are you suggesting to remove @simd?

Alternatively, we can introduce @muladd that would automatically find cases where it can be applied by rewriting expressions. Internally, it would probably generate function calls to a new function unsafe_muladd.

@simonbyrne
Copy link
Contributor

Would it also be possible to have a way of determining whether or not hardware fma is available? C has the FP_FAST_FMA macro: could we export this as a boolean constant?

@eschnett
Copy link
Contributor Author

eschnett commented Oct 3, 2014

We could. However, it is more elegant to let the programmer specify what the algorithm needs ("accurate", "fast"), and then let Julia choose what is right for a platform. Whether an actual fma instruction will be used for a particular construct depends on the code generator, and may not be known until machine code is generated.

For example, the optimizer may determine that the multiplication can be hoisted out of a loop, leaving only the addition inside the loop. If you check for such a feature, and then use an algorithm with an explicit fma, this optimization is not possible any more.

@simonbyrne
Copy link
Contributor

@eschnett It depends on what level you're programming at. For things like double-double arithmetic, you need the accuracy, but you don't want to fallback to using software fma.

@jiahao jiahao force-pushed the master branch 3 times, most recently from 6c7c7e3 to 1a4c02f Compare October 11, 2014 22:06
@simonbyrne
Copy link
Contributor

I've had a quick play around with this (thanks to @xianyi for access to the openblas Haswell machine). Two things I've noticed:

  1. I've been unable to get muladd to ever actually use an fma, despite it being empirically faster.
  2. For double-double multiplication, you often end up with sequences such as:
hi = x*y
lo = fma(x,y,-hi)

The resulting code_native output indicates that on the second line it does an explicit negation, followed by a vfmadd213sd (multiply-add) instruction, rather than using a single vfmsubXXXsd (multiply-subtract) instruction.

@ArchRobison
Copy link
Contributor

@simonbyrne: Are you using LLVM 3.3 or 3.5? My impression is that LLVM 3.5 has better support for Haswell instructions, though I have not played with fma.

@eschnett
Copy link
Contributor Author

These look like LLVM issues to me. Can anyone confirm that the muladd implementation is correct, and "should" work? Or do we need to add some kind of annotation to the generated code to tell LLVM that muladd can be optimized to fma if fma is faster?

@simonbyrne
Copy link
Contributor

I just updated to 3.5, but that didn't help with either issue.

@ArchRobison
Copy link
Contributor

I found the problem with muladd. Our LLVM version symbols are a bit misleading. #if LLVM33 really means "if LLVM 3.3 or higher". I changed the affected section to:

    HANDLE(muladd_float,3)
#ifdef LLVM34
    {
      assert(y->getType() == x->getType());
      assert(z->getType() == y->getType());
      return builder.CreateCall3
        (Intrinsic::getDeclaration(jl_Module, Intrinsic::fmuladd,
                                   ArrayRef<Type*>(x->getType())),
         FP(x), FP(y), FP(z));
    }
#else
      return builder.CreateFAdd(builder.CreateFMul(FP(x), FP(y)), FP(z));
#endif

and muladd got me a vfmadd213ss on a Haswell box.

FP(x), FP(y), FP(z));
}
HANDLE(muladd_float,3)
#ifdef LLVM33
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per note, this needs to be #idef LLVM34 and the then/else parts swapped.

@eschnett
Copy link
Contributor Author

Thanks for the pointers.

@simonbyrne
Copy link
Contributor

I just tried this branch again: both issues I raised above now seem fixed.

@simonbyrne
Copy link
Contributor

I would be in favour of keeping muladd: it is distinct from the full @fastmath, in that it typically doesn't involve a loss of accuracy. One example use case where you probably don't want the full @fastmath would be for use inside the @horner macro.

@eschnett
Copy link
Contributor Author

eschnett commented Jan 7, 2015

One difference between muladd and @fastmath is that @fastmath tags all math operations that it transforms, and those can then be optimized. For example, in

@fastmath y = a*b+x
@fastmath z = c*d+y

there is no guarantee that a*b+x is evaluated before c*d+y. Instead, all * and + operators are tagged, so that this code is equivalent to @fastmath a*b+c*d+x. The fact that there are two separate @fastmath declarations is irrelevant.

I assume that there are cases where one may want a different behaviour. In my mind, muladd(a,b,x) is allowed to evaluate a*b+x in the most efficient manner (in particular, in a more exact manner), but further re-associations and optimizations are not permitted.

Since the cases of @fastmath, fma, and muladd are somewhat different, and in the hope that separating things helps with the code review and getting-this-thing-pulled-into-master, I've split out muladd. @simonbyrne, if you like muladd, could you open a separate issue? I will then revive my muladd code and create a pull request. After @fastmath and fma have been handled.

@ViralBShah ViralBShah changed the title Implement fma and mad Implement fma and muladd Jan 7, 2015
@eschnett
Copy link
Contributor Author

eschnett commented Jan 7, 2015

The Travis failure seems unrelated.

@eschnett eschnett force-pushed the fma branch 2 times, most recently from abad2b6 to cf27a93 Compare January 13, 2015 21:47
@simonbyrne
Copy link
Contributor

Are there any objections to merging this?

@ViralBShah
Copy link
Member

This looks like it should be ready to merge.

@eschnett
Copy link
Contributor Author

I'll do the final rebase and handle the conflicts.

@eschnett eschnett changed the title Implement fma and muladd Implement fma Jan 19, 2015
@eschnett
Copy link
Contributor Author

Rebased. Note that this only implements fma; since there was some objection to muladd I took it out. I'll open another pull request for muladd.

`fma(x,y,z)` calculates `x*y+z` without rounding the intermediate result `x*y`.
@jakebolewski
Copy link
Member

@eschnett can you squash this again? It's best to not have test failures in the commit history to help out git-bisect.

@simonbyrne
Copy link
Contributor

What is the best way to export TargetLowering::isFMAFasterThanFMulAndFAdd (from #8112 (comment))?

@eschnett
Copy link
Contributor Author

@simonbyrne We can always add a new intrinsic for this.

@eschnett
Copy link
Contributor Author

Squashed.

jakebolewski added a commit that referenced this pull request Jan 19, 2015
@jakebolewski jakebolewski merged commit 06e2137 into JuliaLang:master Jan 19, 2015
@jakebolewski
Copy link
Member

Thanks again @eschnett!

@eschnett eschnett deleted the fma branch January 19, 2015 20:16
@ViralBShah
Copy link
Member

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants