FMA multiversioning. #43085

maleadt · 2021-11-15T10:27:48Z

Adds a julia.cpu.have_fma intrinsic that gets detected by multiversioning (cloning the function) and lowered afterwards, replacing the calls with simple constants that can then be trivially optimized away. Very ad-hoc implementation as a starting point, but it also seems tricky to generalize this so that CUDA.jl can reuse the mechanism (i.e. adding it to CodegenParams wouldn't work with how CPU multiversioning currently works).

Demo:

$ ./julia
julia> @code_native fma(1.,2.,3.)
        .text
        .file   "fma"
        .globl  julia_fma_84                    # -- Begin function julia_fma_84
        .p2align        4, 0x90
        .type   julia_fma_84,@function
julia_fma_84:                           # @julia_fma_84
; ┌ @ floatfuncs.jl:414 within `fma`
        .cfi_startproc
# %bb.0:                                # %L4
; │┌ @ floatfuncs.jl:408 within `fma_llvm`
        vfmadd213sd     %xmm2, %xmm1, %xmm0     # xmm0 = (xmm1 * xmm0) + xmm2
; │└
        retq
.Lfunc_end0:
        .size   julia_fma_84, .Lfunc_end0-julia_fma_84
        .cfi_endproc
; └
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

$ ./julia -C sandybridge
julia> @code_native fma(1.,2.,3.)
        .text
        .file   "fma"
        .globl  julia_fma_49                    # -- Begin function julia_fma_49
        .p2align        4, 0x90
        .type   julia_fma_49,@function
julia_fma_49:                           # @julia_fma_49
; ┌ @ floatfuncs.jl:414 within `fma`
        .cfi_startproc
# %bb.0:                                # %L6
        subq    $8, %rsp
        .cfi_def_cfa_offset 16
        movabsq $j_fma_emulated_51, %rax
        callq   *%rax
        popq    %rax
        .cfi_def_cfa_offset 8
        retq
.Lfunc_end0:
        .size   julia_fma_49, .Lfunc_end0-julia_fma_49
        .cfi_endproc
; └
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

I've also included #42942, but somebody will have to fine-tune the condition when FMA emulation is used.

yuyichao

I feel like this should just be part of the cloning pass which will give it more knowledge about whether something needs to be cloned, e.g. the cloning trigger from this on aarch64 would be completely unnecessary....

Note that this still won't give people the same API in C, i.e. you won't be using FMA if the sysimg was compiled without it even if it's available at runtime but this should be a good enough starting point for sure.

src/llvm-cpufeatures.cpp

oscardssmith · 2021-11-15T14:16:55Z

Can you check that this fixes #43088?
The easiest way to do this is to remove if !(@static Sys.iswindows() && Int===Int64) (line 1291) from test/math.jl and see if CI passes.

yuyichao · 2021-11-15T15:29:22Z

Since this does not change any fma implementations and merely which one to use in various situations this would not fix #43088 and would at most potentially hide it. This is especially the case since which one to use still depends on build-time information. (just not only based on the default target anymore).

maleadt · 2021-11-15T16:35:07Z

Thanks for the review. I agree it would be better to just put this in the multiversioning pass so that it can do more fine-grained cloning. I'll update the PR.

Since this does not change any fma implementations and merely which one to use in various situations this would not fix #43088 and would at most potentially hide it.

If the goal is to work around bad or missing FMA implementations, the have_fma predicate could always be extended to include the platform so that we uses the emulated implementation on Windows.

yuyichao · 2021-11-15T17:04:32Z

If the goal is to work around bad or missing FMA implementations, the have_fma predicate could always be extended to include the platform so that we uses the emulated implementation on Windows.

Sure. I'm just saying that this mostly has nothing to do with the issue as it stands. And in any case just running the test blindly isn't very useful, especially without any investigation to what the issue actually is.

maleadt · 2021-11-16T12:01:38Z

Improved the feature detection and made sure we don't clone when not needed (e.g. on AArch64).

src/llvm-cpufeatures.cpp

base/floatfuncs.jl

maleadt · 2021-11-17T08:46:48Z

Re. the added passes: It is only needed on O1 or below, where we don't really optimize enough to clean up the following pattern that comes from FMA multiversioning:

; ┌ @ floatfuncs.jl:412 within `have_fma`
; │┌ @ promotion.jl:473 within `==`
    %5 = icmp eq i64 1, 1
    %6 = zext i1 %5 to i8
; └└
  %7 = trunc i8 %6 to i1
  %8 = xor i1 %7, true
  br i1 %8, label %L6, label %L4

It requires both an instsimplify to get to a simple br i1 false, label %L6, label %L4, after which simplifycfg cleans this up. But we don't run instcombine on O1, so moving the pass earlier doesn't help much. Having codegen lower the intrinsic won't help either because we'd still end up with the above code pattern. I've reordered the passes to avoid an extra CFG simplification pass on O1 at least.

On normal optimization levels this all doesn't matter since we run enough passes after multiversioning anyway to clean-up this code.

simonbyrne · 2021-11-18T05:27:12Z

Nice! This can be used to get rid of FMA_NATIVE and fix #33011.

maleadt · 2021-11-18T10:22:00Z

OK, I've removed the checks on FMA_NATIVE as we should be able to trust fma to be accurate now. However, since we were empirically checking if fma worked, we may have to add additional entries to the 'blacklist' in the new LLVM pass triggering use of emulated instead of native FMA.

oscardssmith · 2021-11-18T12:42:27Z

I think the better fix here is to just call the fma intrinsic. using an emulated fma here will be much more expensive.

yuyichao · 2021-11-18T16:16:55Z

src/runtime_intrinsics.c

+JL_DLLEXPORT jl_value_t *jl_have_fma(jl_value_t *typ)
+{
+    JL_TYPECHK(have_fma, datatype, typ);
+    // TODO: run-time feature check?


Runtime feature check here, if implemented, should match the global target being used. And it will essentially always be allowed to return false conservatively since it can't return the exact value unless it knows something about the code calling it.

check processor.cpp for this info.

I know; let's just keep it returning false for now. The overhead of calling this runtime intrinsic is likely going to be larger than the cost of an emulated FMA anyway.

I do think this should be implemented, e.g. to give us the option of running in the interpreter and getting the same answer.

maleadt · 2021-11-19T06:46:41Z

Removed the FMA_NATIVE changes, we can do that in a follow-up PR. Also included Windows in the list of platforms to use an emulated FMA on, to work around #43088.

maleadt · 2021-11-19T13:32:32Z

CI failure is #43124.

simonbyrne · 2021-11-22T17:12:17Z

Is it worth making this an intrinsic and upstreaming to LLVM (suggested here: https://twitter.com/stephentyrone/status/1461854303149305857)?

oscardssmith · 2021-11-22T17:42:31Z

eventually maybe. For now, I'd rather just get this working.

Don't clone on platforms that always support it, and more accurately check CPU features.

maleadt · 2021-11-25T15:56:07Z

Rebased and added a test. Since this looks stable and fixes actual issues, I propose we merge this and work on improvements/extensions later. Maybe after the US holidays to give people another chance at reviewing.

JeffreySarnoff · 2021-11-26T00:25:05Z

+1 for merge this and work on improvements/extensions later

oscardssmith · 2021-11-29T16:15:45Z

Thanks so much for getting this working!

src/jl_exported_funcs.inc

vtjnash · 2021-11-29T17:20:26Z

src/llvm-multiversioning.cpp

+                                flag |= JL_TARGET_CLONE_CPU;
+                        } else {
+                            flag |= JL_TARGET_CLONE_CPU;


FMA is in JL_TARGET_CLONE_MATH, so this also may need that flag too, since they work in tandem together

But this only matches the have_fma intrinsic; presumably the containing function then also contains a call to fma which would have set JL_TARGET_CLONE_MATH.

My question is why you would ever want to clone this call but not clone the fma call itself, since that seems like it would cause divergence in the results returned from the have_fma call vs the use of the fma algorithm, resulting in incorrect program behaviors.

why you would ever want to clone this call but not clone the fma call itself

You wouldn't, but that also can't happen. The check for llvm.fma is right before that, and would set the necessary clone flag:

julia/src/llvm-multiversioning.cpp

Lines 467 to 469 in 5864e43

if (name.startswith("llvm.muladd.") || name.startswith("llvm.fma.")) {

flag |= JL_TARGET_CLONE_MATH;

}

The fma call and the have_fma call are probably in different function however

vtjnash · 2021-11-29T17:21:37Z

src/llvm-cpufeatures.cpp

+        FSAttr.isValid() ? FSAttr.getValueAsString() : jl_TargetMachine->getTargetFeatureString();
+
+    SmallVector<StringRef, 6> Features;
+    FS.split(Features, ',');


From talking to @oscardssmith earlier this isn't quite sufficient to be correct, since we next also need to remove from the list any feature that wasn't present in the particular sysimg copy that we are using. Otherwise we are setting have_fma here, but fma support may actually be disabled in the places we are using it.

I'm not sure I understand. This function will generally only use the feature string from the target-features function attribute, which will represent the features for this particular sysimg copy. Unless that attribute isn't set, in which case we can just use the native target's features. Doesn't that suffice?

No, that would return incorrect answers at runtime for the have_fma function, since the attributes set on the function may a superset of the attributes that were available to other compilation units.

vchuravy · 2022-01-18T13:17:54Z

src/llvm-cpufeatures.cpp

+
+    Attribute FSAttr = caller.getFnAttribute("target-features");
+    StringRef FS =
+        FSAttr.isValid() ? FSAttr.getValueAsString() : jl_TargetMachine->getTargetFeatureString();


The access to jl_TargetMachine here is problematic. It should instead use the TLI to query the TM for the features, so that this pass can be run through opt

It wasn't clear to me how to do getAnalysisUsage(TargetLibraryInfoWrapperPass) with the new pass manager infrastructure here.

I think the canonical way is something like https://github.com/JuliaLang/julia/pull/42463/files#diff-8256bffe96cd66c411bbca57c488532cdba63fc13d346548181f7a303dec5e9aR2689-R2695

https://github.com/llvm/llvm-project/blob/5a667c0e741e5a895161b7a14376c59632fc5aa1/llvm/lib/Target/AMDGPU/AMDGPUPrintfRuntimeBinding.cpp#L579-L602

In conversation with decided that TargetLibraryInfo is actually the wrong thing, but TargetTransformInfo might be the right thing. It already has things like https://llvm.org/doxygen/classllvm_1_1TargetTransformInfo.html#ac31bf22f119c5a99c36646d8e0eb2c0f, but it doesn't expose information about FMA that we might have to add ourselves first.

Alternativly this pass could take the TM on construction, still meh.

So, it turns out, in offline discussion, that this possibly wasn't even the correct variable to look at anyways. We instead probably wanted to query the processor.cpp file to ask whether the current running system image was compiled with support for +fma in the architecture flags. Though we may want to now make it incompatible to load a "-fma" image onto a "+fma" processor (now similar to having a vector-size mismatch)?

Though we may want to now make it incompatible to load a "-fma" image onto a "+fma" processor (now similar to having a vector-size mismatch)?

Why should that be incompatible? Loading a non-fma image onto a fma processor should not cause any problems AFAICT.

There are two possible semantics for this intrinsic,

it returns result for the features the code is compiled for

it returns the result for the features available at runtime (which might also be different from the host processor one if the user disables it on the command line)

For the former, it's possible for different calls of this function to return different values for different parts of the code, but as long as the information is used more or less locally it should not cause much problem. It'll still guarantee that if it returns true, the feature will be available on the processor and also available for the code around that to actually use.
For the latter it's not particularly useful when it returns a different result from the former since it's not generally possible to make llvm emit code in part of the function that uses features unavailable to the function. In another word, if this is the semantics of the intrinsics, when has_fma() returns true, it's not guaranteed that fma will be used for the code around it.

Taking the use of this intrinsic when making cloning decision will help a little, but is not a guarantee for consistency since there's no guarantee that the matching fma target is provided. Adding fma target for each non-fma target will fix this, and make the two semantics exactly the same, but that's not a scalable solution especially when done globally. It's not helped by the fact that on x86 there are two fma flavors (so you need two fma target per non-fma target). And it's definitely not scalable, as in it'll blow up exponentially, when feature detection for other features are added. Locally adding this isn't going to help very much either though since it'll rely too much on the inlining decision.

GCC avoids this scaling problem by having different notation for enabling features per function and detecting them, and I think that's what we need to do as well with some tweaks. The intrinsic should return what's available to the compiler but the user should also mark the function that uses the result, as well as any indirect user of the result (e.g some caller and callee of this function that uses the detection results) as "create additional clone for xyz features". (edit: and the intrinsic with this semantic will still be useful, and have behavior that is predictable even without a way to mark function for additional cloning. It's just that it may require the right global cloning flags to be fully useful. The sysimg build flags for the binary build should already handle this quite well for fma)

would it make sense to add this as an LLVM intrinsic?

Depending on which flavor you want. For the former flavor (one that returns current compilation feature set) it doesn't have a gcc extension correspondence but might be generally useful for other frontends because it has a well defined semantics. (though I haven't heard of any, not that I looked very hard). For the latter flavor (the one that returns runtime feature set), it's what clang will need if/when they implement target_clones attribute but their semantics will be different from ours even if we implement this flavor here, and will definitely use different lowering (the info will come from different places) so it wouldn't help the pass very much.

. The intrinsic should return what's available to the compiler but the user should also mark the function that uses the result, as well as any indirect user of the result (e.g some caller and callee of this function that uses the detection results) as "create additional clone for xyz features".

This sounds similar to https://rust-lang.github.io/rfcs/2045-target-feature.html / https://doc.rust-lang.org/reference/attributes/codegen.html#the-target_feature-attribute

So the example that Jameson showed me of a current usage that is non-local is:

fma(x::Float32, y::Float32, z::Float32) = Core.Intrinsics.have_fma(Float32) ? fma_llvm(x,y,z) : fma_emulated(x,y,z) @inline function two_mul(x::T, y::T) where T<: Union{Float16, Float32} if Core.Intrinsics.have_fma(T) xy = x*y return xy, fma(x, y, -xy) end xy = widen(x)*y Txy = T(xy) return Txy, T(xy-Txy) end

If the function fma is baked into the sysimage and the function two_mul isn't we have a cornercase if the sysimage has been compiled with -fma. The runtime query would answer "Yes this processor has fast fma", and so using it to decide whether or not the copy of fma is fast is faulty.

Right, this is not a hypothetical, it is based specifically on the usages we have now.

as in it'll blow up exponentially

There are not ever going to be exponential processors produced. We also can gate the runtime feature based on the sysimage (similar to vector length).

The relevant question here is only whether we gate our have_fma pseudo-feature only, or whether we guarantee that we will feature-gate access to fma itself also to be identical to the value of have_fma.

Currently we fail to feature gate have_fma to have any particular relationship to fma (it is neither conservative nor optimistic), except in the current function being compiled. That situation is not good.

So the example that Jameson showed me of a current usage that is non-local is:

But returning a different values here in the fma function is also bad since fma feature still won't be available in fma function. The issue is what target fma function is compiled for, not what value the has_fma returns in either functions. As long as fma julia function is compiles without fma feature, it won't work correctly in this usage pattern.

In this particular case it might still have the expected behavior only because the libm implementation of fma might be doing the dispatch (and it can do the right thing in that dispatch not because it uses runtime information, but because it compiled a version of the function with fma feature enabled). fma_llvm still won't use fma instructions. (and FWIW the openlibm version isn't...)

There are not ever going to be exponential processors produced. We also can gate the runtime feature based on the sysimage (similar to vector length).

Without a well defined feature dependency there is exponentially many processors that needs to be supported. That exponential space can only be reduced if you know all the processors that exists. Even if it's possible to create a way to reduce things to a minimal set, it's still not feasible to ensure future compatibility for the processor feature set database used to generate such reduction map.

The relevant question here is only whether we gate our have_fma pseudo-feature only, or whether we guarantee that we will feature-gate access to fma itself also to be identical to the value of have_fma.

No we should not gate anything like that. For now, it's only about gating the access to fma, but soon enough you'll want to gate access to all the features. That basically means that the runtime feature has to be identical to the sysimg feature and that's exactly what I was setting out to avoid when I write the dispatch logic.

As I said, the right fix to this is really for the user to clone the right functions (returning the runtime value doesn't actually work and gating the feature set is way too restrictive). That's in general a call for the user, but in this case and at least as a stop gap measure it would be fine to make that decision in the compiler for fma. We can have a local cloning and dispatch just for fma's (or the different flavors of it) for functions that uses have_fma or fma (at llvm level). The dispatch logic will basically need to write a flag to let the caller to pick the right function from the sysimg.

* Add FMA multiversioning.

maleadt added maths Mathematical functions compiler:codegen Generation of LLVM IR and native code labels Nov 15, 2021

maleadt requested review from vtjnash, yuyichao and oscardssmith November 15, 2021 10:27

yuyichao requested changes Nov 15, 2021

View reviewed changes

src/llvm-cpufeatures.cpp Outdated Show resolved Hide resolved

maleadt force-pushed the tb/fma_multiversioning branch 2 times, most recently from 3b7e715 to 36a7f6f Compare November 16, 2021 12:00

DilumAluthge requested a review from yuyichao November 16, 2021 16:09

maleadt marked this pull request as draft November 16, 2021 16:43

yuyichao reviewed Nov 16, 2021

View reviewed changes

src/llvm-cpufeatures.cpp Outdated Show resolved Hide resolved

yuyichao reviewed Nov 16, 2021

View reviewed changes

base/floatfuncs.jl Outdated Show resolved Hide resolved

maleadt force-pushed the tb/fma_multiversioning branch from 36a7f6f to e759d3b Compare November 17, 2021 11:44

oscardssmith mentioned this pull request Nov 18, 2021

The test for checking FMA_NATIVE is faulty. #33011

Closed

maleadt force-pushed the tb/fma_multiversioning branch from e759d3b to aa4321d Compare November 18, 2021 10:19

maleadt force-pushed the tb/fma_multiversioning branch 2 times, most recently from b117145 to 854eaa8 Compare November 18, 2021 15:20

yuyichao reviewed Nov 18, 2021

View reviewed changes

maleadt marked this pull request as ready for review November 19, 2021 06:45

oscardssmith requested a review from yuyichao November 22, 2021 20:27

oscardssmith approved these changes Nov 22, 2021

View reviewed changes

maleadt added 7 commits November 25, 2021 16:30

Add FMA multiversioning.

f96f7ee

Improve FMA detection.

bd6968e

Don't clone on platforms that always support it, and more accurately check CPU features.

Reorder passes for lower latency on -O0 and -O1.

3d243f4

Make the FMA test intrinsic behave like an overloaded intrinsic.

1a48c42

Make have_fma a Julia intrinsic.

ca80dce

Disable FMA on Windows.

9f0fea9

Add a test.

f8d5bdc

maleadt force-pushed the tb/fma_multiversioning branch from 659d193 to f8d5bdc Compare November 25, 2021 15:50

chriselrod mentioned this pull request Nov 25, 2021

different results with evalpoly SciML/Static.jl#28

Open

oscardssmith merged commit 5b4fd5f into master Nov 29, 2021

oscardssmith deleted the tb/fma_multiversioning branch November 29, 2021 16:15

vtjnash reviewed Nov 29, 2021

View reviewed changes

src/jl_exported_funcs.inc Show resolved Hide resolved

vtjnash reviewed Nov 29, 2021

View reviewed changes

gbaraldi mentioned this pull request Dec 3, 2021

Hardware Float16 on A64fx #40216

Closed

maleadt mentioned this pull request Dec 6, 2021

Make DemoteFloat16 a conditional pass #43327

Merged

vchuravy mentioned this pull request Dec 10, 2021

[CAPI] Add alias for LLVMExtraAddCPUFeaturesPass #43391

Merged

maleadt mentioned this pull request Jan 18, 2022

FMA multiversioning fixes #43854

Merged

vchuravy reviewed Jan 18, 2022

View reviewed changes

vchuravy mentioned this pull request Feb 1, 2022

llvm-cpufeatures: get TargetMachine from the MachineModuleInfoWrapperPass pass #44005

Open

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022

FMA multiversioning. (JuliaLang#43085)

43ed6d9

* Add FMA multiversioning.

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022

FMA multiversioning. (JuliaLang#43085)

c051d5b

* Add FMA multiversioning.

vtjnash mentioned this pull request Nov 16, 2023

Fix multiversioning issues caused by the parallel llvm work #52194

Merged

	if (name.startswith("llvm.muladd.") \|\| name.startswith("llvm.fma.")) {
	flag \|= JL_TARGET_CLONE_MATH;
	}

FMA multiversioning. #43085

FMA multiversioning. #43085

Conversation

maleadt commented Nov 15, 2021

yuyichao left a comment

Choose a reason for hiding this comment

oscardssmith commented Nov 15, 2021 • edited Loading

yuyichao commented Nov 15, 2021

maleadt commented Nov 15, 2021

yuyichao commented Nov 15, 2021

maleadt commented Nov 16, 2021

maleadt commented Nov 17, 2021

simonbyrne commented Nov 18, 2021

maleadt commented Nov 18, 2021

oscardssmith commented Nov 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maleadt commented Nov 19, 2021

maleadt commented Nov 19, 2021

simonbyrne commented Nov 22, 2021

oscardssmith commented Nov 22, 2021

maleadt commented Nov 25, 2021

JeffreySarnoff commented Nov 26, 2021

oscardssmith commented Nov 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuyichao Jan 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuyichao Jan 20, 2022 • edited Loading

Choose a reason for hiding this comment

oscardssmith commented Nov 15, 2021 •

edited

Loading

yuyichao Jan 20, 2022 •

edited

Loading

yuyichao Jan 20, 2022 •

edited

Loading