Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FMA multiversioning. #43085

Merged
merged 7 commits into from
Nov 29, 2021
Merged

FMA multiversioning. #43085

merged 7 commits into from
Nov 29, 2021

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Nov 15, 2021

Adds a julia.cpu.have_fma intrinsic that gets detected by multiversioning (cloning the function) and lowered afterwards, replacing the calls with simple constants that can then be trivially optimized away. Very ad-hoc implementation as a starting point, but it also seems tricky to generalize this so that CUDA.jl can reuse the mechanism (i.e. adding it to CodegenParams wouldn't work with how CPU multiversioning currently works).

Demo:

$ ./julia
julia> @code_native fma(1.,2.,3.)
        .text
        .file   "fma"
        .globl  julia_fma_84                    # -- Begin function julia_fma_84
        .p2align        4, 0x90
        .type   julia_fma_84,@function
julia_fma_84:                           # @julia_fma_84
; ┌ @ floatfuncs.jl:414 within `fma`
        .cfi_startproc
# %bb.0:                                # %L4
; │┌ @ floatfuncs.jl:408 within `fma_llvm`
        vfmadd213sd     %xmm2, %xmm1, %xmm0     # xmm0 = (xmm1 * xmm0) + xmm2
; │└
        retq
.Lfunc_end0:
        .size   julia_fma_84, .Lfunc_end0-julia_fma_84
        .cfi_endproc
; └
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits
$ ./julia -C sandybridge
julia> @code_native fma(1.,2.,3.)
        .text
        .file   "fma"
        .globl  julia_fma_49                    # -- Begin function julia_fma_49
        .p2align        4, 0x90
        .type   julia_fma_49,@function
julia_fma_49:                           # @julia_fma_49
; ┌ @ floatfuncs.jl:414 within `fma`
        .cfi_startproc
# %bb.0:                                # %L6
        subq    $8, %rsp
        .cfi_def_cfa_offset 16
        movabsq $j_fma_emulated_51, %rax
        callq   *%rax
        popq    %rax
        .cfi_def_cfa_offset 8
        retq
.Lfunc_end0:
        .size   julia_fma_49, .Lfunc_end0-julia_fma_49
        .cfi_endproc
; └
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

I've also included #42942, but somebody will have to fine-tune the condition when FMA emulation is used.

@maleadt maleadt added maths Mathematical functions compiler:codegen Generation of LLVM IR and native code labels Nov 15, 2021
Copy link
Contributor

@yuyichao yuyichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should just be part of the cloning pass which will give it more knowledge about whether something needs to be cloned, e.g. the cloning trigger from this on aarch64 would be completely unnecessary....

Note that this still won't give people the same API in C, i.e. you won't be using FMA if the sysimg was compiled without it even if it's available at runtime but this should be a good enough starting point for sure.

src/llvm-cpufeatures.cpp Outdated Show resolved Hide resolved
@oscardssmith
Copy link
Member

oscardssmith commented Nov 15, 2021

Can you check that this fixes #43088?
The easiest way to do this is to remove if !(@static Sys.iswindows() && Int===Int64) (line 1291) from test/math.jl and see if CI passes.

@yuyichao
Copy link
Contributor

Since this does not change any fma implementations and merely which one to use in various situations this would not fix #43088 and would at most potentially hide it. This is especially the case since which one to use still depends on build-time information. (just not only based on the default target anymore).

@maleadt
Copy link
Member Author

maleadt commented Nov 15, 2021

Thanks for the review. I agree it would be better to just put this in the multiversioning pass so that it can do more fine-grained cloning. I'll update the PR.

Since this does not change any fma implementations and merely which one to use in various situations this would not fix #43088 and would at most potentially hide it.

If the goal is to work around bad or missing FMA implementations, the have_fma predicate could always be extended to include the platform so that we uses the emulated implementation on Windows.

@yuyichao
Copy link
Contributor

If the goal is to work around bad or missing FMA implementations, the have_fma predicate could always be extended to include the platform so that we uses the emulated implementation on Windows.

Sure. I'm just saying that this mostly has nothing to do with the issue as it stands. And in any case just running the test blindly isn't very useful, especially without any investigation to what the issue actually is.

@maleadt maleadt force-pushed the tb/fma_multiversioning branch 2 times, most recently from 3b7e715 to 36a7f6f Compare November 16, 2021 12:00
@maleadt
Copy link
Member Author

maleadt commented Nov 16, 2021

Improved the feature detection and made sure we don't clone when not needed (e.g. on AArch64).

@maleadt maleadt marked this pull request as draft November 16, 2021 16:43
src/llvm-cpufeatures.cpp Outdated Show resolved Hide resolved
base/floatfuncs.jl Outdated Show resolved Hide resolved
@maleadt
Copy link
Member Author

maleadt commented Nov 17, 2021

Re. the added passes: It is only needed on O1 or below, where we don't really optimize enough to clean up the following pattern that comes from FMA multiversioning:

; ┌ @ floatfuncs.jl:412 within `have_fma`
; │┌ @ promotion.jl:473 within `==`
    %5 = icmp eq i64 1, 1
    %6 = zext i1 %5 to i8
; └└
  %7 = trunc i8 %6 to i1
  %8 = xor i1 %7, true
  br i1 %8, label %L6, label %L4

It requires both an instsimplify to get to a simple br i1 false, label %L6, label %L4, after which simplifycfg cleans this up. But we don't run instcombine on O1, so moving the pass earlier doesn't help much. Having codegen lower the intrinsic won't help either because we'd still end up with the above code pattern. I've reordered the passes to avoid an extra CFG simplification pass on O1 at least.

On normal optimization levels this all doesn't matter since we run enough passes after multiversioning anyway to clean-up this code.

@simonbyrne
Copy link
Contributor

Nice! This can be used to get rid of FMA_NATIVE and fix #33011.

@maleadt maleadt force-pushed the tb/fma_multiversioning branch from e759d3b to aa4321d Compare November 18, 2021 10:19
@maleadt
Copy link
Member Author

maleadt commented Nov 18, 2021

OK, I've removed the checks on FMA_NATIVE as we should be able to trust fma to be accurate now. However, since we were empirically checking if fma worked, we may have to add additional entries to the 'blacklist' in the new LLVM pass triggering use of emulated instead of native FMA.

@oscardssmith
Copy link
Member

I think the better fix here is to just call the fma intrinsic. using an emulated fma here will be much more expensive.

@maleadt maleadt force-pushed the tb/fma_multiversioning branch 2 times, most recently from b117145 to 854eaa8 Compare November 18, 2021 15:20
JL_DLLEXPORT jl_value_t *jl_have_fma(jl_value_t *typ)
{
JL_TYPECHK(have_fma, datatype, typ);
// TODO: run-time feature check?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime feature check here, if implemented, should match the global target being used. And it will essentially always be allowed to return false conservatively since it can't return the exact value unless it knows something about the code calling it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check processor.cpp for this info.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know; let's just keep it returning false for now. The overhead of calling this runtime intrinsic is likely going to be larger than the cost of an emulated FMA anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think this should be implemented, e.g. to give us the option of running in the interpreter and getting the same answer.

@maleadt maleadt marked this pull request as ready for review November 19, 2021 06:45
@maleadt
Copy link
Member Author

maleadt commented Nov 19, 2021

Removed the FMA_NATIVE changes, we can do that in a follow-up PR. Also included Windows in the list of platforms to use an emulated FMA on, to work around #43088.

@maleadt
Copy link
Member Author

maleadt commented Nov 19, 2021

CI failure is #43124.

@simonbyrne
Copy link
Contributor

Is it worth making this an intrinsic and upstreaming to LLVM (suggested here: https://twitter.com/stephentyrone/status/1461854303149305857)?

@oscardssmith
Copy link
Member

eventually maybe. For now, I'd rather just get this working.

@maleadt maleadt force-pushed the tb/fma_multiversioning branch from 659d193 to f8d5bdc Compare November 25, 2021 15:50
@maleadt
Copy link
Member Author

maleadt commented Nov 25, 2021

Rebased and added a test. Since this looks stable and fixes actual issues, I propose we merge this and work on improvements/extensions later. Maybe after the US holidays to give people another chance at reviewing.

@JeffreySarnoff
Copy link
Contributor

+1 for merge this and work on improvements/extensions later

@oscardssmith oscardssmith merged commit 5b4fd5f into master Nov 29, 2021
@oscardssmith oscardssmith deleted the tb/fma_multiversioning branch November 29, 2021 16:15
@oscardssmith
Copy link
Member

Thanks so much for getting this working!

Comment on lines +479 to +481
flag |= JL_TARGET_CLONE_CPU;
} else {
flag |= JL_TARGET_CLONE_CPU;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FMA is in JL_TARGET_CLONE_MATH, so this also may need that flag too, since they work in tandem together

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this only matches the have_fma intrinsic; presumably the containing function then also contains a call to fma which would have set JL_TARGET_CLONE_MATH.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is why you would ever want to clone this call but not clone the fma call itself, since that seems like it would cause divergence in the results returned from the have_fma call vs the use of the fma algorithm, resulting in incorrect program behaviors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why you would ever want to clone this call but not clone the fma call itself

You wouldn't, but that also can't happen. The check for llvm.fma is right before that, and would set the necessary clone flag:

if (name.startswith("llvm.muladd.") || name.startswith("llvm.fma.")) {
flag |= JL_TARGET_CLONE_MATH;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fma call and the have_fma call are probably in different function however

FSAttr.isValid() ? FSAttr.getValueAsString() : jl_TargetMachine->getTargetFeatureString();

SmallVector<StringRef, 6> Features;
FS.split(Features, ',');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From talking to @oscardssmith earlier this isn't quite sufficient to be correct, since we next also need to remove from the list any feature that wasn't present in the particular sysimg copy that we are using. Otherwise we are setting have_fma here, but fma support may actually be disabled in the places we are using it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. This function will generally only use the feature string from the target-features function attribute, which will represent the features for this particular sysimg copy. Unless that attribute isn't set, in which case we can just use the native target's features. Doesn't that suffice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that would return incorrect answers at runtime for the have_fma function, since the attributes set on the function may a superset of the attributes that were available to other compilation units.


Attribute FSAttr = caller.getFnAttribute("target-features");
StringRef FS =
FSAttr.isValid() ? FSAttr.getValueAsString() : jl_TargetMachine->getTargetFeatureString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The access to jl_TargetMachine here is problematic. It should instead use the TLI to query the TM for the features, so that this pass can be run through opt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't clear to me how to do getAnalysisUsage(TargetLibraryInfoWrapperPass) with the new pass manager infrastructure here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In conversation with decided that TargetLibraryInfo is actually the wrong thing, but TargetTransformInfo might be the right thing. It already has things like https://llvm.org/doxygen/classllvm_1_1TargetTransformInfo.html#ac31bf22f119c5a99c36646d8e0eb2c0f, but it doesn't expose information about FMA that we might have to add ourselves first.

Alternativly this pass could take the TM on construction, still meh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it turns out, in offline discussion, that this possibly wasn't even the correct variable to look at anyways. We instead probably wanted to query the processor.cpp file to ask whether the current running system image was compiled with support for +fma in the architecture flags. Though we may want to now make it incompatible to load a "-fma" image onto a "+fma" processor (now similar to having a vector-size mismatch)?

Copy link
Contributor

@yuyichao yuyichao Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though we may want to now make it incompatible to load a "-fma" image onto a "+fma" processor (now similar to having a vector-size mismatch)?

Why should that be incompatible? Loading a non-fma image onto a fma processor should not cause any problems AFAICT.

There are two possible semantics for this intrinsic,

  1. it returns result for the features the code is compiled for
  2. it returns the result for the features available at runtime (which might also be different from the host processor one if the user disables it on the command line)

For the former, it's possible for different calls of this function to return different values for different parts of the code, but as long as the information is used more or less locally it should not cause much problem. It'll still guarantee that if it returns true, the feature will be available on the processor and also available for the code around that to actually use.
For the latter it's not particularly useful when it returns a different result from the former since it's not generally possible to make llvm emit code in part of the function that uses features unavailable to the function. In another word, if this is the semantics of the intrinsics, when has_fma() returns true, it's not guaranteed that fma will be used for the code around it.

Taking the use of this intrinsic when making cloning decision will help a little, but is not a guarantee for consistency since there's no guarantee that the matching fma target is provided. Adding fma target for each non-fma target will fix this, and make the two semantics exactly the same, but that's not a scalable solution especially when done globally. It's not helped by the fact that on x86 there are two fma flavors (so you need two fma target per non-fma target). And it's definitely not scalable, as in it'll blow up exponentially, when feature detection for other features are added. Locally adding this isn't going to help very much either though since it'll rely too much on the inlining decision.

GCC avoids this scaling problem by having different notation for enabling features per function and detecting them, and I think that's what we need to do as well with some tweaks. The intrinsic should return what's available to the compiler but the user should also mark the function that uses the result, as well as any indirect user of the result (e.g some caller and callee of this function that uses the detection results) as "create additional clone for xyz features". (edit: and the intrinsic with this semantic will still be useful, and have behavior that is predictable even without a way to mark function for additional cloning. It's just that it may require the right global cloning flags to be fully useful. The sysimg build flags for the binary build should already handle this quite well for fma)

would it make sense to add this as an LLVM intrinsic?

Depending on which flavor you want. For the former flavor (one that returns current compilation feature set) it doesn't have a gcc extension correspondence but might be generally useful for other frontends because it has a well defined semantics. (though I haven't heard of any, not that I looked very hard). For the latter flavor (the one that returns runtime feature set), it's what clang will need if/when they implement target_clones attribute but their semantics will be different from ours even if we implement this flavor here, and will definitely use different lowering (the info will come from different places) so it wouldn't help the pass very much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. The intrinsic should return what's available to the compiler but the user should also mark the function that uses the result, as well as any indirect user of the result (e.g some caller and callee of this function that uses the detection results) as "create additional clone for xyz features".

This sounds similar to https://rust-lang.github.io/rfcs/2045-target-feature.html / https://doc.rust-lang.org/reference/attributes/codegen.html#the-target_feature-attribute

So the example that Jameson showed me of a current usage that is non-local is:

fma(x::Float32, y::Float32, z::Float32) = Core.Intrinsics.have_fma(Float32) ? fma_llvm(x,y,z) : fma_emulated(x,y,z)

@inline function two_mul(x::T, y::T) where T<: Union{Float16, Float32}
    if Core.Intrinsics.have_fma(T)
        xy = x*y
        return xy, fma(x, y, -xy)
    end
    xy = widen(x)*y
    Txy = T(xy)
    return Txy, T(xy-Txy)
end

If the function fma is baked into the sysimage and the function two_mul isn't we have a cornercase if the sysimage has been compiled with -fma. The runtime query would answer "Yes this processor has fast fma", and so using it to decide whether or not the copy of fma is fast is faulty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is not a hypothetical, it is based specifically on the usages we have now.

as in it'll blow up exponentially

There are not ever going to be exponential processors produced. We also can gate the runtime feature based on the sysimage (similar to vector length).

The relevant question here is only whether we gate our have_fma pseudo-feature only, or whether we guarantee that we will feature-gate access to fma itself also to be identical to the value of have_fma.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we fail to feature gate have_fma to have any particular relationship to fma (it is neither conservative nor optimistic), except in the current function being compiled. That situation is not good.

Copy link
Contributor

@yuyichao yuyichao Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the example that Jameson showed me of a current usage that is non-local is:

But returning a different values here in the fma function is also bad since fma feature still won't be available in fma function. The issue is what target fma function is compiled for, not what value the has_fma returns in either functions. As long as fma julia function is compiles without fma feature, it won't work correctly in this usage pattern.

In this particular case it might still have the expected behavior only because the libm implementation of fma might be doing the dispatch (and it can do the right thing in that dispatch not because it uses runtime information, but because it compiled a version of the function with fma feature enabled). fma_llvm still won't use fma instructions. (and FWIW the openlibm version isn't...)

There are not ever going to be exponential processors produced. We also can gate the runtime feature based on the sysimage (similar to vector length).

Without a well defined feature dependency there is exponentially many processors that needs to be supported. That exponential space can only be reduced if you know all the processors that exists. Even if it's possible to create a way to reduce things to a minimal set, it's still not feasible to ensure future compatibility for the processor feature set database used to generate such reduction map.

The relevant question here is only whether we gate our have_fma pseudo-feature only, or whether we guarantee that we will feature-gate access to fma itself also to be identical to the value of have_fma.

No we should not gate anything like that. For now, it's only about gating the access to fma, but soon enough you'll want to gate access to all the features. That basically means that the runtime feature has to be identical to the sysimg feature and that's exactly what I was setting out to avoid when I write the dispatch logic.

As I said, the right fix to this is really for the user to clone the right functions (returning the runtime value doesn't actually work and gating the feature set is way too restrictive). That's in general a call for the user, but in this case and at least as a stop gap measure it would be fine to make that decision in the compiler for fma. We can have a local cloning and dispatch just for fma's (or the different flavors of it) for functions that uses have_fma or fma (at llvm level). The dispatch logic will basically need to write a flag to let the caller to pick the right function from the sysimg.

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
* Add FMA multiversioning.
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
* Add FMA multiversioning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code maths Mathematical functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.