-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FMA multiversioning. #43085
FMA multiversioning. #43085
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this should just be part of the cloning pass which will give it more knowledge about whether something needs to be cloned, e.g. the cloning trigger from this on aarch64 would be completely unnecessary....
Note that this still won't give people the same API in C, i.e. you won't be using FMA if the sysimg was compiled without it even if it's available at runtime but this should be a good enough starting point for sure.
Can you check that this fixes #43088? |
Since this does not change any fma implementations and merely which one to use in various situations this would not fix #43088 and would at most potentially hide it. This is especially the case since which one to use still depends on build-time information. (just not only based on the default target anymore). |
Thanks for the review. I agree it would be better to just put this in the multiversioning pass so that it can do more fine-grained cloning. I'll update the PR.
If the goal is to work around bad or missing FMA implementations, the |
Sure. I'm just saying that this mostly has nothing to do with the issue as it stands. And in any case just running the test blindly isn't very useful, especially without any investigation to what the issue actually is. |
3b7e715
to
36a7f6f
Compare
Improved the feature detection and made sure we don't clone when not needed (e.g. on AArch64). |
Re. the added passes: It is only needed on ; ┌ @ floatfuncs.jl:412 within `have_fma`
; │┌ @ promotion.jl:473 within `==`
%5 = icmp eq i64 1, 1
%6 = zext i1 %5 to i8
; └└
%7 = trunc i8 %6 to i1
%8 = xor i1 %7, true
br i1 %8, label %L6, label %L4 It requires both an instsimplify to get to a simple On normal optimization levels this all doesn't matter since we run enough passes after multiversioning anyway to clean-up this code. |
36a7f6f
to
e759d3b
Compare
Nice! This can be used to get rid of |
e759d3b
to
aa4321d
Compare
OK, I've removed the checks on |
I think the better fix here is to just call the fma intrinsic. using an emulated fma here will be much more expensive. |
b117145
to
854eaa8
Compare
JL_DLLEXPORT jl_value_t *jl_have_fma(jl_value_t *typ) | ||
{ | ||
JL_TYPECHK(have_fma, datatype, typ); | ||
// TODO: run-time feature check? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runtime feature check here, if implemented, should match the global target being used. And it will essentially always be allowed to return false
conservatively since it can't return the exact value unless it knows something about the code calling it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check processor.cpp for this info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know; let's just keep it returning false
for now. The overhead of calling this runtime intrinsic is likely going to be larger than the cost of an emulated FMA anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think this should be implemented, e.g. to give us the option of running in the interpreter and getting the same answer.
Removed the FMA_NATIVE changes, we can do that in a follow-up PR. Also included Windows in the list of platforms to use an emulated FMA on, to work around #43088. |
CI failure is #43124. |
Is it worth making this an intrinsic and upstreaming to LLVM (suggested here: https://twitter.com/stephentyrone/status/1461854303149305857)? |
eventually maybe. For now, I'd rather just get this working. |
Don't clone on platforms that always support it, and more accurately check CPU features.
659d193
to
f8d5bdc
Compare
Rebased and added a test. Since this looks stable and fixes actual issues, I propose we merge this and work on improvements/extensions later. Maybe after the US holidays to give people another chance at reviewing. |
+1 for merge this and work on improvements/extensions later |
Thanks so much for getting this working! |
flag |= JL_TARGET_CLONE_CPU; | ||
} else { | ||
flag |= JL_TARGET_CLONE_CPU; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FMA is in JL_TARGET_CLONE_MATH, so this also may need that flag too, since they work in tandem together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this only matches the have_fma
intrinsic; presumably the containing function then also contains a call to fma
which would have set JL_TARGET_CLONE_MATH
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is why you would ever want to clone this call but not clone the fma
call itself, since that seems like it would cause divergence in the results returned from the have_fma call vs the use of the fma
algorithm, resulting in incorrect program behaviors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why you would ever want to clone this call but not clone the fma call itself
You wouldn't, but that also can't happen. The check for llvm.fma
is right before that, and would set the necessary clone flag:
julia/src/llvm-multiversioning.cpp
Lines 467 to 469 in 5864e43
if (name.startswith("llvm.muladd.") || name.startswith("llvm.fma.")) { | |
flag |= JL_TARGET_CLONE_MATH; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fma call and the have_fma call are probably in different function however
FSAttr.isValid() ? FSAttr.getValueAsString() : jl_TargetMachine->getTargetFeatureString(); | ||
|
||
SmallVector<StringRef, 6> Features; | ||
FS.split(Features, ','); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From talking to @oscardssmith earlier this isn't quite sufficient to be correct, since we next also need to remove from the list any feature that wasn't present in the particular sysimg copy that we are using. Otherwise we are setting have_fma here, but fma support may actually be disabled in the places we are using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. This function will generally only use the feature string from the target-features
function attribute, which will represent the features for this particular sysimg copy. Unless that attribute isn't set, in which case we can just use the native target's features. Doesn't that suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that would return incorrect answers at runtime for the have_fma
function, since the attributes set on the function may a superset of the attributes that were available to other compilation units.
|
||
Attribute FSAttr = caller.getFnAttribute("target-features"); | ||
StringRef FS = | ||
FSAttr.isValid() ? FSAttr.getValueAsString() : jl_TargetMachine->getTargetFeatureString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The access to jl_TargetMachine here is problematic. It should instead use the TLI to query the TM for the features, so that this pass can be run through opt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't clear to me how to do getAnalysisUsage(TargetLibraryInfoWrapperPass) with the new pass manager infrastructure here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the canonical way is something like https://github.com/JuliaLang/julia/pull/42463/files#diff-8256bffe96cd66c411bbca57c488532cdba63fc13d346548181f7a303dec5e9aR2689-R2695
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In conversation with decided that TargetLibraryInfo
is actually the wrong thing, but TargetTransformInfo
might be the right thing. It already has things like https://llvm.org/doxygen/classllvm_1_1TargetTransformInfo.html#ac31bf22f119c5a99c36646d8e0eb2c0f, but it doesn't expose information about FMA that we might have to add ourselves first.
Alternativly this pass could take the TM on construction, still meh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it turns out, in offline discussion, that this possibly wasn't even the correct variable to look at anyways. We instead probably wanted to query the processor.cpp file to ask whether the current running system image was compiled with support for +fma
in the architecture flags. Though we may want to now make it incompatible to load a "-fma" image onto a "+fma" processor (now similar to having a vector-size mismatch)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though we may want to now make it incompatible to load a "-fma" image onto a "+fma" processor (now similar to having a vector-size mismatch)?
Why should that be incompatible? Loading a non-fma image onto a fma processor should not cause any problems AFAICT.
There are two possible semantics for this intrinsic,
- it returns result for the features the code is compiled for
- it returns the result for the features available at runtime (which might also be different from the host processor one if the user disables it on the command line)
For the former, it's possible for different calls of this function to return different values for different parts of the code, but as long as the information is used more or less locally it should not cause much problem. It'll still guarantee that if it returns true, the feature will be available on the processor and also available for the code around that to actually use.
For the latter it's not particularly useful when it returns a different result from the former since it's not generally possible to make llvm emit code in part of the function that uses features unavailable to the function. In another word, if this is the semantics of the intrinsics, when has_fma()
returns true, it's not guaranteed that fma will be used for the code around it.
Taking the use of this intrinsic when making cloning decision will help a little, but is not a guarantee for consistency since there's no guarantee that the matching fma target is provided. Adding fma target for each non-fma target will fix this, and make the two semantics exactly the same, but that's not a scalable solution especially when done globally. It's not helped by the fact that on x86 there are two fma flavors (so you need two fma target per non-fma target). And it's definitely not scalable, as in it'll blow up exponentially, when feature detection for other features are added. Locally adding this isn't going to help very much either though since it'll rely too much on the inlining decision.
GCC avoids this scaling problem by having different notation for enabling features per function and detecting them, and I think that's what we need to do as well with some tweaks. The intrinsic should return what's available to the compiler but the user should also mark the function that uses the result, as well as any indirect user of the result (e.g some caller and callee of this function that uses the detection results) as "create additional clone for xyz features". (edit: and the intrinsic with this semantic will still be useful, and have behavior that is predictable even without a way to mark function for additional cloning. It's just that it may require the right global cloning flags to be fully useful. The sysimg build flags for the binary build should already handle this quite well for fma)
would it make sense to add this as an LLVM intrinsic?
Depending on which flavor you want. For the former flavor (one that returns current compilation feature set) it doesn't have a gcc extension correspondence but might be generally useful for other frontends because it has a well defined semantics. (though I haven't heard of any, not that I looked very hard). For the latter flavor (the one that returns runtime feature set), it's what clang will need if/when they implement target_clones
attribute but their semantics will be different from ours even if we implement this flavor here, and will definitely use different lowering (the info will come from different places) so it wouldn't help the pass very much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. The intrinsic should return what's available to the compiler but the user should also mark the function that uses the result, as well as any indirect user of the result (e.g some caller and callee of this function that uses the detection results) as "create additional clone for xyz features".
This sounds similar to https://rust-lang.github.io/rfcs/2045-target-feature.html / https://doc.rust-lang.org/reference/attributes/codegen.html#the-target_feature-attribute
So the example that Jameson showed me of a current usage that is non-local is:
fma(x::Float32, y::Float32, z::Float32) = Core.Intrinsics.have_fma(Float32) ? fma_llvm(x,y,z) : fma_emulated(x,y,z)
@inline function two_mul(x::T, y::T) where T<: Union{Float16, Float32}
if Core.Intrinsics.have_fma(T)
xy = x*y
return xy, fma(x, y, -xy)
end
xy = widen(x)*y
Txy = T(xy)
return Txy, T(xy-Txy)
end
If the function fma
is baked into the sysimage and the function two_mul
isn't we have a cornercase if the sysimage has been compiled with -fma
. The runtime query would answer "Yes this processor has fast fma", and so using it to decide whether or not the copy of fma is fast is faulty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is not a hypothetical, it is based specifically on the usages we have now.
as in it'll blow up exponentially
There are not ever going to be exponential processors produced. We also can gate the runtime feature based on the sysimage (similar to vector length).
The relevant question here is only whether we gate our have_fma
pseudo-feature only, or whether we guarantee that we will feature-gate access to fma
itself also to be identical to the value of have_fma
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we fail to feature gate have_fma
to have any particular relationship to fma
(it is neither conservative nor optimistic), except in the current function being compiled. That situation is not good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the example that Jameson showed me of a current usage that is non-local is:
But returning a different values here in the fma function is also bad since fma feature still won't be available in fma
function. The issue is what target fma
function is compiled for, not what value the has_fma
returns in either functions. As long as fma
julia function is compiles without fma
feature, it won't work correctly in this usage pattern.
In this particular case it might still have the expected behavior only because the libm implementation of fma
might be doing the dispatch (and it can do the right thing in that dispatch not because it uses runtime information, but because it compiled a version of the function with fma
feature enabled). fma_llvm
still won't use fma instructions. (and FWIW the openlibm version isn't...)
There are not ever going to be exponential processors produced. We also can gate the runtime feature based on the sysimage (similar to vector length).
Without a well defined feature dependency there is exponentially many processors that needs to be supported. That exponential space can only be reduced if you know all the processors that exists. Even if it's possible to create a way to reduce things to a minimal set, it's still not feasible to ensure future compatibility for the processor feature set database used to generate such reduction map.
The relevant question here is only whether we gate our have_fma pseudo-feature only, or whether we guarantee that we will feature-gate access to fma itself also to be identical to the value of have_fma.
No we should not gate anything like that. For now, it's only about gating the access to fma, but soon enough you'll want to gate access to all the features. That basically means that the runtime feature has to be identical to the sysimg feature and that's exactly what I was setting out to avoid when I write the dispatch logic.
As I said, the right fix to this is really for the user to clone the right functions (returning the runtime value doesn't actually work and gating the feature set is way too restrictive). That's in general a call for the user, but in this case and at least as a stop gap measure it would be fine to make that decision in the compiler for fma. We can have a local cloning and dispatch just for fma's (or the different flavors of it) for functions that uses have_fma
or fma
(at llvm level). The dispatch logic will basically need to write a flag to let the caller to pick the right function from the sysimg.
* Add FMA multiversioning.
* Add FMA multiversioning.
Adds a
julia.cpu.have_fma
intrinsic that gets detected by multiversioning (cloning the function) and lowered afterwards, replacing the calls with simple constants that can then be trivially optimized away. Very ad-hoc implementation as a starting point, but it also seems tricky to generalize this so that CUDA.jl can reuse the mechanism (i.e. adding it toCodegenParams
wouldn't work with how CPU multiversioning currently works).Demo:
I've also included #42942, but somebody will have to fine-tune the condition when FMA emulation is used.