Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We need a way to specify FP denormal behavior on a per-instruction basis #30633

Open
andykaylor opened this issue Dec 6, 2016 · 12 comments
Open
Labels
bugzilla Issues migrated from bugzilla floating-point Floating-point math

Comments

@andykaylor
Copy link
Contributor

Bugzilla Link 31285
Version unspecified
OS All
CC @andykaylor,@hfinkel,@arsenm,@pogo59,@rotateright,@yuanfang-chen

Extended Description

As per the discussion in the comments for https://reviews.llvm.org/D27028, we need a way to control flush-to-zero behavior of floating point operations on a per instructions basis.

Currently there is a TargetMachine option and a set of function attributes for controlling denormal behavior. It isn't clear to me whether this approach is sufficient for the needs of all architectures.

@arsenm
Copy link
Contributor

arsenm commented Dec 6, 2016

I think we should have .ftz variants of the various rounding modes, and then a function attribute for the default mode the standard operations will use.

The ftz is needed to decide how to correctly lower some operations on AMDGPU. Disabling denormals in these cases is not an optimization, although some other optimizations are allowed when denormals are disabled. Implementing some of the standard operations requires emitting intermediate code which does modify the FP environment which for now we are using custom target nodes for the side effecting FP instructions.

@llvmbot
Copy link
Member

llvmbot commented Jan 4, 2017

Matt, it sounds like you are requesting a mechanism for enforcing flush-to-zero behavior as opposed to merely permitting it. Is that accurate?

If we want to provide the same level of generality as

namespace FPDenormal {
enum DenormalMode {
IEEE, // IEEE 754 denormal numbers
PreserveSign, // the sign of a flushed-to-zero number is preserved in
// the sign of 0
PositiveZero // denormals are flushed to positive zero
};
}

then folding flush-to-zero behavior into the rounding mode argument of the constrained intrinsics seems impractical. Maybe we want to add a third metadata argument for it? Also, do we need to make a distinction between output denormals and input denormals? For x86, the MXCSR has separate FTZ and DAZ controls for output and input denormals, respectively. I don't know what other architectures provide.

@hfinkel
Copy link
Collaborator

hfinkel commented Jan 4, 2017

Matt, it sounds like you are requesting a mechanism for enforcing
flush-to-zero behavior as opposed to merely permitting it. Is that accurate?

If we want to provide the same level of generality as

namespace FPDenormal {
enum DenormalMode {
IEEE, // IEEE 754 denormal numbers
PreserveSign, // the sign of a flushed-to-zero number is preserved in
// the sign of 0
PositiveZero // denormals are flushed to positive zero
};
}

then folding flush-to-zero behavior into the rounding mode argument of the
constrained intrinsics seems impractical. Maybe we want to add a third
metadata argument for it?

This was my recommendation; and I think we should mirror exactly the options we have in the backend. If we need any extensions to the set, then we should add them both to the set the backend understands and also the set the intrinsics accept.

Also, do we need to make a distinction between
output denormals and input denormals? For x86, the MXCSR has separate FTZ
and DAZ controls for output and input denormals, respectively. I don't know
what other architectures provide.

@arsenm
Copy link
Contributor

arsenm commented Jan 5, 2017

Matt, it sounds like you are requesting a mechanism for enforcing
flush-to-zero behavior as opposed to merely permitting it. Is that accurate?

Yes. The lowerings for some operations (fdiv in particular) will change depending on whether denormals are enabled or not since they are expansions and not a single instruction like for example on x86. Additionally some of the OpenCL builtin library math function implementations also need to ensure that denormals are enabled or disabled for some subsequence of the function.

If we want to provide the same level of generality as

namespace FPDenormal {
enum DenormalMode {
IEEE, // IEEE 754 denormal numbers
PreserveSign, // the sign of a flushed-to-zero number is preserved in
// the sign of 0
PositiveZero // denormals are flushed to positive zero
};
}

then folding flush-to-zero behavior into the rounding mode argument of the
constrained intrinsics seems impractical. Maybe we want to add a third
metadata argument for it? Also, do we need to make a distinction between
output denormals and input denormals? For x86, the MXCSR has separate FTZ
and DAZ controls for output and input denormals, respectively. I don't know
what other architectures provide.

AMDGPU's FP mode can be set to flush input or output denormals (although I am pretty sure we only ever set input/output to the same value). The subtarget feature we've been using for the default mode doesn't distinguish these.

@andykaylor
Copy link
Contributor Author

Yes. The lowerings for some operations (fdiv in particular) will change
depending on whether denormals are enabled or not since they are expansions
and not a single instruction like for example on x86. Additionally some of
the OpenCL builtin library math function implementations also need to ensure
that denormals are enabled or disabled for some subsequence of the function.

If I may drift toward the front end a bit here, how do you envision the denormal argument being set? Is there some language specific syntax that you expect to trigger it?

In the case of the rounding mode argument as I've proposed it, I'm imaging the front end will always set it to dynamic unless a pragma is used to declare the rounding mode for a given scope and then hopefully I'll be able to write a pass that looks for calls that specifically set the rounding mode and replaces the argument for instructions that we can prove will have a specific value.

I'm just trying to understand how ftz would work as an argument to the intrinsics.

@arsenm
Copy link
Contributor

arsenm commented Jan 7, 2017

Yes. The lowerings for some operations (fdiv in particular) will change
depending on whether denormals are enabled or not since they are expansions
and not a single instruction like for example on x86. Additionally some of
the OpenCL builtin library math function implementations also need to ensure
that denormals are enabled or disabled for some subsequence of the function.

If I may drift toward the front end a bit here, how do you envision the
denormal argument being set? Is there some language specific syntax that
you expect to trigger it?

In the case of the rounding mode argument as I've proposed it, I'm imaging
the front end will always set it to dynamic unless a pragma is used to
declare the rounding mode for a given scope and then hopefully I'll be able
to write a pass that looks for calls that specifically set the rounding mode
and replaces the argument for instructions that we can prove will have a
specific value.

I'm just trying to understand how ftz would work as an argument to the
intrinsics.

A pragma like the rounding mode I suppose. For the main use cases we have writing the handful of wrapper functions necessary in IR would probably be good enough.

@andykaylor
Copy link
Contributor Author

Yes. The lowerings for some operations (fdiv in particular) will change
depending on whether denormals are enabled or not since they are expansions
and not a single instruction like for example on x86. Additionally some of
the OpenCL builtin library math function implementations also need to ensure
that denormals are enabled or disabled for some subsequence of the function.

If I understand correctly you want the semantics of the ftz argument to be such that the lowering somehow enforces or guarantees the ftz mode. This is different from what I have proposed for the rounding mode argument.

What I intended with my current proposal for constrained FP intrinsics is that the rounding mode argument will simply tell the optimizer what assumptions it can make about the rounding mode at any given time and if additional instructions are needed to set the rounding mode those will have been inserted separately (by the front end).

I'm not familiar with the AMD GPU architecture, but for X86 targets having the intrinsics guarantee either ftz mode or rounding mode could require, in a worst case scenario, that an extra LDMXCSR instruction (or the x87 equivalent) be inserted before every floating point operation. We could, theoretically, eliminate this instruction in cases where we could prove that the mode was already what we wanted it to be, but my preference would be to have the front end insert calls to set the FP environment where needed and simply allow CodeGen to rely on the fact that the mode was already set.

Does this present a conflict with the needs of the AMD GPU architecture?

@arsenm
Copy link
Contributor

arsenm commented Jan 13, 2017

Yes. The lowerings for some operations (fdiv in particular) will change
depending on whether denormals are enabled or not since they are expansions
and not a single instruction like for example on x86. Additionally some of
the OpenCL builtin library math function implementations also need to ensure
that denormals are enabled or disabled for some subsequence of the function.

If I understand correctly you want the semantics of the ftz argument to be
such that the lowering somehow enforces or guarantees the ftz mode. This is
different from what I have proposed for the rounding mode argument.

What I intended with my current proposal for constrained FP intrinsics is
that the rounding mode argument will simply tell the optimizer what
assumptions it can make about the rounding mode at any given time and if
additional instructions are needed to set the rounding mode those will have
been inserted separately (by the front end).

I'm not familiar with the AMD GPU architecture, but for X86 targets having
the intrinsics guarantee either ftz mode or rounding mode could require, in
a worst case scenario, that an extra LDMXCSR instruction (or the x87
equivalent) be inserted before every floating point operation. We could,
theoretically, eliminate this instruction in cases where we could prove that
the mode was already what we wanted it to be, but my preference would be to
have the front end insert calls to set the FP environment where needed and
simply allow CodeGen to rely on the fact that the mode was already set.

Does this present a conflict with the needs of the AMD GPU architecture?

This sounds like the same thing to me? The mode register could be reset for every instruction in the worst case. I thought the proposal for the rounding mode was to allow specifying the specific rounding mode that operation would use, as well as whether it should use the dynamic rounding mode.

Are you saying that the lowering for an llvm.contrained.fadd for example with a known constant round.downward argument would not be responsible for inserting the mode set instruction?

@andykaylor
Copy link
Contributor Author

Are you saying that the lowering for an llvm.contrained.fadd for example
with a known constant round.downward argument would not be responsible for
inserting the mode set instruction?

Yes, that is what I was suggesting.

I have to admit that my thinking on this is very much coupled with how I imagine it being implemented by a C/C++ front end, but I think something analogous should be practical with other front ends as well.

The way I imagine it, in C, there are basically two ways that the rounding mode could be set:

  1. A pragma like "STDC FENV_ACCESS ON" is used to enable fenv access and an explicit call to fesetround() is used to set the rounding mode.

  2. A pragma like "STDC FENV_ROUND FE_UPWARD" is used to specify the rounding mode for a specific scope.

In case 1, I intend that the front end would translate all FP operations using the constrained intrinsics with the rounding argument set to "round.dynamic" and (eventually) we'd have a pass that would look for calls to fesetround() or equivalents thereof and change the rounding mode argument in instructions for which it could prove the rounding mode was known and constant.

In case 2, it is my understanding that the pragma itself is intended to have the effect of setting the rounding mode within the scope for which it applies, but (at least in the standard draft I've been using as reference) it suggests that a compiler could implement this by inserting explicit calls to fegetround()/fesetround() in appropriate locations to bracket FP operations affected by the rounding mode (changing back to the previous mode for calls out of the scope). This matches what I had in mind. I intend that in this case the front end would insert calls to set the rounding mode and use whatever constant rounding mode is specified for FP operations within the given scope.

In the general (front end agnostic) scenario, other means of specifying a rounding mode (such as a command line option) are possible of course, but I think they could be mapped to one of the two cases above -- either (1) the rounding mode is unknown to the front end and must be set using an explicit instruction/call or (2) the rounding mode is known to the front end and the front end can insert calls/instructions to set it if necessary.

In my mind (with an admitted X86 bias), flush-to-zero can be implemented in the same way. Of course, this assumes that there is an instruction of some sort that is independent of the general FP instructions and can be executed to set the rounding mode and/or flush-to-zero mode (such as LDMXCSR in the SSE instruction set). I suppose an architecture that needed to encode these modes directly into an instruction sequence in some way could still do that during ISel even if we did not require such lowering in our language specification for the intrinsic.

By the way, here's the standard draft I mentioned above: http://www.open-std.org/JTC1/sc22/wg14/www/docs/n1778.pdf

@arsenm
Copy link
Contributor

arsenm commented Jan 14, 2017

This all matches how I think ftz should also be handled

@andykaylor
Copy link
Contributor Author

This all matches how I think ftz should also be handled

Excellent.

The biggest potential problem I see remaining is how to handle the transition through ISel. The implementation in my current patch doesn't preserve the rounding mode and exception behavior arguments beyond DAG formation. It could easily do so, but since my implementation of MutateStrictFPToFP() was going to ignore them anyway, I just left them out.

My reasoning is that once the target registers that hold the actual FP environment (e.g. MXCSR) are modeled in the machine instructions there would be no need for the extra arguments. This assumes no constant folding is attempted beyond ISel, or at least that suppressing it when the FP env register is modeled is acceptable.

However, if you think you will need to know the FTZ mode for instruction selection, then we'll need to keep at least that argument, and we'll need a target specific way to hook in to the MutateStrictFPToFP() handling.

Can you please comment on the code review if you think something additional is needed for your case?

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@arsenm arsenm added the floating-point Floating-point math label Aug 13, 2023
@DemiMarie
Copy link

Is using LLVM inline assembly a sufficient workaround here? Of course, this disables lots of other optimizations, but IIUC programmers who write SIMD code are generally using the compiler for register allocation and would be surprised (in a bad way) by optimizations such as constant folding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla floating-point Floating-point math
Projects
None yet
Development

No branches or pull requests

5 participants