We need a way to specify FP denormal behavior on a per-instruction basis #30633

andykaylor · 2016-12-06T02:47:21Z


Bugzilla Link	31285
Version	unspecified
OS	All
CC	@andykaylor,@hfinkel,@arsenm,@pogo59,@rotateright,@yuanfang-chen

Extended Description

As per the discussion in the comments for https://reviews.llvm.org/D27028, we need a way to control flush-to-zero behavior of floating point operations on a per instructions basis.

Currently there is a TargetMachine option and a set of function attributes for controlling denormal behavior. It isn't clear to me whether this approach is sufficient for the needs of all architectures.

arsenm · 2016-12-06T03:34:43Z

I think we should have .ftz variants of the various rounding modes, and then a function attribute for the default mode the standard operations will use.

The ftz is needed to decide how to correctly lower some operations on AMDGPU. Disabling denormals in these cases is not an optimization, although some other optimizations are allowed when denormals are disabled. Implementing some of the standard operations requires emitting intermediate code which does modify the FP environment which for now we are using custom target nodes for the side effecting FP instructions.

llvmbot · 2017-01-04T20:19:12Z

Matt, it sounds like you are requesting a mechanism for enforcing flush-to-zero behavior as opposed to merely permitting it. Is that accurate?

If we want to provide the same level of generality as

namespace FPDenormal {
enum DenormalMode {
IEEE, // IEEE 754 denormal numbers
PreserveSign, // the sign of a flushed-to-zero number is preserved in
// the sign of 0
PositiveZero // denormals are flushed to positive zero
};
}

then folding flush-to-zero behavior into the rounding mode argument of the constrained intrinsics seems impractical. Maybe we want to add a third metadata argument for it? Also, do we need to make a distinction between output denormals and input denormals? For x86, the MXCSR has separate FTZ and DAZ controls for output and input denormals, respectively. I don't know what other architectures provide.

hfinkel · 2017-01-04T20:33:45Z

Matt, it sounds like you are requesting a mechanism for enforcing
flush-to-zero behavior as opposed to merely permitting it. Is that accurate?

If we want to provide the same level of generality as

namespace FPDenormal {
enum DenormalMode {
IEEE, // IEEE 754 denormal numbers
PreserveSign, // the sign of a flushed-to-zero number is preserved in
// the sign of 0
PositiveZero // denormals are flushed to positive zero
};
}

then folding flush-to-zero behavior into the rounding mode argument of the
constrained intrinsics seems impractical. Maybe we want to add a third
metadata argument for it?

This was my recommendation; and I think we should mirror exactly the options we have in the backend. If we need any extensions to the set, then we should add them both to the set the backend understands and also the set the intrinsics accept.

Also, do we need to make a distinction between
output denormals and input denormals? For x86, the MXCSR has separate FTZ
and DAZ controls for output and input denormals, respectively. I don't know
what other architectures provide.

arsenm · 2017-01-05T00:46:48Z

Matt, it sounds like you are requesting a mechanism for enforcing
flush-to-zero behavior as opposed to merely permitting it. Is that accurate?

Yes. The lowerings for some operations (fdiv in particular) will change depending on whether denormals are enabled or not since they are expansions and not a single instruction like for example on x86. Additionally some of the OpenCL builtin library math function implementations also need to ensure that denormals are enabled or disabled for some subsequence of the function.

If we want to provide the same level of generality as

namespace FPDenormal {
enum DenormalMode {
IEEE, // IEEE 754 denormal numbers
PreserveSign, // the sign of a flushed-to-zero number is preserved in
// the sign of 0
PositiveZero // denormals are flushed to positive zero
};
}

then folding flush-to-zero behavior into the rounding mode argument of the
constrained intrinsics seems impractical. Maybe we want to add a third
metadata argument for it? Also, do we need to make a distinction between
output denormals and input denormals? For x86, the MXCSR has separate FTZ
and DAZ controls for output and input denormals, respectively. I don't know
what other architectures provide.

AMDGPU's FP mode can be set to flush input or output denormals (although I am pretty sure we only ever set input/output to the same value). The subtarget feature we've been using for the default mode doesn't distinguish these.

andykaylor · 2017-01-05T03:19:51Z

Yes. The lowerings for some operations (fdiv in particular) will change
depending on whether denormals are enabled or not since they are expansions
and not a single instruction like for example on x86. Additionally some of
the OpenCL builtin library math function implementations also need to ensure
that denormals are enabled or disabled for some subsequence of the function.

If I may drift toward the front end a bit here, how do you envision the denormal argument being set? Is there some language specific syntax that you expect to trigger it?

In the case of the rounding mode argument as I've proposed it, I'm imaging the front end will always set it to dynamic unless a pragma is used to declare the rounding mode for a given scope and then hopefully I'll be able to write a pass that looks for calls that specifically set the rounding mode and replaces the argument for instructions that we can prove will have a specific value.

I'm just trying to understand how ftz would work as an argument to the intrinsics.

arsenm · 2017-01-07T22:01:19Z

Yes. The lowerings for some operations (fdiv in particular) will change
depending on whether denormals are enabled or not since they are expansions
and not a single instruction like for example on x86. Additionally some of
the OpenCL builtin library math function implementations also need to ensure
that denormals are enabled or disabled for some subsequence of the function.

If I may drift toward the front end a bit here, how do you envision the
denormal argument being set? Is there some language specific syntax that
you expect to trigger it?

In the case of the rounding mode argument as I've proposed it, I'm imaging
the front end will always set it to dynamic unless a pragma is used to
declare the rounding mode for a given scope and then hopefully I'll be able
to write a pass that looks for calls that specifically set the rounding mode
and replaces the argument for instructions that we can prove will have a
specific value.

I'm just trying to understand how ftz would work as an argument to the
intrinsics.

A pragma like the rounding mode I suppose. For the main use cases we have writing the handful of wrapper functions necessary in IR would probably be good enough.

andykaylor · 2017-01-12T21:27:17Z

Yes. The lowerings for some operations (fdiv in particular) will change
depending on whether denormals are enabled or not since they are expansions
and not a single instruction like for example on x86. Additionally some of
the OpenCL builtin library math function implementations also need to ensure
that denormals are enabled or disabled for some subsequence of the function.

If I understand correctly you want the semantics of the ftz argument to be such that the lowering somehow enforces or guarantees the ftz mode. This is different from what I have proposed for the rounding mode argument.

What I intended with my current proposal for constrained FP intrinsics is that the rounding mode argument will simply tell the optimizer what assumptions it can make about the rounding mode at any given time and if additional instructions are needed to set the rounding mode those will have been inserted separately (by the front end).

I'm not familiar with the AMD GPU architecture, but for X86 targets having the intrinsics guarantee either ftz mode or rounding mode could require, in a worst case scenario, that an extra LDMXCSR instruction (or the x87 equivalent) be inserted before every floating point operation. We could, theoretically, eliminate this instruction in cases where we could prove that the mode was already what we wanted it to be, but my preference would be to have the front end insert calls to set the FP environment where needed and simply allow CodeGen to rely on the fact that the mode was already set.

Does this present a conflict with the needs of the AMD GPU architecture?

arsenm · 2017-01-13T04:50:57Z

Yes. The lowerings for some operations (fdiv in particular) will change
depending on whether denormals are enabled or not since they are expansions
and not a single instruction like for example on x86. Additionally some of
the OpenCL builtin library math function implementations also need to ensure
that denormals are enabled or disabled for some subsequence of the function.

If I understand correctly you want the semantics of the ftz argument to be
such that the lowering somehow enforces or guarantees the ftz mode. This is
different from what I have proposed for the rounding mode argument.

What I intended with my current proposal for constrained FP intrinsics is
that the rounding mode argument will simply tell the optimizer what
assumptions it can make about the rounding mode at any given time and if
additional instructions are needed to set the rounding mode those will have
been inserted separately (by the front end).

I'm not familiar with the AMD GPU architecture, but for X86 targets having
the intrinsics guarantee either ftz mode or rounding mode could require, in
a worst case scenario, that an extra LDMXCSR instruction (or the x87
equivalent) be inserted before every floating point operation. We could,
theoretically, eliminate this instruction in cases where we could prove that
the mode was already what we wanted it to be, but my preference would be to
have the front end insert calls to set the FP environment where needed and
simply allow CodeGen to rely on the fact that the mode was already set.

Does this present a conflict with the needs of the AMD GPU architecture?

This sounds like the same thing to me? The mode register could be reset for every instruction in the worst case. I thought the proposal for the rounding mode was to allow specifying the specific rounding mode that operation would use, as well as whether it should use the dynamic rounding mode.

Are you saying that the lowering for an llvm.contrained.fadd for example with a known constant round.downward argument would not be responsible for inserting the mode set instruction?

andykaylor · 2017-01-13T08:08:35Z

Are you saying that the lowering for an llvm.contrained.fadd for example
with a known constant round.downward argument would not be responsible for
inserting the mode set instruction?

Yes, that is what I was suggesting.

I have to admit that my thinking on this is very much coupled with how I imagine it being implemented by a C/C++ front end, but I think something analogous should be practical with other front ends as well.

The way I imagine it, in C, there are basically two ways that the rounding mode could be set:

A pragma like "STDC FENV_ACCESS ON" is used to enable fenv access and an explicit call to fesetround() is used to set the rounding mode.
A pragma like "STDC FENV_ROUND FE_UPWARD" is used to specify the rounding mode for a specific scope.

In case 1, I intend that the front end would translate all FP operations using the constrained intrinsics with the rounding argument set to "round.dynamic" and (eventually) we'd have a pass that would look for calls to fesetround() or equivalents thereof and change the rounding mode argument in instructions for which it could prove the rounding mode was known and constant.

In case 2, it is my understanding that the pragma itself is intended to have the effect of setting the rounding mode within the scope for which it applies, but (at least in the standard draft I've been using as reference) it suggests that a compiler could implement this by inserting explicit calls to fegetround()/fesetround() in appropriate locations to bracket FP operations affected by the rounding mode (changing back to the previous mode for calls out of the scope). This matches what I had in mind. I intend that in this case the front end would insert calls to set the rounding mode and use whatever constant rounding mode is specified for FP operations within the given scope.

In the general (front end agnostic) scenario, other means of specifying a rounding mode (such as a command line option) are possible of course, but I think they could be mapped to one of the two cases above -- either (1) the rounding mode is unknown to the front end and must be set using an explicit instruction/call or (2) the rounding mode is known to the front end and the front end can insert calls/instructions to set it if necessary.

In my mind (with an admitted X86 bias), flush-to-zero can be implemented in the same way. Of course, this assumes that there is an instruction of some sort that is independent of the general FP instructions and can be executed to set the rounding mode and/or flush-to-zero mode (such as LDMXCSR in the SSE instruction set). I suppose an architecture that needed to encode these modes directly into an instruction sequence in some way could still do that during ISel even if we did not require such lowering in our language specification for the intrinsic.

By the way, here's the standard draft I mentioned above: http://www.open-std.org/JTC1/sc22/wg14/www/docs/n1778.pdf

arsenm · 2017-01-14T00:24:57Z

This all matches how I think ftz should also be handled

andykaylor · 2017-01-14T00:45:15Z

This all matches how I think ftz should also be handled

Excellent.

The biggest potential problem I see remaining is how to handle the transition through ISel. The implementation in my current patch doesn't preserve the rounding mode and exception behavior arguments beyond DAG formation. It could easily do so, but since my implementation of MutateStrictFPToFP() was going to ignore them anyway, I just left them out.

My reasoning is that once the target registers that hold the actual FP environment (e.g. MXCSR) are modeled in the machine instructions there would be no need for the extra arguments. This assumes no constant folding is attempted beyond ISel, or at least that suppressing it when the FP env register is modeled is acceptable.

However, if you think you will need to know the FTZ mode for instruction selection, then we'll need to keep at least that argument, and we'll need a target specific way to hook in to the MutateStrictFPToFP() handling.

Can you please comment on the code review if you think something additional is needed for your case?

DemiMarie · 2024-11-10T20:13:36Z

Is using LLVM inline assembly a sufficient workaround here? Of course, this disables lots of other optimizations, but IIUC programmers who write SIMD code are generally using the compiler for register allocation and would be surprised (in a bad way) by optimizations such as constant folding.

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

arsenm added the floating-point Floating-point math label Aug 13, 2023

workingjubilee mentioned this issue Nov 10, 2024

Aligning std::simd and Rust on Arm v7 Neon float behavior rust-lang/portable-simd#439

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We need a way to specify FP denormal behavior on a per-instruction basis #30633

We need a way to specify FP denormal behavior on a per-instruction basis #30633

andykaylor commented Dec 6, 2016

arsenm commented Dec 6, 2016

llvmbot commented Jan 4, 2017

hfinkel commented Jan 4, 2017

arsenm commented Jan 5, 2017

andykaylor commented Jan 5, 2017

arsenm commented Jan 7, 2017

andykaylor commented Jan 12, 2017

arsenm commented Jan 13, 2017

andykaylor commented Jan 13, 2017

arsenm commented Jan 14, 2017

andykaylor commented Jan 14, 2017

DemiMarie commented Nov 10, 2024

We need a way to specify FP denormal behavior on a per-instruction basis #30633

We need a way to specify FP denormal behavior on a per-instruction basis #30633

Comments

andykaylor commented Dec 6, 2016

Extended Description

arsenm commented Dec 6, 2016

llvmbot commented Jan 4, 2017

hfinkel commented Jan 4, 2017

arsenm commented Jan 5, 2017

andykaylor commented Jan 5, 2017

arsenm commented Jan 7, 2017

andykaylor commented Jan 12, 2017

arsenm commented Jan 13, 2017

andykaylor commented Jan 13, 2017

arsenm commented Jan 14, 2017

andykaylor commented Jan 14, 2017

DemiMarie commented Nov 10, 2024