-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Consume FMA intrinsic operands in right order #102914
Conversation
The operands of the FMA intrinsic are permuted in a non-standard way during LSRA. Codegen already takes this into account, but the handling was missing when consuming the operands. Ideally we would permute these during lowering instead to avoid these hacks. Fix dotnet#102773
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
cc @dotnet/jit-contrib PTAL @tannergooding @kunalspathak This fixes the superpmi-replay blocking issue. |
#ifdef DEBUG | ||
// Use nums are assigned in LIR order but this node is special and doesn't | ||
// actually use operands. Fix up the use nums here to avoid asserts. | ||
unsigned useNum1 = op1->gtUseNum; | ||
unsigned useNum2 = op2->gtUseNum; | ||
unsigned useNum3 = op3->gtUseNum; | ||
emitOp1->gtUseNum = useNum1; | ||
emitOp2->gtUseNum = useNum2; | ||
emitOp3->gtUseNum = useNum3; | ||
#endif | ||
|
||
genConsumeRegs(emitOp1); | ||
genConsumeRegs(emitOp2); | ||
genConsumeRegs(emitOp3); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would permute these during lowering instead to avoid these hacks.
It's worth noting the reason we didn't do this in lowering (unlike most of the other cases that do swap operands) is because it would require introducing a lot of new synthetic intrinsics and it was believed to be overall more costly.
Each FusedMultiplyAdd
intrinsic has three forms, where op3
is always the node that can be optionally contained:
132
-op1 = (op1 * op3) + op2
213
- `op1 = (op2 * op1) + op3231
-op1 = (op2 * op3) + op1
The managed API we expose is the 213
form and there are 10 different FMA intrinsics, so we'd need to expose 10 more for the 132
and 10 more for the 231
form. Then we'd need to repeat this for the Avx512
specific variants, giving us at least 50 new synthetic intrinsics in lowering just to cover FMA
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be able to avoid new synthetic intrinsics if we had a way to track what permutation it was, but free bits are fairly sparse right now. So I think we'd need to get clever in how we tracked that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot use genConsumeMultiOpOperands()
here instead?
Edit: I assume because we won't be using the same order as swapped operands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, op1, op2, op3
(the order that genConsumeMultiOpOperands
consumes in) is not the same as emitOp1, emitOp2, emitOp3
, which as I understand it is the order that uses were built in by LSRA. We should consume in that order.
I don't have the context necessary to completely understand why we can't build and consume the operands in the op1, op2, op3
order even if we end up emitting different instructions using the registers in different orders in the instruction we emit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding - do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That I don't. It's an area of the register allocator I'm not well versed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That I don't. It's an area of the register allocator I'm not well versed in.
Hhm, I think it is to do with how we are consuming them in codegen as opposed to the LSRA ordering.
and consume the operands in the op1, op2, op3 order even if we end up emitting different instructions using the registers in different order
The operands of the FMA intrinsics are permuted in a non-standard way during LSRA. Codegen already takes this into account, but the handling was missing when consuming the operands.
Ideally we would permute these during lowering instead to avoid these hacks.
Fix #102773