Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Consume FMA intrinsic operands in right order #102914

Merged
merged 1 commit into from
May 31, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 15 additions & 2 deletions src/coreclr/jit/hwintrinsiccodegenxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3053,8 +3053,6 @@ void CodeGen::genFMAIntrinsic(GenTreeHWIntrinsic* node, insOpts instOptions)

regNumber targetReg = node->GetRegNum();

genConsumeMultiOpOperands(node);

regNumber op1NodeReg = op1->GetRegNum();
regNumber op2NodeReg = op2->GetRegNum();
regNumber op3NodeReg = op3->GetRegNum();
Expand Down Expand Up @@ -3143,6 +3141,21 @@ void CodeGen::genFMAIntrinsic(GenTreeHWIntrinsic* node, insOpts instOptions)
}
}

#ifdef DEBUG
// Use nums are assigned in LIR order but this node is special and doesn't
// actually use operands. Fix up the use nums here to avoid asserts.
unsigned useNum1 = op1->gtUseNum;
unsigned useNum2 = op2->gtUseNum;
unsigned useNum3 = op3->gtUseNum;
emitOp1->gtUseNum = useNum1;
emitOp2->gtUseNum = useNum2;
emitOp3->gtUseNum = useNum3;
#endif

genConsumeRegs(emitOp1);
genConsumeRegs(emitOp2);
genConsumeRegs(emitOp3);

Comment on lines +3144 to +3158
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would permute these during lowering instead to avoid these hacks.

It's worth noting the reason we didn't do this in lowering (unlike most of the other cases that do swap operands) is because it would require introducing a lot of new synthetic intrinsics and it was believed to be overall more costly.

Each FusedMultiplyAdd intrinsic has three forms, where op3 is always the node that can be optionally contained:

  • 132 - op1 = (op1 * op3) + op2
  • 213 - `op1 = (op2 * op1) + op3
  • 231 - op1 = (op2 * op3) + op1

The managed API we expose is the 213 form and there are 10 different FMA intrinsics, so we'd need to expose 10 more for the 132 and 10 more for the 231 form. Then we'd need to repeat this for the Avx512 specific variants, giving us at least 50 new synthetic intrinsics in lowering just to cover FMA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to avoid new synthetic intrinsics if we had a way to track what permutation it was, but free bits are fairly sparse right now. So I think we'd need to get clever in how we tracked that.

Copy link
Member

@kunalspathak kunalspathak May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot use genConsumeMultiOpOperands() here instead?
Edit: I assume because we won't be using the same order as swapped operands?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, op1, op2, op3 (the order that genConsumeMultiOpOperands consumes in) is not the same as emitOp1, emitOp2, emitOp3, which as I understand it is the order that uses were built in by LSRA. We should consume in that order.

I don't have the context necessary to completely understand why we can't build and consume the operands in the op1, op2, op3 order even if we end up emitting different instructions using the registers in different orders in the instruction we emit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding - do you know?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I don't. It's an area of the register allocator I'm not well versed in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I don't. It's an area of the register allocator I'm not well versed in.

Hhm, I think it is to do with how we are consuming them in codegen as opposed to the LSRA ordering.

and consume the operands in the op1, op2, op3 order even if we end up emitting different instructions using the registers in different order

assert(ins != INS_invalid);
genHWIntrinsic_R_R_R_RM(ins, attr, targetReg, emitOp1->GetRegNum(), emitOp2->GetRegNum(), emitOp3, instOptions);
genProduceReg(node);
Expand Down
Loading