Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Consume FMA intrinsic operands in right order #102914

Merged
merged 1 commit into from
May 31, 2024

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented May 31, 2024

The operands of the FMA intrinsics are permuted in a non-standard way during LSRA. Codegen already takes this into account, but the handling was missing when consuming the operands.

Ideally we would permute these during lowering instead to avoid these hacks.

Fix #102773

The operands of the FMA intrinsic are permuted in a non-standard way
during LSRA. Codegen already takes this into account, but the handling
was missing when consuming the operands.

Ideally we would permute these during lowering instead to avoid these
hacks.

Fix dotnet#102773
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 31, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@jakobbotsch jakobbotsch marked this pull request as ready for review May 31, 2024 15:08
@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib PTAL @tannergooding @kunalspathak

No diffs

This fixes the superpmi-replay blocking issue.

Comment on lines +3144 to +3158
#ifdef DEBUG
// Use nums are assigned in LIR order but this node is special and doesn't
// actually use operands. Fix up the use nums here to avoid asserts.
unsigned useNum1 = op1->gtUseNum;
unsigned useNum2 = op2->gtUseNum;
unsigned useNum3 = op3->gtUseNum;
emitOp1->gtUseNum = useNum1;
emitOp2->gtUseNum = useNum2;
emitOp3->gtUseNum = useNum3;
#endif

genConsumeRegs(emitOp1);
genConsumeRegs(emitOp2);
genConsumeRegs(emitOp3);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would permute these during lowering instead to avoid these hacks.

It's worth noting the reason we didn't do this in lowering (unlike most of the other cases that do swap operands) is because it would require introducing a lot of new synthetic intrinsics and it was believed to be overall more costly.

Each FusedMultiplyAdd intrinsic has three forms, where op3 is always the node that can be optionally contained:

  • 132 - op1 = (op1 * op3) + op2
  • 213 - `op1 = (op2 * op1) + op3
  • 231 - op1 = (op2 * op3) + op1

The managed API we expose is the 213 form and there are 10 different FMA intrinsics, so we'd need to expose 10 more for the 132 and 10 more for the 231 form. Then we'd need to repeat this for the Avx512 specific variants, giving us at least 50 new synthetic intrinsics in lowering just to cover FMA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to avoid new synthetic intrinsics if we had a way to track what permutation it was, but free bits are fairly sparse right now. So I think we'd need to get clever in how we tracked that.

Copy link
Member

@kunalspathak kunalspathak May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot use genConsumeMultiOpOperands() here instead?
Edit: I assume because we won't be using the same order as swapped operands?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, op1, op2, op3 (the order that genConsumeMultiOpOperands consumes in) is not the same as emitOp1, emitOp2, emitOp3, which as I understand it is the order that uses were built in by LSRA. We should consume in that order.

I don't have the context necessary to completely understand why we can't build and consume the operands in the op1, op2, op3 order even if we end up emitting different instructions using the registers in different orders in the instruction we emit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding - do you know?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I don't. It's an area of the register allocator I'm not well versed in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I don't. It's an area of the register allocator I'm not well versed in.

Hhm, I think it is to do with how we are consuming them in codegen as opposed to the LSRA ordering.

and consume the operands in the op1, op2, op3 order even if we end up emitting different instructions using the registers in different order

@jakobbotsch jakobbotsch merged commit e9cd3f1 into dotnet:main May 31, 2024
112 of 114 checks passed
@jakobbotsch jakobbotsch deleted the fix-102773 branch May 31, 2024 18:59
@github-actions github-actions bot locked and limited conversation to collaborators Jul 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SPMI Replay failing sporadically
3 participants