JIT: Reconsider how to represent defs/uses of CPU flags (GTF_SET_FLAGS) #74867

jakobbotsch · 2022-08-31T12:20:10Z

When GTF_SET_FLAGS is set on a node it indicates that a future node may consume the CPU flags that were set by this node. However, this is outside the JIT's modelling of values in the IR as we do not track CPU flag dependencies at all. It means the interference checking we have today is not really sufficient, and it complicates some things such as potentially implementing rematerialization that could introduce nodes trashing the CPU flags at arbitrary places.

Today this works out because we do very limited transformations on LIR and because we use the GTF_SET_FLAGS capability only in limited situations: decomposed longs on 32-bit platforms and for conditional branching.

If we want rematerialization we are probably going to have to solve this in some way. Two solutions come to mind:

Actually model CPU flags at the IR level in a way that we can reason about these when doing interference checks and from LSRA. Probably quite a bit of work.
Reimplement it via containment, i.e. contain child nodes that may generate values into CPU flags in parent nodes that can consume values from CPU flags. This may not be as powerful as the above, but allows us to rely on the existing def/use model and thus seems much simpler. Compare chains on ARM64 work based on this model today.

One fairly simple step might be to move all non-decomposition uses of GTF_SET_FLAGS to the containment model, which should mean that rematerialization becomes possible in many contexts (e.g. always on 64-bit targets).

cc @dotnet/jit-contrib

category:proposal
theme:ir

The text was updated successfully, but these errors were encountered:

ghost · 2022-08-31T12:20:18Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

When GTF_SET_FLAGS is set on a node it indicates that a future node may consume the CPU flags that were set by this node. However, this is outside the JIT's modelling of values in the IR as we do not track CPU flag dependencies at all. It means the interference checking we have today is not really sufficient, and it complicates some things such as potentially implementing rematerialization that could introduce nodes trashing the CPU flags at arbitrary places.

Today this works out because we do very limited transformations on LIR and because we use the GTF_SET_FLAGS capability only in limited situations: decomposed longs on 32-bit platforms and for conditional branching.

If we want rematerialization we are probably going to have to solve this in some way. Two solutions come to mind:

Actually model CPU flags at the IR level in a way that we can reason about these when doing interference checks and from LSRA. Probably quite a bit of work.
Reimplement it via containment, i.e. contain child nodes that may generate values into CPU flags in parent nodes that can consume values from CPU flags. This may not be as powerful as the above, but allows us to rely on the existing def/use model and thus seems much simpler. Compare chains on ARM64 work based on this model today.

One fairly simple step might be to move all non-decomposition uses of GTF_SET_FLAGS to the containment model, which should mean that rematerialization becomes possible in many contexts (e.g. always on 64-bit targets).

cc @dotnet/jit-contrib

Author:	jakobbotsch
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

jakobbotsch · 2022-08-31T12:22:40Z

One fairly simple step might be to move all non-decomposition uses of GTF_SET_FLAGS to the containment model

As part of this GT_SETCC can probably also be removed and the behavior of compare nodes can be made more in line with the rest of the IR (i.e. to always produce values in registers unless contained).

BruceForstall · 2022-11-01T00:27:28Z

Containment currently means that the "containee" doesn't generate any code, and that its effect is completely represented in the instruction generated by the "container" node. This would no longer be true. The "container" would be generating multiple instructions.

Also, no instructions could be placed between them (this probably doesn't matter much currently -- and even if it was used, any intervening instructions couldn't affect the flags).

jakobbotsch · 2022-11-01T01:05:16Z

Containment currently means that the "containee" doesn't generate any code, and that its effect is completely represented in the instruction generated by the "container" node. This would no longer be true. The "container" would be generating multiple instructions.

Do you see any issues with this? To me it is an arbitrary limitation to put on containment, also considering that many nodes today already need to generate more than one instruction.
I personally never really thought of containment as "its effect completely represented in the instruction generated by the parent", I always thought of containment as a general "codegen is handled as part of the parent". I think we already have many cases where a more flexible definition of containment is used (e.g. codegen for PUTARG_STK with contained GT_OBJ comes to mind).

Also, no instructions could be placed between them (this probably doesn't matter much currently -- and even if it was used, any intervening instructions couldn't affect the flags).

Indeed, that's what I meant by the fact that this might not be as powerful as a fully fledged way to reason about CPU flags.

BruceForstall · 2022-11-01T01:20:03Z

I always thought of containment as a general "codegen is handled as part of the parent".

I think that's a reasonable way to think about it. Ideally, post-lower, an IR node generates exactly one machine instruction. It's a simple mental model. There are probably exceptions today as you note. As long as LSRA's register needs are exposed, it works even if more than one instruction is generated. But it does make any more general "scheduling" (which we don't do) potentially harder.

jakobbotsch · 2023-01-30T11:42:32Z

I'm no longer so sure that doing this the containment way is the best approach after having spent a while trying to plumb some of our more 'exotic' optimizations through the GT_SELECT support in the xarch backend.

Containment for normal relops is not so bad, but the xarch backend can also optimize JTRUE (EQ (AND x y) 0) (and others) into something that avoids the relop entirely and uses ZF after AND. Doing this via containment for SELECT codegen (i.e. marking the and contained in the SELECT) turns out to be quite hard due to the 'destructive' nature of instructions like and on xarch.
The containment makes it apparent that actually we don't need the value at all, just the result of the flags, but unfortunately xarch does not have something like and xzr, xop1, xop2 like arm64 -- so we still need an internal register to compute the result into. And we would like all the preferencing that comes with that, as well as tricks that genCodeBinary employs. This ends up being pretty hacky to implement since internal registers follow a different lifetime model in LSRA than normal defs, so we end up needing multiple locations for GT_SELECT to get the same good codegen as we normally would get.

I had a go at implementing it, and it does seem doable in this particular case, but I'm not sure the approach is really scalable. For reference, this is what the parts ended up looking like:

Containment
LSRA
Codegen (Ideally we could just call genCodeForBinary, but this required some refactoring to get the target register threaded through and properly emitting the right instructions)

jakobbotsch · 2023-04-13T19:26:29Z

There is a prototype in #82355 of the first approach, i.e. explicitly represent flags dependencies in LIR. Here were my thoughts on the work: #82355 (comment)

I think the explicit representation would be the way forward, but I don't think the trade off is worth it today, so I will move this to future.

pentp · 2023-04-15T11:24:10Z

Containment for normal relops is not so bad, but the xarch backend can also optimize JTRUE (EQ (AND x y) 0) (and others) into something that avoids the relop entirely and uses ZF after AND. Doing this via containment for SELECT codegen (i.e. marking the and contained in the SELECT) turns out to be quite hard due to the 'destructive' nature of instructions like and on xarch. The containment makes it apparent that actually we don't need the value at all, just the result of the flags, but unfortunately xarch does not have something like and xzr, xop1, xop2 like arm64 -- so we still need an internal register to compute the result into.

For AND specifically there's TEST that sets the exact same flags. I assume the current use of AND in some cases like this is just that the optimization is incomplete?

jakobbotsch · 2023-04-19T09:43:20Z

For AND specifically there's TEST that sets the exact same flags. I assume the current use of AND in some cases like this is just that the optimization is incomplete?

This was just an example, EQ/NE(AND(...), 0) is indeed transformed into code that emits test instead, so it doesn't suffer from the problem described above. Regardless, the restrictions related to that comment were lifted in #82235.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 31, 2022

ghost added the untriaged New issue has not been triaged by the area owner label Aug 31, 2022

jakobbotsch added this to the 8.0.0 milestone Aug 31, 2022

jakobbotsch removed the untriaged New issue has not been triaged by the area owner label Aug 31, 2022

jakobbotsch modified the milestones: 8.0.0, Future Aug 31, 2022

jakobbotsch mentioned this issue Oct 3, 2022

JIT: 'X <= uint.MaxValue' to 'X >> 32' #76525

Open

TIHan self-assigned this Oct 6, 2022

JulieLeeMSFT modified the milestones: Future, 8.0.0 Oct 10, 2022

TIHan mentioned this issue Nov 1, 2022

Do not set flags for JTRUE op1 #76731

Closed

jakobbotsch mentioned this issue Jan 16, 2023

[API Proposal]: Add ADCX and ADOX intrinsics #80674

Open

jakobbotsch mentioned this issue Jan 30, 2023

JIT: Support containing compares in GT_SELECT for xarch (i.e. start emitting cmov instructions) #81267

Merged

3 tasks

TIHan removed their assignment Jan 31, 2023

jakobbotsch mentioned this issue Feb 2, 2023

Arm64: Combine if conditions into compare chains #79283

Merged

JulieLeeMSFT assigned jakobbotsch Apr 13, 2023

jakobbotsch modified the milestones: 8.0.0, Future Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Reconsider how to represent defs/uses of CPU flags (GTF_SET_FLAGS) #74867

JIT: Reconsider how to represent defs/uses of CPU flags (GTF_SET_FLAGS) #74867

jakobbotsch commented Aug 31, 2022 •

edited by BruceForstall

Loading

ghost commented Aug 31, 2022

jakobbotsch commented Aug 31, 2022 •

edited

Loading

BruceForstall commented Nov 1, 2022

jakobbotsch commented Nov 1, 2022

BruceForstall commented Nov 1, 2022

jakobbotsch commented Jan 30, 2023 •

edited

Loading

jakobbotsch commented Apr 13, 2023

pentp commented Apr 15, 2023

jakobbotsch commented Apr 19, 2023 •

edited

Loading

JIT: Reconsider how to represent defs/uses of CPU flags (GTF_SET_FLAGS) #74867

JIT: Reconsider how to represent defs/uses of CPU flags (GTF_SET_FLAGS) #74867

Comments

jakobbotsch commented Aug 31, 2022 • edited by BruceForstall Loading

ghost commented Aug 31, 2022

jakobbotsch commented Aug 31, 2022 • edited Loading

BruceForstall commented Nov 1, 2022

jakobbotsch commented Nov 1, 2022

BruceForstall commented Nov 1, 2022

jakobbotsch commented Jan 30, 2023 • edited Loading

jakobbotsch commented Apr 13, 2023

pentp commented Apr 15, 2023

jakobbotsch commented Apr 19, 2023 • edited Loading

jakobbotsch commented Aug 31, 2022 •

edited by BruceForstall

Loading

jakobbotsch commented Aug 31, 2022 •

edited

Loading

jakobbotsch commented Jan 30, 2023 •

edited

Loading

jakobbotsch commented Apr 19, 2023 •

edited

Loading