Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Reconsider how to represent defs/uses of CPU flags (GTF_SET_FLAGS) #74867

Open
jakobbotsch opened this issue Aug 31, 2022 · 9 comments
Open
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@jakobbotsch
Copy link
Member

jakobbotsch commented Aug 31, 2022

When GTF_SET_FLAGS is set on a node it indicates that a future node may consume the CPU flags that were set by this node. However, this is outside the JIT's modelling of values in the IR as we do not track CPU flag dependencies at all. It means the interference checking we have today is not really sufficient, and it complicates some things such as potentially implementing rematerialization that could introduce nodes trashing the CPU flags at arbitrary places.

Today this works out because we do very limited transformations on LIR and because we use the GTF_SET_FLAGS capability only in limited situations: decomposed longs on 32-bit platforms and for conditional branching.

If we want rematerialization we are probably going to have to solve this in some way. Two solutions come to mind:

  • Actually model CPU flags at the IR level in a way that we can reason about these when doing interference checks and from LSRA. Probably quite a bit of work.
  • Reimplement it via containment, i.e. contain child nodes that may generate values into CPU flags in parent nodes that can consume values from CPU flags. This may not be as powerful as the above, but allows us to rely on the existing def/use model and thus seems much simpler. Compare chains on ARM64 work based on this model today.

One fairly simple step might be to move all non-decomposition uses of GTF_SET_FLAGS to the containment model, which should mean that rematerialization becomes possible in many contexts (e.g. always on 64-bit targets).

cc @dotnet/jit-contrib

category:proposal
theme:ir

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 31, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Aug 31, 2022
@ghost
Copy link

ghost commented Aug 31, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

When GTF_SET_FLAGS is set on a node it indicates that a future node may consume the CPU flags that were set by this node. However, this is outside the JIT's modelling of values in the IR as we do not track CPU flag dependencies at all. It means the interference checking we have today is not really sufficient, and it complicates some things such as potentially implementing rematerialization that could introduce nodes trashing the CPU flags at arbitrary places.

Today this works out because we do very limited transformations on LIR and because we use the GTF_SET_FLAGS capability only in limited situations: decomposed longs on 32-bit platforms and for conditional branching.

If we want rematerialization we are probably going to have to solve this in some way. Two solutions come to mind:

  • Actually model CPU flags at the IR level in a way that we can reason about these when doing interference checks and from LSRA. Probably quite a bit of work.
  • Reimplement it via containment, i.e. contain child nodes that may generate values into CPU flags in parent nodes that can consume values from CPU flags. This may not be as powerful as the above, but allows us to rely on the existing def/use model and thus seems much simpler. Compare chains on ARM64 work based on this model today.

One fairly simple step might be to move all non-decomposition uses of GTF_SET_FLAGS to the containment model, which should mean that rematerialization becomes possible in many contexts (e.g. always on 64-bit targets).

cc @dotnet/jit-contrib

Author: jakobbotsch
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch jakobbotsch added this to the 8.0.0 milestone Aug 31, 2022
@jakobbotsch jakobbotsch removed the untriaged New issue has not been triaged by the area owner label Aug 31, 2022
@jakobbotsch
Copy link
Member Author

jakobbotsch commented Aug 31, 2022

One fairly simple step might be to move all non-decomposition uses of GTF_SET_FLAGS to the containment model

As part of this GT_SETCC can probably also be removed and the behavior of compare nodes can be made more in line with the rest of the IR (i.e. to always produce values in registers unless contained).

@BruceForstall
Copy link
Member

Containment currently means that the "containee" doesn't generate any code, and that its effect is completely represented in the instruction generated by the "container" node. This would no longer be true. The "container" would be generating multiple instructions.

Also, no instructions could be placed between them (this probably doesn't matter much currently -- and even if it was used, any intervening instructions couldn't affect the flags).

@jakobbotsch
Copy link
Member Author

Containment currently means that the "containee" doesn't generate any code, and that its effect is completely represented in the instruction generated by the "container" node. This would no longer be true. The "container" would be generating multiple instructions.

Do you see any issues with this? To me it is an arbitrary limitation to put on containment, also considering that many nodes today already need to generate more than one instruction.
I personally never really thought of containment as "its effect completely represented in the instruction generated by the parent", I always thought of containment as a general "codegen is handled as part of the parent". I think we already have many cases where a more flexible definition of containment is used (e.g. codegen for PUTARG_STK with contained GT_OBJ comes to mind).

Also, no instructions could be placed between them (this probably doesn't matter much currently -- and even if it was used, any intervening instructions couldn't affect the flags).

Indeed, that's what I meant by the fact that this might not be as powerful as a fully fledged way to reason about CPU flags.

@BruceForstall
Copy link
Member

I always thought of containment as a general "codegen is handled as part of the parent".

I think that's a reasonable way to think about it. Ideally, post-lower, an IR node generates exactly one machine instruction. It's a simple mental model. There are probably exceptions today as you note. As long as LSRA's register needs are exposed, it works even if more than one instruction is generated. But it does make any more general "scheduling" (which we don't do) potentially harder.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Jan 30, 2023

I'm no longer so sure that doing this the containment way is the best approach after having spent a while trying to plumb some of our more 'exotic' optimizations through the GT_SELECT support in the xarch backend.

Containment for normal relops is not so bad, but the xarch backend can also optimize JTRUE (EQ (AND x y) 0) (and others) into something that avoids the relop entirely and uses ZF after AND. Doing this via containment for SELECT codegen (i.e. marking the and contained in the SELECT) turns out to be quite hard due to the 'destructive' nature of instructions like and on xarch.
The containment makes it apparent that actually we don't need the value at all, just the result of the flags, but unfortunately xarch does not have something like and xzr, xop1, xop2 like arm64 -- so we still need an internal register to compute the result into. And we would like all the preferencing that comes with that, as well as tricks that genCodeBinary employs. This ends up being pretty hacky to implement since internal registers follow a different lifetime model in LSRA than normal defs, so we end up needing multiple locations for GT_SELECT to get the same good codegen as we normally would get.

I had a go at implementing it, and it does seem doable in this particular case, but I'm not sure the approach is really scalable. For reference, this is what the parts ended up looking like:

  • Containment
  • LSRA
  • Codegen (Ideally we could just call genCodeForBinary, but this required some refactoring to get the target register threaded through and properly emitting the right instructions)

@jakobbotsch
Copy link
Member Author

There is a prototype in #82355 of the first approach, i.e. explicitly represent flags dependencies in LIR. Here were my thoughts on the work: #82355 (comment)

I think the explicit representation would be the way forward, but I don't think the trade off is worth it today, so I will move this to future.

@jakobbotsch jakobbotsch modified the milestones: 8.0.0, Future Apr 13, 2023
@pentp
Copy link
Contributor

pentp commented Apr 15, 2023

Containment for normal relops is not so bad, but the xarch backend can also optimize JTRUE (EQ (AND x y) 0) (and others) into something that avoids the relop entirely and uses ZF after AND. Doing this via containment for SELECT codegen (i.e. marking the and contained in the SELECT) turns out to be quite hard due to the 'destructive' nature of instructions like and on xarch. The containment makes it apparent that actually we don't need the value at all, just the result of the flags, but unfortunately xarch does not have something like and xzr, xop1, xop2 like arm64 -- so we still need an internal register to compute the result into.

For AND specifically there's TEST that sets the exact same flags. I assume the current use of AND in some cases like this is just that the optimization is incomplete?

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Apr 19, 2023

For AND specifically there's TEST that sets the exact same flags. I assume the current use of AND in some cases like this is just that the optimization is incomplete?

This was just an example, EQ/NE(AND(...), 0) is indeed transformed into code that emits test instead, so it doesn't suffer from the problem described above. Regardless, the restrictions related to that comment were lifted in #82235.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

5 participants