Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cranelift/x64 backend: do not use one-way branches. #10086

Merged

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Jan 23, 2025

In #9980, we saw that code compiled with the single-pass register allocator has incorrect behavior. We eventually narrowed this down to the fact that the single-pass allocator is inserting code meant to be at the end of a block, just before its terminator, between two branches that form the terminator sequence. The allocator is correct; the bug is with Cranelift's x64 backend.

When we produce instructions into a VCode container, we maintain basic blocks, and we have the invariant (usual for basic block-based IR) that only the last -- terminator -- instruction is a branch that can leave the block. Even the conditional branches maintain this invariant: though VCode is meant to be "almost machine code", we emit two-target conditionals that are semantically like "jcond; jmp". We then are able to optimize this inline during binary emission in the MachBuffer: the buffer knows about unconditional and conditional branches and will "chomp" branches off the tail of the buffer whenever they target the fallthrough block. (We designed the system this way because it is simpler to think about BBs that are order-invariant, i.e., not bake the "fallthrough" concept into the IR.) Thus we have a simpler abstraction but produce optimal terminator sequences.

Unfortunately, when adding a branch-on-floating-point-compare lowering, we had the need to branch to a target if either of two conditions were true, and rather than add a new kind of terminator instruction, we added a "one-armed branch": conditionally branch to label or fall through. We emitted this in sequence right before the actual terminator, so semantically it was almost equivalent.

I write "almost" because the register allocator is allowed to insert spills/reloads/moves between any two instructions. Here the distinct pieces of the terminator sequence matter: the allocator might insert something just before the last instruction, assuming the basic-block "single in, single out" invariant means this will always run with the block. With one-armed branches this is no longer true.

The backtracking allocator (our original RA2 algorithm, and still the default today) will never insert code at the end of a block when it has multiple terminators, because it associates such block-start/end insertions with edges; so in such conditions it inserts instructions into the tops of successor blocks instead. But the single-pass allocator needs to perform work at the end of every block, so it will trigger this bug.

This PR removes JmpIf and converts the br-of-fcmp lowering to use JmpCondOr instead, which is a pseudoinstruction that does jcc1; jcc2; jmp. This maintains the BB invariant and fixes the bug.

Note that Winch still uses JmpIf, so we cannot remove it entirely: this PR renames it to WinchJmpIf instead, and adds a mechanism to assert failure if it is ever added to VCode (rather than emitted directly, as Winch's macro-assembler does). We could instead write Winch's jmp_if assembler function in terms of JmpCond with a fallthrough label that is immediately bound, and let the MachBuffer always chomp the jmp; I opted not to regress Winch compiler performance by doing this. If one day we abstract out the assembler further, we can remove WinchJmpIf.

This is one of two instances of a "one-armed branch"; the other is s390x's OneWayCondBr, used in br_table lowerings, which we will address separately. Once we do, that will address #9980 entirely.

@cfallin cfallin requested a review from a team as a code owner January 23, 2025 00:07
@cfallin cfallin requested review from fitzgen and removed request for a team January 23, 2025 00:07
In bytecodealliance#9980, we saw that code copmiled with the single-pass register
allocator has incorrect behavior. We eventually narrowed this down to
the fact that the single-pass allocator is inserting code meant to be
at the end of a block, just before its terminator, *between* two
branches that form the terminator sequence. The allocator is correct;
the bug is with Cranelift's x64 backend.

When we produce instructions into a VCode container, we maintain basic
blocks, and we have the invariant (usual for basic block-based IR)
that only the last -- terminator -- instruction is a branch that can
leave the block. Even the conditional branches maintain this
invariant: though VCode is meant to be "almost machine code", we
emit *two-target conditionals* that are semantically like "jcond;
jmp". We then are able to optimize this inline during binary emission
in the `MachBuffer`: the buffer knows about unconditional and
conditional branches and will "chomp" branches off the tail of the
buffer whenever they target the fallthrough block. (We designed the
system this way because it is simpler to think about BBs that are
order-invariant, i.e., not bake the "fallthrough" concept into the
IR.) Thus we have a simpler abstraction but produce optimal terminator
sequences.

Unfortunately, when adding a branch-on-floating-point-compare
lowering, we had the need to branch to a target if either of *two*
conditions were true, and rather than add a new kind of terminator
instruction, we added a "one-armed branch": conditionally branch to
label or fall through. We emitted this in sequence right before the
actual terminator, so semantically it was almost equivalent.

I write "almost" because the register allocator *is* allowed to insert
spills/reloads/moves between any two instructions. Here the distinct
pieces of the terminator sequence matter: the allocator might insert
something just before the last instruction, assuming the basic-block
"single in, single out" invariant means this will always run with the
block. With one-armed branches this is no longer true.

The backtracking allocator (our original RA2 algorithm, and still the
default today) will never insert code at the end of a block when it
has multiple terminators, because it associates such block-start/end
insertions with *edges*; so in such conditions it inserts instructions
into the tops of successor blocks instead. But the single-pass
allocator needs to perform work at the end of every block, so it will
trigger this bug.

This PR removes `JmpIf` and converts the br-of-fcmp lowering to use
`JmpCondOr` instead, which is a pseudoinstruction that does `jcc1;
jcc2; jmp`. This maintains the BB invariant and fixes the bug.

Note that Winch still uses `JmpIf`, so we cannot remove it entirely:
this PR renames it to `WinchJmpIf` instead, and adds a mechanism to
assert failure if it is ever added to `VCode` (rather than emitted
directly, as Winch's macro-assembler does). We could instead write
Winch's `jmp_if` assembler function in terms of `JmpCond` with a
fallthrough label that is immediately bound, and let the MachBuffer
always chomp the jmp; I opted not to regress Winch compiler
performance by doing this. If one day we abstract out the assembler
further, we can remove `WinchJmpIf`.

This is one of two instances of a "one-armed branch"; the other is
s390x's `OneWayCondBr`, used in `br_table` lowerings, which we will
address separately. Once we do, that will address bytecodealliance#9980 entirely.
@cfallin cfallin force-pushed the lets-keep-our-basic-blocks-basic branch from 89a6136 to 82b833e Compare January 23, 2025 00:10
@cfallin
Copy link
Member Author

cfallin commented Jan 23, 2025

(Thanks to @alexcrichton and others for bottoming this out into the root cause; discussed earlier today in the Cranelift meeting.)

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Thanks for trying to be defensive with the low-level branching instruction bits as well. Kind of annoying that we are dealing with that. Hopefully a separate assembler library will help here...

cranelift/codegen/src/isa/x64/inst/emit.rs Show resolved Hide resolved
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen winch Winch issues or pull requests labels Jan 23, 2025
Copy link

Subscribe to Label Action

cc @saulecabrera

This issue or pull request has been labeled: "cranelift", "cranelift:area:machinst", "cranelift:area:x64", "winch"

Thus the following users have been cc'd because of the following labels:

  • saulecabrera: winch

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@cfallin cfallin force-pushed the lets-keep-our-basic-blocks-basic branch from 3a50e58 to 34ba441 Compare January 23, 2025 02:10
@cfallin cfallin enabled auto-merge January 23, 2025 02:14
@cfallin cfallin added this pull request to the merge queue Jan 23, 2025
Merged via the queue into bytecodealliance:main with commit 392c7a9 Jan 23, 2025
51 checks passed
@cfallin cfallin deleted the lets-keep-our-basic-blocks-basic branch January 23, 2025 02:50
cfallin added a commit to cfallin/wasmtime that referenced this pull request Jan 23, 2025
This is a followup to bytecodealliance#10086, this time removing the one-armed branch
variant for s390x. This branch was only used as the default-target
branch in the `br_table` lowering. This PR incorporates the branch into
the `JTSequence` pseudo-instruction. Some care is needed to keep the
`ProducesBool` abstraction; it is unwrapped into its `ProducesFlags` and
the `JTSequence` becomes a `ConsumesFlags`, so the compare for the
jump-table bound (for default target) is not part of the pseudoinst.
(This is OK because regalloc-inserted moves never alter flags, by
explicit contract; the same reason allows cmp/branch terminators.)

Along the way I noticed that the comments on `JTSequence` claimed that
`targets` included the default, but this is (no longer?) the case, as
the targets are unwrapped by `jump_table_targets` which peels off the
first (default) separately. Aside from comments, this only affected
pretty-printing; codegen was correct.

With this, we have no more one-armed branches; hence, this fixes bytecodealliance#9980.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Jan 23, 2025
This is a followup to bytecodealliance#10086, this time removing the one-armed branch
variant for s390x. This branch was only used as the default-target
branch in the `br_table` lowering. This PR incorporates the branch into
the `JTSequence` pseudo-instruction. Some care is needed to keep the
`ProducesBool` abstraction; it is unwrapped into its `ProducesFlags` and
the `JTSequence` becomes a `ConsumesFlags`, so the compare for the
jump-table bound (for default target) is not part of the pseudoinst.
(This is OK because regalloc-inserted moves never alter flags, by
explicit contract; the same reason allows cmp/branch terminators.)

Along the way I noticed that the comments on `JTSequence` claimed that
`targets` included the default, but this is (no longer?) the case, as
the targets are unwrapped by `jump_table_targets` which peels off the
first (default) separately. Aside from comments, this only affected
pretty-printing; codegen was correct.

With this, we have no more one-armed branches; hence, this fixes bytecodealliance#9980.
github-merge-queue bot pushed a commit that referenced this pull request Jan 23, 2025
* Cranelift/s390x: do not use one-way conditional branches.

This is a followup to #10086, this time removing the one-armed branch
variant for s390x. This branch was only used as the default-target
branch in the `br_table` lowering. This PR incorporates the branch into
the `JTSequence` pseudo-instruction. Some care is needed to keep the
`ProducesBool` abstraction; it is unwrapped into its `ProducesFlags` and
the `JTSequence` becomes a `ConsumesFlags`, so the compare for the
jump-table bound (for default target) is not part of the pseudoinst.
(This is OK because regalloc-inserted moves never alter flags, by
explicit contract; the same reason allows cmp/branch terminators.)

Along the way I noticed that the comments on `JTSequence` claimed that
`targets` included the default, but this is (no longer?) the case, as
the targets are unwrapped by `jump_table_targets` which peels off the
first (default) separately. Aside from comments, this only affected
pretty-printing; codegen was correct.

With this, we have no more one-armed branches; hence, this fixes #9980.

* Review feedback.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator winch Winch issues or pull requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants