Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x64: port select to ISLE #3682

Merged
merged 3 commits into from
Feb 23, 2022
Merged

x64: port select to ISLE #3682

merged 3 commits into from
Feb 23, 2022

Conversation

abrown
Copy link
Contributor

@abrown abrown commented Jan 11, 2022

This change ports CLIF's select instruction to ISLE for x64.

@abrown
Copy link
Contributor Author

abrown commented Jan 11, 2022

This is a complex lowering due to not only the tree-matching of the CLIF IR but also the sequences required by x64. Because of this the PR will need to go through a few more changes:

  • add a way to emit a sequence like the following for FloatCC.Equal and FloatCC.NotEqual: UNORD* + MOV + JNP + MOV + JZ + MOV--because this sequence must reuse the same flags (and avoid modifications to them, @cfallin suggested implementing a new Inst::XmmCmoveOr (but that also implies building a new Inst::CmoveOr as well)
  • add a way to emit conditional moves for multiple register values
  • implement the integer comparison side of these instructions (both the tree-matching version and the non-tree-matching versions)

I'm putting this PR up as a draft to see if I can get some help with the above and early feedback on any suggested refactoring to the rules.

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen isle Related to the ISLE domain-specific language labels Jan 11, 2022
@github-actions
Copy link

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

  • cfallin: isle
  • fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

size: *size,
consequent: consequent.clone(),
alternative: alternative.clone(),
dst: dst.clone(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overwrites dst from above, I think -- it seems we want the first cmove to write a tmp, and alternative in the second to be that tmp? So then we have the equivalent of cmove(cc1, a, cmove(cc2, b, alternative)).

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good to me!

I think with some additional run tests we can start landing this incremental work and then finish up with the rest of the select lowerings in follow up PRs (if you want; if you prefer doing everything in one big PR that's no problem too).

cranelift/codegen/src/isa/x64/inst/mod.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/x64/inst/mod.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/x64/lower.isle Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/x64/inst.isle Outdated Show resolved Hide resolved
cranelift/filetests/filetests/isa/x64/cmp-mem-bug.clif Outdated Show resolved Hide resolved
@abrown
Copy link
Contributor Author

abrown commented Jan 15, 2022

@cfallin, @fitzgen: I am finding some very strange behavior with ISLE and I'm not sure what is going on. Take, e.g., the following runtest:

test run
target x86_64

function %select_le_i32(f32, f32) -> i32 {
block0(v0: f32, v1: f32):
    v2 = fcmp le v0, v1
    v3 = iconst.i32 1
    v4 = iconst.i32 0
    v5 = select v2, v3, v4
    return v5
}
; run: %select_le_i32(0x42.42, 0.0) == 0
; run: %select_le_i32(0.0, 0.0) == 1
; run: %select_le_i32(0x0.0, 0x42.42) == 1
; run: %select_le_i32(0x0.0, NaN) == 0

function %select_le_f32(f32, f32) -> f32 {
block0(v0: f32, v1: f32):
    v2 = fcmp le v0, v1
    v3 = f32const 0x1.0
    v4 = f32const 0x0.0
    v5 = select v2, v3, v4
    return v5
}
; run: %select_le_f32(0x42.42, 0.0) == 0x0.0
; run: %select_le_f32(0.0, 0.0) == 0x1.0
; run: %select_le_f32(0x0.0, 0x42.42) == 0x1.0
; run: %select_le_f32(0x0.0, NaN) == 0x0.0

If we run that (RUST_LOG=cranelift_codegen::machinst::compile=trace cargo run -p cranelift-tools -- test -dv scratch.clif), the second set of tests fail:

FAIL scratch.clif: run

Caused by:
    Failed test: run: %select_le_f32(0x1.090800p6, 0.0) == 0.0, actual: 0x1.000000p

The reason for this is that %xmm0 and %xmm1 are magically "flipped" when passed to the comparison instruction, ucomiss. The first function, select_le_i32, has the correct order (ucomiss %xmm0, %xmm1) whereas the second does not (ucomiss %xmm1, %xmm0). This is visible in the log statements for each (in order):

 TRACE cranelift_codegen::machinst::compile > vcode after regalloc: final version:
VCode_ShowWithRRU {{
  Entry block: 0
Block 0:
  (original IR block: block0)
  (instruction range: 0 .. 10)
  Inst 0:   pushq   %rbp
  Inst 1:   movq    %rsp, %rbp
  Inst 2:   xorl    %esi, %esi
  Inst 3:   movl    $1, %edi
  Inst 4:   ucomiss %xmm0, %xmm1
  Inst 5:   cmovnbl %edi, %esi
  Inst 6:   movq    %rsi, %rax
  Inst 7:   movq    %rbp, %rsp
  Inst 8:   popq    %rbp
  Inst 9:   ret
}}
 TRACE cranelift_codegen::machinst::compile > vcode after regalloc: final version:
VCode_ShowWithRRU {{
  Entry block: 0
Block 0:
  (original IR block: block0)
  (instruction range: 0 .. 11)
  Inst 0:   pushq   %rbp
  Inst 1:   movq    %rsp, %rbp
  Inst 2:   xorps   %xmm2, %xmm2
  Inst 3:   movl    $1065353216, %esi
  Inst 4:   movd    %esi, %xmm3
  Inst 5:   ucomiss %xmm1, %xmm0
  Inst 6:   jb $next; movss %xmm3, %xmm2; $next: 
  Inst 7:   movaps  %xmm2, %xmm0
  Inst 8:   movq    %rbp, %rsp
  Inst 9:   popq    %rbp
  Inst 10:   ret
}}

In the code on this branch, both lowerings use the fpcmp helper for the comparison and the cmove helper to emit either a true CMOV in the first case or a jump version in the second. I would expect both to have the same ucomiss ordering but they do not.

I then try to determine if in fact the "less than or equal" rule is firing in both cases. Using a well-placed panic! I now see that it is not firing in the first case but is in the second (this implies that I will have to rather unintuitively flip the operands in the rule to make the lowering come out right).

But why is the rule not firing in the first case? After debugging through generated_code.rs for a while, it becomes clear that it is bailing out in the fpcmp helper. Ok, I guess the type being passed in to the helper should be the comparison type, not the select type. [...spends time trying to figure out how to grab that type]. I guess first that has_type wraps def_inst--nope. Does it wrap the fcmp itself? Nope. How about wrapping one of fcmp's operands? That fails with an Unknown variable 'a'.

Any help?

@fitzgen
Copy link
Member

fitzgen commented Jan 18, 2022

Ok, I guess the type being passed in to the helper should be the comparison type, not the select type. [...spends time trying to figure out how to grab that type].

You can use value_type to extract the type of a value, e.g. an operand to the instruction you are lowering:

(rule (lower (select val @ (value_type val_ty) ...))
      ...)

@abrown
Copy link
Contributor Author

abrown commented Jan 29, 2022

@fitzgen, @cfallin: this should be ready for review. The lower.isle code is a lot more readable now but I suspect there are better ways to do things; concretely, to somehow avoid the "use values instead of regs" pattern and the hard-coding in the i128 versions of cmove.

I've also opened #3743 and #3744 as a result of this work. Work that is still TODO but could probably go in separate PRs is to translate the other side of the select, the integer side, to ISLE and upgrade the CLIF fuzzer to generate these kinds of instructions (it only emits simple numerics now).

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How confident are you in these lowerings? We talked before about getting the clif interpreter and fuzzer working for f32/f64 constants, fcmp, and select. Should we do that before landing these changes?

cranelift/codegen/src/isa/x64/inst.isle Outdated Show resolved Hide resolved
Comment on lines +919 to +1547
(let ((cons ValueRegs (put_in_regs consequent))
(alt ValueRegs (put_in_regs alternative))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that cmove_from_values should take these ValueRegs instead of Values. This lets us reuse it in places where we've already put the values into registers.

(decl cmove_from_values (Type CC ValueRegs ValueRegs) ConsumesFlags)
(rule (cmove_from_vlaue $I128 cc consequent alternative)
      ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside to that is that it bloats the rules in lower.isle; can we do this refactoring in a future PR or issue, e.g., the next time cmove is used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that works for me.

Comment on lines 925 to 927
;; TODO there must be some way to say that both of these
;; instructions consume the flags together.
(_ Unit (emit (MInst.Cmove size cc (RegMem.Reg (value_regs_get cons 0)) (value_regs_get alt 0) dst1))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing the emit here is actually unsafe because nothing is guaranteeing that

  • we just emitted the flags producer before this
  • nothing else is emitted after the flags producer and before this flags consumer that clobbers the flags

What we want to do is return two ConsumesFlags here that we can then pass into with_flags_2 along with a FlagsProducer.

ISLE doesn't have tuple/fixed-size-arrays support, so we will have to define an extern type for this.

But! The other rule for cmove_from_values for when it is a single-register value emits only one ConsumesFlags, so I think we actually want an extern type that is either one or two ConsumesFlags. I think we can just use a SmallVec<[ConsumesFlags; 2]>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we will want a with_flags_* variant that takes the SmallVec<[ConsumesFlags; 2]> as well, and it will probably have to be implemented in external Rust code so it can loop over each ConsumesFlags in the small vec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a lot of new infrastructure to build and this PR may already have enough of that. The current lowering rules perform the same unsafe emission as this PR. Can't this be done afterwards?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current lowering rules for select are written in Rust, so it isn't really comparable. It would be one thing if this emit was near the emit for the flags-producer instruction and the emit for the other flags-consuming instruction, but it isn't. It isn't even clear to me, reading this code, that the instructions will always be emitted in the correct order!

It seems to me that the order of evaluation will be something like

  • construct FlagsProducer (do not emit it)
  • call this constructor
    • emit first cmove
    • construct FlagsConsumer with the second cmove
  • call a with_flags variant with the constructed FlagsProducer and FlagsConsumer
    • emit the producer
    • emit the consumer

If that is actually how things end up evaluating, then we would get cmove; test/cmp; cmove instead of test/cmp; cmove; cmove.

It looks like we aren't testing select with i128s yet (since you have only ported floating point selects so far). But this is exactly the kind bug I want to avoid with a dedicated with_flags_* variant that takes SmallVec<[ConsumesFlags; 2]>, as described above.

I really think we should fix this issue before landing this PR, or change this PR to not define and use cmove_from_values and instead use plain cmove (which will work for all the single-register values for select that are are currently supported in ISLE).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, @cfallin and I (but mostly @cfallin), worked on this a bit. The unpolished results are in 667935f--@fitzgen, thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extensions to with_flags, ProducesFlags, and ConsumesFlags look good!

cranelift/codegen/src/isa/x64/inst.isle Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/x64/inst.isle Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/x64/lower.isle Show resolved Hide resolved
@abrown
Copy link
Contributor Author

abrown commented Feb 1, 2022

How confident are you in these lowerings? We talked before about getting the clif interpreter and fuzzer working for f32/f64 constants, fcmp, and select. Should we do that before landing these changes?

I would vote "no." ISLE basically creates a conflict for every rebased commit requiring an ISLE rebuild and Git twiddling to ensure things are right before moving on. For those of use who try to keep a clean history with atomic commits, this additional burden makes keeping a long-running PR like this pretty painful. As you can see, I'm now just committing review patches in hopes that I can do one final rebase before merging.

Also, the CLIF interpreter has not been extended for any of the other ISLE lowerings so doing it here is extra work that no other PRs have had to do. I don't mind doing this kind of thing but the "build up the infrastructure before merging" tasks are starting to pile up: bringing FloatCC to ISLE (which turned into bringing all enums to ISLE), create a new machinst variant for CmoveOr, create ISLE helpers for various things (cmove, FP comparison, register class checks), create CLIF run tests for select, find and report bugs with extractors, and (just recently added) figure out a way to express that two instructions consume the same flags. So I was hoping to not have to build up a fuzz target to get this merged.

@cfallin
Copy link
Member

cfallin commented Feb 1, 2022

@abrown @fitzgen from my own perspective at least, I think it's fair to merge this at its current state and take some followup PRs/tasks -- I agree with @abrown that this pretty much maintains our status-quo in the sense that it passes existing tests. Deferring to @fitzgen on the detailed review discussions above of course...

@github-actions github-actions bot added the cranelift:area:aarch64 Issues related to AArch64 backend. label Feb 15, 2022
@abrown abrown force-pushed the isle-select branch 2 times, most recently from bcc3249 to dff48f9 Compare February 18, 2022 21:15
@abrown
Copy link
Contributor Author

abrown commented Feb 18, 2022

Not sure why aarch64 and s390x CLIF tests are failing--they shouldn't be running!

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+ from me at least, having been through this code while resolving some issues with @abrown last week, I think the basic approach is sound and this is more than enough improvement to merge at its current state, even if we still have further to go to complete the transition on this opcode. Also, as you mentioned above, the rebasing pain increases the longer we keep this pending, and there are other changes (e.g. implicit conversions) I want to land that should come after this.

It looks like there are some filetest failures, but I didn't look in-depth at what's causing them; perhaps some holes in the pattern coverage or some code that can't quite be deleted from the handwritten side yet?

It looks like @fitzgen is happy with the refactorings we did together but I'll defer to him to make sure he's happy with the final state of this.

Thanks again for all the patience here!

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, let's merge this and incrementally continue improving it in follow ups.

This change includes quite a few interlocking parts, required mainly by
the current x64 conventions in ISLE:
 - it adds a way to emit a `cmove` with multiple OR-ing conditions;
   because x64 ISLE cannot currently safely emit a comparison followed
   by several jumps, this adds `MachInst::CmoveOr` and
   `MachInst::XmmCmoveOr` macro instructions. Unfortunately, these macro
   instructions hide the multi-instruction sequence in `lower.isle`
 - to properly keep track of what instructions consume and produce
   flags, @cfallin added a way to pass around variants of
   `ConsumesFlags` and `ProducesFlags`--these changes affect all
   backends
 - then, to lower the `fcmp + select` CLIF, this change adds several
   `cmove*_from_values` helpers that perform all of the awkward
   conversions between `Value`, `ValueReg`, `Reg`, and `Gpr/Xmm`; one
   upside is that now these lowerings have much-improved documentation
   explaining why the various `FloatCC` and `CC` choices are made the
   the way they are.

Co-authored-by: Chris Fallin <chris@cfallin.org>
@abrown abrown merged commit f87c611 into bytecodealliance:main Feb 23, 2022
@abrown abrown deleted the isle-select branch February 23, 2022 18:03
cfallin added a commit to cfallin/wasmtime that referenced this pull request Mar 16, 2022
The `fpcmp` helper in the x64 backend uses `put_in_xmm_mem` for one of
its operands, which allows the compiler to merge a load with the compare
instruction (`ucomiss` or `ucomisd`).

Unfortunately, as we saw in bytecodealliance#2576 for the integer-compare case, this
does not work with our lowering algorithm because compares can be
lowered more than once (unlike all other instructions) to reproduce the
flags where needed. Merging a load into an op that executes more than
once is invalid in general (the two loads may observe different values,
which violates the original program semantics because there was only one
load originally).

This does not result in a miscompilation, but instead will cause a panic
at regalloc time because the register that should have been defined by
the separate load is never written (the load is never emitted
separately).

I think this (very subtle, easy to miss) condition was unfortunately not
ported over when we moved the logic in bytecodealliance#3682.

The existing fcmp-of-load test in `cmp-mem-bug` (from bytecodealliance#2576) does not
seem to trigger it, for a reason I haven't fully deduced. I just added
the verbatim function body (happens to come from `clang.wasm`) that
triggers the bug as a test.

Discovered while bringing up regalloc2 support. It's pretty unlikely to
hit by chance, which is why I think none of our fuzzing has hit it yet.
abrown pushed a commit that referenced this pull request Mar 16, 2022
The `fpcmp` helper in the x64 backend uses `put_in_xmm_mem` for one of
its operands, which allows the compiler to merge a load with the compare
instruction (`ucomiss` or `ucomisd`).

Unfortunately, as we saw in #2576 for the integer-compare case, this
does not work with our lowering algorithm because compares can be
lowered more than once (unlike all other instructions) to reproduce the
flags where needed. Merging a load into an op that executes more than
once is invalid in general (the two loads may observe different values,
which violates the original program semantics because there was only one
load originally).

This does not result in a miscompilation, but instead will cause a panic
at regalloc time because the register that should have been defined by
the separate load is never written (the load is never emitted
separately).

I think this (very subtle, easy to miss) condition was unfortunately not
ported over when we moved the logic in #3682.

The existing fcmp-of-load test in `cmp-mem-bug` (from #2576) does not
seem to trigger it, for a reason I haven't fully deduced. I just added
the verbatim function body (happens to come from `clang.wasm`) that
triggers the bug as a test.

Discovered while bringing up regalloc2 support. It's pretty unlikely to
hit by chance, which is why I think none of our fuzzing has hit it yet.
mpardesh pushed a commit to avanhatt/wasmtime that referenced this pull request Mar 17, 2022
* x64: port `select` using an FP comparison to ISLE

This change includes quite a few interlocking parts, required mainly by
the current x64 conventions in ISLE:
 - it adds a way to emit a `cmove` with multiple OR-ing conditions;
   because x64 ISLE cannot currently safely emit a comparison followed
   by several jumps, this adds `MachInst::CmoveOr` and
   `MachInst::XmmCmoveOr` macro instructions. Unfortunately, these macro
   instructions hide the multi-instruction sequence in `lower.isle`
 - to properly keep track of what instructions consume and produce
   flags, @cfallin added a way to pass around variants of
   `ConsumesFlags` and `ProducesFlags`--these changes affect all
   backends
 - then, to lower the `fcmp + select` CLIF, this change adds several
   `cmove*_from_values` helpers that perform all of the awkward
   conversions between `Value`, `ValueReg`, `Reg`, and `Gpr/Xmm`; one
   upside is that now these lowerings have much-improved documentation
   explaining why the various `FloatCC` and `CC` choices are made the
   the way they are.

Co-authored-by: Chris Fallin <chris@cfallin.org>
cfallin added a commit to cfallin/wasmtime that referenced this pull request Apr 11, 2022
The `fpcmp` helper in the x64 backend uses `put_in_xmm_mem` for one of
its operands, which allows the compiler to merge a load with the compare
instruction (`ucomiss` or `ucomisd`).

Unfortunately, as we saw in bytecodealliance#2576 for the integer-compare case, this
does not work with our lowering algorithm because compares can be
lowered more than once (unlike all other instructions) to reproduce the
flags where needed. Merging a load into an op that executes more than
once is invalid in general (the two loads may observe different values,
which violates the original program semantics because there was only one
load originally).

This does not result in a miscompilation, but instead will cause a panic
at regalloc time because the register that should have been defined by
the separate load is never written (the load is never emitted
separately).

I think this (very subtle, easy to miss) condition was unfortunately not
ported over when we moved the logic in bytecodealliance#3682.

The existing fcmp-of-load test in `cmp-mem-bug` (from bytecodealliance#2576) does not
seem to trigger it, for a reason I haven't fully deduced. I just added
the verbatim function body (happens to come from `clang.wasm`) that
triggers the bug as a test.

Discovered while bringing up regalloc2 support. It's pretty unlikely to
hit by chance, which is why I think none of our fuzzing has hit it yet.
cfallin added a commit that referenced this pull request Apr 11, 2022
The `fpcmp` helper in the x64 backend uses `put_in_xmm_mem` for one of
its operands, which allows the compiler to merge a load with the compare
instruction (`ucomiss` or `ucomisd`).

Unfortunately, as we saw in #2576 for the integer-compare case, this
does not work with our lowering algorithm because compares can be
lowered more than once (unlike all other instructions) to reproduce the
flags where needed. Merging a load into an op that executes more than
once is invalid in general (the two loads may observe different values,
which violates the original program semantics because there was only one
load originally).

This does not result in a miscompilation, but instead will cause a panic
at regalloc time because the register that should have been defined by
the separate load is never written (the load is never emitted
separately).

I think this (very subtle, easy to miss) condition was unfortunately not
ported over when we moved the logic in #3682.

The existing fcmp-of-load test in `cmp-mem-bug` (from #2576) does not
seem to trigger it, for a reason I haven't fully deduced. I just added
the verbatim function body (happens to come from `clang.wasm`) that
triggers the bug as a test.

Discovered while bringing up regalloc2 support. It's pretty unlikely to
hit by chance, which is why I think none of our fuzzing has hit it yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants