-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN-canonicalization without branching on x64 #8313
NaN-canonicalization without branching on x64 #8313
Conversation
Modify the cranelift pass that performs NaN-canonicalization to avoid branches on x64. The current implementation uses two branches.
@@ -65,16 +65,23 @@ fn add_nan_canon_seq(pos: &mut FuncCursor, inst: Inst) { | |||
let new_res = pos.func.dfg.replace_result(val, val_type); | |||
let _next_inst = pos.next_inst().expect("block missing terminator!"); | |||
|
|||
// Insert a comparison instruction, to check if `inst_res` is NaN. Select | |||
// the canonical NaN value if `val` is NaN, assign the result to `inst`. | |||
let is_nan = pos.ins().fcmp(FloatCC::NotEqual, new_res, new_res); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without any of the other changes, just changing this comparison from NotEqual
to Unordered
removes one of the two jumps which is a significant improvement. With the other changes I don't think there's a difference between using NotEqual
or Unordered
, but Unordered
seemed more precise.
@@ -1427,7 +1427,7 @@ | |||
|
|||
(decl pure partial all_ones_or_all_zeros (Value) bool) | |||
(rule (all_ones_or_all_zeros (and (icmp _ _ _) (value_type (multi_lane _ _)))) $true) | |||
(rule (all_ones_or_all_zeros (and (fcmp _ _ _) (value_type (multi_lane _ _)))) $true) | |||
(rule (all_ones_or_all_zeros (and (bitcast _ (fcmp _ _ _)) (value_type (multi_lane _ _)))) $true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original pattern was never triggered when doing NaN-canonicalization because fcmp
will result in either an I32X4
or I64X2
which always needs to be bitcast back to F32X4
or F64X2
before it can be passed to bitselect
.
This reverts commit 48c3712.
Also is there a way to enable NaN-canonicalization in a |
You should be able to add something along these lines, to test with nan canonicalization enabled:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! Thanks! I don't know if @abrown wants to review it as well.
cranelift-fuzzgen unconditionally runs the NaN Canonicalization pass on all functions that it generates. This is so that we can ensure that when running in the interpreter vs natively we get the same bitpattern for all NaN's. Until now we just picked a random ISA (the host ISA), disabled the verifier and ran the pass with that. This was because the ISA didn't really matter for the passes that we wanted to run. In bytecodealliance#8313 the ISA now drives some codgen decisions for the NaN Canonicalization pass. Namely, if the ISA supports Vectors, it tries to use that. In bytecodealliance#8359 there was a fuzz bug reported where fuzzgen generated vector code for RISC-V without the `has_v` flag, something that should *never* happen, because we simply cannot compile that code. It turns out that fuzzgen did not generate vector code itself. But since we were passing the host ISA to the nan canonicalization pass, it assumed that it could use vectors and did so. But the actual target isa did not support vectors. To fix this, we now correctly pass the target isa that we are building a function for.
…8360) cranelift-fuzzgen unconditionally runs the NaN Canonicalization pass on all functions that it generates. This is so that we can ensure that when running in the interpreter vs natively we get the same bitpattern for all NaN's. Until now we just picked a random ISA (the host ISA), disabled the verifier and ran the pass with that. This was because the ISA didn't really matter for the passes that we wanted to run. In #8313 the ISA now drives some codgen decisions for the NaN Canonicalization pass. Namely, if the ISA supports Vectors, it tries to use that. In #8359 there was a fuzz bug reported where fuzzgen generated vector code for RISC-V without the `has_v` flag, something that should *never* happen, because we simply cannot compile that code. It turns out that fuzzgen did not generate vector code itself. But since we were passing the host ISA to the nan canonicalization pass, it assumed that it could use vectors and did so. But the actual target isa did not support vectors. To fix this, we now correctly pass the target isa that we are building a function for.
Modify the cranelift pass that performs NaN-canonicalization to avoid branches on x64. The current implementation uses two branches:
With these changes, NaN-canonicalization becomes:
Running both versions against an small image classification inference benchmark here resulted in a ~50% improvement:
As a side note, I didn't notice any sightglass benchmark that was performing mainly float arithmetic to test against. I'd be happy to add this image classification case if there's interest.