Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN-canonicalization without branching on x64 #8313

Conversation

adambratschikaye
Copy link
Contributor

Modify the cranelift pass that performs NaN-canonicalization to avoid branches on x64. The current implementation uses two branches:

       8: be 00 00 c0 7f               	mov	esi, 0x7fc00000
       d: c5 f9 6e de                  	vmovd	xmm3, esi
      11: 0f 2e c0                     	ucomiss	xmm0, xmm0
      14: 0f 8b 04 00 00 00            	jnp	0x1e <wasm[0]::function[0]+0x1e>
      1a: f2 0f 10 c3                  	movsd	xmm0, xmm3              # xmm0 = xmm3[0],xmm0[1]
      1e: 0f 84 04 00 00 00            	je	0x28 <wasm[0]::function[0]+0x28>
      24: f2 0f 10 c3                  	movsd	xmm0, xmm3              # xmm0 = xmm3[0],xmm0[1]

With these changes, NaN-canonicalization becomes:

       8: c5 e8 c2 da 03               	vcmpunordps	xmm3, xmm2, xmm2
       d: be 00 00 c0 7f               	mov	esi, 0x7fc00000
      12: c5 f9 6e e6                  	vmovd	xmm4, esi
      16: c4 e3 69 4c c4 30            	vpblendvb	xmm0, xmm2, xmm4, xmm3

Running both versions against an small image classification inference benchmark here resulted in a ~50% improvement:

image_classification/opt-level=0
                        time:   [728.16 ms 730.00 ms 732.05 ms]
                        change: [-44.476% -44.251% -44.029%] (p = 0.00 < 0.05)
                        Performance has improved.

image_classification/opt-level=1
                        time:   [593.90 ms 595.51 ms 597.34 ms]
                        change: [-51.561% -51.396% -51.211%] (p = 0.00 < 0.05)
                        Performance has improved.

As a side note, I didn't notice any sightglass benchmark that was performing mainly float arithmetic to test against. I'd be happy to add this image classification case if there's interest.

Modify the cranelift pass that performs NaN-canonicalization to avoid
branches on x64. The current implementation uses two branches.
@@ -65,16 +65,23 @@ fn add_nan_canon_seq(pos: &mut FuncCursor, inst: Inst) {
let new_res = pos.func.dfg.replace_result(val, val_type);
let _next_inst = pos.next_inst().expect("block missing terminator!");

// Insert a comparison instruction, to check if `inst_res` is NaN. Select
// the canonical NaN value if `val` is NaN, assign the result to `inst`.
let is_nan = pos.ins().fcmp(FloatCC::NotEqual, new_res, new_res);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without any of the other changes, just changing this comparison from NotEqual to Unordered removes one of the two jumps which is a significant improvement. With the other changes I don't think there's a difference between using NotEqual or Unordered, but Unordered seemed more precise.

@@ -1427,7 +1427,7 @@

(decl pure partial all_ones_or_all_zeros (Value) bool)
(rule (all_ones_or_all_zeros (and (icmp _ _ _) (value_type (multi_lane _ _)))) $true)
(rule (all_ones_or_all_zeros (and (fcmp _ _ _) (value_type (multi_lane _ _)))) $true)
(rule (all_ones_or_all_zeros (and (bitcast _ (fcmp _ _ _)) (value_type (multi_lane _ _)))) $true)
Copy link
Contributor Author

@adambratschikaye adambratschikaye Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original pattern was never triggered when doing NaN-canonicalization because fcmp will result in either an I32X4 or I64X2 which always needs to be bitcast back to F32X4 or F64X2 before it can be passed to bitselect.

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen labels Apr 8, 2024
@adambratschikaye adambratschikaye marked this pull request as ready for review April 8, 2024 09:06
@adambratschikaye adambratschikaye requested a review from a team as a code owner April 8, 2024 09:06
@adambratschikaye adambratschikaye requested review from abrown and removed request for a team April 8, 2024 09:06
@adambratschikaye
Copy link
Contributor Author

Also is there a way to enable NaN-canonicalization in a clif test to add a test for this?

@afonso360
Copy link
Contributor

You should be able to add something along these lines, to test with nan canonicalization enabled:

test {run,compile,etc...}
set enable_nan_canonicalization=true
target x86_64

Copy link
Contributor

@afonso360 afonso360 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! Thanks! I don't know if @abrown wants to review it as well.

@abrown abrown added this pull request to the merge queue Apr 9, 2024
Merged via the queue into bytecodealliance:main with commit 72a3b8b Apr 9, 2024
21 checks passed
afonso360 added a commit to afonso360/wasmtime that referenced this pull request Apr 13, 2024
cranelift-fuzzgen unconditionally runs the NaN Canonicalization pass on all functions that it generates. This is so that we can ensure that when running in the interpreter vs natively we get the same bitpattern for all NaN's.

Until now we just picked a random ISA (the host ISA), disabled the verifier and ran the pass with that. This was because the ISA didn't really matter for the passes that we wanted to run.

In bytecodealliance#8313 the ISA now drives some codgen decisions for the NaN Canonicalization pass. Namely, if the ISA supports Vectors, it tries to use that.

In bytecodealliance#8359 there was a fuzz bug reported where fuzzgen generated vector code for RISC-V without the `has_v` flag, something that should *never* happen, because we simply cannot compile that code.

It turns out that fuzzgen did not generate vector code itself. But since we were passing the host ISA to the nan canonicalization pass, it assumed that it could use vectors and did so. But the actual target isa did not support vectors.

To fix this, we now correctly pass the target isa that we are building a function for.
github-merge-queue bot pushed a commit that referenced this pull request Apr 13, 2024
…8360)

cranelift-fuzzgen unconditionally runs the NaN Canonicalization pass on all functions that it generates. This is so that we can ensure that when running in the interpreter vs natively we get the same bitpattern for all NaN's.

Until now we just picked a random ISA (the host ISA), disabled the verifier and ran the pass with that. This was because the ISA didn't really matter for the passes that we wanted to run.

In #8313 the ISA now drives some codgen decisions for the NaN Canonicalization pass. Namely, if the ISA supports Vectors, it tries to use that.

In #8359 there was a fuzz bug reported where fuzzgen generated vector code for RISC-V without the `has_v` flag, something that should *never* happen, because we simply cannot compile that code.

It turns out that fuzzgen did not generate vector code itself. But since we were passing the host ISA to the nan canonicalization pass, it assumed that it could use vectors and did so. But the actual target isa did not support vectors.

To fix this, we now correctly pass the target isa that we are building a function for.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants