NaN-canonicalization without branching on x64 #8313

adambratschikaye · 2024-04-08T07:53:43Z

Modify the cranelift pass that performs NaN-canonicalization to avoid branches on x64. The current implementation uses two branches:

       8: be 00 00 c0 7f               	mov	esi, 0x7fc00000
       d: c5 f9 6e de                  	vmovd	xmm3, esi
      11: 0f 2e c0                     	ucomiss	xmm0, xmm0
      14: 0f 8b 04 00 00 00            	jnp	0x1e <wasm[0]::function[0]+0x1e>
      1a: f2 0f 10 c3                  	movsd	xmm0, xmm3              # xmm0 = xmm3[0],xmm0[1]
      1e: 0f 84 04 00 00 00            	je	0x28 <wasm[0]::function[0]+0x28>
      24: f2 0f 10 c3                  	movsd	xmm0, xmm3              # xmm0 = xmm3[0],xmm0[1]

With these changes, NaN-canonicalization becomes:

       8: c5 e8 c2 da 03               	vcmpunordps	xmm3, xmm2, xmm2
       d: be 00 00 c0 7f               	mov	esi, 0x7fc00000
      12: c5 f9 6e e6                  	vmovd	xmm4, esi
      16: c4 e3 69 4c c4 30            	vpblendvb	xmm0, xmm2, xmm4, xmm3

Running both versions against an small image classification inference benchmark here resulted in a ~50% improvement:

image_classification/opt-level=0
                        time:   [728.16 ms 730.00 ms 732.05 ms]
                        change: [-44.476% -44.251% -44.029%] (p = 0.00 < 0.05)
                        Performance has improved.

image_classification/opt-level=1
                        time:   [593.90 ms 595.51 ms 597.34 ms]
                        change: [-51.561% -51.396% -51.211%] (p = 0.00 < 0.05)
                        Performance has improved.

As a side note, I didn't notice any sightglass benchmark that was performing mainly float arithmetic to test against. I'd be happy to add this image classification case if there's interest.

Modify the cranelift pass that performs NaN-canonicalization to avoid branches on x64. The current implementation uses two branches.

adambratschikaye · 2024-04-08T07:57:21Z

cranelift/codegen/src/nan_canonicalization.rs

@@ -65,16 +65,23 @@ fn add_nan_canon_seq(pos: &mut FuncCursor, inst: Inst) {
 let new_res = pos.func.dfg.replace_result(val, val_type);
 let _next_inst = pos.next_inst().expect("block missing terminator!");

- // Insert a comparison instruction, to check if `inst_res` is NaN. Select
- // the canonical NaN value if `val` is NaN, assign the result to `inst`.
- let is_nan = pos.ins().fcmp(FloatCC::NotEqual, new_res, new_res);


Without any of the other changes, just changing this comparison from NotEqual to Unordered removes one of the two jumps which is a significant improvement. With the other changes I don't think there's a difference between using NotEqual or Unordered, but Unordered seemed more precise.

adambratschikaye · 2024-04-08T07:59:33Z

cranelift/codegen/src/isa/x64/lower.isle

@@ -1427,7 +1427,7 @@

 (decl pure partial all_ones_or_all_zeros (Value) bool)
 (rule (all_ones_or_all_zeros (and (icmp _ _ _) (value_type (multi_lane _ _)))) $true)
-(rule (all_ones_or_all_zeros (and (fcmp _ _ _) (value_type (multi_lane _ _)))) $true)
+(rule (all_ones_or_all_zeros (and (bitcast _ (fcmp _ _ _)) (value_type (multi_lane _ _)))) $true)


The original pattern was never triggered when doing NaN-canonicalization because fcmp will result in either an I32X4 or I64X2 which always needs to be bitcast back to F32X4 or F64X2 before it can be passed to bitselect.

This reverts commit 48c3712.

adambratschikaye · 2024-04-08T09:07:48Z

Also is there a way to enable NaN-canonicalization in a clif test to add a test for this?

afonso360 · 2024-04-08T09:29:03Z

You should be able to add something along these lines, to test with nan canonicalization enabled:

test {run,compile,etc...}
set enable_nan_canonicalization=true
target x86_64

cranelift/codegen/src/nan_canonicalization.rs

afonso360

This LGTM! Thanks! I don't know if @abrown wants to review it as well.

cranelift-fuzzgen unconditionally runs the NaN Canonicalization pass on all functions that it generates. This is so that we can ensure that when running in the interpreter vs natively we get the same bitpattern for all NaN's. Until now we just picked a random ISA (the host ISA), disabled the verifier and ran the pass with that. This was because the ISA didn't really matter for the passes that we wanted to run. In bytecodealliance#8313 the ISA now drives some codgen decisions for the NaN Canonicalization pass. Namely, if the ISA supports Vectors, it tries to use that. In bytecodealliance#8359 there was a fuzz bug reported where fuzzgen generated vector code for RISC-V without the `has_v` flag, something that should *never* happen, because we simply cannot compile that code. It turns out that fuzzgen did not generate vector code itself. But since we were passing the host ISA to the nan canonicalization pass, it assumed that it could use vectors and did so. But the actual target isa did not support vectors. To fix this, we now correctly pass the target isa that we are building a function for.

…8360) cranelift-fuzzgen unconditionally runs the NaN Canonicalization pass on all functions that it generates. This is so that we can ensure that when running in the interpreter vs natively we get the same bitpattern for all NaN's. Until now we just picked a random ISA (the host ISA), disabled the verifier and ran the pass with that. This was because the ISA didn't really matter for the passes that we wanted to run. In #8313 the ISA now drives some codgen decisions for the NaN Canonicalization pass. Namely, if the ISA supports Vectors, it tries to use that. In #8359 there was a fuzz bug reported where fuzzgen generated vector code for RISC-V without the `has_v` flag, something that should *never* happen, because we simply cannot compile that code. It turns out that fuzzgen did not generate vector code itself. But since we were passing the host ISA to the nan canonicalization pass, it assumed that it could use vectors and did so. But the actual target isa did not support vectors. To fix this, we now correctly pass the target isa that we are building a function for.

adambratschikaye added 2 commits April 8, 2024 09:11

NaN-canonicalization without branching on x64

764979b

Modify the cranelift pass that performs NaN-canonicalization to avoid branches on x64. The current implementation uses two branches.

remove old fcmp case

48c3712

adambratschikaye commented Apr 8, 2024

View reviewed changes

Revert "remove old fcmp case"

90b9e1a

This reverts commit 48c3712.

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen labels Apr 8, 2024

adambratschikaye marked this pull request as ready for review April 8, 2024 09:06

adambratschikaye requested a review from a team as a code owner April 8, 2024 09:06

adambratschikaye requested review from abrown and removed request for a team April 8, 2024 09:06

afonso360 reviewed Apr 8, 2024

View reviewed changes

cranelift/codegen/src/nan_canonicalization.rs Show resolved Hide resolved

adambratschikaye added 2 commits April 8, 2024 16:48

add filetests

f2e156e

use old version for riscv

302cd3c

afonso360 approved these changes Apr 9, 2024

View reviewed changes

abrown approved these changes Apr 9, 2024

View reviewed changes

abrown added this pull request to the merge queue Apr 9, 2024

Merged via the queue into bytecodealliance:main with commit 72a3b8b Apr 9, 2024
21 checks passed

alexcrichton mentioned this pull request Apr 13, 2024

riscv64: Panic on partial gen_extractlane rule #8359

Closed

afonso360 mentioned this pull request Apr 13, 2024

fuzzgen: Use the correct ISA when running NaN Canonicalization pass #8360

Merged

jlb6740 mentioned this pull request Apr 26, 2024

Add a pure Wasm image classification benchmark bytecodealliance/sightglass#270

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN-canonicalization without branching on x64 #8313

NaN-canonicalization without branching on x64 #8313

adambratschikaye commented Apr 8, 2024

adambratschikaye Apr 8, 2024

adambratschikaye Apr 8, 2024 •

edited

Loading

adambratschikaye commented Apr 8, 2024

afonso360 commented Apr 8, 2024

afonso360 left a comment

NaN-canonicalization without branching on x64 #8313

NaN-canonicalization without branching on x64 #8313

Conversation

adambratschikaye commented Apr 8, 2024

adambratschikaye Apr 8, 2024

Choose a reason for hiding this comment

adambratschikaye Apr 8, 2024 • edited Loading

Choose a reason for hiding this comment

adambratschikaye commented Apr 8, 2024

afonso360 commented Apr 8, 2024

afonso360 left a comment

Choose a reason for hiding this comment

adambratschikaye Apr 8, 2024 •

edited

Loading