Naive fibonacci benchmark over 2x slower than native #972

alexcrichton · 2020-02-24T19:55:45Z

While a pretty awful benchmark "in the large" I was surprised playing around locally how the naive fibonacci program was so slow relative to native performance. Especially because fibonacci benchmarks don't touch linear memory much, this may at least be a decent benchmark of cranelift and/or the code generator in use.

Given an input file like:

fn main() {
    std::process::exit(run() as i32);
}

#[no_mangle]
pub extern "C" fn run() -> u32 {
    fib(43)
}

fn fib(n: u32) -> u32 {
    if n <= 1 {
        1
    } else {
        fib(n - 1) + fib(n - 2)
    }
}

native execution looks like (for me at least)

$ rustc -O fib.rs
$ time ./fib
./fib  0.90s user 0.00s system 99% cpu 0.902 total

Whereas the wasm execution looks like:

$ rustc -O fib.rs --target wasm32-wasi --crate-type cdylib
$ time wasmtime --disable-cache -O --invoke run fib.wasm
warning: using `--invoke` with a function that returns values is experimental and may break in the future
701408733
wasmtime --disable-cache -O --invoke run fib.wasm  2.20s user 0.02s system 104% cpu 2.123 total

Here the wasm is over 2x slower than native, which was a bit surprising to me!

Some other reference information below is...

Native assembly for the `fib` function

0000000000004d40 <_ZN3fib3fib17h2bacf53cb3845acfE>:
    4d40:	55                   	push   %rbp
    4d41:	53                   	push   %rbx
    4d42:	50                   	push   %rax
    4d43:	bd 01 00 00 00       	mov    $0x1,%ebp
    4d48:	83 ff 02             	cmp    $0x2,%edi
    4d4b:	72 25                	jb     4d72 <_ZN3fib3fib17h2bacf53cb3845acfE+0x32>
    4d4d:	89 fb                	mov    %edi,%ebx
    4d4f:	bd 01 00 00 00       	mov    $0x1,%ebp
    4d54:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
    4d5b:	00 00 00 
    4d5e:	66 90                	xchg   %ax,%ax
    4d60:	8d 7b ff             	lea    -0x1(%rbx),%edi
    4d63:	e8 d8 ff ff ff       	callq  4d40 <_ZN3fib3fib17h2bacf53cb3845acfE>
    4d68:	83 c3 fe             	add    $0xfffffffe,%ebx
    4d6b:	01 c5                	add    %eax,%ebp
    4d6d:	83 fb 01             	cmp    $0x1,%ebx
    4d70:	77 ee                	ja     4d60 <_ZN3fib3fib17h2bacf53cb3845acfE+0x20>
    4d72:	89 e8                	mov    %ebp,%eax
    4d74:	48 83 c4 08          	add    $0x8,%rsp
    4d78:	5b                   	pop    %rbx
    4d79:	5d                   	pop    %rbp
    4d7a:	c3                   	retq   
    4d7b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)

WebAssembly of the `fib` function

  (func $_ZN3fib3fib17hedcc9d2af68c6e00E (type 1) (param i32) (result i32)
    (local i32)
    i32.const 1
    local.set 1
    block  ;; label = @1
      local.get 0
      i32.const 2
      i32.lt_u
      br_if 0 (;@1;)
      i32.const 1
      local.set 1
      loop  ;; label = @2
        local.get 0
        i32.const -1
        i32.add
        call $_ZN3fib3fib17hedcc9d2af68c6e00E
        local.get 1
        i32.add
        local.set 1
        local.get 0
        i32.const -2
        i32.add
        local.tee 0
        i32.const 1
        i32.gt_u
        br_if 0 (;@2;)
      end
    end
    local.get 1)

At this time I don't know of a great way to get out the assembly generated by cranelift unfortunately, but I'm hoping others may know of an easy way to do so!

The text was updated successfully, but these errors were encountered:

rene-fonseca · 2020-02-24T20:01:04Z

It would be good to have a benchmark that isn't relying on function call - just loops - for comparison. This is because function calls can have their own overhead.

alexcrichton · 2020-02-24T20:14:15Z

Perhaps, yes, but that's sort of a different benchmark. I wouldn't expect anything used in the wasm here to be 2x slower than native, and while the function call may be the most expensive part then that's what we should fix in this benchmark.

rene-fonseca · 2020-02-24T20:29:07Z

Maybe you can try wasm2obj for now and disassembly the output to see if the code looks ok. This is not an apple to apple comparison but may it can give some hints.

alexcrichton · 2020-02-24T20:38:21Z

I think any comparison between wasm and native code needs to be closely examined, because yes it's very easy to compare apples to oranges. I do not, however, believe that is the case here, since this is an extremely simple function and it's just one function we're looking at, not entire programs or anything like that. While wasm has overhead almost everything about this benchmark is known statically and should have extremely little overhead, not 2x overhead.

…bytecodealliance#972)

kubkon · 2020-03-25T13:57:25Z

@alexcrichton @rene-fonseca did you guys manage to work out what the cause of the overhead is?

alexcrichton · 2020-03-25T14:03:56Z

No there hasn't been any discussion about the generated code itself. Using --jitdump, though, I've extracted both the rustc version of the compiled fib function as well as the cranelift-compiled version:

https://gist.github.com/alexcrichton/6d80d25ee6f64857c3388e130dad22ed

bjorn3 · 2020-03-25T14:24:25Z

There seems to be more spilling and more register-register moves in the Cranelift version. The work on a new regalloc may help with this.

kubkon · 2020-03-27T16:44:32Z

Not sure how useful this will be but I've done some more benchmarks including re-running fibonacci stripped of all recursion (sources and some numbers available at github.com/kubkon/wasmtime-bench). In that particular case, there is a massive speedup. Nonetheless, Cranelift still incurs 2x overhead as seen in a more complicated mandelbrot example.

bjorn3 · 2021-02-03T20:43:05Z

How is the performance with the new x64 backend?

cfallin · 2021-02-03T20:51:18Z

This is somewhat outdated; closing in favor of our upcoming more general perf-tracking efforts (bytecodealliance/rfcs#3 and bytecodealliance/rfcs#4).

alexcrichton added cranelift Issues related to the Cranelift code generator cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. labels Mar 2, 2020

arkpar pushed a commit to paritytech/wasmtime that referenced this issue Mar 4, 2020

Limit redundant jump folding to only fold parameterless target blocks (…

26b88ae

…bytecodealliance#972)

cfallin closed this as completed Feb 3, 2021

fitzgen mentioned this issue Sep 28, 2024

Cranelift: Consider taking trapnz and trapz all the way to the backends rather than legalizing them #6055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naive fibonacci benchmark over 2x slower than native #972

Naive fibonacci benchmark over 2x slower than native #972

alexcrichton commented Feb 24, 2020 •

edited

Loading

rene-fonseca commented Feb 24, 2020

alexcrichton commented Feb 24, 2020

rene-fonseca commented Feb 24, 2020

alexcrichton commented Feb 24, 2020

kubkon commented Mar 25, 2020

alexcrichton commented Mar 25, 2020

bjorn3 commented Mar 25, 2020

kubkon commented Mar 27, 2020

bjorn3 commented Feb 3, 2021

cfallin commented Feb 3, 2021

Naive fibonacci benchmark over 2x slower than native #972

Naive fibonacci benchmark over 2x slower than native #972

Comments

alexcrichton commented Feb 24, 2020 • edited Loading

rene-fonseca commented Feb 24, 2020

alexcrichton commented Feb 24, 2020

rene-fonseca commented Feb 24, 2020

alexcrichton commented Feb 24, 2020

kubkon commented Mar 25, 2020

alexcrichton commented Mar 25, 2020

bjorn3 commented Mar 25, 2020

kubkon commented Mar 27, 2020

bjorn3 commented Feb 3, 2021

cfallin commented Feb 3, 2021

alexcrichton commented Feb 24, 2020 •

edited

Loading