Mark a lot of functions inline #347

tavianator · 2022-01-26T19:54:55Z

This fixes a lot of the performance regression reported in #339.

This fixes a lot of the performance regression reported in #339

tavianator · 2022-01-26T19:58:24Z

Right now with lto = false, codegen-units = 16, the benefit is:

execute/tiny_keccak/v1  time:   [1.2371 ms 1.2384 ms 1.2398 ms]                                    
                        change: [-67.706% -67.636% -67.583%] (p = 0.00 < 0.05)
                        Performance has improved.

The remaining regression (compared to lto = "fat", codegen-units = 1) is:

execute/tiny_keccak/v1  time:   [706.29 us 713.67 us 720.90 us]                                   
                        change: [-42.294% -42.015% -41.745%] (p = 0.00 < 0.05)
                        Performance has improved.

ggwpez · 2022-01-26T22:55:29Z

We could also try to use opt-level 2 instead of 3.
3 optimized for binary size as well, and would therefore do less inlining than it could.
At least from C++ experience I can say that 2 is sometimes faster than 3, if you take CPU cache size into account.
Should probably test this on ref hardware.

Robbepop · 2022-01-27T10:04:43Z

We could also try to use opt-level 2 instead of 3. 3 optimized for binary size as well, and would therefore do less inlining than it could. At least from C++ experience I can say that 2 is sometimes faster than 3, if you take CPU cache size into account. Should probably test this on ref hardware.

Good idea! I just tried that out on my computer and benchmarked it. The rest of the profile config (lto="fat",cgu=1) remained the same and results are a roughly 100% slowdown. So opt-level=2 is not something we should pursue further for wasmi I think.

Robbepop · 2022-01-27T10:05:51Z

@tavianator thanks a lot for this PR and for pasting the benchmark results!

Can you share more how you selected the functions that you annotated with #[inline]?
Did you also check the performance of the new engine under the profile

[profile.release]
lto = "fat"
codegen-units = 1

I am asking because we should definitely make sure that the best possible performance won't get pessimized by any "optimization" that we apply. We only want to close the gap between both without pessimizing the best performance. If we are lucky the optimizations further improve the best possible outcome. I am sure there is still a lot of room of improvements even for the best results we have seen so far. For example from what I can tell is that the Rust compiler currently does not optimize our switch based instruction dispatch in a way that would be most optimal for it - but it could in theory do that. This could have a potential 30% gain in performance on top of our best results that we have seen so far.

I also have a more in-depth answer here.

Robbepop · 2022-01-27T11:01:47Z

I just compared this branch with master branch on the following profile

[profile.release]
lto = "fat"
codegen-units = 1

Unfortunately we can see slowdowns in the range of 20-33% especially for benchmarks that evaluate performance of function dispatch and module instantiation. Instruction dispatch is also affected.

tl;dr: While this PR closes the gap between best and worst performance it also pessimizes the best performance.

compile_and_validate/v0 time:   [6.7932 ms 6.8118 ms 6.8340 ms]                                    
                        change: [-1.9011% -1.4463% -0.9943%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

compile_and_validate/v1 time:   [6.7415 ms 6.7567 ms 6.7735 ms]                                    
                        change: [-1.0009% -0.6186% -0.2596%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

instantiate/v0          time:   [454.12 us 455.82 us 457.68 us]                           
                        change: [-2.5762% -0.7318% +1.0146%] (p = 0.45 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  7 (7.00%) high mild
  10 (10.00%) high severe

instantiate/v1          time:   [58.478 us 58.600 us 58.734 us]                           
                        change: [+8.8324% +9.6211% +10.495%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

Benchmarking execute/tiny_keccak/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60.
execute/tiny_keccak/v0  time:   [1.2641 ms 1.2673 ms 1.2704 ms]                                    
                        change: [+1.0453% +1.5824% +2.0917%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  4 (4.00%) high severe

Benchmarking execute/tiny_keccak/v1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s, enable flat sampling, or reduce sample count to 60.
execute/tiny_keccak/v1  time:   [1.0314 ms 1.0341 ms 1.0373 ms]                                    
                        change: [+6.6106% +7.2196% +7.8061%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  2 (2.00%) high severe

Benchmarking execute/rev_complement/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.9s, enable flat sampling, or reduce sample count to 50.
execute/rev_complement/v0                                                                             
                        time:   [1.5623 ms 1.5656 ms 1.5692 ms]
                        change: [-1.3037% -0.8696% -0.4748%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low mild
  7 (7.00%) high mild
  3 (3.00%) high severe

Benchmarking execute/rev_complement/v1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.1s, enable flat sampling, or reduce sample count to 50.
execute/rev_complement/v1                                                                             
                        time:   [1.4035 ms 1.4061 ms 1.4090 ms]
                        change: [+23.567% +24.388% +25.178%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe

Benchmarking execute/regex_redux/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.5s, enable flat sampling, or reduce sample count to 50.
execute/regex_redux/v0  time:   [1.6797 ms 1.6821 ms 1.6848 ms]                                    
                        change: [-1.2953% -0.6148% +0.0253%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe

Benchmarking execute/regex_redux/v1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.5s, enable flat sampling, or reduce sample count to 50.
execute/regex_redux/v1  time:   [1.4787 ms 1.4811 ms 1.4836 ms]                                    
                        change: [+24.427% +24.963% +25.497%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

Benchmarking execute/count_until/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.5s, enable flat sampling, or reduce sample count to 50.
execute/count_until/v0  time:   [1.8691 ms 1.8716 ms 1.8740 ms]                                    
                        change: [+1.4873% +1.7939% +2.0877%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

Benchmarking execute/count_until/v1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.8s, enable flat sampling, or reduce sample count to 50.
execute/count_until/v1  time:   [1.7343 ms 1.7375 ms 1.7410 ms]                                    
                        change: [+1.0826% +1.4768% +1.8841%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

execute/factorial_recursive/v0                                                                             
                        time:   [25.236 us 25.280 us 25.326 us]
                        change: [+9.2538% +9.7995% +10.420%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

execute/factorial_recursive/v1                                                                             
                        time:   [1.5521 us 1.5540 us 1.5561 us]
                        change: [+32.842% +33.422% +33.966%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

execute/factorial_optimized/v0                                                                             
                        time:   [24.157 us 24.213 us 24.284 us]
                        change: [+8.3346% +8.8757% +9.4642%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) low mild
  4 (4.00%) high mild
  7 (7.00%) high severe

execute/factorial_optimized/v1                                                                             
                        time:   [727.84 ns 729.14 ns 730.44 ns]
                        change: [-1.6743% -1.1619% -0.7095%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

execute/recursive_ok/v0 time:   [514.24 us 515.16 us 516.13 us]                                    
                        change: [+4.9704% +5.4392% +5.8648%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

execute/recursive_ok/v1 time:   [408.60 us 409.55 us 410.46 us]                                    
                        change: [+26.512% +27.249% +27.946%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

execute/recursive_trap/v0                                                                            
                        time:   [71.507 us 71.625 us 71.750 us]
                        change: [+2.6568% +3.3007% +3.9599%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

execute/recursive_trap/v1                                                                             
                        time:   [40.497 us 40.576 us 40.663 us]
                        change: [+33.254% +33.990% +34.696%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

execute/host_calls/v0   time:   [81.444 us 81.584 us 81.727 us]                                  
                        change: [+10.842% +11.421% +12.033%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

execute/host_calls/v1   time:   [57.899 us 58.016 us 58.147 us]                                  
                        change: [+26.063% +26.725% +27.390%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

Robbepop · 2022-01-27T13:32:28Z

Did some test where I only added some #[inline] annotations to a selected amount of functions in the wasmi_core crate here.
Benchmarks show significant improvements for the best case scenario, however, not so many improvements to the default release profile unfortunately.

tavianator · 2022-01-27T14:16:58Z

@tavianator thanks a lot for this PR and for pasting the benchmark results!

You're welcome! :)

Can you share more how you selected the functions that you annotated with #[inline]?

I got a perf profile of the fast and the slow configuration and used perf report and perf diff to compare them. I found in the slow configuration quite a few functions that don't appear at all in the fast profile, indicating those functions were likely inlined in the fast configuration. I sprinkled #[inline] on the ones that looked small enough, took another profile, and kept going until I had to go somewhere and I put this PR up.

Did you also check the performance of the new engine under the profile
[profile.release]
lto = "fat"
codegen-units = 1

That's the "remaining regression" in my above post. I should do a before and after comparison with that configuration but I haven't yet.

I am asking because we should definitely make sure that the best possible performance won't get pessimized by any "optimization" that we apply. We only want to close the gap between both without pessimizing the best performance. If we are lucky the optimizations further improve the best possible outcome. I am sure there is still a lot of room of improvements even for the best results we have seen so far. For example from what I can tell is that the Rust compiler currently does not optimize our switch based instruction dispatch in a way that would be most optimal for it - but it could in theory do that. This could have a potential 30% gain in performance on top of our best results that we have seen so far.

RIght, agreed. And this is all only looking at one benchmark (execute/tiny_keccak/v1), presumably I should look at a wider range of benchmarks to find the appropriate candidates for inlining.

I also have a more in-depth answer here.

Thanks, I'll read that now.

Robbepop · 2022-01-27T14:59:50Z

@tavianator Would you like to revert the #[inline] annotations from source files other than the ones from the wasmi_core crate to see whether that lifts the performance on par with my PR. If that's the case we should merge your since it would introduce fewer #[inline] annotations (71 versus 114).

And yes, ideally you test performance for all the benchmarks we have using cargo bench.

Robbepop · 2022-02-03T23:48:54Z

@tavianator Would you like to revert the #[inline] annotations from source files other than the ones from the wasmi_core crate to see whether that lifts the performance on par with my PR. If that's the case we should merge your since it would introduce fewer #[inline] annotations (71 versus 114).

And yes, ideally you test performance for all the benchmarks we have using cargo bench.

@tavianator would you be willing to do this? Otherwise I'd like to close this PR.

Robbepop · 2022-02-04T09:58:35Z

Closing this due to inactivity and also because we just merged #348 that is doing something similar but without the regressions for the v1 engine.

However, thank you a lot for researching again in this direction even if it not really led to the solution of the original problem! :)

Mark a lot of functions inline

923c616

This fixes a lot of the performance regression reported in #339

This was referenced Jan 27, 2022

Investigate: dependency on lto="fat" and codegen-units=1 #339

Closed

Add #[inline] annotations to small functions in wasmi_core crate #348

Merged

Robbepop closed this Feb 4, 2022

tavianator deleted the inline-all-the-things branch February 4, 2022 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark a lot of functions inline #347

Mark a lot of functions inline #347

tavianator commented Jan 26, 2022

tavianator commented Jan 26, 2022

ggwpez commented Jan 26, 2022

Robbepop commented Jan 27, 2022

Robbepop commented Jan 27, 2022 •

edited

Loading

Robbepop commented Jan 27, 2022 •

edited

Loading

Robbepop commented Jan 27, 2022 •

edited

Loading

tavianator commented Jan 27, 2022

Robbepop commented Jan 27, 2022 •

edited

Loading

Robbepop commented Feb 3, 2022

Robbepop commented Feb 4, 2022 •

edited

Loading

Mark a lot of functions inline #347

Mark a lot of functions inline #347

Conversation

tavianator commented Jan 26, 2022

tavianator commented Jan 26, 2022

ggwpez commented Jan 26, 2022

Robbepop commented Jan 27, 2022

Robbepop commented Jan 27, 2022 • edited Loading

Robbepop commented Jan 27, 2022 • edited Loading

Robbepop commented Jan 27, 2022 • edited Loading

tavianator commented Jan 27, 2022

Robbepop commented Jan 27, 2022 • edited Loading

Robbepop commented Feb 3, 2022

Robbepop commented Feb 4, 2022 • edited Loading

Robbepop commented Jan 27, 2022 •

edited

Loading

Robbepop commented Jan 27, 2022 •

edited

Loading

Robbepop commented Jan 27, 2022 •

edited

Loading

Robbepop commented Jan 27, 2022 •

edited

Loading

Robbepop commented Feb 4, 2022 •

edited

Loading