Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark a lot of functions inline #347

Closed
wants to merge 1 commit into from
Closed

Mark a lot of functions inline #347

wants to merge 1 commit into from

Conversation

tavianator
Copy link

This fixes a lot of the performance regression reported in #339.

This fixes a lot of the performance regression reported in #339
@tavianator
Copy link
Author

Right now with lto = false, codegen-units = 16, the benefit is:

execute/tiny_keccak/v1  time:   [1.2371 ms 1.2384 ms 1.2398 ms]                                    
                        change: [-67.706% -67.636% -67.583%] (p = 0.00 < 0.05)
                        Performance has improved.

The remaining regression (compared to lto = "fat", codegen-units = 1) is:

execute/tiny_keccak/v1  time:   [706.29 us 713.67 us 720.90 us]                                   
                        change: [-42.294% -42.015% -41.745%] (p = 0.00 < 0.05)
                        Performance has improved.

@ggwpez
Copy link

ggwpez commented Jan 26, 2022

We could also try to use opt-level 2 instead of 3.
3 optimized for binary size as well, and would therefore do less inlining than it could.
At least from C++ experience I can say that 2 is sometimes faster than 3, if you take CPU cache size into account.
Should probably test this on ref hardware.

@Robbepop
Copy link
Member

We could also try to use opt-level 2 instead of 3. 3 optimized for binary size as well, and would therefore do less inlining than it could. At least from C++ experience I can say that 2 is sometimes faster than 3, if you take CPU cache size into account. Should probably test this on ref hardware.

Good idea! I just tried that out on my computer and benchmarked it. The rest of the profile config (lto="fat",cgu=1) remained the same and results are a roughly 100% slowdown. So opt-level=2 is not something we should pursue further for wasmi I think.

@Robbepop
Copy link
Member

Robbepop commented Jan 27, 2022

@tavianator thanks a lot for this PR and for pasting the benchmark results!

Can you share more how you selected the functions that you annotated with #[inline]?
Did you also check the performance of the new engine under the profile

[profile.release]
lto = "fat"
codegen-units = 1

I am asking because we should definitely make sure that the best possible performance won't get pessimized by any "optimization" that we apply. We only want to close the gap between both without pessimizing the best performance. If we are lucky the optimizations further improve the best possible outcome. I am sure there is still a lot of room of improvements even for the best results we have seen so far. For example from what I can tell is that the Rust compiler currently does not optimize our switch based instruction dispatch in a way that would be most optimal for it - but it could in theory do that. This could have a potential 30% gain in performance on top of our best results that we have seen so far.

I also have a more in-depth answer here.

@Robbepop
Copy link
Member

Robbepop commented Jan 27, 2022

I just compared this branch with master branch on the following profile

[profile.release]
lto = "fat"
codegen-units = 1

Unfortunately we can see slowdowns in the range of 20-33% especially for benchmarks that evaluate performance of function dispatch and module instantiation. Instruction dispatch is also affected.

tl;dr: While this PR closes the gap between best and worst performance it also pessimizes the best performance.

compile_and_validate/v0 time:   [6.7932 ms 6.8118 ms 6.8340 ms]                                    
                        change: [-1.9011% -1.4463% -0.9943%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

compile_and_validate/v1 time:   [6.7415 ms 6.7567 ms 6.7735 ms]                                    
                        change: [-1.0009% -0.6186% -0.2596%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

instantiate/v0          time:   [454.12 us 455.82 us 457.68 us]                           
                        change: [-2.5762% -0.7318% +1.0146%] (p = 0.45 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  7 (7.00%) high mild
  10 (10.00%) high severe

instantiate/v1          time:   [58.478 us 58.600 us 58.734 us]                           
                        change: [+8.8324% +9.6211% +10.495%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

Benchmarking execute/tiny_keccak/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60.
execute/tiny_keccak/v0  time:   [1.2641 ms 1.2673 ms 1.2704 ms]                                    
                        change: [+1.0453% +1.5824% +2.0917%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  4 (4.00%) high severe

Benchmarking execute/tiny_keccak/v1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s, enable flat sampling, or reduce sample count to 60.
execute/tiny_keccak/v1  time:   [1.0314 ms 1.0341 ms 1.0373 ms]                                    
                        change: [+6.6106% +7.2196% +7.8061%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  2 (2.00%) high severe

Benchmarking execute/rev_complement/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.9s, enable flat sampling, or reduce sample count to 50.
execute/rev_complement/v0                                                                             
                        time:   [1.5623 ms 1.5656 ms 1.5692 ms]
                        change: [-1.3037% -0.8696% -0.4748%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low mild
  7 (7.00%) high mild
  3 (3.00%) high severe

Benchmarking execute/rev_complement/v1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.1s, enable flat sampling, or reduce sample count to 50.
execute/rev_complement/v1                                                                             
                        time:   [1.4035 ms 1.4061 ms 1.4090 ms]
                        change: [+23.567% +24.388% +25.178%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe

Benchmarking execute/regex_redux/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.5s, enable flat sampling, or reduce sample count to 50.
execute/regex_redux/v0  time:   [1.6797 ms 1.6821 ms 1.6848 ms]                                    
                        change: [-1.2953% -0.6148% +0.0253%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe

Benchmarking execute/regex_redux/v1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.5s, enable flat sampling, or reduce sample count to 50.
execute/regex_redux/v1  time:   [1.4787 ms 1.4811 ms 1.4836 ms]                                    
                        change: [+24.427% +24.963% +25.497%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

Benchmarking execute/count_until/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.5s, enable flat sampling, or reduce sample count to 50.
execute/count_until/v0  time:   [1.8691 ms 1.8716 ms 1.8740 ms]                                    
                        change: [+1.4873% +1.7939% +2.0877%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

Benchmarking execute/count_until/v1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.8s, enable flat sampling, or reduce sample count to 50.
execute/count_until/v1  time:   [1.7343 ms 1.7375 ms 1.7410 ms]                                    
                        change: [+1.0826% +1.4768% +1.8841%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

execute/factorial_recursive/v0                                                                             
                        time:   [25.236 us 25.280 us 25.326 us]
                        change: [+9.2538% +9.7995% +10.420%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

execute/factorial_recursive/v1                                                                             
                        time:   [1.5521 us 1.5540 us 1.5561 us]
                        change: [+32.842% +33.422% +33.966%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

execute/factorial_optimized/v0                                                                             
                        time:   [24.157 us 24.213 us 24.284 us]
                        change: [+8.3346% +8.8757% +9.4642%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) low mild
  4 (4.00%) high mild
  7 (7.00%) high severe

execute/factorial_optimized/v1                                                                             
                        time:   [727.84 ns 729.14 ns 730.44 ns]
                        change: [-1.6743% -1.1619% -0.7095%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

execute/recursive_ok/v0 time:   [514.24 us 515.16 us 516.13 us]                                    
                        change: [+4.9704% +5.4392% +5.8648%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

execute/recursive_ok/v1 time:   [408.60 us 409.55 us 410.46 us]                                    
                        change: [+26.512% +27.249% +27.946%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

execute/recursive_trap/v0                                                                            
                        time:   [71.507 us 71.625 us 71.750 us]
                        change: [+2.6568% +3.3007% +3.9599%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

execute/recursive_trap/v1                                                                             
                        time:   [40.497 us 40.576 us 40.663 us]
                        change: [+33.254% +33.990% +34.696%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

execute/host_calls/v0   time:   [81.444 us 81.584 us 81.727 us]                                  
                        change: [+10.842% +11.421% +12.033%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

execute/host_calls/v1   time:   [57.899 us 58.016 us 58.147 us]                                  
                        change: [+26.063% +26.725% +27.390%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

@Robbepop
Copy link
Member

Robbepop commented Jan 27, 2022

Did some test where I only added some #[inline] annotations to a selected amount of functions in the wasmi_core crate here.
Benchmarks show significant improvements for the best case scenario, however, not so many improvements to the default release profile unfortunately.

@tavianator
Copy link
Author

@tavianator thanks a lot for this PR and for pasting the benchmark results!

You're welcome! :)

Can you share more how you selected the functions that you annotated with #[inline]?

I got a perf profile of the fast and the slow configuration and used perf report and perf diff to compare them. I found in the slow configuration quite a few functions that don't appear at all in the fast profile, indicating those functions were likely inlined in the fast configuration. I sprinkled #[inline] on the ones that looked small enough, took another profile, and kept going until I had to go somewhere and I put this PR up.

Did you also check the performance of the new engine under the profile

[profile.release]
lto = "fat"
codegen-units = 1

That's the "remaining regression" in my above post. I should do a before and after comparison with that configuration but I haven't yet.

I am asking because we should definitely make sure that the best possible performance won't get pessimized by any "optimization" that we apply. We only want to close the gap between both without pessimizing the best performance. If we are lucky the optimizations further improve the best possible outcome. I am sure there is still a lot of room of improvements even for the best results we have seen so far. For example from what I can tell is that the Rust compiler currently does not optimize our switch based instruction dispatch in a way that would be most optimal for it - but it could in theory do that. This could have a potential 30% gain in performance on top of our best results that we have seen so far.

RIght, agreed. And this is all only looking at one benchmark (execute/tiny_keccak/v1), presumably I should look at a wider range of benchmarks to find the appropriate candidates for inlining.

I also have a more in-depth answer here.

Thanks, I'll read that now.

@Robbepop
Copy link
Member

Robbepop commented Jan 27, 2022

@tavianator Would you like to revert the #[inline] annotations from source files other than the ones from the wasmi_core crate to see whether that lifts the performance on par with my PR. If that's the case we should merge your since it would introduce fewer #[inline] annotations (71 versus 114).

And yes, ideally you test performance for all the benchmarks we have using cargo bench.

@Robbepop
Copy link
Member

Robbepop commented Feb 3, 2022

@tavianator Would you like to revert the #[inline] annotations from source files other than the ones from the wasmi_core crate to see whether that lifts the performance on par with my PR. If that's the case we should merge your since it would introduce fewer #[inline] annotations (71 versus 114).

And yes, ideally you test performance for all the benchmarks we have using cargo bench.

@tavianator would you be willing to do this? Otherwise I'd like to close this PR.

@Robbepop
Copy link
Member

Robbepop commented Feb 4, 2022

Closing this due to inactivity and also because we just merged #348 that is doing something similar but without the regressions for the v1 engine.

However, thank you a lot for researching again in this direction even if it not really led to the solution of the original problem! :)

@Robbepop Robbepop closed this Feb 4, 2022
@tavianator tavianator deleted the inline-all-the-things branch February 4, 2022 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants