Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add #[inline] annotations to small functions in wasmi_core crate #348

Merged
merged 2 commits into from
Feb 4, 2022

Conversation

Robbepop
Copy link
Member

@Robbepop Robbepop commented Jan 27, 2022

This PR is a refinement of this PR.
The main difference between this PR and the former is that I only annotated relevant functions in the wasmi_core crate.

I verified with benchmarks that the best case performance is not regressed.
In fact benchmarks show some neat wins in performance even in the current best case profile settings.

[profile.release]
lto = "fat"
codegen-units = 1
compile_and_validate/v0 time:   [6.8221 ms 6.8453 ms 6.8721 ms]                                    
                        change: [-2.0524% -1.5869% -1.1425%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe

compile_and_validate/v1 time:   [6.7297 ms 6.7463 ms 6.7650 ms]                                    
                        change: [-2.6581% -2.2644% -1.8682%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

instantiate/v0          time:   [460.08 us 461.96 us 464.18 us]                           
                        change: [+0.2618% +2.3480% +4.2720%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  6 (6.00%) high mild
  9 (9.00%) high severe

instantiate/v1          time:   [54.018 us 54.107 us 54.208 us]                           
                        change: [-4.6674% -4.1457% -3.5764%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

Benchmarking execute/tiny_keccak/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60.
execute/tiny_keccak/v0  time:   [1.2702 ms 1.2727 ms 1.2752 ms]                                    
                        change: [+0.5417% +1.2087% +1.8752%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

execute/tiny_keccak/v1  time:   [961.91 us 963.43 us 965.09 us]                                   
                        change: [-2.0734% -1.7034% -1.3231%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

Benchmarking execute/rev_complement/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.9s, enable flat sampling, or reduce sample count to 50.
execute/rev_complement/v0                                                                             
                        time:   [1.5623 ms 1.5643 ms 1.5664 ms]
                        change: [+0.6921% +1.2161% +1.7325%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

Benchmarking execute/rev_complement/v1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.7s, enable flat sampling, or reduce sample count to 60.
execute/rev_complement/v1                                                                             
                        time:   [1.1290 ms 1.1311 ms 1.1333 ms]
                        change: [-2.6785% -1.9878% -1.2998%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  5 (5.00%) high severe

Benchmarking execute/regex_redux/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.5s, enable flat sampling, or reduce sample count to 50.
execute/regex_redux/v0  time:   [1.6770 ms 1.6812 ms 1.6857 ms]                                    
                        change: [+1.3202% +1.8556% +2.3431%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  8 (8.00%) high severe

Benchmarking execute/regex_redux/v1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.0s, enable flat sampling, or reduce sample count to 60.
execute/regex_redux/v1  time:   [1.1888 ms 1.1910 ms 1.1934 ms]                                    
                        change: [-1.2959% -0.7429% -0.1354%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

Benchmarking execute/count_until/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.5s, enable flat sampling, or reduce sample count to 50.
execute/count_until/v0  time:   [1.8797 ms 1.8817 ms 1.8838 ms]                                    
                        change: [+1.5764% +2.0140% +2.4777%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

Benchmarking execute/count_until/v1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.7s, enable flat sampling, or reduce sample count to 50.
execute/count_until/v1  time:   [1.7268 ms 1.7289 ms 1.7312 ms]                                    
                        change: [-4.8373% -4.3642% -3.9103%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  7 (7.00%) high severe

execute/factorial_recursive/v0                                                                             
                        time:   [25.391 us 25.457 us 25.530 us]
                        change: [+10.521% +11.035% +11.545%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

execute/factorial_recursive/v1                                                                             
                        time:   [1.1499 us 1.1519 us 1.1543 us]
                        change: [-5.7141% -5.2811% -4.8554%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  5 (5.00%) high mild
  7 (7.00%) high severe

execute/factorial_optimized/v0                                                                             
                        time:   [24.226 us 24.291 us 24.364 us]
                        change: [+9.9294% +10.795% +11.618%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  8 (8.00%) high mild
  2 (2.00%) high severe

execute/factorial_optimized/v1                                                                             
                        time:   [734.22 ns 735.52 ns 736.90 ns]
                        change: [-0.7316% -0.3179% +0.0833%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

execute/recursive_ok/v0 time:   [518.42 us 519.49 us 520.76 us]                                    
                        change: [+6.4530% +7.1026% +7.8196%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  3 (3.00%) high severe

execute/recursive_ok/v1 time:   [301.69 us 302.45 us 303.27 us]                                    
                        change: [-6.0900% -5.6249% -5.1553%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

execute/recursive_trap/v0                                                                            
                        time:   [71.429 us 71.619 us 71.832 us]
                        change: [+2.5093% +3.0868% +3.6646%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

execute/recursive_trap/v1                                                                             
                        time:   [28.415 us 28.458 us 28.503 us]
                        change: [-6.2794% -5.8939% -5.5115%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

execute/host_calls/v0   time:   [75.813 us 75.955 us 76.109 us]                                  
                        change: [+3.0887% +3.6202% +4.1416%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) low mild
  6 (6.00%) high mild
  6 (6.00%) high severe

execute/host_calls/v1   time:   [46.744 us 46.836 us 46.933 us]                                   
                        change: [+2.6132% +3.0292% +3.4293%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe

Compared to the original PR there are fewer gains on the default release profile:

[profile.release]
lto = false
codegen-units = 16
compile_and_validate/v0 time:   [8.8091 ms 8.8399 ms 8.8740 ms]                                    
                        change: [-3.4497% -2.8317% -2.2452%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

compile_and_validate/v1 time:   [9.1008 ms 9.1259 ms 9.1530 ms]                                    
                        change: [-3.1303% -2.6654% -2.2118%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

instantiate/v0          time:   [516.68 us 519.71 us 523.97 us]                           
                        change: [-1.9549% -0.1458% +1.7433%] (p = 0.89 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) high mild
  8 (8.00%) high severe

instantiate/v1          time:   [75.096 us 75.261 us 75.447 us]                           
                        change: [-4.7806% -3.9669% -3.1288%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

Benchmarking execute/tiny_keccak/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.7s, enable flat sampling, or reduce sample count to 60.
execute/tiny_keccak/v0  time:   [1.3225 ms 1.3257 ms 1.3296 ms]                                    
                        change: [-15.715% -15.047% -14.214%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

execute/tiny_keccak/v1  time:   [4.4582 ms 4.5235 ms 4.5867 ms]                                    
                        change: [-1.7800% -0.2564% +1.1385%] (p = 0.73 > 0.05)
                        No change in performance detected.

Benchmarking execute/rev_complement/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.4s, enable flat sampling, or reduce sample count to 50.
execute/rev_complement/v0                                                                             
                        time:   [1.6442 ms 1.6473 ms 1.6504 ms]
                        change: [-24.252% -23.870% -23.474%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  6 (6.00%) high severe

execute/rev_complement/v1                                                                             
                        time:   [5.5133 ms 5.5247 ms 5.5369 ms]
                        change: [-11.425% -11.086% -10.744%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Benchmarking execute/regex_redux/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.3s, enable flat sampling, or reduce sample count to 50.
execute/regex_redux/v0  time:   [1.8321 ms 1.8355 ms 1.8392 ms]                                    
                        change: [-11.425% -10.620% -9.9913%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

execute/regex_redux/v1  time:   [5.2250 ms 5.2395 ms 5.2563 ms]                                    
                        change: [-10.833% -10.416% -10.005%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

Benchmarking execute/count_until/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.8s, enable flat sampling, or reduce sample count to 50.
execute/count_until/v0  time:   [1.7288 ms 1.7322 ms 1.7359 ms]                                    
                        change: [-16.428% -16.072% -15.691%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

execute/count_until/v1  time:   [5.2632 ms 5.2786 ms 5.2973 ms]                                    
                        change: [-12.046% -11.713% -11.328%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

execute/factorial_recursive/v0                                                                             
                        time:   [23.622 us 23.680 us 23.745 us]
                        change: [-13.596% -13.063% -12.521%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

execute/factorial_recursive/v1                                                                             
                        time:   [3.3644 us 3.3738 us 3.3838 us]
                        change: [-2.9039% -2.3538% -1.7840%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

execute/factorial_optimized/v0                                                                             
                        time:   [22.001 us 22.059 us 22.118 us]
                        change: [-14.925% -14.352% -13.786%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

execute/factorial_optimized/v1                                                                             
                        time:   [2.1976 us 2.2023 us 2.2075 us]
                        change: [-26.559% -26.166% -25.777%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe

execute/recursive_ok/v0 time:   [576.76 us 578.01 us 579.46 us]                                    
                        change: [-6.8714% -6.1659% -5.4949%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low mild
  7 (7.00%) high mild
  4 (4.00%) high severe

execute/recursive_ok/v1 time:   [828.75 us 830.57 us 832.48 us]                                    
                        change: [-7.4671% -6.9124% -6.3596%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) high mild
  4 (4.00%) high severe

execute/recursive_trap/v0                                                                            
                        time:   [76.037 us 76.206 us 76.384 us]
                        change: [-10.528% -9.6716% -8.9099%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

execute/recursive_trap/v1                                                                            
                        time:   [76.506 us 76.787 us 77.122 us]
                        change: [-9.6152% -7.9603% -6.4676%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

execute/host_calls/v0   time:   [96.371 us 96.596 us 96.845 us]                                  
                        change: [-14.304% -13.825% -13.346%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

execute/host_calls/v1   time:   [117.97 us 118.26 us 118.60 us]                                  
                        change: [-3.5932% -2.9390% -2.3392%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  7 (7.00%) high severe

@athei
Copy link
Collaborator

athei commented Jan 28, 2022

I don't see a clear win in the first benchmarks you posted (the first box). Some regressed and some improved. I think using profile guided optimization would be a more systematic approach.

@Robbepop
Copy link
Member Author

I don't see a clear win in the first benchmarks you posted (the first box). Some regressed and some improved. I think using profile guided optimization would be a more systematic approach.

Yeah I will explain. In the benchmarks most v0 benchmarks show slight regressions in the range of 2-4% whereas for v1 we can see improvements of 3-5% across the board.

@athei
Copy link
Collaborator

athei commented Jan 28, 2022

We shouldn't merge it then, right? Cause we are actually still using v0 for some time now.

@Robbepop
Copy link
Member Author

Robbepop commented Jan 28, 2022

We shouldn't merge it then, right? Cause we are actually still using v0 for some time now.

I currently do not plan to release another v0 version. As soon as the big tasks are done for wasmi_v1 I start working on the Substrate PR to use wasmi_v1 for experimentation.
The big tasks that are required include:

  • Using wasmparser for parsing and validation: PR
    • Note that this PR will also allow for streaming module compilation.
    • Also in the future we might be able to implement parallel module compilation.
  • Implement wasmi bytecode fusion since papers suggest 50-100% performance boosts by just this optimization compared to our current stack based bytecode.

@Robbepop
Copy link
Member Author

We shouldn't merge it then, right? Cause we are actually still using v0 for some time now.

Note that I have only benchmarked lto="fat",cgu=1. So it could very well be that v0 sees some improvements under different profile settings. I would not wonder at all about this given that in the past v0 and v1 behaved so differently with respect to benchmarks and profiles.

@athei
Copy link
Collaborator

athei commented Jan 28, 2022

I don't care about other profiles which are clearly inferior. We shouldn't just merge inlines on a hunch when they even pessimise the profile and version of the crate we are currently using (or will be using very soon).

@Robbepop
Copy link
Member Author

I guess this PR then has to wait until wasmi_v1 is ready.

@Robbepop Robbepop added the blocked The issue or PR is currently blocked. label Jan 28, 2022
@Robbepop
Copy link
Member Author

Robbepop commented Feb 4, 2022

@athei can I merge this since we no longer really seem to be interested in merging any of the old wasmi v0 versions into Substrate?

@athei
Copy link
Collaborator

athei commented Feb 4, 2022

Yeah sure.

@Robbepop Robbepop merged commit fbb556f into master Feb 4, 2022
@athei athei deleted the rf-inline-wasmi-core branch February 4, 2022 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked The issue or PR is currently blocked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants