Add #[inline] annotations to small functions in wasmi_core crate #348

Robbepop · 2022-01-27T13:31:38Z

This PR is a refinement of this PR.
The main difference between this PR and the former is that I only annotated relevant functions in the wasmi_core crate.

I verified with benchmarks that the best case performance is not regressed.
In fact benchmarks show some neat wins in performance even in the current best case profile settings.

[profile.release]
lto = "fat"
codegen-units = 1

compile_and_validate/v0 time:   [6.8221 ms 6.8453 ms 6.8721 ms]                                    
                        change: [-2.0524% -1.5869% -1.1425%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe

compile_and_validate/v1 time:   [6.7297 ms 6.7463 ms 6.7650 ms]                                    
                        change: [-2.6581% -2.2644% -1.8682%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

instantiate/v0          time:   [460.08 us 461.96 us 464.18 us]                           
                        change: [+0.2618% +2.3480% +4.2720%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  6 (6.00%) high mild
  9 (9.00%) high severe

instantiate/v1          time:   [54.018 us 54.107 us 54.208 us]                           
                        change: [-4.6674% -4.1457% -3.5764%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

Benchmarking execute/tiny_keccak/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60.
execute/tiny_keccak/v0  time:   [1.2702 ms 1.2727 ms 1.2752 ms]                                    
                        change: [+0.5417% +1.2087% +1.8752%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

execute/tiny_keccak/v1  time:   [961.91 us 963.43 us 965.09 us]                                   
                        change: [-2.0734% -1.7034% -1.3231%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

Benchmarking execute/rev_complement/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.9s, enable flat sampling, or reduce sample count to 50.
execute/rev_complement/v0                                                                             
                        time:   [1.5623 ms 1.5643 ms 1.5664 ms]
                        change: [+0.6921% +1.2161% +1.7325%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

Benchmarking execute/rev_complement/v1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.7s, enable flat sampling, or reduce sample count to 60.
execute/rev_complement/v1                                                                             
                        time:   [1.1290 ms 1.1311 ms 1.1333 ms]
                        change: [-2.6785% -1.9878% -1.2998%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  5 (5.00%) high severe

Benchmarking execute/regex_redux/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.5s, enable flat sampling, or reduce sample count to 50.
execute/regex_redux/v0  time:   [1.6770 ms 1.6812 ms 1.6857 ms]                                    
                        change: [+1.3202% +1.8556% +2.3431%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  8 (8.00%) high severe

Benchmarking execute/regex_redux/v1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.0s, enable flat sampling, or reduce sample count to 60.
execute/regex_redux/v1  time:   [1.1888 ms 1.1910 ms 1.1934 ms]                                    
                        change: [-1.2959% -0.7429% -0.1354%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

Benchmarking execute/count_until/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.5s, enable flat sampling, or reduce sample count to 50.
execute/count_until/v0  time:   [1.8797 ms 1.8817 ms 1.8838 ms]                                    
                        change: [+1.5764% +2.0140% +2.4777%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

Benchmarking execute/count_until/v1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.7s, enable flat sampling, or reduce sample count to 50.
execute/count_until/v1  time:   [1.7268 ms 1.7289 ms 1.7312 ms]                                    
                        change: [-4.8373% -4.3642% -3.9103%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  7 (7.00%) high severe

execute/factorial_recursive/v0                                                                             
                        time:   [25.391 us 25.457 us 25.530 us]
                        change: [+10.521% +11.035% +11.545%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

execute/factorial_recursive/v1                                                                             
                        time:   [1.1499 us 1.1519 us 1.1543 us]
                        change: [-5.7141% -5.2811% -4.8554%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  5 (5.00%) high mild
  7 (7.00%) high severe

execute/factorial_optimized/v0                                                                             
                        time:   [24.226 us 24.291 us 24.364 us]
                        change: [+9.9294% +10.795% +11.618%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  8 (8.00%) high mild
  2 (2.00%) high severe

execute/factorial_optimized/v1                                                                             
                        time:   [734.22 ns 735.52 ns 736.90 ns]
                        change: [-0.7316% -0.3179% +0.0833%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

execute/recursive_ok/v0 time:   [518.42 us 519.49 us 520.76 us]                                    
                        change: [+6.4530% +7.1026% +7.8196%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  3 (3.00%) high severe

execute/recursive_ok/v1 time:   [301.69 us 302.45 us 303.27 us]                                    
                        change: [-6.0900% -5.6249% -5.1553%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

execute/recursive_trap/v0                                                                            
                        time:   [71.429 us 71.619 us 71.832 us]
                        change: [+2.5093% +3.0868% +3.6646%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

execute/recursive_trap/v1                                                                             
                        time:   [28.415 us 28.458 us 28.503 us]
                        change: [-6.2794% -5.8939% -5.5115%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

execute/host_calls/v0   time:   [75.813 us 75.955 us 76.109 us]                                  
                        change: [+3.0887% +3.6202% +4.1416%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) low mild
  6 (6.00%) high mild
  6 (6.00%) high severe

execute/host_calls/v1   time:   [46.744 us 46.836 us 46.933 us]                                   
                        change: [+2.6132% +3.0292% +3.4293%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe

Compared to the original PR there are fewer gains on the default release profile:

[profile.release]
lto = false
codegen-units = 16

compile_and_validate/v0 time:   [8.8091 ms 8.8399 ms 8.8740 ms]                                    
                        change: [-3.4497% -2.8317% -2.2452%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

compile_and_validate/v1 time:   [9.1008 ms 9.1259 ms 9.1530 ms]                                    
                        change: [-3.1303% -2.6654% -2.2118%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

instantiate/v0          time:   [516.68 us 519.71 us 523.97 us]                           
                        change: [-1.9549% -0.1458% +1.7433%] (p = 0.89 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) high mild
  8 (8.00%) high severe

instantiate/v1          time:   [75.096 us 75.261 us 75.447 us]                           
                        change: [-4.7806% -3.9669% -3.1288%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

Benchmarking execute/tiny_keccak/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.7s, enable flat sampling, or reduce sample count to 60.
execute/tiny_keccak/v0  time:   [1.3225 ms 1.3257 ms 1.3296 ms]                                    
                        change: [-15.715% -15.047% -14.214%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

execute/tiny_keccak/v1  time:   [4.4582 ms 4.5235 ms 4.5867 ms]                                    
                        change: [-1.7800% -0.2564% +1.1385%] (p = 0.73 > 0.05)
                        No change in performance detected.

Benchmarking execute/rev_complement/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.4s, enable flat sampling, or reduce sample count to 50.
execute/rev_complement/v0                                                                             
                        time:   [1.6442 ms 1.6473 ms 1.6504 ms]
                        change: [-24.252% -23.870% -23.474%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  6 (6.00%) high severe

execute/rev_complement/v1                                                                             
                        time:   [5.5133 ms 5.5247 ms 5.5369 ms]
                        change: [-11.425% -11.086% -10.744%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Benchmarking execute/regex_redux/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.3s, enable flat sampling, or reduce sample count to 50.
execute/regex_redux/v0  time:   [1.8321 ms 1.8355 ms 1.8392 ms]                                    
                        change: [-11.425% -10.620% -9.9913%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

execute/regex_redux/v1  time:   [5.2250 ms 5.2395 ms 5.2563 ms]                                    
                        change: [-10.833% -10.416% -10.005%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

Benchmarking execute/count_until/v0: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.8s, enable flat sampling, or reduce sample count to 50.
execute/count_until/v0  time:   [1.7288 ms 1.7322 ms 1.7359 ms]                                    
                        change: [-16.428% -16.072% -15.691%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

execute/count_until/v1  time:   [5.2632 ms 5.2786 ms 5.2973 ms]                                    
                        change: [-12.046% -11.713% -11.328%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

execute/factorial_recursive/v0                                                                             
                        time:   [23.622 us 23.680 us 23.745 us]
                        change: [-13.596% -13.063% -12.521%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

execute/factorial_recursive/v1                                                                             
                        time:   [3.3644 us 3.3738 us 3.3838 us]
                        change: [-2.9039% -2.3538% -1.7840%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

execute/factorial_optimized/v0                                                                             
                        time:   [22.001 us 22.059 us 22.118 us]
                        change: [-14.925% -14.352% -13.786%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

execute/factorial_optimized/v1                                                                             
                        time:   [2.1976 us 2.2023 us 2.2075 us]
                        change: [-26.559% -26.166% -25.777%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe

execute/recursive_ok/v0 time:   [576.76 us 578.01 us 579.46 us]                                    
                        change: [-6.8714% -6.1659% -5.4949%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low mild
  7 (7.00%) high mild
  4 (4.00%) high severe

execute/recursive_ok/v1 time:   [828.75 us 830.57 us 832.48 us]                                    
                        change: [-7.4671% -6.9124% -6.3596%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) high mild
  4 (4.00%) high severe

execute/recursive_trap/v0                                                                            
                        time:   [76.037 us 76.206 us 76.384 us]
                        change: [-10.528% -9.6716% -8.9099%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

execute/recursive_trap/v1                                                                            
                        time:   [76.506 us 76.787 us 77.122 us]
                        change: [-9.6152% -7.9603% -6.4676%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

execute/host_calls/v0   time:   [96.371 us 96.596 us 96.845 us]                                  
                        change: [-14.304% -13.825% -13.346%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

execute/host_calls/v1   time:   [117.97 us 118.26 us 118.60 us]                                  
                        change: [-3.5932% -2.9390% -2.3392%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  7 (7.00%) high severe

athei · 2022-01-28T12:29:46Z

I don't see a clear win in the first benchmarks you posted (the first box). Some regressed and some improved. I think using profile guided optimization would be a more systematic approach.

Robbepop · 2022-01-28T13:01:07Z

I don't see a clear win in the first benchmarks you posted (the first box). Some regressed and some improved. I think using profile guided optimization would be a more systematic approach.

Yeah I will explain. In the benchmarks most v0 benchmarks show slight regressions in the range of 2-4% whereas for v1 we can see improvements of 3-5% across the board.

athei · 2022-01-28T13:05:48Z

We shouldn't merge it then, right? Cause we are actually still using v0 for some time now.

Robbepop · 2022-01-28T13:08:16Z

We shouldn't merge it then, right? Cause we are actually still using v0 for some time now.

I currently do not plan to release another v0 version. As soon as the big tasks are done for wasmi_v1 I start working on the Substrate PR to use wasmi_v1 for experimentation.
The big tasks that are required include:

Using wasmparser for parsing and validation: PR
- Note that this PR will also allow for streaming module compilation.
- Also in the future we might be able to implement parallel module compilation.
Implement wasmi bytecode fusion since papers suggest 50-100% performance boosts by just this optimization compared to our current stack based bytecode.

Robbepop · 2022-01-28T13:16:17Z

We shouldn't merge it then, right? Cause we are actually still using v0 for some time now.

Note that I have only benchmarked lto="fat",cgu=1. So it could very well be that v0 sees some improvements under different profile settings. I would not wonder at all about this given that in the past v0 and v1 behaved so differently with respect to benchmarks and profiles.

athei · 2022-01-28T13:48:43Z

I don't care about other profiles which are clearly inferior. We shouldn't just merge inlines on a hunch when they even pessimise the profile and version of the crate we are currently using (or will be using very soon).

Robbepop · 2022-01-28T13:51:29Z

I guess this PR then has to wait until wasmi_v1 is ready.

Robbepop · 2022-02-04T09:08:26Z

@athei can I merge this since we no longer really seem to be interested in merging any of the old wasmi v0 versions into Substrate?

athei · 2022-02-04T09:49:09Z

Yeah sure.

put #[inline] onto small functions in wasmi_core crate

54b6d42

Robbepop mentioned this pull request Jan 27, 2022

Mark a lot of functions inline #347

Closed

Robbepop added the blocked The issue or PR is currently blocked. label Jan 28, 2022

Merge branch 'master' into rf-inline-wasmi-core

2a1b20b

athei approved these changes Feb 4, 2022

View reviewed changes

Robbepop merged commit fbb556f into master Feb 4, 2022

athei deleted the rf-inline-wasmi-core branch February 4, 2022 10:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add #[inline] annotations to small functions in wasmi_core crate #348

Add #[inline] annotations to small functions in wasmi_core crate #348

Robbepop commented Jan 27, 2022 •

edited

Loading

athei commented Jan 28, 2022

Robbepop commented Jan 28, 2022

athei commented Jan 28, 2022

Robbepop commented Jan 28, 2022 •

edited

Loading

Robbepop commented Jan 28, 2022

athei commented Jan 28, 2022 •

edited

Loading

Robbepop commented Jan 28, 2022

Robbepop commented Feb 4, 2022

athei commented Feb 4, 2022

Add #[inline] annotations to small functions in wasmi_core crate #348

Add #[inline] annotations to small functions in wasmi_core crate #348

Conversation

Robbepop commented Jan 27, 2022 • edited Loading

athei commented Jan 28, 2022

Robbepop commented Jan 28, 2022

athei commented Jan 28, 2022

Robbepop commented Jan 28, 2022 • edited Loading

Robbepop commented Jan 28, 2022

athei commented Jan 28, 2022 • edited Loading

Robbepop commented Jan 28, 2022

Robbepop commented Feb 4, 2022

athei commented Feb 4, 2022

Robbepop commented Jan 27, 2022 •

edited

Loading

Robbepop commented Jan 28, 2022 •

edited

Loading

athei commented Jan 28, 2022 •

edited

Loading