Small stack optimization #265

chfast · 2020-04-09T09:24:49Z

Requires #247

This implements "small vector" / "small string" optimization to OperandStack. When the max stack height is smaller than static limit we omit dynamic allocation and use static storage within OperandStack.

The max stack height is usually very small. E.g. for LLVM generated code it is usually 3 except calls (as they naturally requires the stack to have all call args).

fizzy/execute/blake2b/rounds_1_mean                      -0.1445         -0.1445        265267        226946        265240        226922
fizzy/execute/blake2b/rounds_16_mean                     -0.1447         -0.1447       3975578       3400162       3975531       3400087
fizzy/execute/ecpairing/multipoint_mean                  -0.1318         -0.1318    2871590597    2493111309    2871538432    2493063825
fizzy/execute/memset/256_bytes_mean                      -0.1257         -0.1258         24064         21039         24029         21006
fizzy/execute/memset/60000_bytes_mean                    -0.1379         -0.1379       5059587       4361997       5059539       4361958
fizzy/execute/mul256_opt0/input0_mean                    -0.0618         -0.0619         75348         70693         75314         70656
fizzy/execute/mul256_opt0/input1_mean                    -0.0676         -0.0672         75570         70461         75502         70425
fizzy/execute/sha1/rounds_1_mean                         -0.1733         -0.1733        276025        228194        275996        228169
fizzy/execute/sha1/rounds_16_mean                        -0.1726         -0.1726       3808532       3151106       3808477       3151040
fizzy/execute/micro/factorial/10_mean                    -0.1620         -0.1644          2887          2420          2893          2418
fizzy/execute/micro/factorial/20_mean                    -0.1897         -0.1921          4858          3936          4873          3937
fizzy/execute/micro/fibonacci/24_mean                    -0.1509         -0.1509      34522229      29313074      34521804      29312302
fizzy/execute/micro/spinner/1_mean                       -0.0545         -0.0571           882           834           881           830
fizzy/execute/micro/spinner/1000_mean                    -0.0711         -0.0712         39449         36644         39455         36646

codecov-io · 2020-04-09T09:28:12Z

Codecov Report

Merging #265 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #265   +/-   ##
=======================================
  Coverage   98.84%   98.85%           
=======================================
  Files          42       42           
  Lines       12133    12184   +51     
=======================================
+ Hits        11993    12044   +51     
  Misses        140      140

axic · 2020-06-01T10:19:27Z

Rebased (and hopefully correctly solved conflicts).

axic · 2020-06-03T11:35:46Z

@chfast @gumb0 can we perhaps merge this? It showed good results for me. Span has regressions.

gumb0 · 2020-06-03T17:11:18Z

lib/fizzy/stack.hpp

+    uint64_t m_small_storage[small_storage_size];
+
+    /// The unbounded storage for items.
+    std::unique_ptr<uint64_t[]> m_large_storage;


could be std::array

unique_ptr<std::array>?

Oh, I put comment on the wrong line, sorry, I meant it for the stack storage above

I guessed that later.

We only take the pointer of m_small_storage once in constructor. So I don't think it is even worth to include <array> header as we are not using any of std::array features.

chfast · 2020-06-03T17:20:59Z

fizzy/execute/blake2b/512_bytes_rounds_1_mean                     -0.0777         -0.0777            79            73            79            73
fizzy/execute/blake2b/512_bytes_rounds_16_mean                    -0.0744         -0.0744          1174          1087          1174          1087
fizzy/execute/ecpairing/onepoint_mean                             -0.1531         -0.1531        466278        394909        466281        394913
fizzy/execute/keccak256/512_bytes_rounds_1_mean                   -0.1019         -0.1019            92            83            92            83
fizzy/execute/keccak256/512_bytes_rounds_16_mean                  -0.1124         -0.1124          1341          1190          1341          1190
fizzy/execute/memset/256_bytes_mean                               -0.0922         -0.0925             7             6             7             6
fizzy/execute/memset/60000_bytes_mean                             -0.0918         -0.0918          1418          1288          1418          1288
fizzy/execute/mul256_opt0/input0_mean                             +0.0598         +0.0593            25            26            25            26
fizzy/execute/mul256_opt0/input1_mean                             +0.0653         +0.0647            25            26            25            26
fizzy/execute/sha1/512_bytes_rounds_1_mean                        -0.0716         -0.0717            85            79            85            79
fizzy/execute/sha1/512_bytes_rounds_16_mean                       -0.0616         -0.0616          1164          1092          1164          1092
fizzy/execute/sha256/512_bytes_rounds_1_mean                      -0.0147         -0.0148            76            75            76            75
fizzy/execute/sha256/512_bytes_rounds_16_mean                     -0.0150         -0.0150          1033          1018          1033          1018
fizzy/execute/micro/factorial/10_mean                             -0.1454         -0.1452             1             1             1             1
fizzy/execute/micro/factorial/20_mean                             -0.1699         -0.1697             2             2             2             2
fizzy/execute/micro/fibonacci/24_mean                             -0.2199         -0.2199         11653          9091         11654          9091
fizzy/execute/micro/host_adler32/1_mean                           -0.0456         -0.0470             1             1             1             1
fizzy/execute/micro/host_adler32/100_mean                         -0.0030         -0.0030             6             6             6             6
fizzy/execute/micro/host_adler32/1000_mean                        +0.0067         +0.0067            58            58            58            58
fizzy/execute/micro/spinner/1_mean                                -0.0471         -0.0482             0             0             0             0
fizzy/execute/micro/spinner/1000_mean                             -0.0976         -0.0983            10             9            10             9

The mul benchmark starts being annoying...

axic · 2020-06-03T17:27:06Z

The mul benchmark starts being annoying...

Though in absolute numbers it is a change from 25 to 26, I'd risk saying it is within the measuring error range. It does give quite a huge benefit to the large benchmarks (ecpairing).

chfast · 2020-06-03T20:19:49Z

Though in absolute numbers it is a change from 25 to 26, I'd risk saying it is within the measuring error range. It does give quite a huge benefit to the large benchmarks (ecpairing).

They are statistically significant (pvalue is 0, the lowest possible).

chfast · 2020-06-03T20:38:32Z

I have run mul cases in isolation and it looks the execution time is rather unchanged.
With another full round:

fizzy/execute/blake2b/512_bytes_rounds_1_mean                     -0.0748         -0.0749            79            73            79            73
fizzy/execute/blake2b/512_bytes_rounds_16_mean                    -0.0860         -0.0860          1174          1073          1174          1073
fizzy/execute/ecpairing/onepoint_mean                             -0.1481         -0.1481        466278        397213        466281        397216
fizzy/execute/keccak256/512_bytes_rounds_1_mean                   -0.0982         -0.0982            92            83            92            83
fizzy/execute/keccak256/512_bytes_rounds_16_mean                  -0.1124         -0.1124          1341          1190          1341          1190
fizzy/execute/memset/256_bytes_mean                               -0.0881         -0.0883             7             6             7             6
fizzy/execute/memset/60000_bytes_mean                             -0.0909         -0.0909          1418          1289          1418          1289
fizzy/execute/mul256_opt0/input0_mean                             +0.0109         +0.0105            25            25            25            25
fizzy/execute/mul256_opt0/input1_mean                             +0.0139         +0.0134            25            25            25            25
fizzy/execute/sha1/512_bytes_rounds_1_mean                        -0.0721         -0.0721            85            79            85            79
fizzy/execute/sha1/512_bytes_rounds_16_mean                       -0.0619         -0.0619          1164          1092          1164          1092
fizzy/execute/sha256/512_bytes_rounds_1_mean                      -0.0065         -0.0066            76            76            76            76
fizzy/execute/sha256/512_bytes_rounds_16_mean                     -0.0015         -0.0015          1033          1032          1033          1032
fizzy/execute/micro/factorial/10_mean                             -0.1439         -0.1433             1             1             1             1
fizzy/execute/micro/factorial/20_mean                             -0.1699         -0.1699             2             2             2             2
fizzy/execute/micro/fibonacci/24_mean                             -0.2157         -0.2157         11653          9140         11654          9140
fizzy/execute/micro/host_adler32/1_mean                           -0.0449         -0.0464             1             1             1             1
fizzy/execute/micro/host_adler32/100_mean                         -0.0045         -0.0045             6             6             6             6
fizzy/execute/micro/host_adler32/1000_mean                        +0.0014         +0.0014            58            58            58            58
fizzy/execute/micro/spinner/1_mean                                -0.0466         -0.0477             0             0             0             0
fizzy/execute/micro/spinner/1000_mean                             -0.0973         -0.0980            10             9            10             9

axic · 2020-06-03T21:40:49Z

They are statistically significant (pvalue is 0, the lowest possible).

Where do you see the pvalue?

axic · 2020-06-03T21:41:47Z

lib/fizzy/stack.hpp

@@ -58,32 +58,54 @@ class Stack : public std::vector<T>

 class OperandStack
 {
+    /// The size of the pre-allocated internal storage: 128 bytes.
+    static constexpr auto small_storage_size = 128 / sizeof(uint64_t);


Do we want to fine-tune this, e.g. benchmark it with 64, 256, 512? Or should we postpone that until we have a better set of inputs?

I tuned it to current benchmark set - the minimal value that fits all benchmarks (and unit tests by coincidence) in small storage.

axic · 2020-06-03T21:42:23Z

lib/fizzy/stack.hpp

-    std::unique_ptr<uint64_t[]> m_storage;
+    /// The bottom of the stack. Set in the constructor and never modified.
+    ///
+    /// TODO: This pointer is rarely used and may be removed.


Is this comment still valid?

Yes, I just added it. There is some experimentation needed for it, with probably no performance implications. But it is worth just for sanity and "simpler" design.

axic

Looks good to me, but would be nice to have answers to those two questions.

chfast marked this pull request as draft April 9, 2020 10:59

chfast mentioned this pull request Apr 9, 2020

Modify Stack type interface in preparation for new implementation #264

Closed

axic force-pushed the stack_optimization_2 branch from 89341e8 to 461f6d6 Compare June 1, 2020 10:19

chfast force-pushed the stack_optimization_2 branch from 461f6d6 to ceb71dc Compare June 3, 2020 17:01

gumb0 reviewed Jun 3, 2020

View reviewed changes

chfast force-pushed the stack_optimization_2 branch from ceb71dc to ef515a5 Compare June 3, 2020 17:22

chfast added 2 commits June 3, 2020 22:39

Optimize operand stack of small size

eb3855c

test: Add example of using large stack space

6780f69

chfast force-pushed the stack_optimization_2 branch from ef515a5 to 6780f69 Compare June 3, 2020 20:39

chfast marked this pull request as ready for review June 3, 2020 20:40

chfast requested a review from axic June 3, 2020 20:40

axic reviewed Jun 3, 2020

View reviewed changes

axic approved these changes Jun 3, 2020

View reviewed changes

gumb0 approved these changes Jun 4, 2020

View reviewed changes

chfast merged commit 5db0431 into master Jun 4, 2020

chfast deleted the stack_optimization_2 branch June 4, 2020 10:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small stack optimization #265

Small stack optimization #265

chfast commented Apr 9, 2020 •

edited

Loading

codecov-io commented Apr 9, 2020 •

edited by codecov bot

Loading

axic commented Jun 1, 2020 •

edited

Loading

axic commented Jun 3, 2020

gumb0 Jun 3, 2020

chfast Jun 3, 2020

gumb0 Jun 3, 2020

chfast Jun 3, 2020

chfast commented Jun 3, 2020

axic commented Jun 3, 2020

chfast commented Jun 3, 2020

chfast commented Jun 3, 2020

axic commented Jun 3, 2020

axic Jun 3, 2020

chfast Jun 4, 2020

axic Jun 3, 2020

chfast Jun 4, 2020

axic left a comment

Small stack optimization #265

Small stack optimization #265

Conversation

chfast commented Apr 9, 2020 • edited Loading

codecov-io commented Apr 9, 2020 • edited by codecov bot Loading

Codecov Report

axic commented Jun 1, 2020 • edited Loading

axic commented Jun 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chfast commented Jun 3, 2020

axic commented Jun 3, 2020

chfast commented Jun 3, 2020

chfast commented Jun 3, 2020

axic commented Jun 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axic left a comment

Choose a reason for hiding this comment

chfast commented Apr 9, 2020 •

edited

Loading

codecov-io commented Apr 9, 2020 •

edited by codecov bot

Loading

axic commented Jun 1, 2020 •

edited

Loading