Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small stack optimization #265

Merged
merged 2 commits into from
Jun 4, 2020
Merged

Small stack optimization #265

merged 2 commits into from
Jun 4, 2020

Conversation

chfast
Copy link
Collaborator

@chfast chfast commented Apr 9, 2020

Requires #247

This implements "small vector" / "small string" optimization to OperandStack. When the max stack height is smaller than static limit we omit dynamic allocation and use static storage within OperandStack.

The max stack height is usually very small. E.g. for LLVM generated code it is usually 3 except calls (as they naturally requires the stack to have all call args).

fizzy/execute/blake2b/rounds_1_mean                      -0.1445         -0.1445        265267        226946        265240        226922
fizzy/execute/blake2b/rounds_16_mean                     -0.1447         -0.1447       3975578       3400162       3975531       3400087
fizzy/execute/ecpairing/multipoint_mean                  -0.1318         -0.1318    2871590597    2493111309    2871538432    2493063825
fizzy/execute/memset/256_bytes_mean                      -0.1257         -0.1258         24064         21039         24029         21006
fizzy/execute/memset/60000_bytes_mean                    -0.1379         -0.1379       5059587       4361997       5059539       4361958
fizzy/execute/mul256_opt0/input0_mean                    -0.0618         -0.0619         75348         70693         75314         70656
fizzy/execute/mul256_opt0/input1_mean                    -0.0676         -0.0672         75570         70461         75502         70425
fizzy/execute/sha1/rounds_1_mean                         -0.1733         -0.1733        276025        228194        275996        228169
fizzy/execute/sha1/rounds_16_mean                        -0.1726         -0.1726       3808532       3151106       3808477       3151040
fizzy/execute/micro/factorial/10_mean                    -0.1620         -0.1644          2887          2420          2893          2418
fizzy/execute/micro/factorial/20_mean                    -0.1897         -0.1921          4858          3936          4873          3937
fizzy/execute/micro/fibonacci/24_mean                    -0.1509         -0.1509      34522229      29313074      34521804      29312302
fizzy/execute/micro/spinner/1_mean                       -0.0545         -0.0571           882           834           881           830
fizzy/execute/micro/spinner/1000_mean                    -0.0711         -0.0712         39449         36644         39455         36646

@codecov-io
Copy link

codecov-io commented Apr 9, 2020

Codecov Report

Merging #265 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #265   +/-   ##
=======================================
  Coverage   98.84%   98.85%           
=======================================
  Files          42       42           
  Lines       12133    12184   +51     
=======================================
+ Hits        11993    12044   +51     
  Misses        140      140           

@axic
Copy link
Member

axic commented Jun 1, 2020

Rebased (and hopefully correctly solved conflicts).

@axic
Copy link
Member

axic commented Jun 3, 2020

@chfast @gumb0 can we perhaps merge this? It showed good results for me. Span has regressions.

uint64_t m_small_storage[small_storage_size];

/// The unbounded storage for items.
std::unique_ptr<uint64_t[]> m_large_storage;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be std::array

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unique_ptr<std::array>?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I put comment on the wrong line, sorry, I meant it for the stack storage above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guessed that later.

We only take the pointer of m_small_storage once in constructor. So I don't think it is even worth to include <array> header as we are not using any of std::array features.

@chfast
Copy link
Collaborator Author

chfast commented Jun 3, 2020

fizzy/execute/blake2b/512_bytes_rounds_1_mean                     -0.0777         -0.0777            79            73            79            73
fizzy/execute/blake2b/512_bytes_rounds_16_mean                    -0.0744         -0.0744          1174          1087          1174          1087
fizzy/execute/ecpairing/onepoint_mean                             -0.1531         -0.1531        466278        394909        466281        394913
fizzy/execute/keccak256/512_bytes_rounds_1_mean                   -0.1019         -0.1019            92            83            92            83
fizzy/execute/keccak256/512_bytes_rounds_16_mean                  -0.1124         -0.1124          1341          1190          1341          1190
fizzy/execute/memset/256_bytes_mean                               -0.0922         -0.0925             7             6             7             6
fizzy/execute/memset/60000_bytes_mean                             -0.0918         -0.0918          1418          1288          1418          1288
fizzy/execute/mul256_opt0/input0_mean                             +0.0598         +0.0593            25            26            25            26
fizzy/execute/mul256_opt0/input1_mean                             +0.0653         +0.0647            25            26            25            26
fizzy/execute/sha1/512_bytes_rounds_1_mean                        -0.0716         -0.0717            85            79            85            79
fizzy/execute/sha1/512_bytes_rounds_16_mean                       -0.0616         -0.0616          1164          1092          1164          1092
fizzy/execute/sha256/512_bytes_rounds_1_mean                      -0.0147         -0.0148            76            75            76            75
fizzy/execute/sha256/512_bytes_rounds_16_mean                     -0.0150         -0.0150          1033          1018          1033          1018
fizzy/execute/micro/factorial/10_mean                             -0.1454         -0.1452             1             1             1             1
fizzy/execute/micro/factorial/20_mean                             -0.1699         -0.1697             2             2             2             2
fizzy/execute/micro/fibonacci/24_mean                             -0.2199         -0.2199         11653          9091         11654          9091
fizzy/execute/micro/host_adler32/1_mean                           -0.0456         -0.0470             1             1             1             1
fizzy/execute/micro/host_adler32/100_mean                         -0.0030         -0.0030             6             6             6             6
fizzy/execute/micro/host_adler32/1000_mean                        +0.0067         +0.0067            58            58            58            58
fizzy/execute/micro/spinner/1_mean                                -0.0471         -0.0482             0             0             0             0
fizzy/execute/micro/spinner/1000_mean                             -0.0976         -0.0983            10             9            10             9

The mul benchmark starts being annoying...

@axic
Copy link
Member

axic commented Jun 3, 2020

The mul benchmark starts being annoying...

Though in absolute numbers it is a change from 25 to 26, I'd risk saying it is within the measuring error range. It does give quite a huge benefit to the large benchmarks (ecpairing).

@chfast
Copy link
Collaborator Author

chfast commented Jun 3, 2020

Though in absolute numbers it is a change from 25 to 26, I'd risk saying it is within the measuring error range. It does give quite a huge benefit to the large benchmarks (ecpairing).

They are statistically significant (pvalue is 0, the lowest possible).

@chfast
Copy link
Collaborator Author

chfast commented Jun 3, 2020

I have run mul cases in isolation and it looks the execution time is rather unchanged.
With another full round:

fizzy/execute/blake2b/512_bytes_rounds_1_mean                     -0.0748         -0.0749            79            73            79            73
fizzy/execute/blake2b/512_bytes_rounds_16_mean                    -0.0860         -0.0860          1174          1073          1174          1073
fizzy/execute/ecpairing/onepoint_mean                             -0.1481         -0.1481        466278        397213        466281        397216
fizzy/execute/keccak256/512_bytes_rounds_1_mean                   -0.0982         -0.0982            92            83            92            83
fizzy/execute/keccak256/512_bytes_rounds_16_mean                  -0.1124         -0.1124          1341          1190          1341          1190
fizzy/execute/memset/256_bytes_mean                               -0.0881         -0.0883             7             6             7             6
fizzy/execute/memset/60000_bytes_mean                             -0.0909         -0.0909          1418          1289          1418          1289
fizzy/execute/mul256_opt0/input0_mean                             +0.0109         +0.0105            25            25            25            25
fizzy/execute/mul256_opt0/input1_mean                             +0.0139         +0.0134            25            25            25            25
fizzy/execute/sha1/512_bytes_rounds_1_mean                        -0.0721         -0.0721            85            79            85            79
fizzy/execute/sha1/512_bytes_rounds_16_mean                       -0.0619         -0.0619          1164          1092          1164          1092
fizzy/execute/sha256/512_bytes_rounds_1_mean                      -0.0065         -0.0066            76            76            76            76
fizzy/execute/sha256/512_bytes_rounds_16_mean                     -0.0015         -0.0015          1033          1032          1033          1032
fizzy/execute/micro/factorial/10_mean                             -0.1439         -0.1433             1             1             1             1
fizzy/execute/micro/factorial/20_mean                             -0.1699         -0.1699             2             2             2             2
fizzy/execute/micro/fibonacci/24_mean                             -0.2157         -0.2157         11653          9140         11654          9140
fizzy/execute/micro/host_adler32/1_mean                           -0.0449         -0.0464             1             1             1             1
fizzy/execute/micro/host_adler32/100_mean                         -0.0045         -0.0045             6             6             6             6
fizzy/execute/micro/host_adler32/1000_mean                        +0.0014         +0.0014            58            58            58            58
fizzy/execute/micro/spinner/1_mean                                -0.0466         -0.0477             0             0             0             0
fizzy/execute/micro/spinner/1000_mean                             -0.0973         -0.0980            10             9            10             9

@chfast chfast marked this pull request as ready for review June 3, 2020 20:40
@chfast chfast requested a review from axic June 3, 2020 20:40
@axic
Copy link
Member

axic commented Jun 3, 2020

They are statistically significant (pvalue is 0, the lowest possible).

Where do you see the pvalue?

@@ -58,32 +58,54 @@ class Stack : public std::vector<T>

class OperandStack
{
/// The size of the pre-allocated internal storage: 128 bytes.
static constexpr auto small_storage_size = 128 / sizeof(uint64_t);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to fine-tune this, e.g. benchmark it with 64, 256, 512? Or should we postpone that until we have a better set of inputs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tuned it to current benchmark set - the minimal value that fits all benchmarks (and unit tests by coincidence) in small storage.

std::unique_ptr<uint64_t[]> m_storage;
/// The bottom of the stack. Set in the constructor and never modified.
///
/// TODO: This pointer is rarely used and may be removed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still valid?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just added it. There is some experimentation needed for it, with probably no performance implications. But it is worth just for sanity and "simpler" design.

Copy link
Member

@axic axic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but would be nice to have answers to those two questions.

@chfast chfast merged commit 5db0431 into master Jun 4, 2020
@chfast chfast deleted the stack_optimization_2 branch June 4, 2020 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants