Speed up finding jumpdests #80

chfast · 2019-07-02T12:20:51Z

This replaces the linear search if the jumpdest with binary search. It also applies data-driven approach where the jumpdest "map" is not vector of pairs, but two vectors of offsets and targets. We also shrank the size of elements from int to int16_t.

There is a small trade-off here. The analysis takes longer because requires 2x more vector resizes. And for contracts with small number of jumpdests (like blake2b_huff which only has 3 of them) the time increase in analysis might hide the gain in execution. But still I believe it's worth the 33% speed increase in blake2b_shifts which has a lot of jumpdests.

Comparing bin/evmone-bench-master to bin/evmone-bench
Benchmark                            Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------
sha1_shifts/analysis              +0.0501         +0.0501             5             5             5             5
sha1_shifts/empty                 -0.0379         -0.0379            50            48            50            48
sha1_shifts/1351                  -0.0488         -0.0488           938           893           938           893
sha1_shifts/2737                  -0.0496         -0.0497          1825          1734          1825          1734
sha1_shifts/5311                  -0.0486         -0.0486          3554          3381          3554          3381
sha1_shifts/65536                 -0.0494         -0.0495         43275         41135         43274         41132
stop/analysis                     +0.0131         +0.0131             0             0             0             0
stop                              +0.0007         +0.0007             1             1             1             1
blake2b_huff/analysis             +0.0354         +0.0355            52            54            52            54
blake2b_huff/empty                +0.0291         +0.0291            73            75            73            75
blake2b_huff/abc                  +0.0307         +0.0307            72            75            72            75
blake2b_huff/2805nulls            -0.0033         -0.0033           485           483           485           483
blake2b_huff/2805aa               +0.0025         +0.0025           482           483           482           483
blake2b_huff/5610nulls            -0.0065         -0.0065           896           890           896           890
blake2b_huff/8415nulls            -0.0075         -0.0075          1285          1276          1285          1276
blake2b_huff/65536nulls           -0.0094         -0.0094          9609          9519          9609          9519
sha1_divs/analysis                +0.0529         +0.0529             5             5             5             5
sha1_divs/empty                   -0.0164         -0.0164            98            96            98            96
sha1_divs/1351                    -0.0209         -0.0209          1913          1873          1913          1873
sha1_divs/2737                    -0.0207         -0.0207          3728          3651          3728          3651
sha1_divs/5311                    -0.0211         -0.0211          7272          7118          7272          7118
sha1_divs/65536                   -0.0196         -0.0196         88532         86797         88531         86798
weierstrudel/analysis             +0.0591         +0.0591            65            69            65            69
weierstrudel/0                    +0.0005         +0.0005           337           337           337           337
weierstrudel/1                    -0.0192         -0.0192           660           648           660           648
weierstrudel/2                    -0.0192         -0.0192           824           809           824           809
weierstrudel/3                    -0.0194         -0.0194           989           970           989           970
weierstrudel/8                    -0.0231         -0.0231          1806          1764          1806          1764
weierstrudel/9                    -0.0227         -0.0227          1971          1926          1971          1926
weierstrudel/14                   -0.0240         -0.0240          2790          2723          2790          2723
blake2b_shifts/analysis           +0.1496         +0.1496            26            30            26            30
blake2b_shifts/empty              +0.0000         +0.0000             0             0             0             0
blake2b_shifts/2805nulls          -0.3398         -0.3398          6568          4336          6568          4336
blake2b_shifts/5610nulls          -0.3313         -0.3313         12930          8646         12929          8646
blake2b_shifts/8415nulls          -0.3276         -0.3276         19254         12947         19254         12947
blake2b_shifts/65536nulls         -0.2887         -0.2887        143245        101887        143243        101885

codecov-io · 2019-07-26T10:38:25Z

Codecov Report

Merging #80 into master will increase coverage by 4.38%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #80      +/-   ##
==========================================
+ Coverage   83.27%   87.66%   +4.38%     
==========================================
  Files          20       20              
  Lines        1985     1986       +1     
  Branches      218      216       -2     
==========================================
+ Hits         1653     1741      +88     
+ Misses        307      220      -87     
  Partials       25       25

gumb0 · 2019-07-26T15:53:58Z

How does having two vectors instead of a vector of pairs help? Shorter data better fits into the cache line?

gumb0 · 2019-07-26T15:54:11Z

Does this beat unordered_map?

test/utils/dump.cpp

chfast · 2019-07-26T21:23:29Z

How does having two vectors instead of a vector of pairs help? Shorter data better fits into the cache line?

Only the first vector is traversed. In this PR we lowered the memory on which search happens by 4. In worst case of 2 * 0x6000 this is still 1.5x of L1 cache. Also the last checks in the binary search are close to one another so we might hit the same cache line (64 bytes - 32 items).

Anyway, different variants were benchmarked in https://github.com/ethereum/evmone/tree/internal_benchmarks (to be merged independently).

One "easy" missing optimization is to pack both vectors into single memory allocation and not to over-allocate. I was thinking about first using stack space of 0x6000 * 2 * 2 and them copy it to the single heap allocated memory buffer.

There is also possibility to use SIMD to compare 16 or 32 items at a time. Or even use "k-Ary Search": https://event.cwi.nl/damon2009/DaMoN09-KarySearch.pdf

Does this beat unordered_map?

I will have to check. I hope it is because unordered_map is not ideal: there is creation overhead (many allocations, not friendly if we'd like to cache the analysis results) and the default hash function is identity so you can easy create a contract that will make the search linear. And we would ignore the fact that the data is already sorted.

chfast · 2019-08-02T22:04:31Z

Using unordered_map makes it ~4x faster. Benchmark added in cea415e. I have to rerun it later because I'm currently testing that with a laptop running on battery.

I think we will have to revisit using hash map here later on. I've seen some interesting recent work in the subject, including hashmaps using continuous memory.

chfast force-pushed the jump branch 2 times, most recently from a199687 to 18f421e Compare July 26, 2019 10:38

chfast force-pushed the jump branch 3 times, most recently from 110f662 to 9ca8739 Compare July 26, 2019 11:03

chfast requested a review from gumb0 July 26, 2019 11:09

gumb0 reviewed Jul 26, 2019

View reviewed changes

test/utils/dump.cpp Outdated Show resolved Hide resolved

gumb0 approved these changes Jul 29, 2019

View reviewed changes

chfast added 2 commits August 3, 2019 17:45

Do not export and allow inlining of find_jumpdest

fdc1573

Use lower_bound to find jumpdest

f49efce

chfast force-pushed the jump branch 2 times, most recently from a058908 to d3cfba8 Compare August 3, 2019 15:48

Split jumpdest map into 2 vectors of int16_t

e4bd497

chfast force-pushed the jump branch from d3cfba8 to e4bd497 Compare August 3, 2019 16:09

chfast merged commit 597e658 into master Aug 3, 2019

chfast deleted the jump branch August 3, 2019 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up finding jumpdests #80

Speed up finding jumpdests #80

chfast commented Jul 2, 2019 •

edited

Loading

codecov-io commented Jul 26, 2019 •

edited

Loading

gumb0 commented Jul 26, 2019

gumb0 commented Jul 26, 2019

chfast commented Jul 26, 2019

chfast commented Aug 2, 2019

Speed up finding jumpdests #80

Speed up finding jumpdests #80

Conversation

chfast commented Jul 2, 2019 • edited Loading

codecov-io commented Jul 26, 2019 • edited Loading

Codecov Report

gumb0 commented Jul 26, 2019

gumb0 commented Jul 26, 2019

chfast commented Jul 26, 2019

chfast commented Aug 2, 2019

chfast commented Jul 2, 2019 •

edited

Loading

codecov-io commented Jul 26, 2019 •

edited

Loading