Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single arity immediate per br_table instruction and make arity immediate 4-byte wide #354

Merged
merged 3 commits into from
Jul 29, 2020

Conversation

gumb0
Copy link
Collaborator

@gumb0 gumb0 commented May 27, 2020

Get rid of duplicated arity in br_table immediates.

@gumb0 gumb0 marked this pull request as ready for review May 29, 2020 11:02
@gumb0 gumb0 requested review from chfast and axic May 29, 2020 11:15
@codecov
Copy link

codecov bot commented May 29, 2020

Codecov Report

Merging #354 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #354   +/-   ##
=======================================
  Coverage   99.55%   99.55%           
=======================================
  Files          49       49           
  Lines       14529    14535    +6     
=======================================
+ Hits        14464    14470    +6     
  Misses         65       65           

@axic
Copy link
Member

axic commented May 29, 2020

This should be benchmarked.

@chfast
Copy link
Collaborator

chfast commented Jun 1, 2020

There is 5% regression on mul benchmark:

fizzy/execute/blake2b/512_bytes_rounds_1_mean                     -0.0061         -0.0061            73            72            73            72
fizzy/execute/blake2b/512_bytes_rounds_16_mean                    -0.0096         -0.0096          1091          1080          1091          1080
fizzy/execute/ecpairing/onepoint_mean                             +0.0052         +0.0052        424579        426772        424583        426774
fizzy/execute/keccak256/512_bytes_rounds_1_mean                   -0.0053         -0.0053            88            87            88            87
fizzy/execute/keccak256/512_bytes_rounds_16_mean                  -0.0085         -0.0085          1269          1258          1269          1258
fizzy/execute/memset/256_bytes_mean                               -0.0221         -0.0217             6             6             6             6
fizzy/execute/memset/60000_bytes_mean                             -0.0204         -0.0204          1291          1265          1291          1265
fizzy/execute/mul256_opt0/input0_mean                             +0.0407         +0.0408            24            25            24            25
fizzy/execute/mul256_opt0/input1_mean                             +0.0514         +0.0515            24            25            24            25
fizzy/execute/sha1/512_bytes_rounds_1_mean                        -0.0095         -0.0095            77            77            77            77
fizzy/execute/sha1/512_bytes_rounds_16_mean                       -0.0044         -0.0044          1065          1061          1065          1061
fizzy/execute/sha256/512_bytes_rounds_1_mean                      +0.0105         +0.0106            75            76            75            76
fizzy/execute/sha256/512_bytes_rounds_16_mean                     +0.0104         +0.0104          1014          1025          1014          1025
fizzy/execute/micro/factorial/10_mean                             +0.0002         -0.0001             1             1             1             1
fizzy/execute/micro/factorial/20_mean                             +0.0047         +0.0047             2             2             2             2
fizzy/execute/micro/fibonacci/24_mean                             -0.0091         -0.0091         11269         11167         11270         11168
fizzy/execute/micro/host_adler32/1_mean                           -0.0058         -0.0056             1             1             1             1
fizzy/execute/micro/host_adler32/100_mean                         +0.0202         +0.0203             6             6             6             6
fizzy/execute/micro/host_adler32/1000_mean                        +0.0152         +0.0152            58            59            58            59
fizzy/execute/micro/spinner/1_mean                                -0.0024         -0.0024             0             0             0             0
fizzy/execute/micro/spinner/1000_mean                             -0.0484         -0.0483             8             8             8             8

Copy link
Collaborator

@chfast chfast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works provided br_table is validated (i.e. all its targets have the same type). Otherwise, this case cause stack underflow (regression tests attached).

(I have not investigated if situation was better before).

@gumb0 gumb0 force-pushed the br-table-imm-arity branch 2 times, most recently from 58baadd to cdcf9af Compare June 4, 2020 19:33
@gumb0
Copy link
Collaborator Author

gumb0 commented Jun 4, 2020

@chfast's commit with tests was included in #368

@axic
Copy link
Member

axic commented Jun 4, 2020

Basically this change gives no speed benefit, just penalty on the mul256 benchmark. Should we merge this?

@chfast
Copy link
Collaborator

chfast commented Jun 4, 2020

Basically this change gives no speed benefit, just penalty on the mul256 benchmark. Should we merge this?

  1. I had similar issue with mul benchmark in other PR (false regression). I will recheck it.
  2. I doubt any benchmarks cases use br_table. Is C switch converted to br_table? Maybe add a micro benchmark case?

@axic
Copy link
Member

axic commented Jun 4, 2020

I don't think llvm uses br_table much, but we definitely don't have a jumptable benchmark, would be worth adding one.

@chfast
Copy link
Collaborator

chfast commented Jun 5, 2020

I don't have good news. These are execution benchmarks results. I removed statistically insignificant results, except for eli_interpreter (as it is relevant for br_table). I also repeated keccak and eli_interpreter cases separately, but the results were the same or worse.

The issue is with keccak, and it actually uses br_table.

fizzy/execute/blake2b/512_bytes_rounds_1_mean                     -0.0006         -0.0006            72            72            72            72
fizzy/execute/blake2b/512_bytes_rounds_16_mean                    -0.0004         -0.0004          1078          1078          1078          1078
fizzy/execute/ecpairing/onepoint_mean                             +0.0066         +0.0066        396089        398720        396091        398724
fizzy/execute/keccak256/512_bytes_rounds_1_mean                   +0.1008         +0.1008            83            91            83            91
fizzy/execute/keccak256/512_bytes_rounds_16_mean                  +0.1205         +0.1205          1185          1328          1185          1328
fizzy/execute/memset/256_bytes_mean                               +0.0107         +0.0111             6             6             6             6
fizzy/execute/memset/60000_bytes_mean                             +0.0077         +0.0077          1304          1314          1304          1314
fizzy/execute/mul256_opt0/input0_mean                             -0.0125         -0.0123            25            25            25            25
fizzy/execute/mul256_opt0/input1_mean                             -0.0079         -0.0074            25            25            25            25
fizzy/execute/sha1/512_bytes_rounds_1_mean                        +0.0240         +0.0240            80            81            80            81
fizzy/execute/sha1/512_bytes_rounds_16_mean                       +0.0294         +0.0294          1097          1129          1097          1129
fizzy/execute/sha256/512_bytes_rounds_1_mean                      -0.0069         -0.0069            76            75            76            75
fizzy/execute/sha256/512_bytes_rounds_16_mean                     -0.0079         -0.0079          1032          1024          1032          1024
fizzy/execute/micro/eli_interpreter/halt_mean                     -0.0010         -0.0015             0             0             0             0
fizzy/execute/micro/eli_interpreter/exec105_mean                  +0.0006         +0.0006             4             4             4             4
fizzy/execute/micro/factorial/10_mean                             +0.0081         +0.0093             1             1             1             1
fizzy/execute/micro/factorial/20_mean                             +0.0050         +0.0061             2             2             2             2
fizzy/execute/micro/fibonacci/24_mean                             +0.0273         +0.0273          9069          9317          9069          9317
fizzy/execute/micro/host_adler32/1_mean                           -0.0004         +0.0010             1             1             1             1
fizzy/execute/micro/host_adler32/100_mean                         +0.0137         +0.0134             6             6             6             6
fizzy/execute/micro/host_adler32/1000_mean                        +0.0125         +0.0125            58            59            58            59
fizzy/execute/micro/spinner/1_mean                                -0.0022         -0.0007             0             0             0             0
fizzy/execute/micro/spinner/1000_mean                             -0.0500         -0.0498             9             8             9             8

@chfast
Copy link
Collaborator

chfast commented Jun 5, 2020

On laptop/skylake seems to be no regression.

no_turbo:

Comparing master-keccak to brtable-keccak
Benchmark                                                            Time             CPU      Time Old      Time New       CPU Old       CPU New
-------------------------------------------------------------------------------------------------------------------------------------------------
fizzy/execute/keccak256/512_bytes_rounds_16_pvalue                 0.0211          0.0211      U Test, Repetitions: 10 vs 10
fizzy/execute/keccak256/512_bytes_rounds_16_mean                  +0.0082         +0.0082          3310          3337          3309          3337
fizzy/execute/keccak256/512_bytes_rounds_16_median                +0.0085         +0.0086          3310          3338          3310          3338
fizzy/execute/keccak256/512_bytes_rounds_16_stddev                +1.3173         +1.3364            11            25            11            25

turbo (unstable results):

Comparing master-keccak to brtable-keccak-min
Benchmark                                                            Time             CPU      Time Old      Time New       CPU Old       CPU New
-------------------------------------------------------------------------------------------------------------------------------------------------
fizzy/execute/keccak256/512_bytes_rounds_16_pvalue                 0.8501          0.8501      U Test, Repetitions: 10 vs 10
fizzy/execute/keccak256/512_bytes_rounds_16_mean                  -0.0021         -0.0020          1355          1352          1355          1352
fizzy/execute/keccak256/512_bytes_rounds_16_median                -0.0045         -0.0045          1354          1348          1354          1348
fizzy/execute/keccak256/512_bytes_rounds_16_stddev                +0.1226         +0.1224            22            24            22            24

Maybe change in immediates layout affects timings on memory loads (e.g. some immediates accesses cross cache line). I can also see increased branch prediction misses from 2% to 3% on Haswell (first hardware setup).

I'm recommending creating a separate commit where only arity immediate value order is changed.

@gumb0
Copy link
Collaborator Author

gumb0 commented Jun 5, 2020

I'm recommending creating a separate commit where only arity immediate value order is changed.

The first commit here does only reordering.

@chfast
Copy link
Collaborator

chfast commented Jun 5, 2020

The reordering commit does not change anything.

The "Tweak performance 1" lowers the regression by ~2x.

fizzy/execute/keccak256/512_bytes_rounds_1_mean                   +0.0389         +0.0389            83            86            83            86
fizzy/execute/keccak256/512_bytes_rounds_16_mean                  +0.0483         +0.0483          1188          1246          1188          1246

This PR should really not touch any instruction except br_table.

@chfast
Copy link
Collaborator

chfast commented Jun 5, 2020

So I also dumped alignments of 4-byte reads in branch().
First one is master, second pre-"Tweak performance 1" and "Tweak performance 1" (same).

333333333333333333333333133331022312200000000000000000000000200223122000000000000000000000002002231220000000000000000000000033101000001222100131220000000000000000000000032102220

000000000000000000000000221123200232311111111111111111111111022002323111111111111111111111110220023231111111111111111111111112322111112330222312331111111111111111111111112213332

@chfast
Copy link
Collaborator

chfast commented Jun 5, 2020

The "Tweak performance 3" make all reads aligned, but there is still 3-4% performance regression.

@chfast
Copy link
Collaborator

chfast commented Jun 8, 2020

I benchmarked every commit once again on Keccak workload and compared that with immediates alignment, but this rather does not explain the timings:

 0% 33333333333333333333333311^a_3331^2_311^3_0221^e_311^5_2200000000000000000000000201^3_0221^e_311^5_2200000000000000000000000201^3_0221^e_311^5_220000000000000000000000031^6_311^7_010000012221^8_101^b_0131^c_1220000000000000000000000031^d_2102220
 1% 00000000000000000000000021^a_0001^2_021^3_1331^e_021^5_3311111111111111111111111311^3_1331^e_021^5_3311111111111111111111111311^3_1331^e_021^5_331111111111111111111111101^6_021^7_121111123331^8_211^b_1201^c_2331111111111111111111111101^d_3213331
10% 00000000000000000000000021^a_2111^2_231^3_2001^e_231^5_2311111111111111111111111021^3_2001^e_231^5_2311111111111111111111111021^3_2001^e_231^5_231111111111111111111111111^6_231^7_221111123301^8_221^b_2311^c_2331111111111111111111111111^d_2213332
 5% 00000000000000000000000021^a_2111^2_231^3_2001^e_231^5_2311111111111111111111111021^3_2001^e_231^5_2311111111111111111111111021^3_2001^e_231^5_231111111111111111111111111^6_231^7_221111123301^8_221^b_2311^c_2331111111111111111111111111^d_2213332
 5% 00000000000000000000000021^a_1001^2_121^3_1331^e_121^5_1311111111111111111111111311^3_1331^e_121^5_1311111111111111111111111311^3_1331^e_121^5_131111111111111111111111101^6_121^7_121111123331^8_111^b_1201^c_1331111111111111111111111101^d_1213331
 4% 00000000000000000000000000^a_0000^2_000^3_0000^e_000^5_0000000000000000000000000000^3_0000^e_000^5_0000000000000000000000000000^3_0000^e_000^5_000000000000000000000000000^6_000^7_000000000000^8_000^b_0000^c_0000000000000000000000000000^d_0000000

The ^X_ is the br_table index.

@chfast
Copy link
Collaborator

chfast commented Jun 8, 2020

In #376 I checked what is the impact of only changing arity immediate value from uint8_t to uint32_t.
There is 13% performance regression on Keccak, while other cases stay similar.

@chfast
Copy link
Collaborator

chfast commented Jun 8, 2020

I removed last commit because was to invasive (uint8_t to uint32_t for arity immediate value).

This can land in current form.

@chfast
Copy link
Collaborator

chfast commented Jun 25, 2020

I checked the assembly before and after. Looks like for GCC the issues is with arity argument occupies a register longer causing additional register spill, probably in the main loop. I think clang is not affected as showed by other benchmarks.

Copy link
Collaborator

@chfast chfast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be put on hold because of the found performance issues.

I have some other changes planned around branching, so it may make the decision easier to make later.

@axic axic added the refactoring Refactors a part of the codebase label Jul 16, 2020
@gumb0
Copy link
Collaborator Author

gumb0 commented Jul 28, 2020

Rebased.

@axic
Copy link
Member

axic commented Jul 28, 2020

@chfast can you run the benchmarks again? Perhaps numbers are different now after a month's worth of changes.

@@ -670,27 +669,31 @@ ExecutionResult execute(Instance& instance, FuncIdx func_idx, span<const uint64_
case Instr::br_if:
case Instr::return_:
{
const auto arity = read<uint8_t>(immediates);
Copy link
Member

@axic axic Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also check if making this uint32_t makes any difference, as attempted in #376.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need a commit for it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed variant with uint32_t arity.

break;
}
case Instr::br_table:
{
const auto br_table_size = read<uint32_t>(immediates);
const auto arity = read<uint8_t>(immediates);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

@chfast
Copy link
Collaborator

chfast commented Jul 28, 2020

fizzy/execute/blake2b/512_bytes_rounds_1_mean                     +0.0092         +0.0092            84            84            84            84
fizzy/execute/blake2b/512_bytes_rounds_16_mean                    +0.0047         +0.0047          1269          1275          1269          1275
fizzy/execute/ecpairing/onepoint_mean                             -0.0284         -0.0284        431244        419014        431248        419018
fizzy/execute/keccak256/512_bytes_rounds_1_mean                   +0.0098         +0.0098            99           100            99           100
fizzy/execute/keccak256/512_bytes_rounds_16_mean                  +0.0159         +0.0159          1457          1480          1457          1480
fizzy/execute/memset/256_bytes_mean                               +0.0283         +0.0283             7             7             7             7
fizzy/execute/memset/60000_bytes_mean                             +0.0287         +0.0287          1548          1592          1548          1592
fizzy/execute/mul256_opt0/input0_mean                             +0.0604         +0.0604            26            28            26            28
fizzy/execute/mul256_opt0/input1_mean                             +0.0680         +0.0680            26            28            26            28
fizzy/execute/sha1/512_bytes_rounds_1_mean                        +0.0170         +0.0170            91            92            91            92
fizzy/execute/sha1/512_bytes_rounds_16_mean                       +0.0188         +0.0188          1257          1281          1257          1281
fizzy/execute/sha256/512_bytes_rounds_1_mean                      -0.0237         -0.0237            97            95            97            95
fizzy/execute/sha256/512_bytes_rounds_16_mean                     -0.0268         -0.0268          1338          1302          1338          1302
fizzy/execute/micro/eli_interpreter/halt_mean                     -0.0027         -0.0027             0             0             0             0
fizzy/execute/micro/eli_interpreter/exec105_mean                  -0.0064         -0.0064             5             5             5             5
fizzy/execute/micro/factorial/10_mean                             -0.0232         -0.0232             0             0             0             0
fizzy/execute/micro/factorial/20_mean                             +0.0051         +0.0051             1             1             1             1
fizzy/execute/micro/fibonacci/24_mean                             -0.0065         -0.0065          7541          7492          7542          7492
fizzy/execute/micro/host_adler32/1_mean                           +0.0119         +0.0119             0             0             0             0
fizzy/execute/micro/host_adler32/100_mean                         -0.0063         -0.0063             3             3             3             3
fizzy/execute/micro/host_adler32/1000_mean                        -0.0005         -0.0005            29            29            29            29
fizzy/execute/micro/spinner/1_mean                                +0.0338         +0.0338             0             0             0             0
fizzy/execute/micro/spinner/1000_mean                             +0.0001         +0.0001            10            10            10            10

@chfast
Copy link
Collaborator

chfast commented Jul 29, 2020

With arity immediate being 4 bytes, this is nice.

master vs PR
fizzy/execute/blake2b/512_bytes_rounds_1_mean                     -0.0734         -0.0734            83            77            83            77
fizzy/execute/blake2b/512_bytes_rounds_16_mean                    -0.1052         -0.1052          1300          1163          1300          1163
fizzy/execute/ecpairing/onepoint_mean                             -0.0517         -0.0517        421433        399647        421436        399651
fizzy/execute/keccak256/512_bytes_rounds_1_mean                   -0.0434         -0.0434           100            95           100            95
fizzy/execute/keccak256/512_bytes_rounds_16_mean                  -0.0400         -0.0400          1451          1393          1451          1393
fizzy/execute/memset/256_bytes_mean                               -0.0929         -0.0929             7             6             7             6
fizzy/execute/memset/60000_bytes_mean                             -0.0977         -0.0977          1553          1401          1553          1401
fizzy/execute/mul256_opt0/input0_mean                             -0.0469         -0.0469            26            24            26            24
fizzy/execute/mul256_opt0/input1_mean                             -0.0459         -0.0459            26            24            26            24
fizzy/execute/sha1/512_bytes_rounds_1_mean                        -0.0705         -0.0705            90            84            90            84
fizzy/execute/sha1/512_bytes_rounds_16_mean                       -0.0695         -0.0695          1259          1171          1259          1171
fizzy/execute/sha256/512_bytes_rounds_1_mean                      -0.0879         -0.0879            94            86            94            86
fizzy/execute/sha256/512_bytes_rounds_16_mean                     -0.0886         -0.0886          1299          1184          1299          1184
fizzy/execute/micro/eli_interpreter/halt_mean                     -0.0742         -0.0742             0             0             0             0
fizzy/execute/micro/eli_interpreter/exec105_mean                  -0.1594         -0.1594             5             4             5             4
fizzy/execute/micro/factorial/10_mean                             -0.0479         -0.0479             0             0             0             0
fizzy/execute/micro/factorial/20_mean                             -0.0466         -0.0466             1             1             1             1
fizzy/execute/micro/fibonacci/24_mean                             -0.0090         -0.0090          7494          7427          7494          7427
fizzy/execute/micro/host_adler32/1_mean                           -0.0272         -0.0272             0             0             0             0
fizzy/execute/micro/host_adler32/100_mean                         +0.0278         +0.0278             3             3             3             3
fizzy/execute/micro/host_adler32/1000_mean                        +0.0304         +0.0304            30            30            30            30
fizzy/execute/micro/spinner/1_mean                                +0.0305         +0.0305             0             0             0             0
fizzy/execute/micro/spinner/1000_mean                             -0.1223         -0.1223            10             9            10             9

Copy link
Collaborator

@chfast chfast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not ideal, because it mixes two different optimization changes. But it is good enough to merge.

}

TEST(parser_expr, instr_br_table)
TEST(parser_expr, DISABLED_instr_br_table)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is annoying to update, should I just delete it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it anyway.

@gumb0
Copy link
Collaborator Author

gumb0 commented Jul 29, 2020

Changed get_branch_arity helper in parser_expr.cpp to return uint32_t instead of casting on the caller's side.

@axic axic merged commit 17e499f into master Jul 29, 2020
@axic axic deleted the br-table-imm-arity branch July 29, 2020 15:54
@gumb0 gumb0 changed the title Single arity immediate per br_table instruction Single arity immediate per br_table instruction and make arity immediate 4-byte wide Jul 29, 2020
@gumb0 gumb0 mentioned this pull request Jul 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Refactors a part of the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants