Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results from benchmarking pairing libraries #80

Open
jon-chuang opened this issue Mar 31, 2020 · 29 comments
Open

Results from benchmarking pairing libraries #80

jon-chuang opened this issue Mar 31, 2020 · 29 comments
Labels
T-performance Type: performance improvements

Comments

@jon-chuang
Copy link
Contributor

Hi all, I would like to share with you guys some benchmarks I ran on zexe, gnark/goff and zkcrypto's BL12-381. I believe this is a good place to start to identify what the key avenues for optimisation are.

From these results, it seems that major areas of optimisation are in

  • Fq operations, including: add, sub, mul, inverse, negate, to and from montgomery form. Analogous results hold for field extensions, but we should check for improvement here after optimising the base field. From pure optimisations in Fq ops, one can aspect an overall improvement in other parts of the library by approximately 120%.
  • G1 ops: this can be partly explained by the Fq performance. However, scalar mul seems significantly slower, should check what differences exist between gnark and zexe.

More benchmarking should be done once Fq has been sufficiently optimised.

Yet to benchmark:
zk-swap-libff/libff
barretenberg

Results:
BLS12-381
Fq
goff
image
Zexe
image

G1
gnark
image
Zexe
image
zkcrypto
image

Pairing
Zexe
image
zkcrypto
image
gnark
image

@Pratyush
Copy link
Member

Thanks for the comprehensive comparison! I agree that there seem to be some nice avenues for optimization here! Let's collect here the things that goff is doing to optimize arithmetic.

@kobigurk
Copy link
Contributor

Nice! Interesting. It seems that G1 additions gnark < zexe < zkcrypto, while gnark is slowest in full pairing.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 1, 2020

Dear all, after some more exploration, here are my findings:

  • First of all, I tried to keep extraneous apps to zero this time, which explains my overall lower timings.
  • The way that Goff and other libraries benchmark is different from how Zexe benchmarks. Zexe makes one function call per b.iter loop. As I suspect, due to the function call overhead, this severely affects the timing of functions that run very quickly i.e. on the order of ns. Hence, my suggestion is that for such functions, one should iterate over a for loop inside the b.iter call for 1000 iterations instead.
  • Applying this to fields of size 384 and 256, I got Fq mult timings of 46-51ns and 25-26ns respectively, and Fq squarings of 44-47ns and 23-24ns. This is on par or faster than goff (I got new timings for goff: 384: 50-53 ns mult/48-49 ns square and 256: 25-26 ns mult/23-24 ns square). This is actually close to the Barretenberg 256 timings of 20/22ns written in optimised assembly.

I applied Goff's suggested optimisation of no carry, and I got a speedup of 3-5ns (down from 52-54ns), less than advertised, but still significant. The full pairing was about 6% faster with this optimisation.

I rewrote the multiplication as a compact loop and used the crate unroll to make sure it was unrolled. Doing this did not seem to affect performance at all vis a vis writing the unrolled loops out explicitly. I suggest to keep this style as it is much more readable and compact.

With regards to G1 timings, the gap was narrowed by about half, but I should check back once squaring is optimised too.

My suggestions:

  • Benchmark all functions running at < 500 ns by iterating thousand-fold
  • Macro-unroll all field muls, squarings and montgomery reductions for more compact and readable form
  • Implement no-carry optimisation for squaring

@Pratyush
Copy link
Member

Pratyush commented Apr 1, 2020

Thanks for the investigation @jon-chuang ! If you could make a PR with the Goff changes and with the unroll! macro that would be awesome!

Re: benchmarking it’s a slightly more involved process, as there’s a lot of code that would have to be changed :/

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 2, 2020

@Pratyush Actually, I don't think so. I will write some macros for benches for the algebra-benches crate. I will iterate by 1000 everything accept G1/G2 muls, pairing-related ops, and sqrt.

Not all of the curves have the same extension fields, I will figure out how I want to tackle this. I will split up the macros by type.

By the way, is it normal that one needs +nightly for #[feature(test)]?

I believe more accurate benchmarking is mission critical and also makes the Zexe library more attractive.

@Pratyush
Copy link
Member

Pratyush commented Apr 2, 2020

If you'd like to update the benchmarks anyway, you might want to use something like criterion.

This would also solve the issue with requiring nightly for benches.

@jon-chuang
Copy link
Contributor Author

Criterion is a lot slower and verbose however, we may not want it for minor benchmarks. It takes ages to run the full benchmark as it is.

Although, I could try to find a way to tweak the sampling time. Further, I should look into whether there is similar functionality to cargo benchcmp.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 2, 2020

I noticed a troubling phenomenon. G1 doubling seems to be slower after changing to no-carry mult (about 5%). This seems to affect G1 scalar mult as well. @Pratyush , can you replicate this?

G1 Benchmark
name                                          old-ec ns/iter  no_carry_ec ns/iter  diff ns/iter  diff %  speedup
 bls12_377::ec::g1::bench_g1_add_assign        1,178           1,158                         -20  -1.70%   x 1.02
 bls12_377::ec::g1::bench_g1_add_assign_mixed  820             819                            -1  -0.12%   x 1.00
 bls12_377::ec::g1::bench_g1_double            591             602                            11   1.86%   x 0.98
 bls12_377::ec::g1::bench_g1_mul_assign        295,849         297,099                     1,250   0.42%   x 1.00
 bls12_377::ec::g1::bench_g1_rand              294,117         294,953                       836   0.28%   x 1.00
 bls12_377::ec::g2::bench_g2_add_assign        5,121           5,018                        -103  -2.01%   x 1.02
 bls12_377::ec::g2::bench_g2_add_assign_mixed  3,546           3,837                         291   8.21%   x 0.92
 bls12_377::ec::g2::bench_g2_double            2,332           2,279                         -53  -2.27%   x 1.02
 bls12_377::ec::g2::bench_g2_mul_assign        1,237,827       1,196,713                 -41,114  -3.32%   x 1.03
 bls12_377::ec::g2::bench_g2_rand              1,193,239       1,180,192                 -13,047  -1.09%   x 1.01
 bls12_381::ec::g1::bench_g1_add_assign        1,172           1,170                          -2  -0.17%   x 1.00
 bls12_381::ec::g1::bench_g1_add_assign_mixed  817             859                            42   5.14%   x 0.95
 bls12_381::ec::g1::bench_g1_double            588             614                            26   4.42%   x 0.96
 bls12_381::ec::g1::bench_g1_mul_assign        297,426         305,844                     8,418   2.83%   x 0.97
 bls12_381::ec::g1::bench_g1_rand              299,963         303,203                     3,240   1.08%   x 0.99
 bls12_381::ec::g2::bench_g2_add_assign        4,229           4,122                        -107  -2.53%   x 1.03
 bls12_381::ec::g2::bench_g2_add_assign_mixed  2,902           3,158                         256   8.82%   x 0.92
 bls12_381::ec::g2::bench_g2_double            1,815           1,788                         -27  -1.49%   x 1.02
 bls12_381::ec::g2::bench_g2_mul_assign        980,179         963,952                   -16,227  -1.66%   x 1.02
 bls12_381::ec::g2::bench_g2_rand              1,003,275       951,161                   -52,114  -5.19%   x 1.05
 sw6::ec::g1::bench_g1_add_assign              4,416           4,403                         -13  -0.29%   x 1.00
 sw6::ec::g1::bench_g1_add_assign_mixed        3,228           3,094                        -134  -4.15%   x 1.04
 sw6::ec::g1::bench_g1_double                  2,765           2,927                         162   5.86%   x 0.94
 sw6::ec::g1::bench_g1_mul_assign              1,898,836       1,928,669                  29,833   1.57%   x 0.98
 sw6::ec::g1::bench_g1_rand                    1,916,670       1,931,870                  15,200   0.79%   x 0.99
 sw6::ec::g2::bench_g2_add_assign              35,544          33,616                     -1,928  -5.42%   x 1.06
 sw6::ec::g2::bench_g2_add_assign_mixed        23,615          22,878                       -737  -3.12%   x 1.03
 sw6::ec::g2::bench_g2_double                  19,324          19,530                        206   1.07%   x 0.99
 sw6::ec::g2::bench_g2_mul_assign              13,692,588      13,392,381               -300,207  -2.19%   x 1.02
 sw6::ec::g2::bench_g2_rand                    13,614,042      13,254,145               -359,897  -2.64%   x 1.03

I believe this would help us in our diagnosis of why G1 ops are much slower than in gnark. Could it be something to do with inlining/memory usage (e.g. cache-friendliness)? It seems odd that even though our Fp ops are faster or on par with goff, and even though we both use the exact same formulas, the go implementation would be faster than rust. It's a bit of an enigma. I don't know if careful profiling will help.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 3, 2020

I did a final careful benchmarking of the recent PR arkworks-rs/snark#163. My first reccomendation is to try adding RUSTFLAGS="--emit=asm" as this can improve speed significantly.

Here is a graphic summarising my results :) Regressions are largely noise from small functions.
Screenshot from 2020-04-03 13-04-13

Full Benchmark
 name                                                             old-asm ns/iter  new ns/iter  diff ns/iter   diff %  speedup 
 bls12_377::ec::g1::bench_g1_add_assign                           1,104,221        1,048,201         -56,020   -5.07%   x 1.05 
 bls12_377::ec::g1::bench_g1_add_assign_mixed                     769,072          749,362           -19,710   -2.56%   x 1.03 
 bls12_377::ec::g1::bench_g1_double                               546,359          535,695           -10,664   -1.95%   x 1.02 
 bls12_377::ec::g1::bench_g1_mul_assign                           280,718          269,918           -10,800   -3.85%   x 1.04 
 bls12_377::ec::g1::bench_g1_rand                                 283,331          267,739           -15,592   -5.50%   x 1.06 
 bls12_377::ec::g2::bench_g2_add_assign                           4,907            4,612                -295   -6.01%   x 1.06 
 bls12_377::ec::g2::bench_g2_add_assign_mixed                     3,416            3,232                -184   -5.39%   x 1.06 
 bls12_377::ec::g2::bench_g2_double                               2,270            2,158                -112   -4.93%   x 1.05 
 bls12_377::ec::g2::bench_g2_mul_assign                           1,165,130        1,102,020         -63,110   -5.42%   x 1.06 
 bls12_377::ec::g2::bench_g2_rand                                 1,130,329        1,080,859         -49,470   -4.38%   x 1.05 
 bls12_377::fq12::bench_fq12_add_assign                           132              133                     1    0.76%   x 0.99 
 bls12_377::fq12::bench_fq12_double                               130              126                    -4   -3.08%   x 1.03 
 bls12_377::fq12::bench_fq12_inverse                              26,231           25,742               -489   -1.86%   x 1.02 
 bls12_377::fq12::bench_fq12_mul_assign                           7,111            6,874                -237   -3.33%   x 1.03 
 bls12_377::fq12::bench_fq12_square                               4,764            4,723                 -41   -0.86%   x 1.01 
 bls12_377::fq12::bench_fq12_sub_assign                           125              119                    -6   -4.80%   x 1.05 
 bls12_377::fq2::bench_fq2_add_assign                             22               17                     -5  -22.73%   x 1.29 
 bls12_377::fq2::bench_fq2_double                                 18               17                     -1   -5.56%   x 1.06 
 bls12_377::fq2::bench_fq2_inverse                                14,282           13,930               -352   -2.46%   x 1.03 
 bls12_377::fq2::bench_fq2_mul_assign                             278              249                   -29  -10.43%   x 1.12 
 bls12_377::fq2::bench_fq2_sqrt                                   132,251          127,995            -4,256   -3.22%   x 1.03 
 bls12_377::fq2::bench_fq2_square                                 252              231                   -21   -8.33%   x 1.09 
 bls12_377::fq2::bench_fq2_sub_assign                             19               19                      0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_add_assign                               11               10                     -1   -9.09%   x 1.10 
 bls12_377::fq::bench_fq_double                                   9,326            9,108                -218   -2.34%   x 1.02 
 bls12_377::fq::bench_fq_from_repr                                64               59                     -5   -7.81%   x 1.08 
 bls12_377::fq::bench_fq_into_repr                                35               32                     -3   -8.57%   x 1.09 
 bls12_377::fq::bench_fq_inverse                                  14,236           13,708               -528   -3.71%   x 1.04 
 bls12_377::fq::bench_fq_mul_assign                               59,431           51,780             -7,651  -12.87%   x 1.15 
 bls12_377::fq::bench_fq_negate                                   15               13                     -2  -13.33%   x 1.15 
 bls12_377::fq::bench_fq_repr_add_nocarry                         8                8                       0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_repr_div2                                6                6                       0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_repr_mul2                                5                5                       0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_repr_num_bits                            4                4                       0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_repr_sub_noborrow                        9                9                       0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_sqrt                                     81,512           77,501             -4,011   -4.92%   x 1.05 
 bls12_377::fq::bench_fq_square                                   46,611           43,704             -2,907   -6.24%   x 1.07 
 bls12_377::fq::bench_fq_sub_assign                               11               11                      0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_add_assign                               7                7                       0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_double                                   10,267           10,316                 49    0.48%   x 1.00 
 bls12_377::fr::bench_fr_from_repr                                35               31                     -4  -11.43%   x 1.13 
 bls12_377::fr::bench_fr_into_repr                                21               20                     -1   -4.76%   x 1.05 
 bls12_377::fr::bench_fr_inverse                                  7,623            7,516                -107   -1.40%   x 1.01 
 bls12_377::fr::bench_fr_mul_assign                               28,718           25,583             -3,135  -10.92%   x 1.12 
 bls12_377::fr::bench_fr_negate                                   10               10                      0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_repr_add_nocarry                         6                6                       0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_repr_div2                                4                4                       0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_repr_mul2                                4                4                       0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_repr_num_bits                            4                4                       0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_repr_sub_noborrow                        6                6                       0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_sqrt                                     32,498           30,712             -1,786   -5.50%   x 1.06 
 bls12_377::fr::bench_fr_square                                   23,078           21,929             -1,149   -4.98%   x 1.05 
 bls12_377::fr::bench_fr_sub_assign                               8                8                       0    0.00%   x 1.00 
 bls12_377::pairing::pairing::bench_pairing_final_exponentiation  1,311,746        1,292,296         -19,450   -1.48%   x 1.02 
 bls12_377::pairing::pairing::bench_pairing_full                  2,270,597        2,235,430         -35,167   -1.55%   x 1.02 
 bls12_377::pairing::pairing::bench_pairing_miller_loop           638,257          631,072            -7,185   -1.13%   x 1.01 
 bls12_381::ec::g1::bench_g1_add_assign                           1,098,227        1,059,324         -38,903   -3.54%   x 1.04 
 bls12_381::ec::g1::bench_g1_add_assign_mixed                     778,427          758,426           -20,001   -2.57%   x 1.03 
 bls12_381::ec::g1::bench_g1_double                               552,411          539,424           -12,987   -2.35%   x 1.02 
 bls12_381::ec::g1::bench_g1_mul_assign                           287,407          274,129           -13,278   -4.62%   x 1.05 
 bls12_381::ec::g1::bench_g1_rand                                 286,258          274,444           -11,814   -4.13%   x 1.04 
 bls12_381::ec::g2::bench_g2_add_assign                           4,909            4,660                -249   -5.07%   x 1.05 
 bls12_381::ec::g2::bench_g2_add_assign_mixed                     3,401            3,267                -134   -3.94%   x 1.04 
 bls12_381::ec::g2::bench_g2_double                               2,262            2,175                 -87   -3.85%   x 1.04 
 bls12_381::ec::g2::bench_g2_mul_assign                           1,169,377        1,116,571         -52,806   -4.52%   x 1.05 
 bls12_381::ec::g2::bench_g2_rand                                 1,139,891        1,084,845         -55,046   -4.83%   x 1.05 
 bls12_381::fq12::bench_fq12_add_assign                           142              130                   -12   -8.45%   x 1.09 
 bls12_381::fq12::bench_fq12_double                               126              126                     0    0.00%   x 1.00 
 bls12_381::fq12::bench_fq12_inverse                              24,593           24,139               -454   -1.85%   x 1.02 
 bls12_381::fq12::bench_fq12_mul_assign                           6,445            6,199                -246   -3.82%   x 1.04 
 bls12_381::fq12::bench_fq12_square                               4,256            4,118                -138   -3.24%   x 1.03 
 bls12_381::fq12::bench_fq12_sub_assign                           122              119                    -3   -2.46%   x 1.03 
 bls12_381::fq2::bench_fq2_add_assign                             22               18                     -4  -18.18%   x 1.22 
 bls12_381::fq2::bench_fq2_double                                 17               17                      0    0.00%   x 1.00 
 bls12_381::fq2::bench_fq2_inverse                                14,223           14,351                128    0.90%   x 0.99 
 bls12_381::fq2::bench_fq2_mul_assign                             238              225                   -13   -5.46%   x 1.06 
 bls12_381::fq2::bench_fq2_sqrt                                   121,519          124,951             3,432    2.82%   x 0.97 
 bls12_381::fq2::bench_fq2_square                                 187              177                   -10   -5.35%   x 1.06 
 bls12_381::fq2::bench_fq2_sub_assign                             19               19                      0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_add_assign                               16               10                     -6  -37.50%   x 1.60 
 bls12_381::fq::bench_fq_double                                   9,660            9,240                -420   -4.35%   x 1.05 
 bls12_381::fq::bench_fq_from_repr                                63               59                     -4   -6.35%   x 1.07 
 bls12_381::fq::bench_fq_into_repr                                35               32                     -3   -8.57%   x 1.09 
 bls12_381::fq::bench_fq_inverse                                  13,813           14,949              1,136    8.22%   x 0.92 
 bls12_381::fq::bench_fq_mul_assign                               56,248           52,143             -4,105   -7.30%   x 1.08 
 bls12_381::fq::bench_fq_negate                                   13               14                      1    7.69%   x 0.93 
 bls12_381::fq::bench_fq_repr_add_nocarry                         8                8                       0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_repr_div2                                6                6                       0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_repr_mul2                                5                6                       1   20.00%   x 0.83 
 bls12_381::fq::bench_fq_repr_num_bits                            4                4                       0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_repr_sub_noborrow                        9                9                       0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_sqrt                                     61,254           61,857                603    0.98%   x 0.99 
 bls12_381::fq::bench_fq_square                                   44,887           44,295               -592   -1.32%   x 1.01 
 bls12_381::fq::bench_fq_sub_assign                               11               11                      0    0.00%   x 1.00 
 bls12_381::fr::bench_fr_add_assign                               7                7                       0    0.00%   x 1.00 
 bls12_381::fr::bench_fr_double                                   10,176           10,284                108    1.06%   x 0.99 
 bls12_381::fr::bench_fr_from_repr                                34               31                     -3   -8.82%   x 1.10 
 bls12_381::fr::bench_fr_into_repr                                21               20                     -1   -4.76%   x 1.05 
 bls12_381::fr::bench_fr_inverse                                  7,485            7,600                 115    1.54%   x 0.98 
 bls12_381::fr::bench_fr_mul_assign                               28,764           27,166             -1,598   -5.56%   x 1.06 
 bls12_381::fr::bench_fr_negate                                   10               10                      0    0.00%   x 1.00 
 bls12_381::fr::bench_fr_repr_add_nocarry                         6                6                       0    0.00%   x 1.00 
 bls12_381::fr::bench_fr_repr_div2                                4                4                       0    0.00%   x 1.00 
 bls12_381::fr::bench_fr_repr_mul2                                4                4                       0    0.00%   x 1.00 
 bls12_381::fr::bench_fr_repr_num_bits                            4                4                       0    0.00%   x 1.00 
 bls12_381::fr::bench_fr_repr_sub_noborrow                        6                7                       1   16.67%   x 0.86 
 bls12_381::fr::bench_fr_sqrt                                     28,676           27,321             -1,355   -4.73%   x 1.05 
 bls12_381::fr::bench_fr_square                                   25,588           24,908               -680   -2.66%   x 1.03 
 bls12_381::fr::bench_fr_sub_assign                               8                8                       0    0.00%   x 1.00 
 bls12_381::pairing::pairing::bench_pairing_final_exponentiation  1,118,687        1,072,725         -45,962   -4.11%   x 1.04 
 bls12_381::pairing::pairing::bench_pairing_full                  1,958,878        1,871,067         -87,811   -4.48%   x 1.05 
 bls12_381::pairing::pairing::bench_pairing_miller_loop           572,674          546,691           -25,983   -4.54%   x 1.05 
 sw6::ec::g1::bench_g1_add_assign                                 4,553,369        4,185,546        -367,823   -8.08%   x 1.09 
 sw6::ec::g1::bench_g1_add_assign_mixed                           3,180,244        2,963,674        -216,570   -6.81%   x 1.07 
 sw6::ec::g1::bench_g1_double                                     2,797,084        2,624,621        -172,463   -6.17%   x 1.07 
 sw6::ec::g1::bench_g1_mul_assign                                 1,920,092        1,795,032        -125,060   -6.51%   x 1.07 
 sw6::ec::g1::bench_g1_rand                                       1,921,952        1,798,136        -123,816   -6.44%   x 1.07 
 sw6::ec::g2::bench_g2_add_assign                                 4,897            4,600                -297   -6.06%   x 1.06 
 sw6::ec::g2::bench_g2_add_assign_mixed                           3,411            3,206                -205   -6.01%   x 1.06 
 sw6::ec::g2::bench_g2_double                                     2,264            2,140                -124   -5.48%   x 1.06 
 sw6::ec::g2::bench_g2_mul_assign                                 1,164,381        1,098,494         -65,887   -5.66%   x 1.06 
 sw6::ec::g2::bench_g2_rand                                       1,136,139        1,076,092         -60,047   -5.29%   x 1.06 
 sw6::fq3::bench_fq3_add_assign                                   67               75                      8   11.94%   x 0.89 
 sw6::fq3::bench_fq3_double                                       52               50                     -2   -3.85%   x 1.04 
 sw6::fq3::bench_fq3_inverse                                      51,711           50,737               -974   -1.88%   x 1.02 
 sw6::fq3::bench_fq3_mul_assign                                   2,315            2,121                -194   -8.38%   x 1.09 
 sw6::fq3::bench_fq3_sqrt                                         3,562,638        3,321,726        -240,912   -6.76%   x 1.07 
 sw6::fq3::bench_fq3_square                                       1,788            1,696                 -92   -5.15%   x 1.05 
 sw6::fq3::bench_fq3_sub_assign                                   67               67                      0    0.00%   x 1.00 
 sw6::fq6::bench_fq6_add_assign                                   129              144                    15   11.63%   x 0.90 
 sw6::fq6::bench_fq6_double                                       97               96                     -1   -1.03%   x 1.01 
 sw6::fq6::bench_fq6_inverse                                      60,789           58,863             -1,926   -3.17%   x 1.03 
 sw6::fq6::bench_fq6_mul_assign                                   7,593            7,118                -475   -6.26%   x 1.07 
 sw6::fq6::bench_fq6_square                                       5,546            5,248                -298   -5.37%   x 1.06 
 sw6::fq6::bench_fq6_sub_assign                                   134              131                    -3   -2.24%   x 1.02 
 sw6::fq::bench_fq_add_assign                                     24               22                     -2   -8.33%   x 1.09 
 sw6::fq::bench_fq_double                                         16,264           16,932                668    4.11%   x 0.96 
 sw6::fq::bench_fq_from_repr                                      270              245                   -25   -9.26%   x 1.10 
 sw6::fq::bench_fq_into_repr                                      148              138                   -10   -6.76%   x 1.07 
 sw6::fq::bench_fq_inverse                                        49,467           50,218                751    1.52%   x 0.99 
 sw6::fq::bench_fq_mul_assign                                     262,063          242,531           -19,532   -7.45%   x 1.08 
 sw6::fq::bench_fq_negate                                         25               24                     -1   -4.00%   x 1.04 
 sw6::fq::bench_fq_repr_add_nocarry                               15               14                     -1   -6.67%   x 1.07 
 sw6::fq::bench_fq_repr_div2                                      15               15                      0    0.00%   x 1.00 
 sw6::fq::bench_fq_repr_mul2                                      9                9                       0    0.00%   x 1.00 
 sw6::fq::bench_fq_repr_num_bits                                  4                4                       0    0.00%   x 1.00 
 sw6::fq::bench_fq_repr_sub_noborrow                              17               17                      0    0.00%   x 1.00 
 sw6::fq::bench_fq_sqrt                                           565,773          526,690           -39,083   -6.91%   x 1.07 
 sw6::fq::bench_fq_square                                         220,551          208,952           -11,599   -5.26%   x 1.06 
 sw6::fq::bench_fq_sub_assign                                     20               19                     -1   -5.00%   x 1.05 
 sw6::fr::bench_fr_add_assign                                     10               10                      0    0.00%   x 1.00 
 sw6::fr::bench_fr_double                                         9,619            9,174                -445   -4.63%   x 1.05 
 sw6::fr::bench_fr_from_repr                                      62               58                     -4   -6.45%   x 1.07 
 sw6::fr::bench_fr_into_repr                                      36               32                     -4  -11.11%   x 1.12 
 sw6::fr::bench_fr_inverse                                        14,024           13,712               -312   -2.22%   x 1.02 
 sw6::fr::bench_fr_mul_assign                                     56,543           50,664             -5,879  -10.40%   x 1.12 
 sw6::fr::bench_fr_negate                                         15               13                     -2  -13.33%   x 1.15 
 sw6::fr::bench_fr_repr_add_nocarry                               8                8                       0    0.00%   x 1.00 
 sw6::fr::bench_fr_repr_div2                                      6                6                       0    0.00%   x 1.00 
 sw6::fr::bench_fr_repr_mul2                                      6                5                      -1  -16.67%   x 1.20 
 sw6::fr::bench_fr_repr_num_bits                                  4                4                       0    0.00%   x 1.00 
 sw6::fr::bench_fr_repr_sub_noborrow                              9                9                       0    0.00%   x 1.00 
 sw6::fr::bench_fr_sqrt                                           80,292           77,926             -2,366   -2.95%   x 1.03 
 sw6::fr::bench_fr_square                                         44,921           42,764             -2,157   -4.80%   x 1.05 
 sw6::fr::bench_fr_sub_assign                                     11               11                      0    0.00%   x 1.00 
 sw6::pairing::pairing::bench_pairing_final_exponentiation        9,430,418        8,881,217        -549,201   -5.82%   x 1.06 
 sw6::pairing::pairing::bench_pairing_full                        99,900,745       97,824,713     -2,076,032   -2.08%   x 1.02 
 sw6::pairing::pairing::bench_pairing_miller_loop                 89,274,175       88,591,942       -682,233   -0.76%   x 1.01

I also find that Fq mul_assign is about 7-15% faster, square_in_place is about 5% faster, into_repr is about 5-12% faster, and pairing computations are about 1.5-6% faster.

The timings I got vis a vis goff/gnark are essentially the same as my intermediate report. G1 is the only area where we are still lagging behind.

New avenues of optimisation I would like to look into are G1 add_assigned_mixed, which I believe is part of the bottleneck for G1 mul_assign. However, the majority of the bottleneck seems to come from pippenger exponentiation used in gnark, where they compute a pippenger table beforehand. So this is something I will work on for now, which is in parallel with my intention to add WebGPU support for the parallel form of multiexponentiation.

By the way, there are no benchmarks for MSM, but this is a very critical number for provers.

@Pratyush
Copy link
Member

Pratyush commented Apr 3, 2020

I think the issue with scalar multiplication should be gone now? From local benchmarks mixed_add, add, and double are all faster

Btw the scalar multiplication benchmarks are noisy because the scalars are random and have varying hamming weight, so that’s something to keep in mind

@jon-chuang
Copy link
Contributor Author

@Pratyush nonetheless, the scalar_mul timings are significantly slower (1.22x) than gnark. Further, the mixed_add timings are 1.06x slower. This is a finding I repeatedly arrive at.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 3, 2020

I also find that inverse is significantly slower than goff (2.2x 13us vs goff’s 6us). Further, sqrt is also very slow (1.5x slower, 75/54us vs goff's ~25/54us). How performance critical are these @Pratyush?

I think I will run more detailed 1000-fold benchmarks to get to the bottom of which primitive operations (e.g. add/div2). This could also be another benchmarking difference issue in terms of the randomness. Perhaps I should check if I'm looping through random elements.

@Pratyush
Copy link
Member

Pratyush commented Apr 3, 2020

Hmm inversion isn't strictly performance critical because the elliptic curve code does try to minimize it. For the sqrt code, it would be helpful to optimize it is critical for point compression.

@Pratyush
Copy link
Member

Pratyush commented Apr 3, 2020

(That being said, having faster algorithms is always welcomed =D)

@Pratyush
Copy link
Member

Pratyush commented Apr 3, 2020

Also, thanks again for the thorough benchmarks!

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 3, 2020

I see. Is there any scenario in which batch inversion actually helps performance? Perhaps for something like pippenger-scalar mul/MSM?

I am thinking of affine coordinates + batch inversion.

@Pratyush
Copy link
Member

Pratyush commented Apr 3, 2020

Currently our MSM code takes in affine coordinates and uses mixed addition.

@jon-chuang
Copy link
Contributor Author

How about batch inversion in particular?

@Pratyush
Copy link
Member

Pratyush commented Apr 3, 2020

Hmm since batch_inversion performs O(N) multiplications but only one inversion, the cost of a single inversion is usually drowned out for large N. However, using batch inversion with a small N might make the additional cost of inversion significant again.

@jon-chuang
Copy link
Contributor Author

I see. But where is batch inversion used? You mentioned it is used for mixed addition? Or is that standard addition?

@mratsim
Copy link

mratsim commented Apr 3, 2020

I also find that inverse is significantly slower than goff (2.2x 13us vs goff’s 6us). Further, sqrt is also very slow (1.5x slower, 75/54us vs goff's ~25/54us). How performance critical are these @Pratyush?

Goff inversion is not constant-time. If that is desired, you cannot compare it with the usual Little Fermat's based inversion.

Also as mentionned by @Pratyush for pairing in a Fp12 tower, the tower setup passes the inversion down the tower and so the ratio compared to a Fp operation stays acceptable and we can use Jacobian/Projective coordinates to avoid the inversion tax (at the price of more addition/multiplication).

More discussion on fast constant-time modular inversion here: randombit/botan#1479 (comment)

Also, AFAIK the fastest open-source pairing library is MCL if your CPU is at least Broadwell (from) and supports MULX/ADOX/ADCX instructions. The gain is about 15% on my machine. See benchmarks at mratsim/constantine#18 (Note that in those benchmark I was still using inversion via Little Fermat Exponentiation, my new constant time inversion algorithm is about 30% faster though 10x slower than Goff non-constant-time version or GMP).

@jon-chuang
Copy link
Contributor Author

Since we are on par with your library, that would make us about 22-30 clock cycles behind MCL. That's roughly 25% extra bloat. That's remarkable.

@Pratyush, given this amazing result, I may find some time to read the MCL code and implement the ADX instructions in a maintainable way, since x86 intrinsics do not support them due to lack of support from LLVM.

@Pratyush
Copy link
Member

Pratyush commented Apr 3, 2020

Hmm I'm pretty sure we don't have a constant time inversion algorithm; we use the binary euclidean algorithm which isn't constant time: https://github.com/scipr-lab/zexe/blob/master/algebra-core/src/fields/macros.rs

@Pratyush Pratyush transferred this issue from arkworks-rs/snark Nov 20, 2020
@Pratyush Pratyush added the T-performance Type: performance improvements label Nov 20, 2020
@dignifiedquire
Copy link

You can find filecoins new implementation using assembly based optimizations in https://github.com/filecoin-project/blstrs, would be interesting to see how this compares.

@dignifiedquire
Copy link

dignifiedquire commented Nov 23, 2020

Numbers on my AMD Ryzen for blstrs:

Screenshot 2020-11-23 at 21 31 42

Comparison curve-benches bls12-381 from arkworks

Screenshot 2020-11-23 at 22 16 02

@ValarDragon
Copy link
Member

Wow blstrs' fp12 multiplications are significantly faster! (The fp multiplication times are the same in both, but the fp12 multiplication in blstrs is half the time)

One of the things blst does is take advantage of #121 within its quadratic extension arithmetic (see: https://github.com/supranational/blst/blob/master/src/fp12_tower.c#L309) We may need to actually return an unreduced form in #121 to make these additions into it worth it though.

@mratsim
Copy link

mratsim commented Feb 5, 2021

The latest cross-pairing libraries benchmarks were gathered by @yelhousni in https://hackmd.io/@zkteam/eccbench

I also noted all optimizations I am aware of for pairing libraries here: https://github.com/mratsim/constantine/blob/2c5e12d/docs/optimizations.md

@ValarDragon, BLST, MCL and Gurvy do indeed use unreduced multiplication on higher field during the towering.

The technique is described in the following paper by @herumi (MCL author)

  • High-Speed Software Implementation of the Optimal Ate Pairing over Barreto-Naehrig Curves
    Jean-Luc Beuchat and Jorge Enrique González Díaz and Shigeo Mitsunari and Eiji Okamoto and Francisco Rodríguez-Henríquez and Tadanori Teruya, 2010
    https://eprint.iacr.org/2010/354

image
image

And is generalized to all extension fields by @dfaranha (Relic author)

image
image

@dfaranha
Copy link

dfaranha commented Feb 5, 2021

Incredible how much faster pairing implementations have gotten in the past few years. Keep up the good work, you all! :)

@jon-chuang
Copy link
Contributor Author

@mratsim I think these are really great options to look into as our field towers are very slow vis a vis other libraries, even though we have improved on the Fp side. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-performance Type: performance improvements
Projects
None yet
Development

No branches or pull requests

7 participants