Results from benchmarking pairing libraries #80

jon-chuang · 2020-03-31T14:52:11Z

Hi all, I would like to share with you guys some benchmarks I ran on zexe, gnark/goff and zkcrypto's BL12-381. I believe this is a good place to start to identify what the key avenues for optimisation are.

From these results, it seems that major areas of optimisation are in

Fq operations, including: add, sub, mul, inverse, negate, to and from montgomery form. Analogous results hold for field extensions, but we should check for improvement here after optimising the base field. From pure optimisations in Fq ops, one can aspect an overall improvement in other parts of the library by approximately 120%.
G1 ops: this can be partly explained by the Fq performance. However, scalar mul seems significantly slower, should check what differences exist between gnark and zexe.

More benchmarking should be done once Fq has been sufficiently optimised.

Yet to benchmark:
zk-swap-libff/libff
barretenberg

Results:
BLS12-381
Fq
goff

Zexe

G1
gnark

Zexe

zkcrypto

Pairing
Zexe

zkcrypto

gnark

Pratyush · 2020-03-31T16:42:45Z

Thanks for the comprehensive comparison! I agree that there seem to be some nice avenues for optimization here! Let's collect here the things that goff is doing to optimize arithmetic.

kobigurk · 2020-03-31T16:55:14Z

Nice! Interesting. It seems that G1 additions gnark < zexe < zkcrypto, while gnark is slowest in full pairing.

jon-chuang · 2020-04-01T15:54:27Z

Dear all, after some more exploration, here are my findings:

First of all, I tried to keep extraneous apps to zero this time, which explains my overall lower timings.
The way that Goff and other libraries benchmark is different from how Zexe benchmarks. Zexe makes one function call per b.iter loop. As I suspect, due to the function call overhead, this severely affects the timing of functions that run very quickly i.e. on the order of ns. Hence, my suggestion is that for such functions, one should iterate over a for loop inside the b.iter call for 1000 iterations instead.
Applying this to fields of size 384 and 256, I got Fq mult timings of 46-51ns and 25-26ns respectively, and Fq squarings of 44-47ns and 23-24ns. This is on par or faster than goff (I got new timings for goff: 384: 50-53 ns mult/48-49 ns square and 256: 25-26 ns mult/23-24 ns square). This is actually close to the Barretenberg 256 timings of 20/22ns written in optimised assembly.

I applied Goff's suggested optimisation of no carry, and I got a speedup of 3-5ns (down from 52-54ns), less than advertised, but still significant. The full pairing was about 6% faster with this optimisation.

I rewrote the multiplication as a compact loop and used the crate unroll to make sure it was unrolled. Doing this did not seem to affect performance at all vis a vis writing the unrolled loops out explicitly. I suggest to keep this style as it is much more readable and compact.

With regards to G1 timings, the gap was narrowed by about half, but I should check back once squaring is optimised too.

My suggestions:

Benchmark all functions running at < 500 ns by iterating thousand-fold
Macro-unroll all field muls, squarings and montgomery reductions for more compact and readable form
Implement no-carry optimisation for squaring

Pratyush · 2020-04-01T16:37:55Z

Thanks for the investigation @jon-chuang ! If you could make a PR with the Goff changes and with the unroll! macro that would be awesome!

Re: benchmarking it’s a slightly more involved process, as there’s a lot of code that would have to be changed :/

jon-chuang · 2020-04-02T05:02:26Z

@Pratyush Actually, I don't think so. I will write some macros for benches for the algebra-benches crate. I will iterate by 1000 everything accept G1/G2 muls, pairing-related ops, and sqrt.

Not all of the curves have the same extension fields, I will figure out how I want to tackle this. I will split up the macros by type.

By the way, is it normal that one needs +nightly for #[feature(test)]?

I believe more accurate benchmarking is mission critical and also makes the Zexe library more attractive.

Pratyush · 2020-04-02T07:20:59Z

If you'd like to update the benchmarks anyway, you might want to use something like criterion.

This would also solve the issue with requiring nightly for benches.

jon-chuang · 2020-04-02T09:24:29Z

Criterion is a lot slower and verbose however, we may not want it for minor benchmarks. It takes ages to run the full benchmark as it is.

Although, I could try to find a way to tweak the sampling time. Further, I should look into whether there is similar functionality to cargo benchcmp.

jon-chuang · 2020-04-02T12:29:30Z

I noticed a troubling phenomenon. G1 doubling seems to be slower after changing to no-carry mult (about 5%). This seems to affect G1 scalar mult as well. @Pratyush , can you replicate this?

G1 Benchmark

name                                          old-ec ns/iter  no_carry_ec ns/iter  diff ns/iter  diff %  speedup
 bls12_377::ec::g1::bench_g1_add_assign        1,178           1,158                         -20  -1.70%   x 1.02
 bls12_377::ec::g1::bench_g1_add_assign_mixed  820             819                            -1  -0.12%   x 1.00
 bls12_377::ec::g1::bench_g1_double            591             602                            11   1.86%   x 0.98
 bls12_377::ec::g1::bench_g1_mul_assign        295,849         297,099                     1,250   0.42%   x 1.00
 bls12_377::ec::g1::bench_g1_rand              294,117         294,953                       836   0.28%   x 1.00
 bls12_377::ec::g2::bench_g2_add_assign        5,121           5,018                        -103  -2.01%   x 1.02
 bls12_377::ec::g2::bench_g2_add_assign_mixed  3,546           3,837                         291   8.21%   x 0.92
 bls12_377::ec::g2::bench_g2_double            2,332           2,279                         -53  -2.27%   x 1.02
 bls12_377::ec::g2::bench_g2_mul_assign        1,237,827       1,196,713                 -41,114  -3.32%   x 1.03
 bls12_377::ec::g2::bench_g2_rand              1,193,239       1,180,192                 -13,047  -1.09%   x 1.01
 bls12_381::ec::g1::bench_g1_add_assign        1,172           1,170                          -2  -0.17%   x 1.00
 bls12_381::ec::g1::bench_g1_add_assign_mixed  817             859                            42   5.14%   x 0.95
 bls12_381::ec::g1::bench_g1_double            588             614                            26   4.42%   x 0.96
 bls12_381::ec::g1::bench_g1_mul_assign        297,426         305,844                     8,418   2.83%   x 0.97
 bls12_381::ec::g1::bench_g1_rand              299,963         303,203                     3,240   1.08%   x 0.99
 bls12_381::ec::g2::bench_g2_add_assign        4,229           4,122                        -107  -2.53%   x 1.03
 bls12_381::ec::g2::bench_g2_add_assign_mixed  2,902           3,158                         256   8.82%   x 0.92
 bls12_381::ec::g2::bench_g2_double            1,815           1,788                         -27  -1.49%   x 1.02
 bls12_381::ec::g2::bench_g2_mul_assign        980,179         963,952                   -16,227  -1.66%   x 1.02
 bls12_381::ec::g2::bench_g2_rand              1,003,275       951,161                   -52,114  -5.19%   x 1.05
 sw6::ec::g1::bench_g1_add_assign              4,416           4,403                         -13  -0.29%   x 1.00
 sw6::ec::g1::bench_g1_add_assign_mixed        3,228           3,094                        -134  -4.15%   x 1.04
 sw6::ec::g1::bench_g1_double                  2,765           2,927                         162   5.86%   x 0.94
 sw6::ec::g1::bench_g1_mul_assign              1,898,836       1,928,669                  29,833   1.57%   x 0.98
 sw6::ec::g1::bench_g1_rand                    1,916,670       1,931,870                  15,200   0.79%   x 0.99
 sw6::ec::g2::bench_g2_add_assign              35,544          33,616                     -1,928  -5.42%   x 1.06
 sw6::ec::g2::bench_g2_add_assign_mixed        23,615          22,878                       -737  -3.12%   x 1.03
 sw6::ec::g2::bench_g2_double                  19,324          19,530                        206   1.07%   x 0.99
 sw6::ec::g2::bench_g2_mul_assign              13,692,588      13,392,381               -300,207  -2.19%   x 1.02
 sw6::ec::g2::bench_g2_rand                    13,614,042      13,254,145               -359,897  -2.64%   x 1.03

I believe this would help us in our diagnosis of why G1 ops are much slower than in gnark. Could it be something to do with inlining/memory usage (e.g. cache-friendliness)? It seems odd that even though our Fp ops are faster or on par with goff, and even though we both use the exact same formulas, the go implementation would be faster than rust. It's a bit of an enigma. I don't know if careful profiling will help.

jon-chuang · 2020-04-03T05:49:37Z

I did a final careful benchmarking of the recent PR arkworks-rs/snark#163. My first reccomendation is to try adding RUSTFLAGS="--emit=asm" as this can improve speed significantly.

Here is a graphic summarising my results :) Regressions are largely noise from small functions.

Full Benchmark

 name                                                             old-asm ns/iter  new ns/iter  diff ns/iter   diff %  speedup 
 bls12_377::ec::g1::bench_g1_add_assign                           1,104,221        1,048,201         -56,020   -5.07%   x 1.05 
 bls12_377::ec::g1::bench_g1_add_assign_mixed                     769,072          749,362           -19,710   -2.56%   x 1.03 
 bls12_377::ec::g1::bench_g1_double                               546,359          535,695           -10,664   -1.95%   x 1.02 
 bls12_377::ec::g1::bench_g1_mul_assign                           280,718          269,918           -10,800   -3.85%   x 1.04 
 bls12_377::ec::g1::bench_g1_rand                                 283,331          267,739           -15,592   -5.50%   x 1.06 
 bls12_377::ec::g2::bench_g2_add_assign                           4,907            4,612                -295   -6.01%   x 1.06 
 bls12_377::ec::g2::bench_g2_add_assign_mixed                     3,416            3,232                -184   -5.39%   x 1.06 
 bls12_377::ec::g2::bench_g2_double                               2,270            2,158                -112   -4.93%   x 1.05 
 bls12_377::ec::g2::bench_g2_mul_assign                           1,165,130        1,102,020         -63,110   -5.42%   x 1.06 
 bls12_377::ec::g2::bench_g2_rand                                 1,130,329        1,080,859         -49,470   -4.38%   x 1.05 
 bls12_377::fq12::bench_fq12_add_assign                           132              133                     1    0.76%   x 0.99 
 bls12_377::fq12::bench_fq12_double                               130              126                    -4   -3.08%   x 1.03 
 bls12_377::fq12::bench_fq12_inverse                              26,231           25,742               -489   -1.86%   x 1.02 
 bls12_377::fq12::bench_fq12_mul_assign                           7,111            6,874                -237   -3.33%   x 1.03 
 bls12_377::fq12::bench_fq12_square                               4,764            4,723                 -41   -0.86%   x 1.01 
 bls12_377::fq12::bench_fq12_sub_assign                           125              119                    -6   -4.80%   x 1.05 
 bls12_377::fq2::bench_fq2_add_assign                             22               17                     -5  -22.73%   x 1.29 
 bls12_377::fq2::bench_fq2_double                                 18               17                     -1   -5.56%   x 1.06 
 bls12_377::fq2::bench_fq2_inverse                                14,282           13,930               -352   -2.46%   x 1.03 
 bls12_377::fq2::bench_fq2_mul_assign                             278              249                   -29  -10.43%   x 1.12 
 bls12_377::fq2::bench_fq2_sqrt                                   132,251          127,995            -4,256   -3.22%   x 1.03 
 bls12_377::fq2::bench_fq2_square                                 252              231                   -21   -8.33%   x 1.09 
 bls12_377::fq2::bench_fq2_sub_assign                             19               19                      0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_add_assign                               11               10                     -1   -9.09%   x 1.10 
 bls12_377::fq::bench_fq_double                                   9,326            9,108                -218   -2.34%   x 1.02 
 bls12_377::fq::bench_fq_from_repr                                64               59                     -5   -7.81%   x 1.08 
 bls12_377::fq::bench_fq_into_repr                                35               32                     -3   -8.57%   x 1.09 
 bls12_377::fq::bench_fq_inverse                                  14,236           13,708               -528   -3.71%   x 1.04 
 bls12_377::fq::bench_fq_mul_assign                               59,431           51,780             -7,651  -12.87%   x 1.15 
 bls12_377::fq::bench_fq_negate                                   15               13                     -2  -13.33%   x 1.15 
 bls12_377::fq::bench_fq_repr_add_nocarry                         8                8                       0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_repr_div2                                6                6                       0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_repr_mul2                                5                5                       0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_repr_num_bits                            4                4                       0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_repr_sub_noborrow                        9                9                       0    0.00%   x 1.00 
 bls12_377::fq::bench_fq_sqrt                                     81,512           77,501             -4,011   -4.92%   x 1.05 
 bls12_377::fq::bench_fq_square                                   46,611           43,704             -2,907   -6.24%   x 1.07 
 bls12_377::fq::bench_fq_sub_assign                               11               11                      0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_add_assign                               7                7                       0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_double                                   10,267           10,316                 49    0.48%   x 1.00 
 bls12_377::fr::bench_fr_from_repr                                35               31                     -4  -11.43%   x 1.13 
 bls12_377::fr::bench_fr_into_repr                                21               20                     -1   -4.76%   x 1.05 
 bls12_377::fr::bench_fr_inverse                                  7,623            7,516                -107   -1.40%   x 1.01 
 bls12_377::fr::bench_fr_mul_assign                               28,718           25,583             -3,135  -10.92%   x 1.12 
 bls12_377::fr::bench_fr_negate                                   10               10                      0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_repr_add_nocarry                         6                6                       0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_repr_div2                                4                4                       0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_repr_mul2                                4                4                       0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_repr_num_bits                            4                4                       0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_repr_sub_noborrow                        6                6                       0    0.00%   x 1.00 
 bls12_377::fr::bench_fr_sqrt                                     32,498           30,712             -1,786   -5.50%   x 1.06 
 bls12_377::fr::bench_fr_square                                   23,078           21,929             -1,149   -4.98%   x 1.05 
 bls12_377::fr::bench_fr_sub_assign                               8                8                       0    0.00%   x 1.00 
 bls12_377::pairing::pairing::bench_pairing_final_exponentiation  1,311,746        1,292,296         -19,450   -1.48%   x 1.02 
 bls12_377::pairing::pairing::bench_pairing_full                  2,270,597        2,235,430         -35,167   -1.55%   x 1.02 
 bls12_377::pairing::pairing::bench_pairing_miller_loop           638,257          631,072            -7,185   -1.13%   x 1.01 
 bls12_381::ec::g1::bench_g1_add_assign                           1,098,227        1,059,324         -38,903   -3.54%   x 1.04 
 bls12_381::ec::g1::bench_g1_add_assign_mixed                     778,427          758,426           -20,001   -2.57%   x 1.03 
 bls12_381::ec::g1::bench_g1_double                               552,411          539,424           -12,987   -2.35%   x 1.02 
 bls12_381::ec::g1::bench_g1_mul_assign                           287,407          274,129           -13,278   -4.62%   x 1.05 
 bls12_381::ec::g1::bench_g1_rand                                 286,258          274,444           -11,814   -4.13%   x 1.04 
 bls12_381::ec::g2::bench_g2_add_assign                           4,909            4,660                -249   -5.07%   x 1.05 
 bls12_381::ec::g2::bench_g2_add_assign_mixed                     3,401            3,267                -134   -3.94%   x 1.04 
 bls12_381::ec::g2::bench_g2_double                               2,262            2,175                 -87   -3.85%   x 1.04 
 bls12_381::ec::g2::bench_g2_mul_assign                           1,169,377        1,116,571         -52,806   -4.52%   x 1.05 
 bls12_381::ec::g2::bench_g2_rand                                 1,139,891        1,084,845         -55,046   -4.83%   x 1.05 
 bls12_381::fq12::bench_fq12_add_assign                           142              130                   -12   -8.45%   x 1.09 
 bls12_381::fq12::bench_fq12_double                               126              126                     0    0.00%   x 1.00 
 bls12_381::fq12::bench_fq12_inverse                              24,593           24,139               -454   -1.85%   x 1.02 
 bls12_381::fq12::bench_fq12_mul_assign                           6,445            6,199                -246   -3.82%   x 1.04 
 bls12_381::fq12::bench_fq12_square                               4,256            4,118                -138   -3.24%   x 1.03 
 bls12_381::fq12::bench_fq12_sub_assign                           122              119                    -3   -2.46%   x 1.03 
 bls12_381::fq2::bench_fq2_add_assign                             22               18                     -4  -18.18%   x 1.22 
 bls12_381::fq2::bench_fq2_double                                 17               17                      0    0.00%   x 1.00 
 bls12_381::fq2::bench_fq2_inverse                                14,223           14,351                128    0.90%   x 0.99 
 bls12_381::fq2::bench_fq2_mul_assign                             238              225                   -13   -5.46%   x 1.06 
 bls12_381::fq2::bench_fq2_sqrt                                   121,519          124,951             3,432    2.82%   x 0.97 
 bls12_381::fq2::bench_fq2_square                                 187              177                   -10   -5.35%   x 1.06 
 bls12_381::fq2::bench_fq2_sub_assign                             19               19                      0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_add_assign                               16               10                     -6  -37.50%   x 1.60 
 bls12_381::fq::bench_fq_double                                   9,660            9,240                -420   -4.35%   x 1.05 
 bls12_381::fq::bench_fq_from_repr                                63               59                     -4   -6.35%   x 1.07 
 bls12_381::fq::bench_fq_into_repr                                35               32                     -3   -8.57%   x 1.09 
 bls12_381::fq::bench_fq_inverse                                  13,813           14,949              1,136    8.22%   x 0.92 
 bls12_381::fq::bench_fq_mul_assign                               56,248           52,143             -4,105   -7.30%   x 1.08 
 bls12_381::fq::bench_fq_negate                                   13               14                      1    7.69%   x 0.93 
 bls12_381::fq::bench_fq_repr_add_nocarry                         8                8                       0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_repr_div2                                6                6                       0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_repr_mul2                                5                6                       1   20.00%   x 0.83 
 bls12_381::fq::bench_fq_repr_num_bits                            4                4                       0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_repr_sub_noborrow                        9                9                       0    0.00%   x 1.00 
 bls12_381::fq::bench_fq_sqrt                                     61,254           61,857                603    0.98%   x 0.99 
 bls12_381::fq::bench_fq_square                                   44,887           44,295               -592   -1.32%   x 1.01 
 bls12_381::fq::bench_fq_sub_assign                               11               11                      0    0.00%   x 1.00 
 bls12_381::fr::bench_fr_add_assign                               7                7                       0    0.00%   x 1.00 
 bls12_381::fr::bench_fr_double                                   10,176           10,284                108    1.06%   x 0.99 
 bls12_381::fr::bench_fr_from_repr                                34               31                     -3   -8.82%   x 1.10 
 bls12_381::fr::bench_fr_into_repr                                21               20                     -1   -4.76%   x 1.05 
 bls12_381::fr::bench_fr_inverse                                  7,485            7,600                 115    1.54%   x 0.98 
 bls12_381::fr::bench_fr_mul_assign                               28,764           27,166             -1,598   -5.56%   x 1.06 
 bls12_381::fr::bench_fr_negate                                   10               10                      0    0.00%   x 1.00 
 bls12_381::fr::bench_fr_repr_add_nocarry                         6                6                       0    0.00%   x 1.00 
 bls12_381::fr::bench_fr_repr_div2                                4                4                       0    0.00%   x 1.00 
 bls12_381::fr::bench_fr_repr_mul2                                4                4                       0    0.00%   x 1.00 
 bls12_381::fr::bench_fr_repr_num_bits                            4                4                       0    0.00%   x 1.00 
 bls12_381::fr::bench_fr_repr_sub_noborrow                        6                7                       1   16.67%   x 0.86 
 bls12_381::fr::bench_fr_sqrt                                     28,676           27,321             -1,355   -4.73%   x 1.05 
 bls12_381::fr::bench_fr_square                                   25,588           24,908               -680   -2.66%   x 1.03 
 bls12_381::fr::bench_fr_sub_assign                               8                8                       0    0.00%   x 1.00 
 bls12_381::pairing::pairing::bench_pairing_final_exponentiation  1,118,687        1,072,725         -45,962   -4.11%   x 1.04 
 bls12_381::pairing::pairing::bench_pairing_full                  1,958,878        1,871,067         -87,811   -4.48%   x 1.05 
 bls12_381::pairing::pairing::bench_pairing_miller_loop           572,674          546,691           -25,983   -4.54%   x 1.05 
 sw6::ec::g1::bench_g1_add_assign                                 4,553,369        4,185,546        -367,823   -8.08%   x 1.09 
 sw6::ec::g1::bench_g1_add_assign_mixed                           3,180,244        2,963,674        -216,570   -6.81%   x 1.07 
 sw6::ec::g1::bench_g1_double                                     2,797,084        2,624,621        -172,463   -6.17%   x 1.07 
 sw6::ec::g1::bench_g1_mul_assign                                 1,920,092        1,795,032        -125,060   -6.51%   x 1.07 
 sw6::ec::g1::bench_g1_rand                                       1,921,952        1,798,136        -123,816   -6.44%   x 1.07 
 sw6::ec::g2::bench_g2_add_assign                                 4,897            4,600                -297   -6.06%   x 1.06 
 sw6::ec::g2::bench_g2_add_assign_mixed                           3,411            3,206                -205   -6.01%   x 1.06 
 sw6::ec::g2::bench_g2_double                                     2,264            2,140                -124   -5.48%   x 1.06 
 sw6::ec::g2::bench_g2_mul_assign                                 1,164,381        1,098,494         -65,887   -5.66%   x 1.06 
 sw6::ec::g2::bench_g2_rand                                       1,136,139        1,076,092         -60,047   -5.29%   x 1.06 
 sw6::fq3::bench_fq3_add_assign                                   67               75                      8   11.94%   x 0.89 
 sw6::fq3::bench_fq3_double                                       52               50                     -2   -3.85%   x 1.04 
 sw6::fq3::bench_fq3_inverse                                      51,711           50,737               -974   -1.88%   x 1.02 
 sw6::fq3::bench_fq3_mul_assign                                   2,315            2,121                -194   -8.38%   x 1.09 
 sw6::fq3::bench_fq3_sqrt                                         3,562,638        3,321,726        -240,912   -6.76%   x 1.07 
 sw6::fq3::bench_fq3_square                                       1,788            1,696                 -92   -5.15%   x 1.05 
 sw6::fq3::bench_fq3_sub_assign                                   67               67                      0    0.00%   x 1.00 
 sw6::fq6::bench_fq6_add_assign                                   129              144                    15   11.63%   x 0.90 
 sw6::fq6::bench_fq6_double                                       97               96                     -1   -1.03%   x 1.01 
 sw6::fq6::bench_fq6_inverse                                      60,789           58,863             -1,926   -3.17%   x 1.03 
 sw6::fq6::bench_fq6_mul_assign                                   7,593            7,118                -475   -6.26%   x 1.07 
 sw6::fq6::bench_fq6_square                                       5,546            5,248                -298   -5.37%   x 1.06 
 sw6::fq6::bench_fq6_sub_assign                                   134              131                    -3   -2.24%   x 1.02 
 sw6::fq::bench_fq_add_assign                                     24               22                     -2   -8.33%   x 1.09 
 sw6::fq::bench_fq_double                                         16,264           16,932                668    4.11%   x 0.96 
 sw6::fq::bench_fq_from_repr                                      270              245                   -25   -9.26%   x 1.10 
 sw6::fq::bench_fq_into_repr                                      148              138                   -10   -6.76%   x 1.07 
 sw6::fq::bench_fq_inverse                                        49,467           50,218                751    1.52%   x 0.99 
 sw6::fq::bench_fq_mul_assign                                     262,063          242,531           -19,532   -7.45%   x 1.08 
 sw6::fq::bench_fq_negate                                         25               24                     -1   -4.00%   x 1.04 
 sw6::fq::bench_fq_repr_add_nocarry                               15               14                     -1   -6.67%   x 1.07 
 sw6::fq::bench_fq_repr_div2                                      15               15                      0    0.00%   x 1.00 
 sw6::fq::bench_fq_repr_mul2                                      9                9                       0    0.00%   x 1.00 
 sw6::fq::bench_fq_repr_num_bits                                  4                4                       0    0.00%   x 1.00 
 sw6::fq::bench_fq_repr_sub_noborrow                              17               17                      0    0.00%   x 1.00 
 sw6::fq::bench_fq_sqrt                                           565,773          526,690           -39,083   -6.91%   x 1.07 
 sw6::fq::bench_fq_square                                         220,551          208,952           -11,599   -5.26%   x 1.06 
 sw6::fq::bench_fq_sub_assign                                     20               19                     -1   -5.00%   x 1.05 
 sw6::fr::bench_fr_add_assign                                     10               10                      0    0.00%   x 1.00 
 sw6::fr::bench_fr_double                                         9,619            9,174                -445   -4.63%   x 1.05 
 sw6::fr::bench_fr_from_repr                                      62               58                     -4   -6.45%   x 1.07 
 sw6::fr::bench_fr_into_repr                                      36               32                     -4  -11.11%   x 1.12 
 sw6::fr::bench_fr_inverse                                        14,024           13,712               -312   -2.22%   x 1.02 
 sw6::fr::bench_fr_mul_assign                                     56,543           50,664             -5,879  -10.40%   x 1.12 
 sw6::fr::bench_fr_negate                                         15               13                     -2  -13.33%   x 1.15 
 sw6::fr::bench_fr_repr_add_nocarry                               8                8                       0    0.00%   x 1.00 
 sw6::fr::bench_fr_repr_div2                                      6                6                       0    0.00%   x 1.00 
 sw6::fr::bench_fr_repr_mul2                                      6                5                      -1  -16.67%   x 1.20 
 sw6::fr::bench_fr_repr_num_bits                                  4                4                       0    0.00%   x 1.00 
 sw6::fr::bench_fr_repr_sub_noborrow                              9                9                       0    0.00%   x 1.00 
 sw6::fr::bench_fr_sqrt                                           80,292           77,926             -2,366   -2.95%   x 1.03 
 sw6::fr::bench_fr_square                                         44,921           42,764             -2,157   -4.80%   x 1.05 
 sw6::fr::bench_fr_sub_assign                                     11               11                      0    0.00%   x 1.00 
 sw6::pairing::pairing::bench_pairing_final_exponentiation        9,430,418        8,881,217        -549,201   -5.82%   x 1.06 
 sw6::pairing::pairing::bench_pairing_full                        99,900,745       97,824,713     -2,076,032   -2.08%   x 1.02 
 sw6::pairing::pairing::bench_pairing_miller_loop                 89,274,175       88,591,942       -682,233   -0.76%   x 1.01

I also find that Fq mul_assign is about 7-15% faster, square_in_place is about 5% faster, into_repr is about 5-12% faster, and pairing computations are about 1.5-6% faster.

The timings I got vis a vis goff/gnark are essentially the same as my intermediate report. G1 is the only area where we are still lagging behind.

New avenues of optimisation I would like to look into are G1 add_assigned_mixed, which I believe is part of the bottleneck for G1 mul_assign. However, the majority of the bottleneck seems to come from pippenger exponentiation used in gnark, where they compute a pippenger table beforehand. So this is something I will work on for now, which is in parallel with my intention to add WebGPU support for the parallel form of multiexponentiation.

By the way, there are no benchmarks for MSM, but this is a very critical number for provers.

Pratyush · 2020-04-03T06:27:35Z

I think the issue with scalar multiplication should be gone now? From local benchmarks mixed_add, add, and double are all faster

Btw the scalar multiplication benchmarks are noisy because the scalars are random and have varying hamming weight, so that’s something to keep in mind

jon-chuang · 2020-04-03T06:33:27Z

@Pratyush nonetheless, the scalar_mul timings are significantly slower (1.22x) than gnark. Further, the mixed_add timings are 1.06x slower. This is a finding I repeatedly arrive at.

jon-chuang · 2020-04-03T07:28:55Z

I also find that inverse is significantly slower than goff (2.2x 13us vs goff’s 6us). Further, sqrt is also very slow (1.5x slower, 75/54us vs goff's ~25/54us). How performance critical are these @Pratyush?

I think I will run more detailed 1000-fold benchmarks to get to the bottom of which primitive operations (e.g. add/div2). This could also be another benchmarking difference issue in terms of the randomness. Perhaps I should check if I'm looping through random elements.

Pratyush · 2020-04-03T07:56:17Z

Hmm inversion isn't strictly performance critical because the elliptic curve code does try to minimize it. For the sqrt code, it would be helpful to optimize it is critical for point compression.

Pratyush · 2020-04-03T07:56:45Z

(That being said, having faster algorithms is always welcomed =D)

Pratyush · 2020-04-03T08:00:17Z

Also, thanks again for the thorough benchmarks!

jon-chuang · 2020-04-03T08:01:09Z

I see. Is there any scenario in which batch inversion actually helps performance? Perhaps for something like pippenger-scalar mul/MSM?

I am thinking of affine coordinates + batch inversion.

Pratyush · 2020-04-03T08:11:16Z

Currently our MSM code takes in affine coordinates and uses mixed addition.

jon-chuang · 2020-04-03T08:18:13Z

How about batch inversion in particular?

Pratyush · 2020-04-03T08:33:54Z

Hmm since batch_inversion performs O(N) multiplications but only one inversion, the cost of a single inversion is usually drowned out for large N. However, using batch inversion with a small N might make the additional cost of inversion significant again.

jon-chuang · 2020-04-03T09:11:13Z

I see. But where is batch inversion used? You mentioned it is used for mixed addition? Or is that standard addition?

mratsim · 2020-04-03T10:15:58Z

I also find that inverse is significantly slower than goff (2.2x 13us vs goff’s 6us). Further, sqrt is also very slow (1.5x slower, 75/54us vs goff's ~25/54us). How performance critical are these @Pratyush?

Goff inversion is not constant-time. If that is desired, you cannot compare it with the usual Little Fermat's based inversion.

Also as mentionned by @Pratyush for pairing in a Fp12 tower, the tower setup passes the inversion down the tower and so the ratio compared to a Fp operation stays acceptable and we can use Jacobian/Projective coordinates to avoid the inversion tax (at the price of more addition/multiplication).

More discussion on fast constant-time modular inversion here: randombit/botan#1479 (comment)

Also, AFAIK the fastest open-source pairing library is MCL if your CPU is at least Broadwell (from) and supports MULX/ADOX/ADCX instructions. The gain is about 15% on my machine. See benchmarks at mratsim/constantine#18 (Note that in those benchmark I was still using inversion via Little Fermat Exponentiation, my new constant time inversion algorithm is about 30% faster though 10x slower than Goff non-constant-time version or GMP).

jon-chuang · 2020-04-03T13:16:20Z

Since we are on par with your library, that would make us about 22-30 clock cycles behind MCL. That's roughly 25% extra bloat. That's remarkable.

@Pratyush, given this amazing result, I may find some time to read the MCL code and implement the ADX instructions in a maintainable way, since x86 intrinsics do not support them due to lack of support from LLVM.

Pratyush · 2020-04-03T19:28:26Z

Hmm I'm pretty sure we don't have a constant time inversion algorithm; we use the binary euclidean algorithm which isn't constant time: https://github.com/scipr-lab/zexe/blob/master/algebra-core/src/fields/macros.rs

dignifiedquire · 2020-11-23T20:29:13Z

You can find filecoins new implementation using assembly based optimizations in https://github.com/filecoin-project/blstrs, would be interesting to see how this compares.

dignifiedquire · 2020-11-23T20:32:08Z

Numbers on my AMD Ryzen for blstrs:

Comparison curve-benches bls12-381 from arkworks

ValarDragon · 2021-02-05T02:25:32Z

Wow blstrs' fp12 multiplications are significantly faster! (The fp multiplication times are the same in both, but the fp12 multiplication in blstrs is half the time)

One of the things blst does is take advantage of #121 within its quadratic extension arithmetic (see: https://github.com/supranational/blst/blob/master/src/fp12_tower.c#L309) We may need to actually return an unreduced form in #121 to make these additions into it worth it though.

mratsim · 2021-02-05T10:09:45Z

The latest cross-pairing libraries benchmarks were gathered by @yelhousni in https://hackmd.io/@zkteam/eccbench

I also noted all optimizations I am aware of for pairing libraries here: https://github.com/mratsim/constantine/blob/2c5e12d/docs/optimizations.md

@ValarDragon, BLST, MCL and Gurvy do indeed use unreduced multiplication on higher field during the towering.

The technique is described in the following paper by @herumi (MCL author)

High-Speed Software Implementation of the Optimal Ate Pairing over Barreto-Naehrig Curves
Jean-Luc Beuchat and Jorge Enrique González Díaz and Shigeo Mitsunari and Eiji Okamoto and Francisco Rodríguez-Henríquez and Tadanori Teruya, 2010
https://eprint.iacr.org/2010/354

And is generalized to all extension fields by @dfaranha (Relic author)

Faster Explicit Formulas for Computing Pairings over Ordinary Curves
Diego F. Aranha and Koray Karabina and Patrick Longa and Catherine H. Gebotys and Julio López, 2010
https://eprint.iacr.org/2010/526.pdf
https://www.iacr.org/archive/eurocrypt2011/66320047/66320047.pdf

dfaranha · 2021-02-05T12:13:57Z

Incredible how much faster pairing implementations have gotten in the past few years. Keep up the good work, you all! :)

jon-chuang · 2021-02-05T12:35:33Z

@mratsim I think these are really great options to look into as our field towers are very slow vis a vis other libraries, even though we have improved on the Fp side. Thanks!

Pratyush transferred this issue from arkworks-rs/snark Nov 20, 2020

Pratyush added the T-performance Type: performance improvements label Nov 20, 2020

jon-chuang mentioned this issue Jan 21, 2021

Preliminary commit for min const generics #182

Closed

6 tasks

ValarDragon mentioned this issue Feb 5, 2021

Field Trait: Consider adding static linear combination operation #121

Closed

mmagician mentioned this issue Oct 18, 2022

Use improved Final Exponentiation algorithms for BLS12 curves #195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Results from benchmarking pairing libraries #80

Results from benchmarking pairing libraries #80

jon-chuang commented Mar 31, 2020

Pratyush commented Mar 31, 2020

kobigurk commented Mar 31, 2020

jon-chuang commented Apr 1, 2020 •

edited by Pratyush

Loading

Pratyush commented Apr 1, 2020

jon-chuang commented Apr 2, 2020 •

edited

Loading

Pratyush commented Apr 2, 2020 •

edited

Loading

jon-chuang commented Apr 2, 2020

jon-chuang commented Apr 2, 2020 •

edited

Loading

jon-chuang commented Apr 3, 2020 •

edited

Loading

Pratyush commented Apr 3, 2020

jon-chuang commented Apr 3, 2020

jon-chuang commented Apr 3, 2020 •

edited

Loading

Pratyush commented Apr 3, 2020

Pratyush commented Apr 3, 2020

Pratyush commented Apr 3, 2020

jon-chuang commented Apr 3, 2020 •

edited

Loading

Pratyush commented Apr 3, 2020 •

edited

Loading

jon-chuang commented Apr 3, 2020

Pratyush commented Apr 3, 2020

jon-chuang commented Apr 3, 2020

mratsim commented Apr 3, 2020

jon-chuang commented Apr 3, 2020

Pratyush commented Apr 3, 2020

dignifiedquire commented Nov 23, 2020

dignifiedquire commented Nov 23, 2020 •

edited

Loading

ValarDragon commented Feb 5, 2021

mratsim commented Feb 5, 2021

dfaranha commented Feb 5, 2021

jon-chuang commented Feb 5, 2021

Results from benchmarking pairing libraries #80

Results from benchmarking pairing libraries #80

Comments

jon-chuang commented Mar 31, 2020

Pratyush commented Mar 31, 2020

kobigurk commented Mar 31, 2020

jon-chuang commented Apr 1, 2020 • edited by Pratyush Loading

Pratyush commented Apr 1, 2020

jon-chuang commented Apr 2, 2020 • edited Loading

Pratyush commented Apr 2, 2020 • edited Loading

jon-chuang commented Apr 2, 2020

jon-chuang commented Apr 2, 2020 • edited Loading

jon-chuang commented Apr 3, 2020 • edited Loading

Pratyush commented Apr 3, 2020

jon-chuang commented Apr 3, 2020

jon-chuang commented Apr 3, 2020 • edited Loading

Pratyush commented Apr 3, 2020

Pratyush commented Apr 3, 2020

Pratyush commented Apr 3, 2020

jon-chuang commented Apr 3, 2020 • edited Loading

Pratyush commented Apr 3, 2020 • edited Loading

jon-chuang commented Apr 3, 2020

Pratyush commented Apr 3, 2020

jon-chuang commented Apr 3, 2020

mratsim commented Apr 3, 2020

jon-chuang commented Apr 3, 2020

Pratyush commented Apr 3, 2020

dignifiedquire commented Nov 23, 2020

dignifiedquire commented Nov 23, 2020 • edited Loading

ValarDragon commented Feb 5, 2021

mratsim commented Feb 5, 2021

dfaranha commented Feb 5, 2021

jon-chuang commented Feb 5, 2021

jon-chuang commented Apr 1, 2020 •

edited by Pratyush

Loading

jon-chuang commented Apr 2, 2020 •

edited

Loading

Pratyush commented Apr 2, 2020 •

edited

Loading

jon-chuang commented Apr 2, 2020 •

edited

Loading

jon-chuang commented Apr 3, 2020 •

edited

Loading

jon-chuang commented Apr 3, 2020 •

edited

Loading

jon-chuang commented Apr 3, 2020 •

edited

Loading

Pratyush commented Apr 3, 2020 •

edited

Loading

dignifiedquire commented Nov 23, 2020 •

edited

Loading