-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Results from benchmarking pairing libraries #80
Comments
Thanks for the comprehensive comparison! I agree that there seem to be some nice avenues for optimization here! Let's collect here the things that |
Nice! Interesting. It seems that G1 additions gnark < zexe < zkcrypto, while gnark is slowest in full pairing. |
Dear all, after some more exploration, here are my findings:
I applied Goff's suggested optimisation of no carry, and I got a speedup of 3-5ns (down from 52-54ns), less than advertised, but still significant. The full pairing was about 6% faster with this optimisation. I rewrote the multiplication as a compact loop and used the crate With regards to G1 timings, the gap was narrowed by about half, but I should check back once squaring is optimised too. My suggestions:
|
Thanks for the investigation @jon-chuang ! If you could make a PR with the Goff changes and with the Re: benchmarking it’s a slightly more involved process, as there’s a lot of code that would have to be changed :/ |
@Pratyush Actually, I don't think so. I will write some macros for benches for the algebra-benches crate. I will iterate by 1000 everything accept G1/G2 muls, pairing-related ops, and sqrt. Not all of the curves have the same extension fields, I will figure out how I want to tackle this. I will split up the macros by type. By the way, is it normal that one needs I believe more accurate benchmarking is mission critical and also makes the Zexe library more attractive. |
If you'd like to update the benchmarks anyway, you might want to use something like This would also solve the issue with requiring nightly for benches. |
Criterion is a lot slower and verbose however, we may not want it for minor benchmarks. It takes ages to run the full benchmark as it is. Although, I could try to find a way to tweak the sampling time. Further, I should look into whether there is similar functionality to |
I noticed a troubling phenomenon. G1 doubling seems to be slower after changing to no-carry mult (about 5%). This seems to affect G1 scalar mult as well. @Pratyush , can you replicate this? G1 Benchmark
I believe this would help us in our diagnosis of why G1 ops are much slower than in gnark. Could it be something to do with inlining/memory usage (e.g. cache-friendliness)? It seems odd that even though our Fp ops are faster or on par with goff, and even though we both use the exact same formulas, the go implementation would be faster than rust. It's a bit of an enigma. I don't know if careful profiling will help. |
I did a final careful benchmarking of the recent PR arkworks-rs/snark#163. My first reccomendation is to try adding Here is a graphic summarising my results :) Regressions are largely noise from small functions. Full Benchmark
I also find that Fq The timings I got vis a vis goff/gnark are essentially the same as my intermediate report. G1 is the only area where we are still lagging behind. New avenues of optimisation I would like to look into are G1 By the way, there are no benchmarks for MSM, but this is a very critical number for provers. |
I think the issue with scalar multiplication should be gone now? From local benchmarks mixed_add, add, and double are all faster Btw the scalar multiplication benchmarks are noisy because the scalars are random and have varying hamming weight, so that’s something to keep in mind |
@Pratyush nonetheless, the |
I also find that inverse is significantly slower than goff (2.2x 13us vs goff’s 6us). Further, sqrt is also very slow (1.5x slower, 75/54us vs goff's ~25/54us). How performance critical are these @Pratyush? I think I will run more detailed 1000-fold benchmarks to get to the bottom of which primitive operations (e.g. add/div2). This could also be another benchmarking difference issue in terms of the randomness. Perhaps I should check if I'm looping through random elements. |
Hmm inversion isn't strictly performance critical because the elliptic curve code does try to minimize it. For the sqrt code, it would be helpful to optimize it is critical for point compression. |
(That being said, having faster algorithms is always welcomed =D) |
Also, thanks again for the thorough benchmarks! |
I see. Is there any scenario in which batch inversion actually helps performance? Perhaps for something like pippenger-scalar mul/MSM? I am thinking of affine coordinates + batch inversion. |
Currently our MSM code takes in affine coordinates and uses mixed addition. |
How about batch inversion in particular? |
Hmm since batch_inversion performs O(N) multiplications but only one inversion, the cost of a single inversion is usually drowned out for large N. However, using batch inversion with a small N might make the additional cost of inversion significant again. |
I see. But where is batch inversion used? You mentioned it is used for mixed addition? Or is that standard addition? |
Goff inversion is not constant-time. If that is desired, you cannot compare it with the usual Little Fermat's based inversion. Also as mentionned by @Pratyush for pairing in a Fp12 tower, the tower setup passes the inversion down the tower and so the ratio compared to a Fp operation stays acceptable and we can use Jacobian/Projective coordinates to avoid the inversion tax (at the price of more addition/multiplication). More discussion on fast constant-time modular inversion here: randombit/botan#1479 (comment) Also, AFAIK the fastest open-source pairing library is MCL if your CPU is at least Broadwell (from) and supports MULX/ADOX/ADCX instructions. The gain is about 15% on my machine. See benchmarks at mratsim/constantine#18 (Note that in those benchmark I was still using inversion via Little Fermat Exponentiation, my new constant time inversion algorithm is about 30% faster though 10x slower than Goff non-constant-time version or GMP). |
Since we are on par with your library, that would make us about 22-30 clock cycles behind MCL. That's roughly 25% extra bloat. That's remarkable. @Pratyush, given this amazing result, I may find some time to read the MCL code and implement the ADX instructions in a maintainable way, since x86 intrinsics do not support them due to lack of support from LLVM. |
Hmm I'm pretty sure we don't have a constant time inversion algorithm; we use the binary euclidean algorithm which isn't constant time: https://github.com/scipr-lab/zexe/blob/master/algebra-core/src/fields/macros.rs |
You can find filecoins new implementation using assembly based optimizations in https://github.com/filecoin-project/blstrs, would be interesting to see how this compares. |
Wow blstrs' fp12 multiplications are significantly faster! (The fp multiplication times are the same in both, but the fp12 multiplication in blstrs is half the time) One of the things blst does is take advantage of #121 within its quadratic extension arithmetic (see: https://github.com/supranational/blst/blob/master/src/fp12_tower.c#L309) We may need to actually return an unreduced form in #121 to make these additions into it worth it though. |
The latest cross-pairing libraries benchmarks were gathered by @yelhousni in https://hackmd.io/@zkteam/eccbench I also noted all optimizations I am aware of for pairing libraries here: https://github.com/mratsim/constantine/blob/2c5e12d/docs/optimizations.md @ValarDragon, BLST, MCL and Gurvy do indeed use unreduced multiplication on higher field during the towering. The technique is described in the following paper by @herumi (MCL author)
And is generalized to all extension fields by @dfaranha (Relic author)
|
Incredible how much faster pairing implementations have gotten in the past few years. Keep up the good work, you all! :) |
@mratsim I think these are really great options to look into as our field towers are very slow vis a vis other libraries, even though we have improved on the Fp side. Thanks! |
Hi all, I would like to share with you guys some benchmarks I ran on zexe, gnark/goff and zkcrypto's BL12-381. I believe this is a good place to start to identify what the key avenues for optimisation are.
From these results, it seems that major areas of optimisation are in
More benchmarking should be done once Fq has been sufficiently optimised.
Yet to benchmark:
zk-swap-libff/libff
barretenberg
Results:
BLS12-381
Fq
goff
Zexe
G1
gnark
Zexe
zkcrypto
Pairing
Zexe
zkcrypto
gnark
The text was updated successfully, but these errors were encountered: