-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Field arithmetic benchmark + reworked assembly (+40% perf) #49
Conversation
AnalysisThe x86 architecture has several advantage over others for bigint arithmetic:
The previous assembly did not use the 2 carry chains with ADCX and ADOX, hence the computation required 4 instructions instead of 3 per word, there is an extra
See also Intel whitepaper comparing:
LitteratureWith MULX/ADCX/ADOX, the CIOS algorithm for field multiplication is the fastest. Additionally we use the "no-carry" optimization from Gnark author which is applicable both on Fp and Fr
Coarsely Integrated Operand Scanning:
No-carry optimization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for improving the assembly that much! Also thanks for adding the benchmark output with documents for reference!
I have checked that it's same as the evmmax
assembly (tho some registers are different but I guess it doesn't matter), and also checked with the algorithm in gnark team's document line by line and no difference is spotted.
Another thing I realized is we could also replace the inv = const $inv
with inv = in(reg) $inv
in montgomery_reduce
just like mul
in this PR, then we can get rid of nightly
requirement for asm
feature. Just benched locally and no performance differerce is spotted.
For some reason Rust reserves the
It is probably related to these LLVM internal issues:
Apparently LLVM uses And some instructions like |
I see I see. Thanks for the detailed explaination! |
Done Is Montgomery reduce a bottleneck? I have seen used only here: https://github.com/privacy-scaling-explorations/halo2curves/blob/21def8d/src/bn256/fq.rs#L263-L276 |
Ah I don't think so, it's indeed only used for converting back to canonical form (and people caring about serialization performance they'd tend to use the |
First of all, this is a really exciting PR! Because I saw mention about Montgomery reduce, I just wanted to note that it is used a bunch for witness generation in halo2 (maybe not so much after Zhenfei's PR with the table). More specifically: in halo2-ecc there is a lot of conversion from field element to I don't know if this is the right place for it, but I found that it does help to have a Unfortunately I can't find where my bench was, but just thought I'd make a note. I can try to make a separate PR later. |
I was indeed asking because Regarding
If there is an interest in fast Montgomery reduction, I can make a separate PR. Otherwise i'll create an issue with the relevant links. Note that to have the fastest pairings, hence verifiers, you need to delay reductions in extension fields for a ~20% perf improvement minimum:
|
I for one would be interested in faster Montgomery reduction! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an awesome PR! Thank you so much @mratsim not only for the code, but also for all of the documentation surrounding the PR.
LGTM!
…caling-explorations#49) * add benchmarks for BN256 * small assembly changes: adcx -> adc * rework mul assembly to use 2 carry chains * remove need for nightly for asm by using register instead of constant * remove unused regs * run cargo fmt
This PR does 3 things:
Benchmarks
Benchmark on a laptop, archlinux, i9-11980HK (8 cores, with hyperthreading and turbo, minimal amount of programs running)
vs Halo2curves
vs other libraries
vs Constantine: same speed (assembly is taken from Constantine)
Reproduce via (after install the Nim programming language: https://nim-lang.org/)
git clone https://github.com/mratsim/constantine cd constantine CC=clang nimble bench_fp
vs BLST: from @chfast benches at https://github.com/ethereum/evmone/tree/d006d81/lib/evmmax, Cosntantine assembly is faster than BLST
vs Gnark: 27% less time taken
Reproduce via (after install the G programming language)
vs MCL: 35% less time taken
Reproduce via (yes .exe even on Unix):
Note this benches BN254 (from Nogami paper), BN_SNARKS_1 (our curve of interest), P256