Field arithmetic benchmark + reworked assembly (+40% perf) #49

mratsim · 2023-06-09T20:12:18Z

This PR does 3 things:

Introduce a benchmark for field arithmetic. Ideally we also benchmark throughput as in ops/s but criterion does not support it so a workaround is used by using "element" processed per second and setting element to 1. Feature request upstream: [Feat request] Add `Throughput::Operations to report throughput in operations per seconds bheisler/criterion.rs#692
Remove ADCX where ADC suffices
Rework field multiplication to use 2 carry chains via ADOX + ADCX. The assembly is extracted Constantine at https://github.com/mratsim/constantine/blob/151f284/constantine/math/arithmetic/assembly/limbs_asm_mul_mont_x86_adx_bmi2.nim (see also .S file for evm-one / evm-max in this branch/folder https://github.com/ethereum/evmone/tree/d006d81/lib/evmmax )

Benchmarks

Benchmark on a laptop, archlinux, i9-11980HK (8 cores, with hyperthreading and turbo, minimal amount of programs running)

vs Halo2curves

no assembly: 60M field mul per second
old assembly: 67M field mul per second
new assembly: 95M field mul per second, a 40% improvement

vs other libraries

vs Constantine: same speed (assembly is taken from Constantine)
Reproduce via (after install the Nim programming language: https://nim-lang.org/)
```
git clone https://github.com/mratsim/constantine
cd constantine
CC=clang nimble bench_fp
```
vs BLST: from @chfast benches at https://github.com/ethereum/evmone/tree/d006d81/lib/evmmax, Cosntantine assembly is faster than BLST

vs Gnark: 27% less time taken
Reproduce via (after install the G programming language)

git clone https://github.com/ConsenSys/gnark-crypto
cd gnark-crypto/ecc/bn254/fp
go test -bench=. --cpu 1 --run=^#

vs MCL: 35% less time taken
Reproduce via (yes .exe even on Unix):
```
git clone https://github.com/herumi/mcl
make bin/bn_test.exe
bin/bn_test.exe
```
Note this benches BN254 (from Nogami paper), BN_SNARKS_1 (our curve of interest), P256

mratsim · 2023-06-09T20:29:35Z

Analysis

The x86 architecture has several advantage over others for bigint arithmetic:

add-with-carry (which WASM, MIPS and RISC-V do not have)
64x64 -> 128 bit multiplication (which ARM64 does not have)
the possibility to handle 2 carry chains simultaneously (which no other architecture provides)

The previous assembly did not use the 2 carry chains with ADCX and ADOX, hence the computation required 4 instructions instead of 3 per word, there is an extra adc with 0, an immediate 25% perf disadvantage (note xor-ing a register with itself is free https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf. section 3.5.1.8):

Old assembly

halo2curves/src/bn256/assembly.rs

Lines 108 to 132 in 21def8d

    
           // a1 * b0 
        
           "mulx rcx, rax, r13", 
        
           "add r9, rax", 
        
           "adcx r10, rcx", 
        
           "adc r11, 0", 
        
           // a1 * b1 
        
           "mulx rcx, rax, r14", 
        
           "add r10, rax", 
        
           "adcx r11, rcx", 
        
           "adc r12, 0", 
        
           "xor r13, r13", 
        
           // a1 * b2 
        
           "mulx rcx, rax, r15", 
        
           "add r11, rax", 
        
           "adcx r12, rcx", 
        
           "adc r13, 0", 
        
           "xor r14, r14", 
        
           // a1 * b3 
        
           "mulx rcx, rax, qword ptr [{a_ptr} + 24]", 
        
           "add r12, rax", 
        
           "adcx r13, rcx", 
        
           "adc r14, 0",

New Assembly

Litterature

With MULX/ADCX/ADOX, the CIOS algorithm for field multiplication is the fastest. Additionally we use the "no-carry" optimization from Gnark author which is applicable both on Fp and Fr

Intel whitepaper on 2 carry-chains optimization:
http://web.archive.org/web/20230316150627/http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf

Coarsely Integrated Operand Scanning:

Analyzing and Comparing Montgomery Multiplication Algorithms
Cetin Kaya Koc and Tolga Acar and Burton S. Kaliski Jr.
http://pdfs.semanticscholar.org/5e39/41ff482ec3ee41dc53c3298f0be085c69483.pdf

No-carry optimization

https://hackmd.io/@gnark/modular_multiplication

han0110

LGTM! Thank you for improving the assembly that much! Also thanks for adding the benchmark output with documents for reference!

I have checked that it's same as the evmmax assembly (tho some registers are different but I guess it doesn't matter), and also checked with the algorithm in gnark team's document line by line and no difference is spotted.

Another thing I realized is we could also replace the inv = const $inv with inv = in(reg) $inv in montgomery_reduce just like mul in this PR, then we can get rid of nightly requirement for asm feature. Just benched locally and no performance differerce is spotted.

src/bn256/assembly.rs

mratsim · 2023-06-12T19:26:20Z

I have checked that it's same as the evmmax assembly (tho some registers are different but I guess it doesn't matter), and also checked with the algorithm in gnark team's document line by line and no difference is spotted.

For some reason Rust reserves the rbx register so I cannot use it:

It is probably related to these LLVM internal issues:

Apparently LLVM uses rbx as base pointer for complex stack operations (I spotted dynamic stack a couple times in those issues and I assume Address Sanitizer might need that as well). But x86 already has a register for base pointer: RBP (Base Pointer) so ¯\_(ツ)_/¯

And some instructions like cpuid (for CPU feature detection) need rbx so they are broken in Rust.

han0110 · 2023-06-13T07:20:08Z

For some reason Rust reserves the rbx register so I cannot use it

I see I see. Thanks for the detailed explaination!

mratsim · 2023-06-13T08:54:32Z

Another thing I realized is we could also replace the inv = const $inv with inv = in(reg) $inv in montgomery_reduce just like mul in this PR, then we can get rid of nightly requirement for asm feature. Just benched locally and no performance differerce is spotted.

Done

Is Montgomery reduce a bottleneck? I have seen used only here: https://github.com/privacy-scaling-explorations/halo2curves/blob/21def8d/src/bn256/fq.rs#L263-L276

han0110 · 2023-06-13T08:57:37Z

Is Montgomery reduce a bottleneck? I have seen used only here: https://github.com/privacy-scaling-explorations/halo2curves/blob/21def8d/src/bn256/fq.rs#L263-L276

Ah I don't think so, it's indeed only used for converting back to canonical form (and people caring about serialization performance they'd tend to use the SerdeObject methods). Just wanted to make sure I didn't mess thing up.

jonathanpwang · 2023-06-13T16:42:48Z

First of all, this is a really exciting PR!

Because I saw mention about Montgomery reduce, I just wanted to note that it is used a bunch for witness generation in halo2 (maybe not so much after Zhenfei's PR with the table). More specifically: in halo2-ecc there is a lot of conversion from field element to BigUint that uses to_repr.

I don't know if this is the right place for it, but I found that it does help to have a montgomery_reduce_short function for the conversions: https://github.com/axiom-crypto/halo2/blob/bc03964c779c0c014dc08ae8b6c483c57e82a73c/arithmetic/curves/src/derive/field.rs#L435

Unfortunately I can't find where my bench was, but just thought I'd make a note. I can try to make a separate PR later.

mratsim · 2023-06-13T20:14:22Z

@jonathanpwang

I was indeed asking because montgomery_reduce does not use the double carry chain and a similar optimization can be done that brings reduction down to 7ns from 9ns on my machine:

Regarding montgomery_reduce_short yes I do use something similar though I just propagated 1 in the classic Montgomery multiplication algorithm, I don't know if that makes a difference:

pure Nim https://github.com/mratsim/constantine/blob/151f284/constantine/math/arithmetic/limbs_montgomery.nim#L415-L436

func fromMont_CIOS(r: var Limbs, a, M: Limbs, m0ninv: BaseType) =
  ## Convert from Montgomery form to canonical BigInt form
  # for i in 0 .. n-1:
  #   m <- t[0] * m0ninv mod 2ʷ (i.e. simple multiplication)
  #   C, _ = t[0] + m * M[0]
  #   for j in 1 ..n-1:
  #     (C, t[j-1]) <- r[j] + m*M[j] + C
  #   t[n-1] = C

  const N = a.len
  var t {.noInit.} = a # Ensure working in registers

  staticFor i, 0, N:
    let m = t[0] * SecretWord(m0ninv)
    var C, lo: SecretWord
    muladd1(C, lo, m, M[0], t[0])
    staticFor j, 1, N:
      muladd2(C, t[j-1], m, M[j], C, t[j])
    t[N-1] = C

  discard t.csub(M, not(t < M))
  r = t

Assembly with MULX/ADOX/ADCX: https://github.com/mratsim/constantine/blob/151f284/constantine/math/arithmetic/assembly/limbs_asm_redc_mont_x86_adx_bmi2.nim#L187-L230

If there is an interest in fast Montgomery reduction, I can make a separate PR. Otherwise i'll create an issue with the relevant links.

Note that to have the fastest pairings, hence verifiers, you need to delay reductions in extension fields for a ~20% perf improvement minimum:

Faster Explicit Formulas for Computing Pairings over Ordinary Curves
Diego F. Aranha and Koray Karabina and Patrick Longa and Catherine H. Gebotys and Julio López, 2010
https://eprint.iacr.org/2010/526.pdf
https://www.iacr.org/archive/eurocrypt2011/66320047/66320047.pdf
Efficient Implementation of Bilinear Pairings on ARM Processors
Gurleen Grewal, Reza Azarderakhsh,
Patrick Longa, Shi Hu, and David Jao, 2012
https://eprint.iacr.org/2012/408.pdf

jonathanpwang · 2023-06-14T08:46:23Z

I for one would be interested in faster Montgomery reduction!

CPerezz

That's an awesome PR! Thank you so much @mratsim not only for the code, but also for all of the documentation surrounding the PR.

LGTM!

…caling-explorations#49) * add benchmarks for BN256 * small assembly changes: adcx -> adc * rework mul assembly to use 2 carry chains * remove need for nightly for asm by using register instead of constant * remove unused regs * run cargo fmt

mratsim added 3 commits June 9, 2023 13:30

add benchmarks for BN256

34a2e20

small assembly changes: adcx -> adc

ef6e701

rework mul assembly to use 2 carry chains

90acaf9

han0110 self-requested a review June 11, 2023 14:40

han0110 approved these changes Jun 12, 2023

View reviewed changes

src/bn256/assembly.rs Outdated Show resolved Hide resolved

han0110 requested a review from kilic June 12, 2023 15:20

mratsim added 3 commits June 13, 2023 10:49

remove need for nightly for asm by using register instead of constant

d24c558

remove unused regs

c5b3737

run cargo fmt

fd1bc85

mratsim mentioned this pull request Jun 11, 2024

Towards state-of-the-art multi-scalar-muls #163

Open

6 tasks

CPerezz approved these changes Jun 19, 2023

View reviewed changes

kilic approved these changes Jun 19, 2023

View reviewed changes

han0110 merged commit d8e4276 into privacy-scaling-explorations:main Jun 20, 2023

kilic mentioned this pull request Jun 20, 2023

Montgomery reduction inline asm revisited #55

Merged

alexander-camuto mentioned this pull request Jul 31, 2023

chore: update halo2curves zkonduit/ezkl#386

Merged

jonathanpwang mentioned this pull request Aug 13, 2023

Use halo2curves v0.4.0 and ff v0.13 axiom-crypto/halo2-lib#107

Merged

mratsim mentioned this pull request Oct 25, 2023

Msm optimization #29

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Field arithmetic benchmark + reworked assembly (+40% perf) #49

Field arithmetic benchmark + reworked assembly (+40% perf) #49

mratsim commented Jun 9, 2023 •

edited

Loading

mratsim commented Jun 9, 2023 •

edited

Loading

han0110 left a comment

mratsim commented Jun 12, 2023

han0110 commented Jun 13, 2023

mratsim commented Jun 13, 2023

han0110 commented Jun 13, 2023 •

edited

Loading

jonathanpwang commented Jun 13, 2023 •

edited

Loading

mratsim commented Jun 13, 2023

jonathanpwang commented Jun 14, 2023

CPerezz left a comment

Field arithmetic benchmark + reworked assembly (+40% perf) #49

Field arithmetic benchmark + reworked assembly (+40% perf) #49

Conversation

mratsim commented Jun 9, 2023 • edited Loading

Benchmarks

vs Halo2curves

vs other libraries

mratsim commented Jun 9, 2023 • edited Loading

Analysis

Litterature

han0110 left a comment

Choose a reason for hiding this comment

mratsim commented Jun 12, 2023

han0110 commented Jun 13, 2023

mratsim commented Jun 13, 2023

han0110 commented Jun 13, 2023 • edited Loading

jonathanpwang commented Jun 13, 2023 • edited Loading

mratsim commented Jun 13, 2023

jonathanpwang commented Jun 14, 2023

CPerezz left a comment

Choose a reason for hiding this comment

mratsim commented Jun 9, 2023 •

edited

Loading

mratsim commented Jun 9, 2023 •

edited

Loading

han0110 commented Jun 13, 2023 •

edited

Loading

jonathanpwang commented Jun 13, 2023 •

edited

Loading