Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field arithmetic benchmark + reworked assembly (+40% perf) #49

Merged
merged 6 commits into from
Jun 20, 2023
Merged

Conversation

mratsim
Copy link
Contributor

@mratsim mratsim commented Jun 9, 2023

This PR does 3 things:

  1. Introduce a benchmark for field arithmetic. Ideally we also benchmark throughput as in ops/s but criterion does not support it so a workaround is used by using "element" processed per second and setting element to 1. Feature request upstream: [Feat request] Add `Throughput::Operations to report throughput in operations per seconds bheisler/criterion.rs#692
  2. Remove ADCX where ADC suffices
  3. Rework field multiplication to use 2 carry chains via ADOX + ADCX. The assembly is extracted Constantine at https://github.com/mratsim/constantine/blob/151f284/constantine/math/arithmetic/assembly/limbs_asm_mul_mont_x86_adx_bmi2.nim (see also .S file for evm-one / evm-max in this branch/folder https://github.com/ethereum/evmone/tree/d006d81/lib/evmmax )

Benchmarks

Benchmark on a laptop, archlinux, i9-11980HK (8 cores, with hyperthreading and turbo, minimal amount of programs running)

vs Halo2curves

  • no assembly: 60M field mul per second
    image
  • old assembly: 67M field mul per second
    image
  • new assembly: 95M field mul per second, a 40% improvement
    image

vs other libraries

  • vs Constantine: same speed (assembly is taken from Constantine)
    Reproduce via (after install the Nim programming language: https://nim-lang.org/)

    git clone https://github.com/mratsim/constantine
    cd constantine
    CC=clang nimble bench_fp

    image

  • vs BLST: from @chfast benches at https://github.com/ethereum/evmone/tree/d006d81/lib/evmmax, Cosntantine assembly is faster than BLST

  • vs Gnark: 27% less time taken
    Reproduce via (after install the G programming language)

    git clone https://github.com/ConsenSys/gnark-crypto
    cd gnark-crypto/ecc/bn254/fp
    go test -bench=. --cpu 1 --run=^#

    image

  • vs MCL: 35% less time taken
    Reproduce via (yes .exe even on Unix):

    git clone https://github.com/herumi/mcl
    make bin/bn_test.exe
    bin/bn_test.exe

    Note this benches BN254 (from Nogami paper), BN_SNARKS_1 (our curve of interest), P256
    image

@mratsim
Copy link
Contributor Author

mratsim commented Jun 9, 2023

Analysis

The x86 architecture has several advantage over others for bigint arithmetic:

  • add-with-carry (which WASM, MIPS and RISC-V do not have)
  • 64x64 -> 128 bit multiplication (which ARM64 does not have)
  • the possibility to handle 2 carry chains simultaneously (which no other architecture provides)

The previous assembly did not use the 2 carry chains with ADCX and ADOX, hence the computation required 4 instructions instead of 3 per word, there is an extra adc with 0, an immediate 25% perf disadvantage (note xor-ing a register with itself is free https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf. section 3.5.1.8):

  • Old assembly
    // a1 * b0
    "mulx rcx, rax, r13",
    "add r9, rax",
    "adcx r10, rcx",
    "adc r11, 0",
    // a1 * b1
    "mulx rcx, rax, r14",
    "add r10, rax",
    "adcx r11, rcx",
    "adc r12, 0",
    "xor r13, r13",
    // a1 * b2
    "mulx rcx, rax, r15",
    "add r11, rax",
    "adcx r12, rcx",
    "adc r13, 0",
    "xor r14, r14",
    // a1 * b3
    "mulx rcx, rax, qword ptr [{a_ptr} + 24]",
    "add r12, rax",
    "adcx r13, rcx",
    "adc r14, 0",
  • New Assembly
    image

See also Intel whitepaper comparing:

  • MUL/ADD/ADC
  • MULX/ADD/ADC (requires Haswell, 2013)
  • MULX/ADCX/ADOX (requires Boradwell, 2014)

image

Litterature

With MULX/ADCX/ADOX, the CIOS algorithm for field multiplication is the fastest. Additionally we use the "no-carry" optimization from Gnark author which is applicable both on Fp and Fr

Coarsely Integrated Operand Scanning:

No-carry optimization

@han0110 han0110 self-requested a review June 11, 2023 14:40
Copy link
Contributor

@han0110 han0110 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for improving the assembly that much! Also thanks for adding the benchmark output with documents for reference!

I have checked that it's same as the evmmax assembly (tho some registers are different but I guess it doesn't matter), and also checked with the algorithm in gnark team's document line by line and no difference is spotted.

Another thing I realized is we could also replace the inv = const $inv with inv = in(reg) $inv in montgomery_reduce just like mul in this PR, then we can get rid of nightly requirement for asm feature. Just benched locally and no performance differerce is spotted.

src/bn256/assembly.rs Outdated Show resolved Hide resolved
@han0110 han0110 requested a review from kilic June 12, 2023 15:20
@mratsim
Copy link
Contributor Author

mratsim commented Jun 12, 2023

I have checked that it's same as the evmmax assembly (tho some registers are different but I guess it doesn't matter), and also checked with the algorithm in gnark team's document line by line and no difference is spotted.

For some reason Rust reserves the rbx register so I cannot use it:

It is probably related to these LLVM internal issues:

Apparently LLVM uses rbx as base pointer for complex stack operations (I spotted dynamic stack a couple times in those issues and I assume Address Sanitizer might need that as well). But x86 already has a register for base pointer: RBP (Base Pointer) so ¯\_(ツ)_/¯

And some instructions like cpuid (for CPU feature detection) need rbx so they are broken in Rust.

@han0110
Copy link
Contributor

han0110 commented Jun 13, 2023

For some reason Rust reserves the rbx register so I cannot use it

I see I see. Thanks for the detailed explaination!

@mratsim
Copy link
Contributor Author

mratsim commented Jun 13, 2023

Another thing I realized is we could also replace the inv = const $inv with inv = in(reg) $inv in montgomery_reduce just like mul in this PR, then we can get rid of nightly requirement for asm feature. Just benched locally and no performance differerce is spotted.

Done

Is Montgomery reduce a bottleneck? I have seen used only here: https://github.com/privacy-scaling-explorations/halo2curves/blob/21def8d/src/bn256/fq.rs#L263-L276

@han0110
Copy link
Contributor

han0110 commented Jun 13, 2023

Is Montgomery reduce a bottleneck? I have seen used only here: https://github.com/privacy-scaling-explorations/halo2curves/blob/21def8d/src/bn256/fq.rs#L263-L276

Ah I don't think so, it's indeed only used for converting back to canonical form (and people caring about serialization performance they'd tend to use the SerdeObject methods). Just wanted to make sure I didn't mess thing up.

@jonathanpwang
Copy link
Contributor

jonathanpwang commented Jun 13, 2023

First of all, this is a really exciting PR!

Because I saw mention about Montgomery reduce, I just wanted to note that it is used a bunch for witness generation in halo2 (maybe not so much after Zhenfei's PR with the table). More specifically: in halo2-ecc there is a lot of conversion from field element to BigUint that uses to_repr.

I don't know if this is the right place for it, but I found that it does help to have a montgomery_reduce_short function for the conversions: https://github.com/axiom-crypto/halo2/blob/bc03964c779c0c014dc08ae8b6c483c57e82a73c/arithmetic/curves/src/derive/field.rs#L435

Unfortunately I can't find where my bench was, but just thought I'd make a note. I can try to make a separate PR later.

@mratsim
Copy link
Contributor Author

mratsim commented Jun 13, 2023

@jonathanpwang

I was indeed asking because montgomery_reduce does not use the double carry chain and a similar optimization can be done that brings reduction down to 7ns from 9ns on my machine:

image
image

Regarding montgomery_reduce_short yes I do use something similar though I just propagated 1 in the classic Montgomery multiplication algorithm, I don't know if that makes a difference:

If there is an interest in fast Montgomery reduction, I can make a separate PR. Otherwise i'll create an issue with the relevant links.

Note that to have the fastest pairings, hence verifiers, you need to delay reductions in extension fields for a ~20% perf improvement minimum:

@jonathanpwang
Copy link
Contributor

I for one would be interested in faster Montgomery reduction!

Copy link
Member

@CPerezz CPerezz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an awesome PR! Thank you so much @mratsim not only for the code, but also for all of the documentation surrounding the PR.

LGTM!

@han0110 han0110 merged commit d8e4276 into privacy-scaling-explorations:main Jun 20, 2023
mratsim added a commit to taikoxyz/halo2curves that referenced this pull request Oct 25, 2023
…caling-explorations#49)

* add benchmarks for BN256

* small assembly changes: adcx -> adc

* rework mul assembly to use 2 carry chains

* remove need for nightly for asm by using register instead of constant

* remove unused regs

* run cargo fmt
@mratsim mratsim mentioned this pull request Oct 25, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants