Skip to content

Commit

Permalink
Rework assembly to be compatible with LTO (#231)
Browse files Browse the repository at this point in the history
* rework assembler register/mem and constraint declarations

* Introduce constraint UnmutatedPointerToWriteMem

* Create invidual memory cell operands

* [Assembly] fully support indirect memory addressing

* fix calling convention for exported procs

* Prepare for switch to intel syntax to avoid clang constant propagation asm symbol name interfering OR pointer+offset addressing

* use modifiers to prevent bad string mixin fo assembler to linker of propagated consts

* Assembly: switch to intel syntax

* with working memory operand - now works with LTO on both GCC and clang and constant folding

* use memory operand in more places

* remove some inline now that we have lto

* cleanup compiler config and benches

* tracer shouldn't force dependencies when unused

* fix cc on linux

* nimble fixes

* update README [skip CI]

* update MacOS CI with Homebrew Clang

* oops nimble bindings disappeared

* more nimble fixes

* fix sha256 exported symbol

* improve constraints on modular addition

* Add extra constraint to force reloading of pointer in reg inputs

* Fix LLVM gold linker running out of registers

* workaround MinGW64 GCC 12.2 bad codegen in t_pairing_cyclotomic_subgroup with LTO
  • Loading branch information
mratsim authored Apr 26, 2023
1 parent 9a71374 commit c6d9a21
Show file tree
Hide file tree
Showing 49 changed files with 1,355 additions and 1,566 deletions.
32 changes: 29 additions & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,10 @@ jobs:
cpu: amd64
TEST_LANG: c
BACKEND: NO_ASM
- os: windows
cpu: amd64
TEST_LANG: c
BACKEND: ASM
- os: macos
cpu: amd64
TEST_LANG: c
Expand Down Expand Up @@ -172,7 +176,19 @@ jobs:
- name: Install test dependencies (macOS)
if: runner.os == 'macOS'
run: brew install gmp
run: |
brew install gmp
mkdir -p external/bin
cat << EOF > external/bin/clang
#!/bin/bash
exec $(brew --prefix llvm@15)/bin/clang "\$@"
EOF
cat << EOF > external/bin/clang++
#!/bin/bash
exec $(brew --prefix llvm@15)/bin/clang++ "\$@"
EOF
chmod 755 external/bin/{clang,clang++}
echo '${{ github.workspace }}/external/bin' >> $GITHUB_PATH
- name: Setup MSYS2 (Windows)
if: runner.os == 'Windows'
Expand Down Expand Up @@ -210,16 +226,26 @@ jobs:
shell: bash
run: |
cd constantine
nimble bindings --verbose
nimble bindings_no_asm --verbose
nimble test_bindings --verbose
nimble test_parallel_no_asm --verbose
- name: Run Constantine tests (Windows with Assembly)
# So "test_bindings" uses C and can find GMP
# but nim-gmp cannot find GMP on Windows CI
if: runner.os == 'Windows' && matrix.target.BACKEND == 'ASM'
shell: msys2 {0}
run: |
cd constantine
nimble bindings --verbose
nimble test_bindings --verbose
nimble test_parallel_no_gmp --verbose
- name: Run Constantine tests (Windows no Assembly)
# So "test_bindings" uses C and can find GMP
# but nim-gmp cannot find GMP on Windows CI
if: runner.os == 'Windows' && matrix.target.BACKEND == 'NO_ASM'
shell: msys2 {0}
run: |
cd constantine
nimble bindings --verbose
nimble bindings_no_asm --verbose
nimble test_bindings --verbose
nimble test_parallel_no_gmp_no_asm --verbose
241 changes: 164 additions & 77 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,11 @@ The implementations are accompanied with SAGE code used as reference implementat
- [Table of Contents](#table-of-contents)
- [Target audience](#target-audience)
- [Protocols](#protocols)
- [Curves supported in the backend](#curves-supported-in-the-backend)
- [Installation](#installation)
- [Dependencies](#dependencies)
- [From C](#from-c)
- [From Nim](#from-nim)
- [Dependencies & Requirements](#dependencies--requirements)
- [Curves supported in the backend](#curves-supported-in-the-backend)
- [Security](#security)
- [Disclaimer](#disclaimer)
- [Security disclosure](#security-disclosure)
Expand All @@ -36,6 +38,7 @@ The implementations are accompanied with SAGE code used as reference implementat
- [In zero-knowledge proofs](#in-zero-knowledge-proofs)
- [Measuring performance](#measuring-performance)
- [BLS12_381 Clang + inline Assembly](#bls12_381-clang--inline-assembly)
- [Parallelism](#parallelism)
- [Why Nim](#why-nim)
- [Compiler caveats](#compiler-caveats)
- [Inline assembly](#inline-assembly)
Expand Down Expand Up @@ -67,69 +70,97 @@ Protocols to address these goals, (authenticated) encryption, signature, traitor
are designed.\
Note: some goals might be mutually exclusive, for example "plausible deniability" and "non-repudiation".

After [installation](#installation), the available high-level protocols are:
## Installation

- [x] Ethereum EVM precompiles on BN254_Snarks (also called alt_bn128 or bn256 in Ethereum)
### From C

`import constantine/ethereum_evm_precompiles`
- [x] BLS signature on BLS12-381 G2 as used in Ethereum 2.
Cryptographic suite: `BLS_SIG_BLS12381G2_XMD:SHA-256_SSWU_RO_POP_`
1. Install a C compiler, for example:
- Debian/Ubuntu `sudo apt update && sudo apt install build-essential`
- Archlinux `pacman -S base-devel`

This scheme is also used in the following blockchains:
Algorand, Chia, Dfinity, Filecoin, Tezos, Zcash.
They may have their pubkeys on G1 and signatures on G2 like Ethereum or the other way around.
2. Install nim, it is available in most distros package manager for Linux and Homebrew for MacOS
Windows binaries are on the official website: https://nim-lang.org/install_unix.html
- Debian/Ubuntu `sudo apt install nim`
- Archlinux `pacman -S nim`

> Parameter discussion:
>
> As Ethereum validators' pubkeys are duplicated, stored and transmitter over and over in the protocol,
having them be as small as possible was important.
On another hand, BLS signatures were first popularized due to their succinctness.
And having signatures on G1 is useful when short signatures are desired, in embedded for example.
- [x] SHA256 hash
- ...
3. Compile the bindings.
- Recommended: \
`CC:clang nimble bindings`
- or `nimble bindings_no_asm`\
to compile without assembly (otherwise it autodetects support)
- or with default compiler\
`nimble bindings`

## Curves supported in the backend
4. Ensure bindings work
- `nimble test_bindings`

_The backend, unlike protocols, is not public. Here be dragons._
5. Bindings location
- The bindings are put in `constantine/lib`
- The headers are in [constantine/include](./include) for example [Ethereum BLS signatures](./include/constantine_ethereum_bls_signatures.h)

At the moment the following curves are implemented, adding a new curve only requires adding the prime modulus
and its bitsize in [constantine/config/curves.nim](constantine/math/config/curves_declaration.nim).
6. Read the examples in [examples_c](./examples_c):
- Using the [Ethereum BLS signatures bindings from C](./examples_c/ethereum_bls_signatures.c)
- Testing Constantine BLS12-381 vs GMP [./examples_c/t_libctt_bls12_381.c](./examples_c/t_libctt_bls12_381.c)

The following curves are configured:
The bindings currently provided are:

- Pairing-Friendly curves
- BN254_Nogami
- BN254_Snarks (Zero-Knowledge Proofs, Snarks, Starks, Zcash, Ethereum 1)
- BLS12-377 (Zexe)
- BLS12-381 (Algorand, Chia Networks, Dfinity, Ethereum 2, Filecoin, Zcash Sapling)
- BW6-671 (Celo, EY Blockchain) (Pairings are WIP)\
BLS12-377 is embedded in BW6-761 for one layer proof composition in zk-SNARKS.
- Embedded curves
- Jubjub, a curve embedded in BLS12-381 scalar field to be used in zk-SNARKS circuits.
- Bandersnatch, a more efficient curve embedded in BLS12-381 scalar field to be used in zk-SNARKS circuits.
- Other curves
- Edwards25519, used in ed25519 and X25519 from TLS 1.3 protocol and the Signal protocol.

With Ristretto, it can be used in bulletproofs.
- The Pasta curves (Pallas and Vesta) for the Halo 2 proof system (Zcash).
- Ethereum BLS signatures on BLS12-381 G2
Cryptographic suite: `BLS_SIG_BLS12381G2_XMD:SHA-256_SSWU_RO_POP_`

This scheme is also used in the following blockchains:
Algorand, Chia, Dfinity, Filecoin, Tezos, Zcash.
They may have their pubkeys on G1 and signatures on G2 like Ethereum or the other way around.

## Installation
- BLS12-381 arithmetic:
- field arithmetic
- on Fr (i.e. modulo the 255-bit curve order)
- on Fp (i.e. modulo the 381-bit prime modulus)
- on Fp2
- elliptic curve arithmetic:
- on elliptic curve over Fp (EC G1) with affine, jacobian and homogenous projective coordinates
- on elliptic curve over Fp2 (EC G2) with affine, jacobian and homogenous projective coordinates
- currently not exposed: \
scalar multiplication, multi-scalar multiplications \
pairings and multi-pairings \
are implemented but not exposed
- _All operations are constant-time unless explicitly mentioned_ vartime

- The Pasta curves: Pallas and Vesta
- field arithmetic
- on Fr (i.e. modulo the 255-bit curve order)
- on Fp (i.e. modulo the 255-bit prime modulus)
- elliptic curve arithmetic:
- on elliptic curve over Fp (EC G1) with affine, jacobian and homogenous projective coordinates
- currently not exposed: \
scalar multiplication, multi-scalar multiplications \
are implemented but not exposed
- _All operations are constant-time unless explicitly mentioned_ vartime

### From Nim

You can install the developement version of the library through nimble with the following command
```
nimble install https://github.com/mratsim/constantine@#master
```

For speed it is recommended to prefer Clang, MSVC or ICC over GCC (see [Compiler-caveats](#Compiler-caveats)).
## Dependencies & Requirements

Further if using GCC, GCC 7 at minimum is required, previous versions
generated incorrect add-with-carry code.
For speed it is recommended to use Clang (see [Compiler-caveats](#Compiler-caveats)).
In particular GCC generates inefficient add-with-carry code.

On x86-64, inline assembly is used to workaround compilers having issues optimizing large integer arithmetic,
and also ensure constant-time code.
Constantine requires at least:
- GCC 7 \
Previous versions generated incorrect add-with-carry code.
- Clang 14 \
On x86-64, inline assembly is used to workaround compilers having issues optimizing large integer arithmetic,
and also ensure constant-time code. \
Constantine uses the intel assembly syntax to address issues with the default AT&T syntax and constants propagated in Clang. \
Clang 14 added support for `-masm=intel`. \
\
On MacOS, Apple Clang does not support Intel assembly syntax, use Homebrew Clang instead or compile without assembly.\
_Note that Apple is discontinuing Intel CPU throughough their product line so this will impact only older model and Mac Pro_

## Dependencies
On Windows, Constantine is tested with MinGW. The Microsoft Visual C++ Compiler is not configured.

Constantine has no dependencies, even on Nim standard library except:
- for testing
Expand All @@ -144,6 +175,30 @@ Constantine has no dependencies, even on Nim standard library except:
- at compile-time
- we need the std/macros library to generate Nim code.

## Curves supported in the backend

_The backend, unlike protocols, is not public. Here be dragons._

At the moment the following curves are implemented, adding a new curve only requires adding the prime modulus
and its bitsize in [constantine/config/curves.nim](constantine/math/config/curves_declaration.nim).

The following curves are configured:

- Pairing-Friendly curves
- BN254_Nogami
- BN254_Snarks (Zero-Knowledge Proofs, Snarks, Starks, Zcash, Ethereum 1)
- BLS12-377 (Zexe)
- BLS12-381 (Algorand, Chia Networks, Dfinity, Ethereum 2, Filecoin, Zcash Sapling)
- BW6-671 (Celo, EY Blockchain) (Pairings are WIP)\
BLS12-377 is embedded in BW6-761 for one layer proof composition in zk-SNARKS.
- Embedded curves
- Jubjub, a curve embedded in BLS12-381 scalar field to be used in zk-SNARKS circuits.
- Bandersnatch, a more efficient curve embedded in BLS12-381 scalar field to be used in zk-SNARKS circuits.
- Other curves
- Edwards25519, used in ed25519 and X25519 from TLS 1.3 protocol and the Signal protocol. \
With Ristretto, it can be used in bulletproofs.
- The Pasta curves (Pallas and Vesta) for the Halo 2 proof system (Zcash).

## Security

Hardening an implementation against all existing and upcoming attack vectors is an extremely complex task.
Expand Down Expand Up @@ -217,47 +272,79 @@ To measure the performance of Constantine

```bash
git clone https://github.com/mratsim/constantine
nimble bench_fp # Using default compiler + Assembly
nimble bench_fp_clang # Using Clang + Assembly (recommended)
nimble bench_fp_gcc # Using GCC + Assembly (decent)
nimble bench_fp_clang_noasm # Using Clang only (acceptable)
nimble bench_fp_gcc # Using GCC only (slowest)
nimble bench_fp2
# ...
nimble bench_ec_g1_clang
nimble bench_ec_g2_clang
nimble bench_pairing_bn254_nogami_clang
nimble bench_pairing_bn254_snarks_clang
nimble bench_pairing_bls12_377_clang
nimble bench_pairing_bls12_381_clang

# Default compiler
nimble bench_fp

# Arithmetic
CC=clang nimble bench_fp # Using Clang + Assembly (recommended)
CC=clang nimble bench_fp2
CC=clang nimble bench_fp12

# Scalar multiplication and pairings
CC=clang nimble bench_ec_g1_scalar_mul
CC=clang nimble bench_ec_g2_scalar_mul
CC=clang nimble bench_pairing_bls12_381

# And per-curve summaries
nimble bench_summary_bn254_nogami_clang
nimble bench_summary_bn254_snarks_clang
nimble bench_summary_bls12_377_clang
nimble bench_summary_bls12_381_clang
CC=clang nimble bench_summary_bn254_nogami
CC=clang nimble bench_summary_bn254_snarks
CC=clang nimble bench_summary_bls12_377
CC=clang nimble bench_summary_bls12_381

# The Ethereum BLS signature protocol
CC=clang nimble bench_ethereum_bls_signatures

# Multi-scalar multiplication
CC=clang nimble bench_ec_g1_msm_bls12_381
CC=clang nimble bench_ec_g1_msm_bn256_snarks
```

The full list of benchmarks is available in the [`benchmarks`](./benchmarks) folder.

As mentioned in the [Compiler caveats](#compiler-caveats) section, GCC is up to 2x slower than Clang due to mishandling of carries and register usage.

#### BLS12_381 (Clang + inline Assembly)

On my machine i9-11980HK (8 cores 2.6GHz, turbo 5GHz), for Clang + Assembly, **all being constant-time** (including scalar multiplication, square root and inversion).

#### BLS12_381 (Clang + inline Assembly)
![BLS12-381 perf summary](./media/bls12_381_perf_summary_i9-11980HK.png)

```
--------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul 255-bit G1 ECP_ShortW_Prj[Fp[BLS12_381]] 16086.740 ops/s 62163 ns/op 205288 CPU cycles (approx)
EC ScalarMul 255-bit G1 ECP_ShortW_Jac[Fp[BLS12_381]] 16670.834 ops/s 59985 ns/op 198097 CPU cycles (approx)
EC ScalarMul 255-bit G2 ECP_ShortW_Prj[Fp2[BLS12_381]] 8333.403 ops/s 119999 ns/op 396284 CPU cycles (approx)
EC ScalarMul 255-bit G2 ECP_ShortW_Jac[Fp2[BLS12_381]] 9300.682 ops/s 107519 ns/op 355071 CPU cycles (approx)
--------------------------------------------------------------------------------------------------------------------------------------------------------
Miller Loop BLS12 BLS12_381 5102.223 ops/s 195993 ns/op 647251 CPU cycles (approx)
Final Exponentiation BLS12 BLS12_381 4209.109 ops/s 237580 ns/op 784588 CPU cycles (approx)
Pairing BLS12 BLS12_381 2343.045 ops/s 426795 ns/op 1409453 CPU cycles (approx)
--------------------------------------------------------------------------------------------------------------------------------------------------------
Hash to G2 (Draft #11) BLS12_381 6558.495 ops/s 152474 ns/op 503531 CPU cycles (approx)
--------------------------------------------------------------------------------------------------------------------------------------------------------
```
![BLS12-381 Multi-Scalar multiplication 1](./media/bls12_381_msm_i9-11980HK-8cores_1.png)
![BLS12-381 Multi-Scalar multiplication 2](./media/bls12_381_msm_i9-11980HK-8cores_2.png)
![BLS12-381 Multi-Scalar multiplication 3](./media/bls12_381_msm_i9-11980HK-8cores_3.png)

On a i9-9980XE (18 cores,watercooled, overclocked, 4.1GHz all core turbo)

![BN254-Snarks multi-sclar multiplication](./media/bn254_snarks_msm-i9-9980XE-18cores.png)

#### Parallelism

Constantine multithreaded primitives are powered by a highly tuned threadpool and stress-tested for:
- scheduler overhead
- load balancing with extreme imbalance
- nested data parallelism
- contention
- speculative/conditional parallelism

and provides the following paradigms:
- Future-based task-parallelism
- Data parallelism (nestable and awaitable for loops)
- including arbitrary parallel reductions
- Dataflow parallelism / Stream parallelism / Graph Parallelism / Pipeline parallelism
- Structured Parallelism

The threadpool parallel-for loops use lazy loop splitting and are fully adaptative to the workload being scheduled, the threads in-flight load and the hardware speed unlike most (all?) runtime, see:
- OpenMP woes depending on hardware and workload: https://github.com/zy97140/omp-benchmark-for-pytorch
- Raytracing ideal runtime, adapt to pixel compute load: ![load distribution](./media/parallel_load_distribution.png)\
Most (all?) production runtime use scheduling A (split on number of threads like GCC OpenMP) or B (eager splitting, unable to adapt to actual work like LLVM/Intel OpenMP or Intel TBB) while Constantine uses C.

The threadpool provides efficient backoff strategy to conserve power based on:
- eventcounts / futexes, for low overhead backoff
- log-log iterated backoff, a provably optimal backoff strategy used for wireless communication to minimize communication in parallel for-loops

The research papers on high performance multithreading available in Weave repo: https://github.com/mratsim/weave/tree/7682784/research.\
_Note: The threadpool is not backed by Weave but by an inspired runtime that has been significantly simplified for ease of auditing. In particular it uses shared-memory based work-stealing instead of channel-based work-requesting for load balancing as distributed computing is not a target, ..., yet._

## Why Nim

Expand Down
2 changes: 1 addition & 1 deletion benchmarks/bench_blueprint.nim
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ echo " release: ", defined(release)
echo " danger: ", defined(danger)
echo " inline assembly: ", UseASM_X86_64

when (sizeof(int) == 4) or defined(Constantine32):
when (sizeof(int) == 4) or defined(Ctt32):
echo "⚠️ Warning: using Constantine with 32-bit limbs"
else:
echo "Using Constantine with 64-bit limbs"
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/bench_fp_double_precision.nim
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ echo " release: ", defined(release)
echo " danger: ", defined(danger)
echo " inline assembly: ", UseASM_X86_64

when (sizeof(int) == 4) or defined(Constantine32):
when (sizeof(int) == 4) or defined(Ctt32):
echo "⚠️ Warning: using Constantine with 32-bit limbs"
else:
echo "Using Constantine with 64-bit limbs"
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/bench_sha256.nim
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ else:
proc SHA256[T: byte|char](
msg: openarray[T],
digest: ptr array[32, byte] = nil
): ptr array[32, byte] {.cdecl, dynlib: DLLSSLName, importc.}
): ptr array[32, byte] {.noconv, dynlib: DLLSSLName, importc.}

proc SHA256_OpenSSL[T: byte|char](
digest: var array[32, byte],
Expand Down
Loading

0 comments on commit c6d9a21

Please sign in to comment.