You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Constantine is now complete in terms of elliptic curve cryptography primitives.
It provides constant-time:
scalar field (Fr) and prime field (Fp) arithmetic
extension fields
elliptic curve arithmetic on prime and extension fields
for short Weierstrass curves on affine, projective and jacobian coordinates
for Twisted Edwards on projective coordinates
pairings
hashing to elliptic curve
For high-performance and also to guarantee the absence of branches depending on secret data, Constantine uses an ISA-specific domain specific language (DSL) that emits ISA-specific inline assembly, for example for x86 constant-time conditional copy:
a =asmArray(a_PIR, N, PointerInReg, asmInputOutputEarlyClobber, memIndirect = memReadWrite) # MemOffsettable is the better constraint but compilers say it is impossible. Use early clobber to ensure it is not affected by constant propagation at slight pessimization (reloading it).
b =asmArray(b_MEM, N, MemOffsettable, asmInput)
control =asmValue(ctl, Reg, asmInput)
t0Sym =ident"t0"
t1Sym =ident"t1"
var# Swappable registers to break dependency chains
t0 =asmValue(t0Sym, Reg, asmOutputEarlyClobber)
t1 =asmValue(t1Sym, Reg, asmOutputEarlyClobber)
# Prologue
result.addquotedo:
var `t0sym`{.noinit.}, `t1sym`{.noinit.}: BaseType
# Algorithm
ctx.test control, control
for i in0..< N:
ctx.mov t0, a[i]
ctx.cmovnz t0, b[i]
ctx.mov a[i], t0
swap(t0, t1)
# Codegen
result.add ctx.generate()
The DSL solves the pitfalls of https://gcc.gnu.org/wiki/DontUseInlineAsm (it deals with constraints, clobbers, size suffixes, memory addressing, reusing registers, declaring arrays, can be used in loops and even insert comments). Not using inline assembly on x86 leaves up to 70% performance on the table, on something as simple as multi-precision addition with dedicated intrinsics, something the GMP team raised to GCC ages ago:
or tagging assembly noInline (but they are in the hot path and some like conditional copy or addition are smaller than push/pop registers+function call)
or disabling LTO
The key issues were:
When tagging an input with a register constraint, constant propagation can hardcode it. This can happen for example with the address of the prime modulus for r1 = a + b (mod M) and r2 = c + d (mod M). (GCC miscompiles Fp6 Frobenius with -flto flag #230 (comment))
But, in particular with write memory, GCC throws "asm has impossible constraint" and LLVM throws "register allocation recoloring failure" for both memory constraint or dummy memory output
And for Clang, it only supports the regular x86 displacement 8%[M] may be correctly instantiated to 8(rax) if compiler decides to use a pointer in rax, or incorrectly to 8MyModulus(%rip) if compiler wants to tell the linker to use RIP addressing. GCC also supports 8+%[M] syntax which dodges the issue. Rework assembly to be compatible with LTO #231
We can switch to Intel syntax but then Apple Clang doesn't support it (introduced in Clang 14).
Due to the constraints issue, output operands were kept as register constraints despite being pointers, but they were overconstrained, we lied to the compiler, telling it that the value would be changed so it does not const propagated them (at the price of extra useless reloads)
Using assembly files: handwritten vs autogenerated
Another approach discussed with @etan-status#230 (comment) would be to write or auto-generate assembly files.
However:
This means learning calling convention / ABI for each ISA and OS combination. Just for x86 there is the MS-COFF, Apple Mach-O and AMD64 SYSV ABI.
If handwritten, it becomes hard to maintain or audit algorithm that involves loops or register reuse. For example fast Montgomery reduction of 12 words into 6 while we only have 15 usable registers (+stack pointer) involves rotating temporary registers and concatenating "register arrays":
There is yet another approach. For non-CPU backends, like WASM, NVPTX (Nvidia GPUs), AMDGPU (AMD GPUs) or SPIR-V (Intel GPUs), we could be use LLVM IR, augmented with ISA-specific inline assembly, for example:
let r = bld.asArray(addModKernel.getParam(0), fieldTy)
let a = bld.asArray(addModKernel.getParam(1), fieldTy)
let b = bld.asArray(addModKernel.getParam(2), fieldTy)
let t = bld.makeArray(fieldTy)
let N = cm.getNumWords(field)
t[0] = bld.add_co(a[0], b[0])
for i in1..< N:
t[i] = bld.add_cio(a[i], b[i])
That LLVM IR can then be used to generate the assembly files that are checkout out in the repo. This avoids dealing with ABIs, registers just to focus on instructions for each platforms. Also LLVM would be free to use a different more efficient calling convention for functions tagged private. We also wouldn't have issue to handle register spills for large curve like BW6-761.
It is likely that on RISC-like CPU ISAs (ARM, RISC) we can even use LLVM IR without assembly to reach top performance.
On x86, we do need to use MULX/ADCX/ADOX which compilers do not generate, by design. They don't model carry chains, and they certainly don't model 2 carry chains like ADOX/ADCX need.
One issue with MULX is that there is an implicit multiplicand in RDX register, and LLVM IR does not allow fixed registers. It's unsure yet if we can use inline assembly to move to RDX without LLVM undoing this later.
NVPTX and GPU backends
If the backend is seldom used, the generated code can be very poor, see Nvidia code with no add-carry
Unfortunately compilers cannot generate ADOX/ADCX and so leave at least 25% performance on the table compared to optimal so assembly somewhere (within C or LLVM IR) is still needed for x86.
Overview of Constantine assembly backend
Constantine is now complete in terms of elliptic curve cryptography primitives.
It provides constant-time:
For high-performance and also to guarantee the absence of branches depending on secret data, Constantine uses an ISA-specific domain specific language (DSL) that emits ISA-specific inline assembly, for example for x86 constant-time conditional copy:
constantine/constantine/math/arithmetic/assembly/limbs_asm_x86.nim
Lines 25 to 57 in c6d9a21
The DSL solves the pitfalls of https://gcc.gnu.org/wiki/DontUseInlineAsm (it deals with constraints, clobbers, size suffixes, memory addressing, reusing registers, declaring arrays, can be used in loops and even insert comments). Not using inline assembly on x86 leaves up to 70% performance on the table, on something as simple as multi-precision addition with dedicated intrinsics, something the GMP team raised to GCC ages ago:
Current limitations
However, while building #228, we started to see cracks especially with LTO that required:
noInline
(but they are in the hot path and some like conditional copy or addition are smaller than push/pop registers+function call)The key issues were:
r1 = a + b (mod M)
andr2 = c + d (mod M)
. (GCC miscompiles Fp6 Frobenius with -flto flag #230 (comment))8%[M]
may be correctly instantiated to8(rax)
if compiler decides to use a pointer in rax, or incorrectly to8MyModulus(%rip)
if compiler wants to tell the linker to use RIP addressing. GCC also supports8+%[M]
syntax which dodges the issue. Rework assembly to be compatible with LTO #231- https://github.com/llvm/llvm-project/blob/faf8407aecd15125261787bc9b9b4d448174b5d4/clang/test/CodeGen/ms-inline-asm.c#L432-L439
- https://github.com/llvm/llvm-project/blob/faf8407aecd15125261787bc9b9b4d448174b5d4/clang/test/CodeGen/ms-inline-asm-variables.c#L19
Using assembly files: handwritten vs autogenerated
Another approach discussed with @etan-status #230 (comment) would be to write or auto-generate assembly files.
However:
constantine/constantine/math/arithmetic/assembly/limbs_asm_redc_mont_x86_adx_bmi2.nim
Lines 35 to 139 in c6d9a21
using LLVM IR
There is yet another approach. For non-CPU backends, like WASM, NVPTX (Nvidia GPUs), AMDGPU (AMD GPUs) or SPIR-V (Intel GPUs), we could be use LLVM IR, augmented with ISA-specific inline assembly, for example:
constantine/constantine/platforms/gpu/nvidia_inlineasm.nim
Lines 332 to 356 in c6d9a21
constantine/constantine/math_gpu/fields_nvidia.nim
Lines 96 to 107 in c6d9a21
That LLVM IR can then be used to generate the assembly files that are checkout out in the repo. This avoids dealing with ABIs, registers just to focus on instructions for each platforms. Also LLVM would be free to use a different more efficient calling convention for functions tagged private. We also wouldn't have issue to handle register spills for large curve like BW6-761.
x86 and CPU backends
For x86 backend, the codegen of bigint arithmetic is pretty decent if using i256 or i384 operands in particular thanks to the bug reports by @chfast to LLVM (https://github.com/llvm/llvm-project/issues/created_by/chfast) for EVM-C and https://github.com/chfast/intx. And there are anti-regression suites to guarantee outputs: https://github.com/llvm/llvm-project/blob/ddfee6d0b6979fc6e61fa5ac7424096c358746fb/llvm/test/CodeGen/X86/i128-mul.ll#L77-L95
It is likely that on RISC-like CPU ISAs (ARM, RISC) we can even use LLVM IR without assembly to reach top performance.
On x86, we do need to use MULX/ADCX/ADOX which compilers do not generate, by design. They don't model carry chains, and they certainly don't model 2 carry chains like ADOX/ADCX need.
One issue with MULX is that there is an implicit multiplicand in RDX register, and LLVM IR does not allow fixed registers. It's unsure yet if we can use inline assembly to move to RDX without LLVM undoing this later.
NVPTX and GPU backends
If the backend is seldom used, the generated code can be very poor, see Nvidia code with no add-carry
The text was updated successfully, but these errors were encountered: