Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCC miscompiles Fp6 Frobenius with -flto flag #230

Closed
mratsim opened this issue Apr 18, 2023 · 9 comments · Fixed by #231
Closed

GCC miscompiles Fp6 Frobenius with -flto flag #230

mratsim opened this issue Apr 18, 2023 · 9 comments · Fixed by #231
Labels

Comments

@mratsim
Copy link
Owner

mratsim commented Apr 18, 2023

After applying fixes for #229 in #228 548cf20

nim c --cc:gcc -r -d:release --passC:-flto=auto --passL:-flto=auto --outdir:build tests/math/t_fp6_frobenius.nim

fails with

test_fp6_frobenius xoshiro512** seed: 1681811972

[Suite] 𝔽p6 Frobenius map: Frobenius(a, k) = a^(pᵏ) (mod p⁶) [64-bit words]
    Frobenius(a) for Fp6[BN254_Nogami]
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(80, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(80, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(80, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(80, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(80, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(80, 21): Check failed: bool(a == fa)
    Frobenius(a) for Fp6[BN254_Snarks]
    Frobenius(a) for Fp6[BLS12_377]
    Frobenius(a) for Fp6[BLS12_381]
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(80, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(80, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(80, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(80, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(80, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(80, 21): Check failed: bool(a == fa)
    Frobenius(a) for Fp6[BW6_761]
  [FAILED] Frobenius(a) = a^p (mod p^6)
    Frobenius(a, 2) for Fp6[BN254_Nogami]
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(99, 16): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(99, 16): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(99, 16): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(99, 16): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(99, 16): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(99, 16): Check failed: bool(a == fa)
    Frobenius(a, 2) for Fp6[BN254_Snarks]
    Frobenius(a, 2) for Fp6[BLS12_377]
    Frobenius(a, 2) for Fp6[BLS12_381]
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(99, 16): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(99, 16): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(99, 16): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(99, 16): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(99, 16): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(99, 16): Check failed: bool(a == fa)
    Frobenius(a, 2) for Fp6[BW6_761]
  [FAILED] Frobenius(a, 2) = a^(p^2) (mod p^6)
    Frobenius(a, 3) for Fp6[BN254_Nogami]
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(117, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(117, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(117, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(117, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(117, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(117, 21): Check failed: bool(a == fa)
    Frobenius(a, 3) for Fp6[BN254_Snarks]
    Frobenius(a, 3) for Fp6[BLS12_377]
    Frobenius(a, 3) for Fp6[BLS12_381]
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(117, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(117, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(117, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(117, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(117, 21): Check failed: bool(a == fa)
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(117, 21): Check failed: bool(a == fa)
    Frobenius(a, 3) for Fp6[BW6_761]
  [FAILED] Frobenius(a, 3) = a^(p^3) (mod p^6)

Note that this time this does not depend on --opt:size flag.

This likely explains the BLS verification failure on windows.

This does not happen with Clang or without LTO

@mratsim
Copy link
Owner Author

mratsim commented Apr 18, 2023

It seems like the reference exponentiation algorithm gets miscompiled and zero out some data

[Suite] 𝔽p6 Frobenius map: Frobenius(a, k) = a^(pᵏ) (mod p⁶) [64-bit words]
    Frobenius(a) for Fp6[BN254_Nogami]
frob test ( a  in): Fp6(
  c0: Fp2(
    c0: 0x10c399f626ceb62186cefecb6b3666da8db348e4fd28bb7f60c74a7ee5aea24a, 
    c1: 0x02cbc4df3bf4fdbed6be0cd9436a484b7e3294270ec0cc9f8d1378999d112334), 
  c1: Fp2(
    c0: 0x073c6d67ce314a11b78c642631efd635efdbf4f4e436e00f7a7fdd9020bbe642, 
    c1: 0x089bb97f312d37833ba8b0c1ca2c60bd29c22e852b5b571398a5f1994fd9016d), 
  c2: Fp2(
    c0: 0x040ff098efd9b9c8f0cd9f2925f2e2b869fcaa4cf50156192662ac606d1800ff, 
    c1: 0x0a4d8e2b6db81ef92a16d822e855cd38bc6fd8c9161c2b6206990c04bcc40d99))
frob test ( a out): Fp6(
  c0: Fp2(
    c0: 0x1d0240477bb4c3bb13c9e4f44c606fe9d730fc8f00630c2f0316b49beb606b5c, 
    c1: 0x0000000000000000000000000000000000000000000000000000000000000000), 
  c1: Fp2(
    c0: 0x0000000000000000000000000000000000000000000000000000000000000000, 
    c1: 0x0000000000000000000000000000000000000000000000000000000000000000), 
  c2: Fp2(
    c0: 0x0000000000000000000000000000000000000000000000000000000000000000, 
    c1: 0x0000000000000000000000000000000000000000000000000000000000000000))
frob test (fa out): Fp6(
  c0: Fp2(
    c0: 0x10c399f626ceb62186cefecb6b3666da8db348e4fd28bb7f60c74a7ee5aea24a, 
    c1: 0x22579fa3040b0242e37640a6bc95b7bce2ee6bd8f13f337419ec876662eedcdf), 
  c1: Fp2(
    c0: 0x0d118e8c6ba1c6953de028c405b7ffd9b842d85366413f1ec1c49669f3bdc70b, 
    c1: 0x1ad1021b795ca7fec204540f55887869a937057ec588dbf9650ea71fac0be0bd), 
  c2: Fp2(
    c0: 0x2098e7de1289bf414f9bbe13c7337da8c7cf89742a33eeadc8600a4c43f19de6, 
    c1: 0x172c6a37e2db93bd813bcd878a1ddc7d6f6f6b6272ad3596310c740cc17c9a5c))
    /[...]/constantine/tests/math/t_fp_tower_frobenius_template.nim(89, 21): Check failed: bool(a == fa)
  [FAILED] Frobenius(a) = a^p (mod p^6)

And compiling with ubsan makes the problem disappear obviously ...:

nim c --cc:gcc -r -d:release --passC:-fsanitize=undefined --passL:-fsanitize=undefined --passC:-flto=auto --passL:-flto=auto --outdir:build tests/math/t_fp6_frobenius.nim
test_fp6_frobenius xoshiro512** seed: 1681812781

[Suite] 𝔽p6 Frobenius map: Frobenius(a, k) = a^(pᵏ) (mod p⁶) [64-bit words]
    Frobenius(a) for Fp6[BN254_Nogami]
frob test ( a  in): Fp6(
  c0: Fp2(
    c0: 0x10c399f626ceb62186cefecb6b3666da8db348e4fd28bb7f60c74a7ee5aea24a, 
    c1: 0x02cbc4df3bf4fdbed6be0cd9436a484b7e3294270ec0cc9f8d1378999d112334), 
  c1: Fp2(
    c0: 0x073c6d67ce314a11b78c642631efd635efdbf4f4e436e00f7a7fdd9020bbe642, 
    c1: 0x089bb97f312d37833ba8b0c1ca2c60bd29c22e852b5b571398a5f1994fd9016d), 
  c2: Fp2(
    c0: 0x040ff098efd9b9c8f0cd9f2925f2e2b869fcaa4cf50156192662ac606d1800ff, 
    c1: 0x0a4d8e2b6db81ef92a16d822e855cd38bc6fd8c9161c2b6206990c04bcc40d99))
frob test ( a out): Fp6(
  c0: Fp2(
    c0: 0x10c399f626ceb62186cefecb6b3666da8db348e4fd28bb7f60c74a7ee5aea24a, 
    c1: 0x22579fa3040b0242e37640a6bc95b7bce2ee6bd8f13f337419ec876662eedcdf), 
  c1: Fp2(
    c0: 0x0d118e8c6ba1c6953de028c405b7ffd9b842d85366413f1ec1c49669f3bdc70b, 
    c1: 0x1ad1021b795ca7fec204540f55887869a937057ec588dbf9650ea71fac0be0bd), 
  c2: Fp2(
    c0: 0x2098e7de1289bf414f9bbe13c7337da8c7cf89742a33eeadc8600a4c43f19de6, 
    c1: 0x172c6a37e2db93bd813bcd878a1ddc7d6f6f6b6272ad3596310c740cc17c9a5c))
frob test (fa out): Fp6(
  c0: Fp2(
    c0: 0x10c399f626ceb62186cefecb6b3666da8db348e4fd28bb7f60c74a7ee5aea24a, 
    c1: 0x22579fa3040b0242e37640a6bc95b7bce2ee6bd8f13f337419ec876662eedcdf), 
  c1: Fp2(
    c0: 0x0d118e8c6ba1c6953de028c405b7ffd9b842d85366413f1ec1c49669f3bdc70b, 
    c1: 0x1ad1021b795ca7fec204540f55887869a937057ec588dbf9650ea71fac0be0bd), 
  c2: Fp2(
    c0: 0x2098e7de1289bf414f9bbe13c7337da8c7cf89742a33eeadc8600a4c43f19de6, 
    c1: 0x172c6a37e2db93bd813bcd878a1ddc7d6f6f6b6272ad3596310c740cc17c9a5c))
  [OK] Frobenius(a) = a^p (mod p^6)

@mratsim
Copy link
Owner Author

mratsim commented Apr 18, 2023

Further investigation shows that

  • Works: nim c --cc:gcc -r --passC:-flto --passL:-flto -d:release --outdir:build tests/math/t_fp6_bn254_snarks.nim
  • Works: nim c --cc:gcc -r --passC:-flto --passL:-flto -d:release --outdir:build tests/math/t_fp6_bls12_377.nim
  • Fails: nim c --cc:gcc -r --passC:-flto --passL:-flto -d:release --outdir:build tests/math/t_fp6_bn254_nogami.nim
  • Fails: nim c --cc:gcc -r --passC:-flto --passL:-flto -d:release --outdir:build tests/math/t_fp6_bls12_381.nim

The "obvious" difference between the working (BN254_Snarks, BLS12_377) and failing (BN254_Nogami, BLS12_381) set is that the failing set uses (1 + i) as a non-residue to build 𝔽p6 on top of 𝔽p2.

Hence the bug is likely in

func prod*[C: static Curve](
r: var Fp2[C],
a: Fp2[C],
_: type NonResidue) =
## Multiply an element of 𝔽p2 by the non-residue
## chosen to construct the next extension or the twist:
## - if quadratic non-residue: 𝔽p4
## - if cubic non-residue: 𝔽p6
## - if sextic non-residue: 𝔽p4, 𝔽p6 or 𝔽p12
# Yet another const tuple unpacking bug
const u = C.getNonResidueFp2()[0]
const v = C.getNonResidueFp2()[1]
const Beta {.used.} = C.getNonResidueFp()
# ξ = u + v x
# and x² = β
#
# (c0 + c1 x) (u + v x) => u c0 + (u c0 + u c1)x + v c1 x²
# => u c0 + β v c1 + (v c0 + u c1) x
when a.fromComplexExtension() and u == 1 and v == 1:
let t {.noInit.} = a.c0
r.c0.diff(t, a.c1)
r.c1.sum(t, a.c1)
else:
# Case:
# - BN254_Snarks, QNR_Fp: -1, SNR_Fp2: 9+1𝑖 (𝑖 = √-1)
# - BLS12_377, QNR_Fp: -5, SNR_Fp2: 0+1j (j = √-5)
# - BW6_761, SNR_Fp: -4, CNR_Fp2: 0+1j (j = √-4)
when u == 0:
# BLS12_377 and BW6_761, use small addition chain
r.mul_sparse_by_0y(a, v)
else:
# BN254_Snarks, u = 9, v = 1, β = -1
# Even with u = 9, the 2x9 addition chains (8 additions total)
# are cheaper than full Fp2 multiplication
var t {.noInit.}: typeof(a.c0)
t.prod(a.c0, u)
when v == 1 and Beta == -1: # Case BN254_Snarks
t -= a.c1 # r0 = u c0 + β v c1
else:
{.error: "Unimplemented".}
r.c1.prod(a.c1, u)
when v == 1: # r1 = v c0 + u c1
r.c1 += a.c0
# aliasing: a.c0 is unused
r.c0 = t
else:
{.error: "Unimplemented".}

Even multiplying by 0 or 1 or squaring wrong gets wrong

test_fp6_BN254_Nogami xoshiro512** seed: 1681816001

[Suite] 𝔽p6 = 𝔽p2[w] BN254_Nogami [64-bit words]
  [OK] Comparison sanity checks
  [OK] Addition, substraction negation are consistent
  [OK] Division by 2
    //[...]//constantine/tests/math/t_fp_tower_template.nim(149, 21): Check failed: bool(r == One)
  [FAILED] Squaring 1 returns 1
    //[...]//constantine/tests/math/t_fp_tower_template.nim(176, 21): Check failed: bool(r == Four)
  [FAILED] Squaring 2 returns 4
    //[...]//constantine/tests/math/t_fp_tower_template.nim(205, 21): Check failed: bool(u == Nine)
  [FAILED] Squaring 3 returns 9
    //[...]//constantine/tests/math/t_fp_tower_template.nim(234, 21): Check failed: bool(u == Nine)
  [FAILED] Squaring -3 returns 9
fatal.nim(54)            sysFatal

    Unhandled exception: t_fp_tower_template.nim(262, 20) `bool(r == Z)` 
Expected zero but got 
(Fp6[BN254_Nogami]): Fp6(
  c0: Fp2(
    c0: 0x1bd2327ab3dce8c41b4e55237009615f5aefc528e3623dd2fb98dd56fa5beaa2, 
    c1: 0x20fd5f09fc035ab6ba1090a72a5e91a22b13058690761ab6f047f66115ffbd90), 
  c1: Fp2(
    c0: 0x0d1d6f126f6ac9a158f67541e580591d14d5d12d47f0302e9f01e8c31743722f, 
    c1: 0x0250f912118422fca544a186ac502b314f388713905525bc42593672eeff6c39), 
  c2: Fp2(
    c0: 0x1f56de2990d171dbd1062e7f17d2096c120234af97168498a19f8b592a3e2e35, 
    c1: 0x16b952c577acd4af661e7994168797bfebc9852189aa57368e54051e37e9794e)) [AssertionDefect]
  [FAILED] Multiplication by 0 and 1

@mratsim
Copy link
Owner Author

mratsim commented Apr 18, 2023

One possible bugfix

image

It's likely that GCC is trying to aggressively inline the caller of sumprodMont. Note that sumprodMont_CIOS_spare2bits_asm_adx and sumprodMont_CIOS_spare2bits_asm are tagged noInline already.

@mratsim
Copy link
Owner Author

mratsim commented Apr 18, 2023

Just trying to display r or a variable before removes the bug :/

image
image

@mratsim
Copy link
Owner Author

mratsim commented Apr 18, 2023

Another potential bugfix, seems like marking the register as early clobber helps so a bug in GCC inline ASM allocation?

image

Rechecking the code, M isn't overwritten so it's not clobbered.

@mratsim
Copy link
Owner Author

mratsim commented Apr 18, 2023

So reverse engineering the code with Ghidra

The code has been constant-folded and only has 3 inputs, in particular M, the statically known modulus address has been folded
image
image

It's possible that the code has been relocated and the address referred to is garbage.

Potential fixes:

  1. Write .s files instead of inline ASM, meaning learning ASM including calling conventions for x86, x86-64, ARM, ARM64, MIPS, RISC, PowerPC, S390, ...., on Linux, Windows, Darwin, ... instead of just learning the instructions we are interested in. And also learning an assembler between nasm, yasm and other ISA/OS specific assemblers ... (cc @etan-status)
  2. Declare the modulus as an output so that GCC doesn't constant fold it. There will be additional unnecessary instructions to reload its address. Also we might run afoul of the 30 input/output of inline ASM
  3. Encode the modulus as immediate. It's unsure whether the code will be faster (the modulus is hot in cache) but the code will be bigger if there are multiple curves that can share the same code, for example BLS12_377 and BLS12_381, or BN254_Snarks and BanderSnatch.
  4. Rework the inline asm constraint to use m see:

image

@mratsim
Copy link
Owner Author

mratsim commented Apr 20, 2023

Unfortunately swithcing back to memory constraint (m or o) allows GCC to generate correct code, but then LLVM generates incorrect code.
This can be tested on t_fp6_bn254_nogami test suite by experimenting on the register/mem and constraint of mulx_gen and also toggling mul_asm_adx_inline or mul_asm_adx in sqrx2x_complex_asm_adx.

Intuition is that there is miscompilation when the input is on the stack and we use the inline function:

func sqrx2x_complex_asm_adx*(
r: var array[2, FpDbl],
a: array[2, Fp]) =
## Complex squaring on 𝔽p2
# This specialized proc inlines all calls and avoids many ADX support checks.
# and push/pop for paramater passing.
var t0 {.noInit.}, t1 {.noInit.}: typeof(a.c0)
when Fp.has1extraBit():
t0.sumUnr(a.c1, a.c1)
t1.sumUnr(a.c0, a.c1)
else:
t0.double(a.c1)
t1.sum(a.c0, a.c1)
r.c1.limbs2x.mul_asm_adx_inline(t0.mres.limbs, a.c0.mres.limbs)
t0.diff(a.c0, a.c1)
r.c0.limbs2x.mul_asm_adx_inline(t0.mres.limbs, t1.mres.limbs)

Trying to force a pointer input on the assembly function doesn't seem to help LLVM to generate appropriate code.

@mratsim
Copy link
Owner Author

mratsim commented Apr 20, 2023

Now, if the array is "on the stack" the compiler is allowed to store it in registers. GCC probably sees the memory constraint and forces that array on the stack but possibly LLVM does not and so and so the pointers to memory are invalid when the function is inlined.

Now the question becomes, if this is true, why does it work when the constraint is "PointerInReg" here:

macro mulx_gen[rLen, aLen, bLen: static int](r_PIR: var Limbs[rLen], a_PIR: Limbs[aLen], b_PIR: Limbs[bLen]) =
## `a`, `b`, `r` can have a different number of limbs
## if `r`.limbs.len < a.limbs.len + b.limbs.len
## The result will be truncated, i.e. it will be
## a * b (mod (2^WordBitWidth)^r.limbs.len)
##
## Assumes r doesn't aliases a or b
result = newStmtList()
var ctx = init(Assembler_x86, BaseType)
let
r = init(OperandArray, nimSymbol = r_PIR, rLen, PointerInReg, InputOutput_EnsureClobber)
a = init(OperandArray, nimSymbol = a_PIR, aLen, PointerInReg, Input)
b = init(OperandArray, nimSymbol = b_PIR, bLen, PointerInReg, Input)

@mratsim
Copy link
Owner Author

mratsim commented Apr 20, 2023

Seems like LLVM is broken with memory constraint and Rust removed them altogether:

And pointers make GCC miscompile with LTO (due to folding "constant pointers to constants") unless function is noInline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant