-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crypto: import fiat-crypto implementations #40171
Comments
fiat-crypto's 64-bit P-224 implementation is about 2x as fast as Go's existing portable, constant-time P-224 implementation on amd64. Their 32-bit implementation is about the same speed on 386. I expect P-256 will be similar to Curve25519 (i.e., existing assembly implementations are faster, but still worth measuring), but using fiat-crypto for P-384 and P-521 should be both much faster and provide a constant-time implementation (unlike the current, generic math/big code). |
@agl @FiloSottile Ping. While here, I'll point out the fiat-crypto implementation also speeds up curve25519 on ppc64le:
(I don't have any ARM workstations to readily benchmark on.) |
I do have some concerns about adding new license notice requirements in the libraries, because those transitively apply to every Go binary anyone builds (that imports net/http at least). I would feel much more comfortable about this if we could get the code contributed under CLA so that the Go license notice would cover it. |
@JasonGross Do you think we can get fiat-crypto's Go code contributed under Google's CLA? The normal process is documented at https://golang.org/doc/contribute.html#cla. |
Ping @JasonGross. We'd be happy to use this code but don't want to impose new notice requirements on every Go binary. |
Ah, sorry, I meant to follow up on this earlier. As discussed on openssl/openssl#12201 (comment), MIT unfortunately doesn't permit signing CLAs on projects that it holds copyright to. :-/ |
@JasonGross, thanks for replying. I certainly understand MIT not wanting to complete CLAs. An alternative solution to our problem of imposing new notice requirements on every Go binary would be if the generator outputs could be licensed under a non-attribution license such as MIT-0 or a source-code-attribution-only license such as BSD-1-Clause. Do you think that is a possibility? |
That seems quite likely. Let me chat with me colleagues and see if it's feasible. |
Thanks very much. |
@rsc We're in the process of re-licensing under user's choice, MIT OR BSD-1-Clause. However, it seems that BSD-1-Clause is not listed under https://pkg.go.dev/license-policy, even though MIT-0 and BSD-0-Clause are. Is this an oversight? Will BSD-1-Clause in fact be sufficient? |
BSD-1-clause should be fine for our purposes. Thanks very much for tackling this. I don't know why it's not listed in pkg.go.dev. Maybe it's just not very common. CC @jba . |
Looking into it. |
BSD-1-Clause will be fine, thanks. |
Based on the discussion above, this seems like a likely accept. |
I've gotten approval from everyone and have prepared mit-plv/fiat-crypto#881. Hopefully we'll get it merged in the next couple of days. |
The code has now been relicensed under MIT OR BSD-1-Clause OR Apache-2.0 |
Thanks! |
Thanks so much @JasonGross! Accepted. |
Thanks to the work by @JasonGross, naively integrating fiat-crypto into the edwards25519 code scheduled for use in 1.17 now looks like this:
Note that the baseline numbers use assembly language implementations for multiply and square, so the comparison vs fiat isn't "fair" or "direct". The fiat-crypto code outperforms the existing code when assembly is disabled, and comes within spitting distance (+7~16% increased runtime), when compared to code that cheats and uses assembly. If people are determined to squeeze out everything they can from the fiat backend, then there's some more manual inlining/refactoring that could be done, but the low hanging fruit gets performance to "competitive with existing code", at least from my perspective. |
I'll aim to add a flag that allows this shortly. If it's helpful, I can also have the cmovznz calls inlined, though I expect that'll take me a little bit more work |
No, you're right, it's better not to make the caller worry about alias-safety. Thanks for pointing that out. Fortunately, you can get a similar effect by tweaking the method signature a bit to return a value rather than write to func cmovznzU64(arg1 uint1, arg2 uint64, arg3 uint64) uint64 {
x1 := (uint64(arg1) * 0xffffffffffffffff)
return ((x1 & arg3) | ((^x1) & arg2))
}
func Selectznz(out1 *[5]uint64, arg1 uint1, arg2 *[5]uint64, arg3 *[5]uint64) {
x1 := cmovznzU64(arg1, arg2[0], arg3[0])
x2 := cmovznzU64(arg1, arg2[1], arg3[1])
x3 := cmovznzU64(arg1, arg2[2], arg3[2])
x4 := cmovznzU64(arg1, arg2[3], arg3[3])
x5 := cmovznzU64(arg1, arg2[4], arg3[4])
out1[0] = x1
out1[1] = x2
out1[2] = x3
out1[3] = x4
out1[4] = x5
}
// and the obvious change to ToBytes This shrinks the generated code for You'll get benefits throughout from returning values, possibly multiple values, from functions rather than passing pointers. And as a bonus, the code will be much more idiomatic. Another example: func subborrowxU51(arg1 uint1, arg2 uint64, arg3 uint64) (out1 uint64, out2 uint1) {
x1 := ((int64(arg2) - int64(arg1)) - int64(arg3))
x2 := int1((x1 >> 51))
x3 := (uint64(x1) & 0x7ffffffffffff)
return x3, (0x0 - uint1(x2))
}
func ToBytes(out1 *[32]uint8, arg1 *TightFieldElement) {
x1, x2 := subborrowxU51(0x0, arg1[0], 0x7ffffffffffed)
x3, x4 := subborrowxU51(x2, arg1[1], 0x7ffffffffffff)
x5, x6 := subborrowxU51(x4, arg1[2], 0x7ffffffffffff)
x7, x8 := subborrowxU51(x6, arg1[3], 0x7ffffffffffff)
x9, x10 := subborrowxU51(x8, arg1[4], 0x7ffffffffffff)
// ...
} This cuts about 25% of the instructions in ToBytes.
cmovznz is already inlined. Selectznz is not, but getting it inlined is going to be hard to achieve. I messed around with it for a bit, and the only way I saw to do it was to introduce a loop and alter the function signature to return a [5]uint64, which finesses the aliasing issue but will itself be slow. func Selectznz(arg1 uint1, arg2 *[5]uint64, arg3 *[5]uint64) (out1 [5]uint64) {
mask := -uint64(arg1)
for i := 0; i < len(out1); i++ {
out1[i] = cmovznzU64(mask, arg2[i], arg3[i])
}
return
}
func cmovznzU64(mask uint64, arg2 uint64, arg3 uint64) uint64 {
return ((mask & arg3) | ((^mask) & arg2))
} I'm pretty sure that is not a net win, even assuming that the changed function signature is acceptable. |
(Back from vacation! Pardon the lag.) This is excellent, thanks @Yawning, @JasonGross and @josharian for working on fiat's performance! @Yawning makes good points in #40171 (comment), however I am still inclined not to switch the 25519 backends to fiat, even if I agree the performance difference is acceptable now. First, the current code already shipped in Go 1.17, so changing it has the marginal risk involved in making any change: unexpected bugs, edge case performance changes, differences in unspecified behavior. We already incurred that risk in Go 1.17, and doing it again in Go 1.18 is a cost. Second, fiat is leaps and bounds better than the assembly we had before or big.Int, which is what made it acceptable to import code that can't really be reviewed in its output. However, here we have a lot of confidence in the current code, as it's well tested and documented. (I am obviously biased on that, having written the code, but I am also the one that has to maintain it, so my money is where my mouth is.) fiat is generated from a formal model, which rules out one important class of bugs, but it's not impossible for the code generator to have bugs or for unspecified behavior to be subtly different. Third, the current code has two additional non-functional purposes: it's the only example in the tree of how to apply https://golang.org/wiki/AssemblyPolicy, and is moderately educational for people learning about elliptic curve implementations. The latter is definitely not dispositive, but it's been historically true that the Go standard library was a good place to learn about cryptography engineering, and I am keen on preserving that when we can. However, the performance work is still extremely valuable, and maybe eventually we'll even get to switch P-256, which would be a major win. |
The rationale makes sense to me, since I'm currently still trying to decide between similar tradeoffs with the library I maintain as well. A nice thing is that, while my efforts were based around 25519 performance, the changes that @JasonGross was kind enough to make will help performance for the places that currently do use fiat (the NIST curves). It's not as if switching the 25519 code would be expensive if there ends up being a more compelling reason to do so later in the future as well. |
Hey, just checking in. The 1.18 window will close pretty soon. |
…udflare In the least graceful way possible. Smash the code in until it compiles. And runs, too. From https://github.com/cloudflare/circl We can remove this when Go switches to the fiat implementation. See golang#40171 (comment). Change-Id: I1d544f8ee065476f30817bc950246c148cf50234
…udflare In the least graceful way possible. Smash the code in until it compiles. And runs, too. From https://github.com/cloudflare/circl We can remove this when Go switches to the fiat implementation. See golang#40171 (comment). Change-Id: I1d544f8ee065476f30817bc950246c148cf50234
…udflare In the least graceful way possible. Smash the code in until it compiles. And runs, too. From https://github.com/cloudflare/circl We can remove this when Go switches to the fiat implementation. See golang#40171 (comment). Change-Id: I1d544f8ee065476f30817bc950246c148cf50234
…udflare In the least graceful way possible. Smash the code in until it compiles. And runs, too. From https://github.com/cloudflare/circl We can remove this when Go switches to the fiat implementation. See golang#40171 (comment). Change-Id: I1d544f8ee065476f30817bc950246c148cf50234
Change https://golang.org/cl/360014 mentions this issue: |
https://golang.org/cl/360014 brings in P-224 and P-384, too. I was hoping to also replace the generic 32-bit P-256 implementation, but the performance of 32-bit fiat seems to be particularly bad, to the point that the 64-bit variants perform better than the 32-bit ones even on 32-bit architectures. See the benchmarks in https://golang.org/cl/351373. @josharian suggested it might be due to #40171 (comment). |
An additional reason could be #15925. tl;dr; arrays don't optimize as well as structs. |
Also, adopt addchain code generation for field inversion, and switch P-521 to Montgomery multiplication, which is significantly slower but allows us to reuse the P-224/P-256/P-384 wrapper code. No one uses P-521 anyway, and it's still faster than it was in Go 1.16. Removed a portion of tests that ran the P-224 vectors against P-256, for some reason. Sadly, fiat-crypto is not fast enough to replace the generic 32-bit P-256 implementation (just yet?). A change in visible behavior is that we literally can't internally operate on invalid curve points anymore (yay!) but the crypto/elliptic API locked us into accepting any pair of integers for Add/Double/ScalarMult and return no error (sigh), although of course that's undefined behavior. Panics are always regretted. Returning nil leads to panics. A fixed point might be exploited. The most reasonable solution felt to return a made up random point, which is not that different from an off-curve point but leaks less. name old time/op new time/op delta pkg:crypto/elliptic goos:darwin goarch:arm64 ScalarBaseMult/P224-8 573µs ± 0% 146µs ± 0% -74.56% (p=0.000 n=7+9) ScalarMult/P224-8 574µs ± 0% 152µs ± 5% -73.58% (p=0.000 n=7+10) MarshalUnmarshal/P224/Uncompressed-8 664ns ± 0% 481ns ± 1% -27.64% (p=0.000 n=8+10) MarshalUnmarshal/P224/Compressed-8 666ns ± 1% 480ns ± 0% -27.92% (p=0.000 n=10+10) pkg:crypto/ecdsa goos:darwin goarch:arm64 Sign/P224-8 597µs ± 0% 169µs ± 2% -71.71% (p=0.000 n=10+9) Verify/P224-8 1.18ms ± 1% 0.32ms ± 5% -72.81% (p=0.000 n=10+10) GenerateKey/P224-8 577µs ± 0% 147µs ± 0% -74.51% (p=0.000 n=8+8) name old time/op new time/op delta pkg:crypto/elliptic goos:darwin goarch:arm64 ScalarBaseMult/P384-8 2.01ms ± 2% 0.50ms ± 0% -75.00% (p=0.000 n=10+8) ScalarMult/P384-8 2.02ms ± 3% 0.51ms ± 3% -74.64% (p=0.000 n=10+10) MarshalUnmarshal/P384/Uncompressed-8 1.09µs ± 1% 0.76µs ± 0% -30.27% (p=0.000 n=10+9) MarshalUnmarshal/P384/Compressed-8 1.08µs ± 0% 0.76µs ± 1% -29.86% (p=0.000 n=8+10) pkg:crypto/ecdsa goos:darwin goarch:arm64 Sign/P384-8 2.06ms ± 1% 0.56ms ± 2% -72.76% (p=0.000 n=10+10) Verify/P384-8 4.06ms ± 2% 1.08ms ± 0% -73.49% (p=0.000 n=10+8) GenerateKey/P384-8 2.01ms ± 1% 0.51ms ± 3% -74.65% (p=0.000 n=10+10) name old time/op new time/op delta pkg:crypto/elliptic goos:darwin goarch:arm64 ScalarBaseMult/P521-8 715µs ± 6% 1525µs ± 4% +113.39% (p=0.000 n=10+10) ScalarMult/P521-8 698µs ± 1% 1543µs ± 1% +120.99% (p=0.000 n=9+9) MarshalUnmarshal/P521/Uncompressed-8 797ns ± 0% 1296ns ± 0% +62.65% (p=0.000 n=10+9) MarshalUnmarshal/P521/Compressed-8 798ns ± 0% 1299ns ± 1% +62.82% (p=0.000 n=8+10) pkg:crypto/ecdsa goos:darwin goarch:arm64 Sign/P521-8 810µs ± 3% 1645µs ± 0% +103.03% (p=0.000 n=10+10) Verify/P521-8 1.42ms ± 1% 3.19ms ± 1% +125.28% (p=0.000 n=10+8) GenerateKey/P521-8 698µs ± 1% 1549µs ± 0% +121.87% (p=0.000 n=10+7) Updates #40171 Change-Id: I34edf5002b5e9fad0ebb6c1e2119fb123ea6d18f Reviewed-on: https://go-review.googlesource.com/c/go/+/360014 Run-TryBot: Filippo Valsorda <filippo@golang.org> Trust: Filippo Valsorda <filippo@golang.org> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Julie Qiu <julie@golang.org> TryBot-Result: Go Bot <gobot@golang.org>
This is in the Go 1.18 milestone. Is it likely to happen for 1.18? Thanks. |
P-224 and P-384 landed in Go 1.18. Moving to Backlog for P-256, 32-bit implementations, and scalar fields, which are not as immediate to bring in. P-256 and 32-bit depend on performance improvements. Scalar fields depend on better APIs. |
Change https://go.dev/cl/396216 mentions this issue: |
Change https://go.dev/cl/393656 mentions this issue: |
This results in a 1.7-2.4x improvement in native go crypto/elliptic multiplication operations on PPC64, and similar improvements might be possible on other architectures which use flags or similar to represent the carry bit in SSA form. If it is possible, schedule carry chains independently of each other to avoid clobbering the carry flag. This is very expensive. This is done by: 1. Identifying carry bit using, but not creating ops, and lowering their priority below all other ops which do not need to be placed at the top of a block. This effectively ensures only one carry chain will be placed at a time in most important cases (crypto/elliptic/internal/fiat contains most of them). 2. Raising the priority of carry bit generating ops to schedule later in a block to ensure they are placed as soon as they are ready. Likewise, tuple ops which separate carrying ops are scored similar to 2 above. This prevents unrelated ops from being scheduled between carry-dependent operations. This occurs when unrelated ops are ready to schedule alongside such tuple ops. This reduces the chances a flag clobbering op might be placed between two carry-dependent operations. With PPC64 Add64/Sub64 lowering into SSA and this patch, the net performance difference in crypto/elliptic benchmarks on P9/ppc64le are: name old time/op new time/op delta ScalarBaseMult/P256 46.3µs ± 0% 46.9µs ± 0% +1.34% ScalarBaseMult/P224 356µs ± 0% 209µs ± 0% -41.14% ScalarBaseMult/P384 1.20ms ± 0% 0.57ms ± 0% -52.14% ScalarBaseMult/P521 3.38ms ± 0% 1.44ms ± 0% -57.27% ScalarMult/P256 199µs ± 0% 199µs ± 0% -0.17% ScalarMult/P224 357µs ± 0% 212µs ± 0% -40.56% ScalarMult/P384 1.20ms ± 0% 0.58ms ± 0% -51.86% ScalarMult/P521 3.37ms ± 0% 1.44ms ± 0% -57.32% MarshalUnmarshal/P256/Uncompressed 2.59µs ± 0% 2.52µs ± 0% -2.63% MarshalUnmarshal/P256/Compressed 2.58µs ± 0% 2.52µs ± 0% -2.06% MarshalUnmarshal/P224/Uncompressed 1.54µs ± 0% 1.40µs ± 0% -9.42% MarshalUnmarshal/P224/Compressed 1.54µs ± 0% 1.39µs ± 0% -9.87% MarshalUnmarshal/P384/Uncompressed 2.40µs ± 0% 1.80µs ± 0% -24.93% MarshalUnmarshal/P384/Compressed 2.35µs ± 0% 1.81µs ± 0% -23.03% MarshalUnmarshal/P521/Uncompressed 3.79µs ± 0% 2.58µs ± 0% -31.81% MarshalUnmarshal/P521/Compressed 3.80µs ± 0% 2.60µs ± 0% -31.67% Note, P256 uses an asm implementation, thus, little variation is expected. Updates #40171 Change-Id: I810850e8ff429505424c92d6fe37f99aaa0c6e84 Reviewed-on: https://go-review.googlesource.com/c/go/+/393656 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Run-TryBot: Paul Murphy <murp@ibm.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Filippo Valsorda <valsorda@google.com>
The fiat-crypto project (https://github.com/mit-plv/fiat-crypto) generates formally-verified, high-performance modular arithmetic implementations, useful for crypto primitives like Curve25519, Poly1305, and the NIST ECC curves that are used within the Go standard library. They're currently working on a Go backend.
BoringSSL has imported their implementations for Curve25519 and P-256: https://boringssl.googlesource.com/boringssl/+/master/third_party/fiat/
At https://go-review.googlesource.com/c/crypto/+/242177, I've uploaded a WIP CL that imports their Curve25519 implementation (w/ minor tweaks), and demonstrates a significant performance improvement over the current "generic" implementation. (The existing amd64 assembly implementation is still considerably faster though.)
This proposal is to import and make use of those implementations.
Open questions:
Which algorithms should be imported? BoringSSL only imports two. Should we import more?
Do we import both 32-bit and 64-bit implementations? We could import just one implementation and still get a performance speedup (e.g., 386 sees a -10% performance boost with curve25519_fiat_64.go, and amd64 sees a -30% boost with curve25519_fiat_32.go), but they do better still with the CPU-appropriate implementations (-30% for 386 w/ 32-bit, and -61% for amd64 w/ 64-bit).
How should the code be imported? E.g., should it be separated into a third_party or vendor directory with its own LICENSE file like how BoringSSL does it?
/cc @agl @FiloSottile
The text was updated successfully, but these errors were encountered: