Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto: import fiat-crypto implementations #40171

Open
Tracked by #52182
mdempsky opened this issue Jul 12, 2020 · 76 comments
Open
Tracked by #52182

crypto: import fiat-crypto implementations #40171

mdempsky opened this issue Jul 12, 2020 · 76 comments
Assignees
Labels
Proposal Proposal-Accepted Proposal-Crypto Proposal related to crypto packages or other security issues
Milestone

Comments

@mdempsky
Copy link
Contributor

mdempsky commented Jul 12, 2020

The fiat-crypto project (https://github.com/mit-plv/fiat-crypto) generates formally-verified, high-performance modular arithmetic implementations, useful for crypto primitives like Curve25519, Poly1305, and the NIST ECC curves that are used within the Go standard library. They're currently working on a Go backend.

BoringSSL has imported their implementations for Curve25519 and P-256: https://boringssl.googlesource.com/boringssl/+/master/third_party/fiat/

At https://go-review.googlesource.com/c/crypto/+/242177, I've uploaded a WIP CL that imports their Curve25519 implementation (w/ minor tweaks), and demonstrates a significant performance improvement over the current "generic" implementation. (The existing amd64 assembly implementation is still considerably faster though.)

This proposal is to import and make use of those implementations.

Open questions:

  1. Which algorithms should be imported? BoringSSL only imports two. Should we import more?

  2. Do we import both 32-bit and 64-bit implementations? We could import just one implementation and still get a performance speedup (e.g., 386 sees a -10% performance boost with curve25519_fiat_64.go, and amd64 sees a -30% boost with curve25519_fiat_32.go), but they do better still with the CPU-appropriate implementations (-30% for 386 w/ 32-bit, and -61% for amd64 w/ 64-bit).

  3. How should the code be imported? E.g., should it be separated into a third_party or vendor directory with its own LICENSE file like how BoringSSL does it?

/cc @agl @FiloSottile

@gopherbot gopherbot added this to the Proposal milestone Jul 12, 2020
@mdempsky
Copy link
Contributor Author

fiat-crypto's 64-bit P-224 implementation is about 2x as fast as Go's existing portable, constant-time P-224 implementation on amd64. Their 32-bit implementation is about the same speed on 386.

I expect P-256 will be similar to Curve25519 (i.e., existing assembly implementations are faster, but still worth measuring), but using fiat-crypto for P-384 and P-521 should be both much faster and provide a constant-time implementation (unlike the current, generic math/big code).

@mdempsky mdempsky added the Proposal-Crypto Proposal related to crypto packages or other security issues label Jul 30, 2020
@mdempsky
Copy link
Contributor Author

@agl @FiloSottile Ping.

While here, I'll point out the fiat-crypto implementation also speeds up curve25519 on ppc64le:

name               old time/op   new time/op   delta
ScalarBaseMult-32    152µs ± 1%     92µs ± 4%  -39.82%  (p=0.000 n=17+20)

name               old speed     new speed     delta
ScalarBaseMult-32  210kB/s ± 0%  347kB/s ± 2%  +65.27%  (p=0.000 n=17+17)

(I don't have any ARM workstations to readily benchmark on.)

@rsc
Copy link
Contributor

rsc commented Aug 26, 2020

I do have some concerns about adding new license notice requirements in the libraries, because those transitively apply to every Go binary anyone builds (that imports net/http at least).

I would feel much more comfortable about this if we could get the code contributed under CLA so that the Go license notice would cover it.

@mdempsky
Copy link
Contributor Author

@JasonGross Do you think we can get fiat-crypto's Go code contributed under Google's CLA? The normal process is documented at https://golang.org/doc/contribute.html#cla.

@rsc
Copy link
Contributor

rsc commented Sep 16, 2020

Ping @JasonGross. We'd be happy to use this code but don't want to impose new notice requirements on every Go binary.

@JasonGross
Copy link

Ah, sorry, I meant to follow up on this earlier. As discussed on openssl/openssl#12201 (comment), MIT unfortunately doesn't permit signing CLAs on projects that it holds copyright to. :-/

@rsc
Copy link
Contributor

rsc commented Sep 18, 2020

@JasonGross, thanks for replying. I certainly understand MIT not wanting to complete CLAs.

An alternative solution to our problem of imposing new notice requirements on every Go binary would be if the generator outputs could be licensed under a non-attribution license such as MIT-0 or a source-code-attribution-only license such as BSD-1-Clause.

Do you think that is a possibility?

@JasonGross
Copy link

That seems quite likely. Let me chat with me colleagues and see if it's feasible.

@rsc
Copy link
Contributor

rsc commented Sep 18, 2020

Thanks very much.

@JasonGross
Copy link

@rsc We're in the process of re-licensing under user's choice, MIT OR BSD-1-Clause. However, it seems that BSD-1-Clause is not listed under https://pkg.go.dev/license-policy, even though MIT-0 and BSD-0-Clause are. Is this an oversight? Will BSD-1-Clause in fact be sufficient?

@ianlancetaylor
Copy link
Member

ianlancetaylor commented Sep 22, 2020

BSD-1-clause should be fine for our purposes. Thanks very much for tackling this.

I don't know why it's not listed in pkg.go.dev. Maybe it's just not very common. CC @jba .

@jba
Copy link
Contributor

jba commented Sep 22, 2020

Looking into it.

@rsc
Copy link
Contributor

rsc commented Sep 23, 2020

BSD-1-Clause will be fine, thanks.

@rsc
Copy link
Contributor

rsc commented Sep 23, 2020

Based on the discussion above, this seems like a likely accept.

@JasonGross
Copy link

I've gotten approval from everyone and have prepared mit-plv/fiat-crypto#881. Hopefully we'll get it merged in the next couple of days.

@JasonGross
Copy link

The code has now been relicensed under MIT OR BSD-1-Clause OR Apache-2.0

@ianlancetaylor
Copy link
Member

Thanks!

@rsc
Copy link
Contributor

rsc commented Sep 30, 2020

Thanks so much @JasonGross!

Accepted.

@rsc rsc modified the milestones: Proposal, Backlog Sep 30, 2020
@Yawning
Copy link

Yawning commented Aug 8, 2021

Thanks to the work by @JasonGross, naively integrating fiat-crypto into the edwards25519 code scheduled for use in 1.17 now looks like this:

name \ time/op                 baseline     baseline-purego  fiat
MultiScalarMultSize8-4          410µs ± 0%       543µs ± 0%   476µs ± 0%
ScalarBaseMult-4               34.6µs ± 0%      44.7µs ± 0%  40.2µs ± 0%
ScalarMult-4                    115µs ± 0%       160µs ± 0%   127µs ± 0%
VarTimeDoubleScalarBaseMult-4   109µs ± 0%       155µs ± 0%   117µs ± 0%

Note that the baseline numbers use assembly language implementations for multiply and square, so the comparison vs fiat isn't "fair" or "direct". The fiat-crypto code outperforms the existing code when assembly is disabled, and comes within spitting distance (+7~16% increased runtime), when compared to code that cheats and uses assembly.

If people are determined to squeeze out everything they can from the fiat backend, then there's some more manual inlining/refactoring that could be done, but the low hanging fruit gets performance to "competitive with existing code", at least from my perspective.

@JasonGross
Copy link

As a start, looking at the Selectznz in curve25519.go, shortening the lifetimes of the variables will help.

I'll aim to add a flag that allows this shortly. If it's helpful, I can also have the cmovznz calls inlined, though I expect that'll take me a little bit more work

@josharian
Copy link
Contributor

As a start, looking at the Selectznz in curve25519.go, shortening the lifetimes of the variables will help.

I'll aim to add a flag that allows this shortly. If it's helpful, I can also have the cmovznz calls inlined, though I expect that'll take me a little bit more work

No, you're right, it's better not to make the caller worry about alias-safety. Thanks for pointing that out.

Fortunately, you can get a similar effect by tweaking the method signature a bit to return a value rather than write to *arg0:

func cmovznzU64(arg1 uint1, arg2 uint64, arg3 uint64) uint64 {
	x1 := (uint64(arg1) * 0xffffffffffffffff)
	return ((x1 & arg3) | ((^x1) & arg2))
}

func Selectznz(out1 *[5]uint64, arg1 uint1, arg2 *[5]uint64, arg3 *[5]uint64) {
	x1 := cmovznzU64(arg1, arg2[0], arg3[0])
	x2 := cmovznzU64(arg1, arg2[1], arg3[1])
	x3 := cmovznzU64(arg1, arg2[2], arg3[2])
	x4 := cmovznzU64(arg1, arg2[3], arg3[3])
	x5 := cmovznzU64(arg1, arg2[4], arg3[4])
	out1[0] = x1
	out1[1] = x2
	out1[2] = x3
	out1[3] = x4
	out1[4] = x5
}

// and the obvious change to ToBytes

This shrinks the generated code for Selectznz on amd64 from 61 to 46 instructions, and preserving aliasing safety.

You'll get benefits throughout from returning values, possibly multiple values, from functions rather than passing pointers. And as a bonus, the code will be much more idiomatic. Another example:

func subborrowxU51(arg1 uint1, arg2 uint64, arg3 uint64) (out1 uint64, out2 uint1) {
	x1 := ((int64(arg2) - int64(arg1)) - int64(arg3))
	x2 := int1((x1 >> 51))
	x3 := (uint64(x1) & 0x7ffffffffffff)
	return x3, (0x0 - uint1(x2))
}

func ToBytes(out1 *[32]uint8, arg1 *TightFieldElement) {
	x1, x2 := subborrowxU51(0x0, arg1[0], 0x7ffffffffffed)
	x3, x4 := subborrowxU51(x2, arg1[1], 0x7ffffffffffff)
	x5, x6 := subborrowxU51(x4, arg1[2], 0x7ffffffffffff)
	x7, x8 := subborrowxU51(x6, arg1[3], 0x7ffffffffffff)
	x9, x10 := subborrowxU51(x8, arg1[4], 0x7ffffffffffff)
        // ...
}

This cuts about 25% of the instructions in ToBytes.

I can also have the cmovznz calls inlined

cmovznz is already inlined. Selectznz is not, but getting it inlined is going to be hard to achieve. I messed around with it for a bit, and the only way I saw to do it was to introduce a loop and alter the function signature to return a [5]uint64, which finesses the aliasing issue but will itself be slow.

func Selectznz(arg1 uint1, arg2 *[5]uint64, arg3 *[5]uint64) (out1 [5]uint64) {
	mask := -uint64(arg1)
	for i := 0; i < len(out1); i++ {
		out1[i] = cmovznzU64(mask, arg2[i], arg3[i])
	}
	return
}

func cmovznzU64(mask uint64, arg2 uint64, arg3 uint64) uint64 {
	return ((mask & arg3) | ((^mask) & arg2))
}

I'm pretty sure that is not a net win, even assuming that the changed function signature is acceptable.

@FiloSottile
Copy link
Contributor

(Back from vacation! Pardon the lag.)

This is excellent, thanks @Yawning, @JasonGross and @josharian for working on fiat's performance!

@Yawning makes good points in #40171 (comment), however I am still inclined not to switch the 25519 backends to fiat, even if I agree the performance difference is acceptable now.

First, the current code already shipped in Go 1.17, so changing it has the marginal risk involved in making any change: unexpected bugs, edge case performance changes, differences in unspecified behavior. We already incurred that risk in Go 1.17, and doing it again in Go 1.18 is a cost.

Second, fiat is leaps and bounds better than the assembly we had before or big.Int, which is what made it acceptable to import code that can't really be reviewed in its output. However, here we have a lot of confidence in the current code, as it's well tested and documented. (I am obviously biased on that, having written the code, but I am also the one that has to maintain it, so my money is where my mouth is.) fiat is generated from a formal model, which rules out one important class of bugs, but it's not impossible for the code generator to have bugs or for unspecified behavior to be subtly different.

Third, the current code has two additional non-functional purposes: it's the only example in the tree of how to apply https://golang.org/wiki/AssemblyPolicy, and is moderately educational for people learning about elliptic curve implementations. The latter is definitely not dispositive, but it's been historically true that the Go standard library was a good place to learn about cryptography engineering, and I am keen on preserving that when we can.

However, the performance work is still extremely valuable, and maybe eventually we'll even get to switch P-256, which would be a major win.

@Yawning
Copy link

Yawning commented Aug 26, 2021

@Yawning makes good points in #40171 (comment), however I am still inclined not to switch the 25519 backends to fiat, even if I agree the performance difference is acceptable now.

[snip]

However, the performance work is still extremely valuable, and maybe eventually we'll even get to switch P-256, which would be a major win.

The rationale makes sense to me, since I'm currently still trying to decide between similar tradeoffs with the library I maintain as well.

A nice thing is that, while my efforts were based around 25519 performance, the changes that @JasonGross was kind enough to make will help performance for the places that currently do use fiat (the NIST curves).

It's not as if switching the 25519 code would be expensive if there ends up being a more compelling reason to do so later in the future as well.

@josharian
Copy link
Contributor

Hey, just checking in. The 1.18 window will close pretty soon.

josharian added a commit to tailscale/go that referenced this issue Oct 26, 2021
…udflare

In the least graceful way possible. Smash the code in
until it compiles. And runs, too.

From https://github.com/cloudflare/circl

We can remove this when Go switches to the fiat implementation.
See golang#40171 (comment).

Change-Id: I1d544f8ee065476f30817bc950246c148cf50234
josharian added a commit to tailscale/go that referenced this issue Oct 26, 2021
…udflare

In the least graceful way possible. Smash the code in
until it compiles. And runs, too.

From https://github.com/cloudflare/circl

We can remove this when Go switches to the fiat implementation.
See golang#40171 (comment).

Change-Id: I1d544f8ee065476f30817bc950246c148cf50234
josharian added a commit to tailscale/go that referenced this issue Oct 26, 2021
…udflare

In the least graceful way possible. Smash the code in
until it compiles. And runs, too.

From https://github.com/cloudflare/circl

We can remove this when Go switches to the fiat implementation.
See golang#40171 (comment).

Change-Id: I1d544f8ee065476f30817bc950246c148cf50234
josharian added a commit to tailscale/go that referenced this issue Oct 27, 2021
…udflare

In the least graceful way possible. Smash the code in
until it compiles. And runs, too.

From https://github.com/cloudflare/circl

We can remove this when Go switches to the fiat implementation.
See golang#40171 (comment).

Change-Id: I1d544f8ee065476f30817bc950246c148cf50234
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/360014 mentions this issue: crypto/elliptic: port P-224 and P-384 to fiat-crypto

@FiloSottile
Copy link
Contributor

https://golang.org/cl/360014 brings in P-224 and P-384, too.

I was hoping to also replace the generic 32-bit P-256 implementation, but the performance of 32-bit fiat seems to be particularly bad, to the point that the 64-bit variants perform better than the 32-bit ones even on 32-bit architectures.

See the benchmarks in https://golang.org/cl/351373.

@josharian suggested it might be due to #40171 (comment).

@egonelbre
Copy link
Contributor

An additional reason could be #15925. tl;dr; arrays don't optimize as well as structs.

gopherbot pushed a commit that referenced this issue Nov 5, 2021
Also, adopt addchain code generation for field inversion, and switch
P-521 to Montgomery multiplication, which is significantly slower but
allows us to reuse the P-224/P-256/P-384 wrapper code. No one uses P-521
anyway, and it's still faster than it was in Go 1.16.

Removed a portion of tests that ran the P-224 vectors against P-256,
for some reason.

Sadly, fiat-crypto is not fast enough to replace the generic 32-bit
P-256 implementation (just yet?).

A change in visible behavior is that we literally can't internally
operate on invalid curve points anymore (yay!) but the crypto/elliptic
API locked us into accepting any pair of integers for
Add/Double/ScalarMult and return no error (sigh), although of course
that's undefined behavior. Panics are always regretted. Returning nil
leads to panics. A fixed point might be exploited. The most reasonable
solution felt to return a made up random point, which is not that
different from an off-curve point but leaks less.

name                                  old time/op    new time/op    delta
pkg:crypto/elliptic goos:darwin goarch:arm64
ScalarBaseMult/P224-8                    573µs ± 0%     146µs ± 0%   -74.56%  (p=0.000 n=7+9)
ScalarMult/P224-8                        574µs ± 0%     152µs ± 5%   -73.58%  (p=0.000 n=7+10)
MarshalUnmarshal/P224/Uncompressed-8     664ns ± 0%     481ns ± 1%   -27.64%  (p=0.000 n=8+10)
MarshalUnmarshal/P224/Compressed-8       666ns ± 1%     480ns ± 0%   -27.92%  (p=0.000 n=10+10)
pkg:crypto/ecdsa goos:darwin goarch:arm64
Sign/P224-8                              597µs ± 0%     169µs ± 2%   -71.71%  (p=0.000 n=10+9)
Verify/P224-8                           1.18ms ± 1%    0.32ms ± 5%   -72.81%  (p=0.000 n=10+10)
GenerateKey/P224-8                       577µs ± 0%     147µs ± 0%   -74.51%  (p=0.000 n=8+8)

name                                  old time/op    new time/op    delta
pkg:crypto/elliptic goos:darwin goarch:arm64
ScalarBaseMult/P384-8                   2.01ms ± 2%    0.50ms ± 0%  -75.00%  (p=0.000 n=10+8)
ScalarMult/P384-8                       2.02ms ± 3%    0.51ms ± 3%  -74.64%  (p=0.000 n=10+10)
MarshalUnmarshal/P384/Uncompressed-8    1.09µs ± 1%    0.76µs ± 0%  -30.27%  (p=0.000 n=10+9)
MarshalUnmarshal/P384/Compressed-8      1.08µs ± 0%    0.76µs ± 1%  -29.86%  (p=0.000 n=8+10)
pkg:crypto/ecdsa goos:darwin goarch:arm64
Sign/P384-8                             2.06ms ± 1%    0.56ms ± 2%  -72.76%  (p=0.000 n=10+10)
Verify/P384-8                           4.06ms ± 2%    1.08ms ± 0%  -73.49%  (p=0.000 n=10+8)
GenerateKey/P384-8                      2.01ms ± 1%    0.51ms ± 3%  -74.65%  (p=0.000 n=10+10)

name                                  old time/op    new time/op    delta
pkg:crypto/elliptic goos:darwin goarch:arm64
ScalarBaseMult/P521-8                    715µs ± 6%    1525µs ± 4%  +113.39%  (p=0.000 n=10+10)
ScalarMult/P521-8                        698µs ± 1%    1543µs ± 1%  +120.99%  (p=0.000 n=9+9)
MarshalUnmarshal/P521/Uncompressed-8     797ns ± 0%    1296ns ± 0%   +62.65%  (p=0.000 n=10+9)
MarshalUnmarshal/P521/Compressed-8       798ns ± 0%    1299ns ± 1%   +62.82%  (p=0.000 n=8+10)
pkg:crypto/ecdsa goos:darwin goarch:arm64
Sign/P521-8                              810µs ± 3%    1645µs ± 0%  +103.03%  (p=0.000 n=10+10)
Verify/P521-8                           1.42ms ± 1%    3.19ms ± 1%  +125.28%  (p=0.000 n=10+8)
GenerateKey/P521-8                       698µs ± 1%    1549µs ± 0%  +121.87%  (p=0.000 n=10+7)

Updates #40171

Change-Id: I34edf5002b5e9fad0ebb6c1e2119fb123ea6d18f
Reviewed-on: https://go-review.googlesource.com/c/go/+/360014
Run-TryBot: Filippo Valsorda <filippo@golang.org>
Trust: Filippo Valsorda <filippo@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Julie Qiu <julie@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
@ianlancetaylor
Copy link
Member

This is in the Go 1.18 milestone. Is it likely to happen for 1.18? Thanks.

@FiloSottile
Copy link
Contributor

P-224 and P-384 landed in Go 1.18. Moving to Backlog for P-256, 32-bit implementations, and scalar fields, which are not as immediate to bring in. P-256 and 32-bit depend on performance improvements. Scalar fields depend on better APIs.

@FiloSottile FiloSottile modified the milestones: Go1.18, Backlog Nov 18, 2021
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/396216 mentions this issue: WIP: cmd/compile: enable carry chain scheduling for AMD64

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/393656 mentions this issue: cmd/compile: schedule carry chain arithmetic disjointly

gopherbot pushed a commit that referenced this issue May 8, 2022
This results in a 1.7-2.4x improvement in native go crypto/elliptic
multiplication operations on PPC64, and similar improvements might
be possible on other architectures which use flags or similar to
represent the carry bit in SSA form.

If it is possible, schedule carry chains independently of each
other to avoid clobbering the carry flag. This is very expensive.

This is done by:

1. Identifying carry bit using, but not creating ops, and lowering
   their priority below all other ops which do not need to be
   placed at the top of a block. This effectively ensures only
   one carry chain will be placed at a time in most important
   cases (crypto/elliptic/internal/fiat contains most of them).

2. Raising the priority of carry bit generating ops to schedule
   later in a block to ensure they are placed as soon as they
   are ready.

Likewise, tuple ops which separate carrying ops are scored
similar to 2 above. This prevents unrelated ops from being
scheduled between carry-dependent operations. This occurs
when unrelated ops are ready to schedule alongside such
tuple ops. This reduces the chances a flag clobbering op
might be placed between two carry-dependent operations.

With PPC64 Add64/Sub64 lowering into SSA and this patch, the net
performance difference in crypto/elliptic benchmarks on P9/ppc64le
are:

name                                old time/op    new time/op    delta
ScalarBaseMult/P256                   46.3µs ± 0%    46.9µs ± 0%   +1.34%
ScalarBaseMult/P224                    356µs ± 0%     209µs ± 0%  -41.14%
ScalarBaseMult/P384                   1.20ms ± 0%    0.57ms ± 0%  -52.14%
ScalarBaseMult/P521                   3.38ms ± 0%    1.44ms ± 0%  -57.27%
ScalarMult/P256                        199µs ± 0%     199µs ± 0%   -0.17%
ScalarMult/P224                        357µs ± 0%     212µs ± 0%  -40.56%
ScalarMult/P384                       1.20ms ± 0%    0.58ms ± 0%  -51.86%
ScalarMult/P521                       3.37ms ± 0%    1.44ms ± 0%  -57.32%
MarshalUnmarshal/P256/Uncompressed    2.59µs ± 0%    2.52µs ± 0%   -2.63%
MarshalUnmarshal/P256/Compressed      2.58µs ± 0%    2.52µs ± 0%   -2.06%
MarshalUnmarshal/P224/Uncompressed    1.54µs ± 0%    1.40µs ± 0%   -9.42%
MarshalUnmarshal/P224/Compressed      1.54µs ± 0%    1.39µs ± 0%   -9.87%
MarshalUnmarshal/P384/Uncompressed    2.40µs ± 0%    1.80µs ± 0%  -24.93%
MarshalUnmarshal/P384/Compressed      2.35µs ± 0%    1.81µs ± 0%  -23.03%
MarshalUnmarshal/P521/Uncompressed    3.79µs ± 0%    2.58µs ± 0%  -31.81%
MarshalUnmarshal/P521/Compressed      3.80µs ± 0%    2.60µs ± 0%  -31.67%

Note, P256 uses an asm implementation, thus, little variation is expected.

Updates #40171

Change-Id: I810850e8ff429505424c92d6fe37f99aaa0c6e84
Reviewed-on: https://go-review.googlesource.com/c/go/+/393656
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Filippo Valsorda <valsorda@google.com>
@rsc rsc moved this to Accepted in Proposals Aug 10, 2022
@rsc rsc added this to Proposals Aug 10, 2022
@julieqiu julieqiu removed this from Go Security Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Proposal Proposal-Accepted Proposal-Crypto Proposal related to crypto packages or other security issues
Projects
Status: Accepted
Development

No branches or pull requests