Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clang does not recognize portable add-with-carry patterns #73847

Open
davidben opened this issue Nov 29, 2023 · 5 comments
Open

Clang does not recognize portable add-with-carry patterns #73847

davidben opened this issue Nov 29, 2023 · 5 comments
Labels
clang:headers Headers provided by Clang, e.g. for intrinsics

Comments

@davidben
Copy link
Contributor

When writing code for cryptographic primitives, or big integers in general, one often needs to use the ISA's add-with-carry instructions and chain carry flags up a bignum.

Although Clang does provide Clang-specific intrinsics like __builtin_addc, they're not portable across compilers. I made several attempts to write a portable add-with-carry here, and Clang seems unable to recognize any of them. Here's a godbolt link with a bunch of them:
https://godbolt.org/z/WTns6M8E6

(CC @andres-erbsen, do you remember if there were other patterns we'd tried?)

@github-actions github-actions bot added the clang Clang issues not falling into any other category label Nov 29, 2023
@EugeneZelenko EugeneZelenko added clang:headers Headers provided by Clang, e.g. for intrinsics and removed clang Clang issues not falling into any other category labels Nov 29, 2023
@davidben
Copy link
Contributor Author

@EugeneZelenko are you sure clang:headers is the right label? This isn't an issue with the intrinsics, but that very small snippets of portable code are optimized in the same way as the intrinsics.

@shafik
Copy link
Collaborator

shafik commented Nov 29, 2023

@AaronBallman @erichkeane any advice here?

@erichkeane
Copy link
Collaborator

This is likely something that the opt folks have to take a look at. @topperc was particularly good at bit pattern recognition at one point, so he might be able to help out.

davidben added a commit to google/boringssl that referenced this issue Nov 29, 2023
I'm getting tired of having to rederive the best way to convince the
compiler to emit addc and subb functions. Do it once and use the Clang
builtins when available, because compilers seem to generally be terrible
at this. (See llvm/llvm-project#73847.)

The immediate trigger was the FIPS 186-2 PRF, which completely doesn't
matter, but reminded me of this mess.

As far as naming and calling conventions go, I just mimicked the Clang
ones. In doing so, also use the Clang builtins when available, which
helps Clang x86_64 no-asm builds a bit:

Before:
Did 704 ECDH P-384 operations in 1018920us (690.9 ops/sec)
Did 1353 ECDSA P-384 signing operations in 1077927us (1255.2 ops/sec)
Did 1190 ECDSA P-384 verify operations in 1020788us (1165.8 ops/sec)
Did 784 RSA 2048 signing operations in 1058644us (740.6 ops/sec)
Did 34000 RSA 2048 verify (same key) operations in 1011854us (33601.7 ops/sec)
Did 30000 RSA 2048 verify (fresh key) operations in 1005974us (29821.8 ops/sec)
Did 7799 RSA 2048 private key parse operations in 1061203us (7349.2 ops/sec)
Did 130 RSA 4096 signing operations in 1082617us (120.1 ops/sec)
Did 10472 RSA 4096 verify (same key) operations in 1082857us (9670.7 ops/sec)
Did 9086 RSA 4096 verify (fresh key) operations in 1039164us (8743.6 ops/sec)
Did 2574 RSA 4096 private key parse operations in 1078946us (2385.7 ops/sec)

After:
Did 775 ECDH P-384 operations in 1008465us (768.5 ops/sec)
Did 1474 ECDSA P-384 signing operations in 1062096us (1387.8 ops/sec)
Did 1485 ECDSA P-384 verify operations in 1086574us (1366.7 ops/sec)
Did 812 RSA 2048 signing operations in 1043705us (778.0 ops/sec)
Did 36000 RSA 2048 verify (same key) operations in 1005643us (35798.0 ops/sec)
Did 33000 RSA 2048 verify (fresh key) operations in 1028256us (32093.2 ops/sec)
Did 10087 RSA 2048 private key parse operations in 1018067us (9908.0 ops/sec)
Did 132 RSA 4096 signing operations in 1033049us (127.8 ops/sec)
Did 11000 RSA 4096 verify (same key) operations in 1070502us (10275.6 ops/sec)
Did 9812 RSA 4096 verify (fresh key) operations in 1047618us (9366.0 ops/sec)
Did 3245 RSA 4096 private key parse operations in 1083247us (2995.6 ops/sec)

But this is also a no-asm build, so we don't really care. Builds with
assembly, broadly, do not use these codepaths. The exception is the
generic ECC code on 32-bit Arm, which has a few mod-add functions, and
we don't have 32-bit Arm bn_add_words assembly:

Before:
Did 168 ECDH P-384 operations in 1003229us (167.5 ops/sec)
Did 330 ECDSA P-384 signing operations in 1076600us (306.5 ops/sec)
Did 319 ECDSA P-384 verify operations in 1080750us (295.2 ops/sec)
After:
Did 195 ECDH P-384 operations in 1026458us (190.0 ops/sec)
Did 350 ECDSA P-384 signing operations in 1005392us (348.1 ops/sec)
Did 341 ECDSA P-384 verify operations in 1008486us (338.1 ops/sec)

Change-Id: Ia3fa51e59398224b9c39180e1d856bb412aa1246
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/64309
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
@topperc
Copy link
Collaborator

topperc commented Nov 29, 2023

FYI gcc does support __builtin_addc so it isn't clang specific anymore.

Do other compilers optimize any of your sequences?

@davidben
Copy link
Contributor Author

davidben commented Dec 5, 2023

FYI gcc does support __builtin_addc so it isn't clang specific anymore.

Yeah, I realized that after I filed this. We're using that when available, annoying as the ifdef soup is. :-) But it'd be nice if more portable sequences were recognized so not every project needs to discover this on their own.

Do other compilers optimize any of your sequences?

Not that I'm aware of. The compilers we've used seem to be uniformly pretty bad at handling carry flags, alas. This was filed less as a feature parity thing and more as a missed optimization. (This sort of code, when really perf-sensitive, often needs to dip all the way into assembly to avoid compiler mishaps. Compilers have a long way to go here.)

samuel40791765 pushed a commit to samuel40791765/aws-lc that referenced this issue Apr 11, 2024
I'm getting tired of having to rederive the best way to convince the
compiler to emit addc and subb functions. Do it once and use the Clang
builtins when available, because compilers seem to generally be terrible
at this. (See llvm/llvm-project#73847.)

The immediate trigger was the FIPS 186-2 PRF, which completely doesn't
matter, but reminded me of this mess.

As far as naming and calling conventions go, I just mimicked the Clang
ones. In doing so, also use the Clang builtins when available, which
helps Clang x86_64 no-asm builds a bit:

Before:
Did 704 ECDH P-384 operations in 1018920us (690.9 ops/sec)
Did 1353 ECDSA P-384 signing operations in 1077927us (1255.2 ops/sec)
Did 1190 ECDSA P-384 verify operations in 1020788us (1165.8 ops/sec)
Did 784 RSA 2048 signing operations in 1058644us (740.6 ops/sec)
Did 34000 RSA 2048 verify (same key) operations in 1011854us (33601.7 ops/sec)
Did 30000 RSA 2048 verify (fresh key) operations in 1005974us (29821.8 ops/sec)
Did 7799 RSA 2048 private key parse operations in 1061203us (7349.2 ops/sec)
Did 130 RSA 4096 signing operations in 1082617us (120.1 ops/sec)
Did 10472 RSA 4096 verify (same key) operations in 1082857us (9670.7 ops/sec)
Did 9086 RSA 4096 verify (fresh key) operations in 1039164us (8743.6 ops/sec)
Did 2574 RSA 4096 private key parse operations in 1078946us (2385.7 ops/sec)

After:
Did 775 ECDH P-384 operations in 1008465us (768.5 ops/sec)
Did 1474 ECDSA P-384 signing operations in 1062096us (1387.8 ops/sec)
Did 1485 ECDSA P-384 verify operations in 1086574us (1366.7 ops/sec)
Did 812 RSA 2048 signing operations in 1043705us (778.0 ops/sec)
Did 36000 RSA 2048 verify (same key) operations in 1005643us (35798.0 ops/sec)
Did 33000 RSA 2048 verify (fresh key) operations in 1028256us (32093.2 ops/sec)
Did 10087 RSA 2048 private key parse operations in 1018067us (9908.0 ops/sec)
Did 132 RSA 4096 signing operations in 1033049us (127.8 ops/sec)
Did 11000 RSA 4096 verify (same key) operations in 1070502us (10275.6 ops/sec)
Did 9812 RSA 4096 verify (fresh key) operations in 1047618us (9366.0 ops/sec)
Did 3245 RSA 4096 private key parse operations in 1083247us (2995.6 ops/sec)

But this is also a no-asm build, so we don't really care. Builds with
assembly, broadly, do not use these codepaths. The exception is the
generic ECC code on 32-bit Arm, which has a few mod-add functions, and
we don't have 32-bit Arm bn_add_words assembly:

Before:
Did 168 ECDH P-384 operations in 1003229us (167.5 ops/sec)
Did 330 ECDSA P-384 signing operations in 1076600us (306.5 ops/sec)
Did 319 ECDSA P-384 verify operations in 1080750us (295.2 ops/sec)
After:
Did 195 ECDH P-384 operations in 1026458us (190.0 ops/sec)
Did 350 ECDSA P-384 signing operations in 1005392us (348.1 ops/sec)
Did 341 ECDSA P-384 verify operations in 1008486us (338.1 ops/sec)

Change-Id: Ia3fa51e59398224b9c39180e1d856bb412aa1246
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/64309
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
(cherry picked from commit 70ca6bc24be103dabd68e448cd3af29b929b771d)
samuel40791765 pushed a commit to samuel40791765/aws-lc that referenced this issue Apr 11, 2024
I'm getting tired of having to rederive the best way to convince the
compiler to emit addc and subb functions. Do it once and use the Clang
builtins when available, because compilers seem to generally be terrible
at this. (See llvm/llvm-project#73847.)

The immediate trigger was the FIPS 186-2 PRF, which completely doesn't
matter, but reminded me of this mess.

As far as naming and calling conventions go, I just mimicked the Clang
ones. In doing so, also use the Clang builtins when available, which
helps Clang x86_64 no-asm builds a bit:

Before:
Did 704 ECDH P-384 operations in 1018920us (690.9 ops/sec)
Did 1353 ECDSA P-384 signing operations in 1077927us (1255.2 ops/sec)
Did 1190 ECDSA P-384 verify operations in 1020788us (1165.8 ops/sec)
Did 784 RSA 2048 signing operations in 1058644us (740.6 ops/sec)
Did 34000 RSA 2048 verify (same key) operations in 1011854us (33601.7 ops/sec)
Did 30000 RSA 2048 verify (fresh key) operations in 1005974us (29821.8 ops/sec)
Did 7799 RSA 2048 private key parse operations in 1061203us (7349.2 ops/sec)
Did 130 RSA 4096 signing operations in 1082617us (120.1 ops/sec)
Did 10472 RSA 4096 verify (same key) operations in 1082857us (9670.7 ops/sec)
Did 9086 RSA 4096 verify (fresh key) operations in 1039164us (8743.6 ops/sec)
Did 2574 RSA 4096 private key parse operations in 1078946us (2385.7 ops/sec)

After:
Did 775 ECDH P-384 operations in 1008465us (768.5 ops/sec)
Did 1474 ECDSA P-384 signing operations in 1062096us (1387.8 ops/sec)
Did 1485 ECDSA P-384 verify operations in 1086574us (1366.7 ops/sec)
Did 812 RSA 2048 signing operations in 1043705us (778.0 ops/sec)
Did 36000 RSA 2048 verify (same key) operations in 1005643us (35798.0 ops/sec)
Did 33000 RSA 2048 verify (fresh key) operations in 1028256us (32093.2 ops/sec)
Did 10087 RSA 2048 private key parse operations in 1018067us (9908.0 ops/sec)
Did 132 RSA 4096 signing operations in 1033049us (127.8 ops/sec)
Did 11000 RSA 4096 verify (same key) operations in 1070502us (10275.6 ops/sec)
Did 9812 RSA 4096 verify (fresh key) operations in 1047618us (9366.0 ops/sec)
Did 3245 RSA 4096 private key parse operations in 1083247us (2995.6 ops/sec)

But this is also a no-asm build, so we don't really care. Builds with
assembly, broadly, do not use these codepaths. The exception is the
generic ECC code on 32-bit Arm, which has a few mod-add functions, and
we don't have 32-bit Arm bn_add_words assembly:

Before:
Did 168 ECDH P-384 operations in 1003229us (167.5 ops/sec)
Did 330 ECDSA P-384 signing operations in 1076600us (306.5 ops/sec)
Did 319 ECDSA P-384 verify operations in 1080750us (295.2 ops/sec)
After:
Did 195 ECDH P-384 operations in 1026458us (190.0 ops/sec)
Did 350 ECDSA P-384 signing operations in 1005392us (348.1 ops/sec)
Did 341 ECDSA P-384 verify operations in 1008486us (338.1 ops/sec)

Change-Id: Ia3fa51e59398224b9c39180e1d856bb412aa1246
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/64309
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
(cherry picked from commit 70ca6bc24be103dabd68e448cd3af29b929b771d)
samuel40791765 pushed a commit to samuel40791765/aws-lc that referenced this issue Apr 12, 2024
I'm getting tired of having to rederive the best way to convince the
compiler to emit addc and subb functions. Do it once and use the Clang
builtins when available, because compilers seem to generally be terrible
at this. (See llvm/llvm-project#73847.)

The immediate trigger was the FIPS 186-2 PRF, which completely doesn't
matter, but reminded me of this mess.

As far as naming and calling conventions go, I just mimicked the Clang
ones. In doing so, also use the Clang builtins when available, which
helps Clang x86_64 no-asm builds a bit:

Before:
Did 704 ECDH P-384 operations in 1018920us (690.9 ops/sec)
Did 1353 ECDSA P-384 signing operations in 1077927us (1255.2 ops/sec)
Did 1190 ECDSA P-384 verify operations in 1020788us (1165.8 ops/sec)
Did 784 RSA 2048 signing operations in 1058644us (740.6 ops/sec)
Did 34000 RSA 2048 verify (same key) operations in 1011854us (33601.7 ops/sec)
Did 30000 RSA 2048 verify (fresh key) operations in 1005974us (29821.8 ops/sec)
Did 7799 RSA 2048 private key parse operations in 1061203us (7349.2 ops/sec)
Did 130 RSA 4096 signing operations in 1082617us (120.1 ops/sec)
Did 10472 RSA 4096 verify (same key) operations in 1082857us (9670.7 ops/sec)
Did 9086 RSA 4096 verify (fresh key) operations in 1039164us (8743.6 ops/sec)
Did 2574 RSA 4096 private key parse operations in 1078946us (2385.7 ops/sec)

After:
Did 775 ECDH P-384 operations in 1008465us (768.5 ops/sec)
Did 1474 ECDSA P-384 signing operations in 1062096us (1387.8 ops/sec)
Did 1485 ECDSA P-384 verify operations in 1086574us (1366.7 ops/sec)
Did 812 RSA 2048 signing operations in 1043705us (778.0 ops/sec)
Did 36000 RSA 2048 verify (same key) operations in 1005643us (35798.0 ops/sec)
Did 33000 RSA 2048 verify (fresh key) operations in 1028256us (32093.2 ops/sec)
Did 10087 RSA 2048 private key parse operations in 1018067us (9908.0 ops/sec)
Did 132 RSA 4096 signing operations in 1033049us (127.8 ops/sec)
Did 11000 RSA 4096 verify (same key) operations in 1070502us (10275.6 ops/sec)
Did 9812 RSA 4096 verify (fresh key) operations in 1047618us (9366.0 ops/sec)
Did 3245 RSA 4096 private key parse operations in 1083247us (2995.6 ops/sec)

But this is also a no-asm build, so we don't really care. Builds with
assembly, broadly, do not use these codepaths. The exception is the
generic ECC code on 32-bit Arm, which has a few mod-add functions, and
we don't have 32-bit Arm bn_add_words assembly:

Before:
Did 168 ECDH P-384 operations in 1003229us (167.5 ops/sec)
Did 330 ECDSA P-384 signing operations in 1076600us (306.5 ops/sec)
Did 319 ECDSA P-384 verify operations in 1080750us (295.2 ops/sec)
After:
Did 195 ECDH P-384 operations in 1026458us (190.0 ops/sec)
Did 350 ECDSA P-384 signing operations in 1005392us (348.1 ops/sec)
Did 341 ECDSA P-384 verify operations in 1008486us (338.1 ops/sec)

Change-Id: Ia3fa51e59398224b9c39180e1d856bb412aa1246
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/64309
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
(cherry picked from commit 70ca6bc24be103dabd68e448cd3af29b929b771d)
samuel40791765 pushed a commit to samuel40791765/aws-lc that referenced this issue Apr 16, 2024
I'm getting tired of having to rederive the best way to convince the
compiler to emit addc and subb functions. Do it once and use the Clang
builtins when available, because compilers seem to generally be terrible
at this. (See llvm/llvm-project#73847.)

The immediate trigger was the FIPS 186-2 PRF, which completely doesn't
matter, but reminded me of this mess.

As far as naming and calling conventions go, I just mimicked the Clang
ones. In doing so, also use the Clang builtins when available, which
helps Clang x86_64 no-asm builds a bit:

Before:
Did 704 ECDH P-384 operations in 1018920us (690.9 ops/sec)
Did 1353 ECDSA P-384 signing operations in 1077927us (1255.2 ops/sec)
Did 1190 ECDSA P-384 verify operations in 1020788us (1165.8 ops/sec)
Did 784 RSA 2048 signing operations in 1058644us (740.6 ops/sec)
Did 34000 RSA 2048 verify (same key) operations in 1011854us (33601.7 ops/sec)
Did 30000 RSA 2048 verify (fresh key) operations in 1005974us (29821.8 ops/sec)
Did 7799 RSA 2048 private key parse operations in 1061203us (7349.2 ops/sec)
Did 130 RSA 4096 signing operations in 1082617us (120.1 ops/sec)
Did 10472 RSA 4096 verify (same key) operations in 1082857us (9670.7 ops/sec)
Did 9086 RSA 4096 verify (fresh key) operations in 1039164us (8743.6 ops/sec)
Did 2574 RSA 4096 private key parse operations in 1078946us (2385.7 ops/sec)

After:
Did 775 ECDH P-384 operations in 1008465us (768.5 ops/sec)
Did 1474 ECDSA P-384 signing operations in 1062096us (1387.8 ops/sec)
Did 1485 ECDSA P-384 verify operations in 1086574us (1366.7 ops/sec)
Did 812 RSA 2048 signing operations in 1043705us (778.0 ops/sec)
Did 36000 RSA 2048 verify (same key) operations in 1005643us (35798.0 ops/sec)
Did 33000 RSA 2048 verify (fresh key) operations in 1028256us (32093.2 ops/sec)
Did 10087 RSA 2048 private key parse operations in 1018067us (9908.0 ops/sec)
Did 132 RSA 4096 signing operations in 1033049us (127.8 ops/sec)
Did 11000 RSA 4096 verify (same key) operations in 1070502us (10275.6 ops/sec)
Did 9812 RSA 4096 verify (fresh key) operations in 1047618us (9366.0 ops/sec)
Did 3245 RSA 4096 private key parse operations in 1083247us (2995.6 ops/sec)

But this is also a no-asm build, so we don't really care. Builds with
assembly, broadly, do not use these codepaths. The exception is the
generic ECC code on 32-bit Arm, which has a few mod-add functions, and
we don't have 32-bit Arm bn_add_words assembly:

Before:
Did 168 ECDH P-384 operations in 1003229us (167.5 ops/sec)
Did 330 ECDSA P-384 signing operations in 1076600us (306.5 ops/sec)
Did 319 ECDSA P-384 verify operations in 1080750us (295.2 ops/sec)
After:
Did 195 ECDH P-384 operations in 1026458us (190.0 ops/sec)
Did 350 ECDSA P-384 signing operations in 1005392us (348.1 ops/sec)
Did 341 ECDSA P-384 verify operations in 1008486us (338.1 ops/sec)

Change-Id: Ia3fa51e59398224b9c39180e1d856bb412aa1246
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/64309
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
(cherry picked from commit 70ca6bc24be103dabd68e448cd3af29b929b771d)
samuel40791765 pushed a commit to samuel40791765/aws-lc that referenced this issue Apr 16, 2024
I'm getting tired of having to rederive the best way to convince the
compiler to emit addc and subb functions. Do it once and use the Clang
builtins when available, because compilers seem to generally be terrible
at this. (See llvm/llvm-project#73847.)

The immediate trigger was the FIPS 186-2 PRF, which completely doesn't
matter, but reminded me of this mess.

As far as naming and calling conventions go, I just mimicked the Clang
ones. In doing so, also use the Clang builtins when available, which
helps Clang x86_64 no-asm builds a bit:

Before:
Did 704 ECDH P-384 operations in 1018920us (690.9 ops/sec)
Did 1353 ECDSA P-384 signing operations in 1077927us (1255.2 ops/sec)
Did 1190 ECDSA P-384 verify operations in 1020788us (1165.8 ops/sec)
Did 784 RSA 2048 signing operations in 1058644us (740.6 ops/sec)
Did 34000 RSA 2048 verify (same key) operations in 1011854us (33601.7 ops/sec)
Did 30000 RSA 2048 verify (fresh key) operations in 1005974us (29821.8 ops/sec)
Did 7799 RSA 2048 private key parse operations in 1061203us (7349.2 ops/sec)
Did 130 RSA 4096 signing operations in 1082617us (120.1 ops/sec)
Did 10472 RSA 4096 verify (same key) operations in 1082857us (9670.7 ops/sec)
Did 9086 RSA 4096 verify (fresh key) operations in 1039164us (8743.6 ops/sec)
Did 2574 RSA 4096 private key parse operations in 1078946us (2385.7 ops/sec)

After:
Did 775 ECDH P-384 operations in 1008465us (768.5 ops/sec)
Did 1474 ECDSA P-384 signing operations in 1062096us (1387.8 ops/sec)
Did 1485 ECDSA P-384 verify operations in 1086574us (1366.7 ops/sec)
Did 812 RSA 2048 signing operations in 1043705us (778.0 ops/sec)
Did 36000 RSA 2048 verify (same key) operations in 1005643us (35798.0 ops/sec)
Did 33000 RSA 2048 verify (fresh key) operations in 1028256us (32093.2 ops/sec)
Did 10087 RSA 2048 private key parse operations in 1018067us (9908.0 ops/sec)
Did 132 RSA 4096 signing operations in 1033049us (127.8 ops/sec)
Did 11000 RSA 4096 verify (same key) operations in 1070502us (10275.6 ops/sec)
Did 9812 RSA 4096 verify (fresh key) operations in 1047618us (9366.0 ops/sec)
Did 3245 RSA 4096 private key parse operations in 1083247us (2995.6 ops/sec)

But this is also a no-asm build, so we don't really care. Builds with
assembly, broadly, do not use these codepaths. The exception is the
generic ECC code on 32-bit Arm, which has a few mod-add functions, and
we don't have 32-bit Arm bn_add_words assembly:

Before:
Did 168 ECDH P-384 operations in 1003229us (167.5 ops/sec)
Did 330 ECDSA P-384 signing operations in 1076600us (306.5 ops/sec)
Did 319 ECDSA P-384 verify operations in 1080750us (295.2 ops/sec)
After:
Did 195 ECDH P-384 operations in 1026458us (190.0 ops/sec)
Did 350 ECDSA P-384 signing operations in 1005392us (348.1 ops/sec)
Did 341 ECDSA P-384 verify operations in 1008486us (338.1 ops/sec)

Change-Id: Ia3fa51e59398224b9c39180e1d856bb412aa1246
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/64309
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
(cherry picked from commit 70ca6bc24be103dabd68e448cd3af29b929b771d)
samuel40791765 pushed a commit to samuel40791765/aws-lc that referenced this issue Apr 16, 2024
I'm getting tired of having to rederive the best way to convince the
compiler to emit addc and subb functions. Do it once and use the Clang
builtins when available, because compilers seem to generally be terrible
at this. (See llvm/llvm-project#73847.)

The immediate trigger was the FIPS 186-2 PRF, which completely doesn't
matter, but reminded me of this mess.

As far as naming and calling conventions go, I just mimicked the Clang
ones. In doing so, also use the Clang builtins when available, which
helps Clang x86_64 no-asm builds a bit:

Before:
Did 704 ECDH P-384 operations in 1018920us (690.9 ops/sec)
Did 1353 ECDSA P-384 signing operations in 1077927us (1255.2 ops/sec)
Did 1190 ECDSA P-384 verify operations in 1020788us (1165.8 ops/sec)
Did 784 RSA 2048 signing operations in 1058644us (740.6 ops/sec)
Did 34000 RSA 2048 verify (same key) operations in 1011854us (33601.7 ops/sec)
Did 30000 RSA 2048 verify (fresh key) operations in 1005974us (29821.8 ops/sec)
Did 7799 RSA 2048 private key parse operations in 1061203us (7349.2 ops/sec)
Did 130 RSA 4096 signing operations in 1082617us (120.1 ops/sec)
Did 10472 RSA 4096 verify (same key) operations in 1082857us (9670.7 ops/sec)
Did 9086 RSA 4096 verify (fresh key) operations in 1039164us (8743.6 ops/sec)
Did 2574 RSA 4096 private key parse operations in 1078946us (2385.7 ops/sec)

After:
Did 775 ECDH P-384 operations in 1008465us (768.5 ops/sec)
Did 1474 ECDSA P-384 signing operations in 1062096us (1387.8 ops/sec)
Did 1485 ECDSA P-384 verify operations in 1086574us (1366.7 ops/sec)
Did 812 RSA 2048 signing operations in 1043705us (778.0 ops/sec)
Did 36000 RSA 2048 verify (same key) operations in 1005643us (35798.0 ops/sec)
Did 33000 RSA 2048 verify (fresh key) operations in 1028256us (32093.2 ops/sec)
Did 10087 RSA 2048 private key parse operations in 1018067us (9908.0 ops/sec)
Did 132 RSA 4096 signing operations in 1033049us (127.8 ops/sec)
Did 11000 RSA 4096 verify (same key) operations in 1070502us (10275.6 ops/sec)
Did 9812 RSA 4096 verify (fresh key) operations in 1047618us (9366.0 ops/sec)
Did 3245 RSA 4096 private key parse operations in 1083247us (2995.6 ops/sec)

But this is also a no-asm build, so we don't really care. Builds with
assembly, broadly, do not use these codepaths. The exception is the
generic ECC code on 32-bit Arm, which has a few mod-add functions, and
we don't have 32-bit Arm bn_add_words assembly:

Before:
Did 168 ECDH P-384 operations in 1003229us (167.5 ops/sec)
Did 330 ECDSA P-384 signing operations in 1076600us (306.5 ops/sec)
Did 319 ECDSA P-384 verify operations in 1080750us (295.2 ops/sec)
After:
Did 195 ECDH P-384 operations in 1026458us (190.0 ops/sec)
Did 350 ECDSA P-384 signing operations in 1005392us (348.1 ops/sec)
Did 341 ECDSA P-384 verify operations in 1008486us (338.1 ops/sec)

Change-Id: Ia3fa51e59398224b9c39180e1d856bb412aa1246
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/64309
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
(cherry picked from commit 70ca6bc24be103dabd68e448cd3af29b929b771d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:headers Headers provided by Clang, e.g. for intrinsics
Projects
None yet
Development

No branches or pull requests

5 participants