-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crypto/rsa: new key generation prohibitively slow under race detector #70644
Comments
Here is a more direct measurement, using crypto/rsa BenchmarkGenerateKey:
With the improved trial divisions, GenerateKey takes 25% longer than before in non-race mode, but in race mode it takes 300% longer (4X the time). Is the issue that bigmod has more loops not in assembly, and so more instrumented race loops? |
Profiling suggests the addMulVVW loop not being in assembly may be biting us here (for the non-optimized sizes, > 2048). |
The GenerateKey benchmark is only 2048-bit, so addMulVVW should not be the issue. But if it is, maybe we should add a //go:norace to it. |
I updated the benchmark to test larger key sizes as well, as it seems like the slowdown is non-linear, but for really large keys maybe we just don't care. |
Regardless, it seems like we're spending most of the new time in InverseVarTime, which makes sense. I cannot see particularly why that would be all that much worse than the old approach, but my understanding of the race detector is rather rudimentary. Funnily I see significantly worse numbers (with race detector enabled for both runs), but this might just be an example of Apple chips being extremely weird to benchmark on:
|
I poked at this some more and have a pair of CLs, |
https://go.dev/cl/633995 also addressed this. |
I will close this issue once I have an analysis ready to post. It looks like the problem is fixed. |
I wrote a short RSA key generation benchmark:
I ran this against Go toolchains built at four versions, using GOEXPERIMENT=boringcrypto. Rand=Std and Rand=NonStd should behave identically, except that Rand=Std/2048, /3072, and /4096 use BoringCrypto, and other key sizes or non-crypto/rand.Reader randomness sources fall back to non-BoringCrypto paths. (I am making sure to test BoringCrypto because I wanted to see how our new FIPS code compared to the old FIPS "solution".)
Here are a sequence of tables showing each version separately, measuring the base time and then the time under sanitizer (race, asn, msan). I only ran key sizes up to 2048 because the intermediate versions were so slow. The Rand=Std/N and Rand=NonStd/N lines should measure identically, except that (as just noted) Rand=Std/2048 is a special case that dispatches to BoringCrypto in base, race, and asan builds (but not msan). BoringCrypto is invisible to base, race, and asan, so you don't see that line slow down in those modes. You can see in these tables that those two intermediate versions were extremely slow, especially for asan and msan! It's surprising to me how much slower asan and msan are than race. I'd always thought race had the hardest job of the three, but maybe @dvyukov did a better job on the implementation.
And here it is flipped so that the columns compare versions and the tables show different modes. You can see in this form that although the base RSA operations have gotten slower, as expected because of the use of constant-time bignum routines, the asan and msan modes have actually gotten faster, because we've hidden more from them than we did when using math/big. Again race is somehow an exception, as is the Std/2048 BoringCrypto line (but at least we understand that one).
(The two benchstat outputs are very wide. You have to scroll to see the full text, even on a monitor that is plenty wide enough. I don't know why GitHub insists on capping the width at something so narrow!) |
I was gonna ask why base non-BoringCrypto gets slower after CL 632775, but it looks like it's just within the very wide margins of error. By the way, probably not worth it, but to make it less noisy we could make key generation report the number of MR rejections, and normalize the benchmark based on that. Or, try to make the sequence of tested primes fixed. That part doesn't need to change often. (This is yet another reason I like rigid key generation from seed. You can benchmark apples to apples because everyone needs to come to the same output.) |
Even better, it would help to have a benchmark for a single Miller-Rabin round at important sizes. |
Although recent changes to crypto/rsa made key generation 3-4X slower and then improved that to only 50% slower, when run under the race detector the code still runs about 10X slower than before. Using
go test golang.org/x/crypto/openpgp/clearsign
as a test, here are the timings on my M3 MacBook Pro using different crypto commits:The text was updated successfully, but these errors were encountered: