Skip to content

crypto/rsa: RSA key generation is unreasonably slow on MIPS architecture #33224

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ignatk opened this issue Jul 22, 2019 · 6 comments
Open
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@ignatk
Copy link

ignatk commented Jul 22, 2019

What version of Go are you using (go version)?

$ go version
go version go1.12.7 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

GOARCH="mipsle"
GOOS="linux"

What did you do?

Created a simple benchmark for generating RSA keys. Made some adjustments to make the algorithm deterministic, so the benchmark time is independent from the value of the input random seed

rsa_test.go

package rsa_test

import (
	"crypto/aes"
	"crypto/cipher"
	"crypto/rsa"
	"io"
	"runtime/debug"
	"strings"
	"testing"
)

type devZero struct{}

func (dz devZero) Read(p []byte) (n int, err error) {
	for i, _ := range p {
		p[i] = 0
	}
	return len(p), nil
}

// reader which tries to negate the effect of crypto/internal/randutil/randutil.MaybeReadByte
// https://github.com/golang/go/blob/6269dcdc24d74379d8a609ce886149811020b2cc/src/crypto/internal/randutil/randutil.go#L25
// useful to regain deterministic output of some crypto routines with respect to a fixed pseudo-random stream
// such as https://github.com/golang/go/blob/6269dcdc24d74379d8a609ce886149811020b2cc/src/crypto/rsa/rsa.go#L222
type maybenotReader struct {
	io.Reader
}

func (r maybenotReader) Read(p []byte) (n int, err error) {
	if strings.Contains(string(debug.Stack()), "crypto/internal/randutil.MaybeReadByte") {
		// feed MaybeReadByte with dummy zeroes
		for i := range p {
			p[i] = 0
		}
		return len(p), nil
	}

	return r.Reader.Read(p)
}

var seed = []byte{0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}

func newDRNG() io.Reader {
	block, _ := aes.NewCipher(seed)
	stream := cipher.NewCTR(block, make([]byte, 16))

	return cipher.StreamReader{S: stream, R: devZero{}}
}

func benchmarkRsaGenerateKey(bits int, b *testing.B) {
	for n := 0; n < b.N; n++ {
		_, e := rsa.GenerateKey(maybenotReader{newDRNG()}, bits)
		if e != nil {
			b.Fatal(e)
		}
	}
}

func BenchmarkRsaGenerateKey1024(b *testing.B) {
	benchmarkRsaGenerateKey(1024, b)
}

func BenchmarkRsaGenerateKey2048(b *testing.B) {
        benchmarkRsaGenerateKey(2048, b)
}

func BenchmarkRsaGenerateKey4096(b *testing.B) {
        benchmarkRsaGenerateKey(4096, b)
}

Compile the above and run on a target platform:

$ GOARCH=mipsle go test -c -o testmips rsa_test.go
# copy ./testmips to MIPS Linux - I used Debian Stretch in QEMU
$ $ uname -a
Linux debmips 4.9.0-9-4kc-malta #1 Debian 4.9.168-1 (2019-04-12) mips GNU/Linux

What did you expect to see?

Seconds to generate RSA keys

What did you see instead?

Minutes (closer to hours) to generate 2048 and 4096 RSA keys

$ time ./testmips -test.bench=. -test.count 3
goos: linux
goarch: mipsle
BenchmarkRsaGenerateKey1024            1        35798014775 ns/op
BenchmarkRsaGenerateKey1024            1        31147393335 ns/op
BenchmarkRsaGenerateKey1024            1        6627440839 ns/op
BenchmarkRsaGenerateKey2048            1        168812137075 ns/op
BenchmarkRsaGenerateKey2048            1        134741347265 ns/op
BenchmarkRsaGenerateKey2048            1        147304597939 ns/op
BenchmarkRsaGenerateKey4096            1        1039797721543 ns/op
BenchmarkRsaGenerateKey4096            1        823377806776 ns/op
BenchmarkRsaGenerateKey4096            1        623262088403 ns/op
PASS

real    50m11.506s
user    33m49.000s
sys     15m33.744s

For comparison, generating 4096-bit key with openssl 3 times on the same QEMU instance:

$ time for i in 1 2 3; do openssl genpkey -algorithm rsa -pkeyopt rsa_keygen_bits:4096 1>/dev/null; done
........................++++
........................................................................................................++++
.................................++++
....++++
.........................................................................................................................................................................................................................................................................................................................................................................................................................................++++
.....................................++++

real	2m10.133s
user	2m8.800s
sys	0m0.308s

Additional info

It seems the key generation time on MIPS grows disproportionally with the requested key size. Fun fact: for 4096-bit keys it is ~1.75 faster to emulate ARM on MIPS than to run MIPS code directly.

NOTE: for below simulations I used Go 1.10, because later versions of Go are not qemu-user emulator friendly (probably, because of #24656). The test OS in question: Debian Stretch x86_64.

Compile the above test and run under qemu-mipsel-static emulator

$ GOARCH=mipsle go test -c -o testmips rsa_test.go
$ time qemu-mipsel-static ./testmips -test.bench=. -test.count 3
goos: linux
goarch: mipsle
BenchmarkRsaGenerateKey1024-32                 1        5689779179 ns/op
BenchmarkRsaGenerateKey1024-32                 1        5625519786 ns/op
BenchmarkRsaGenerateKey1024-32                 1        5634661337 ns/op
BenchmarkRsaGenerateKey2048-32                 1        89522431701 ns/op
BenchmarkRsaGenerateKey2048-32                 1        92388147744 ns/op
BenchmarkRsaGenerateKey2048-32                 1        91639958105 ns/op
BenchmarkRsaGenerateKey4096-32                 1        589441375945 ns/op
BenchmarkRsaGenerateKey4096-32                 1        583990222852 ns/op
BenchmarkRsaGenerateKey4096-32                 1        589008197818 ns/op
PASS

real    34m14.872s
user    34m8.239s
sys     0m22.022s

Compile the same code, but for arm and run it on qemu-arm-static emulator compiled for mipsle (which I just downloaded from official Debian MIPS stretch repositories) and running on qemu-mipsel-static emulator (double emulation):

$ GOARCH=arm go test -c -o testarm rsa_test.go
$ time qemu-mipsel-static ../emuemu/usr/bin/qemu-arm-static ./testarm -test.bench=. -test.count 3
goos: linux
goarch: arm
BenchmarkRsaGenerateKey1024-32                 1        8405353927 ns/op
BenchmarkRsaGenerateKey1024-32                 1        7890613791 ns/op
BenchmarkRsaGenerateKey1024-32                 1        7742888355 ns/op
BenchmarkRsaGenerateKey2048-32                 1        84443846429 ns/op
BenchmarkRsaGenerateKey2048-32                 1        85332574758 ns/op
BenchmarkRsaGenerateKey2048-32                 1        86418695881 ns/op
BenchmarkRsaGenerateKey4096-32                 1        332455626256 ns/op
BenchmarkRsaGenerateKey4096-32                 1        348751908420 ns/op
BenchmarkRsaGenerateKey4096-32                 1        350252575613 ns/op
PASS

real    25m46.399s
user    23m43.629s
sys     8m41.457s

As we can see it makes almost no difference for 2048-bit keys with double emulation through arm and 4096-bit keys are even faster when emulated with arm.

@ALTree ALTree added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 22, 2019
@yechongbo
Copy link

same issue here

@ignatk
Copy link
Author

ignatk commented Jul 23, 2019

Dumped a CPU profile

ignat@debmips:~$ ./testmips -test.bench=BenchmarkRsaGenerateKey2048 -test.cpuprofile cpu.prof
goos: linux
goarch: mipsle
BenchmarkRsaGenerateKey2048            1        112770767524 ns/op
PASS
ignat@debmips:~$ go tool pprof cpu.prof
File: testmips
Type: cpu
Time: Jul 23, 2019 at 3:42am (PDT)
Duration: 1.18mins, Total samples = 1.17mins (99.43%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 69.19s, 98.34% of 70.36s total
Dropped 132 nodes (cum <= 0.35s)
Showing top 10 nodes out of 24
      flat  flat%   sum%        cum   cum%
    21.88s 31.10% 31.10%     26.21s 37.25%  math/big.mulAddWWW_g
    18.22s 25.90% 56.99%     45.70s 64.95%  math/big.addMulVVW_g
    14.27s 20.28% 77.27%     14.27s 20.28%  runtime.usleep
     6.35s  9.03% 86.30%      6.35s  9.03%  runtime.nanotime
     4.33s  6.15% 92.45%      4.33s  6.15%  math/big.mulWW_g
     1.34s  1.90% 94.36%     48.17s 68.46%  math/big.nat.montgomery
     1.30s  1.85% 96.21%      1.30s  1.85%  math/big.addWW_g
     0.88s  1.25% 97.46%      0.88s  1.25%  math/big.addMulVVW
     0.55s  0.78% 98.24%      0.55s  0.78%  runtime._LostSIGPROFDuringAtomic64
     0.07s 0.099% 98.34%     20.68s 29.39%  runtime.sysmon

The Go math/big benchmarks do not look great either:

ignat@debmips:~$ go test -bench=BenchmarkAdd math/big
goos: linux
goarch: mipsle
pkg: math/big
BenchmarkAddVV/1         2000000               815 ns/op          39.23 MB/s
BenchmarkAddVV/2         2000000               816 ns/op          78.36 MB/s
BenchmarkAddVV/3         2000000               858 ns/op         111.86 MB/s
BenchmarkAddVV/4         2000000               876 ns/op         146.04 MB/s
BenchmarkAddVV/5         2000000               944 ns/op         169.40 MB/s
BenchmarkAddVV/10        1000000              1211 ns/op         264.13 MB/s
BenchmarkAddVV/100        200000              8770 ns/op         364.87 MB/s
BenchmarkAddVV/1000                20000             61477 ns/op         520.51 MB/s
BenchmarkAddVV/10000                3000            723396 ns/op         442.36 MB/s
BenchmarkAddVV/100000                200           7465979 ns/op         428.61 MB/s
BenchmarkAddVW/1                 2000000               642 ns/op           6.22 MB/s
BenchmarkAddVW/2                 2000000               731 ns/op          10.94 MB/s
BenchmarkAddVW/3                 2000000               688 ns/op          17.42 MB/s
BenchmarkAddVW/4                 2000000               724 ns/op          22.07 MB/s
BenchmarkAddVW/5                 2000000               740 ns/op          27.00 MB/s
BenchmarkAddVW/10                2000000               911 ns/op          43.87 MB/s
BenchmarkAddVW/100               1000000              4049 ns/op          98.77 MB/s
BenchmarkAddVW/1000                50000             33805 ns/op         118.32 MB/s
BenchmarkAddVW/10000               10000            368634 ns/op         108.51 MB/s
BenchmarkAddVW/100000                300           4493432 ns/op          89.02 MB/s
BenchmarkAddMulVVW/1             1000000              1664 ns/op          19.23 MB/s
BenchmarkAddMulVVW/2             1000000              2429 ns/op          26.34 MB/s
BenchmarkAddMulVVW/3             1000000              3026 ns/op          31.72 MB/s
BenchmarkAddMulVVW/4             1000000              3880 ns/op          32.99 MB/s
BenchmarkAddMulVVW/5              300000              4553 ns/op          35.14 MB/s
BenchmarkAddMulVVW/10             200000              8320 ns/op          38.46 MB/s
BenchmarkAddMulVVW/100             20000             73056 ns/op          43.80 MB/s
BenchmarkAddMulVVW/1000             3000            662475 ns/op          48.30 MB/s
BenchmarkAddMulVVW/10000             200           7183981 ns/op          44.54 MB/s
BenchmarkAddMulVVW/100000             20          76961522 ns/op          41.58 MB/s
PASS
ok      math/big        157.044s

Compared to a x86_64 VirtualBox VM on my laptop:

$ go test -bench=BenchmarkAdd math/big
goos: linux
goarch: amd64
pkg: math/big
BenchmarkAddVV/1-2     	100000000	        13.6 ns/op	4716.96 MB/s
BenchmarkAddVV/2-2     	100000000	        15.8 ns/op	8124.96 MB/s
BenchmarkAddVV/3-2     	100000000	        19.2 ns/op	9982.14 MB/s
BenchmarkAddVV/4-2     	100000000	        21.7 ns/op	11787.30 MB/s
BenchmarkAddVV/5-2     	100000000	        23.8 ns/op	13461.37 MB/s
BenchmarkAddVV/10-2    	30000000	        33.6 ns/op	19039.51 MB/s
BenchmarkAddVV/100-2   	10000000	       157 ns/op	40675.88 MB/s
BenchmarkAddVV/1000-2  	 1000000	      1511 ns/op	42348.24 MB/s
BenchmarkAddVV/10000-2 	  100000	     18488 ns/op	34616.53 MB/s
BenchmarkAddVV/100000-2         	   10000	    203620 ns/op	31431.00 MB/s
BenchmarkAddVW/1-2              	100000000	        10.7 ns/op	 745.27 MB/s
BenchmarkAddVW/2-2              	100000000	        12.2 ns/op	1314.85 MB/s
BenchmarkAddVW/3-2              	100000000	        13.9 ns/op	1723.45 MB/s
BenchmarkAddVW/4-2              	100000000	        13.7 ns/op	2331.84 MB/s
BenchmarkAddVW/5-2              	100000000	        17.2 ns/op	2330.50 MB/s
BenchmarkAddVW/10-2             	100000000	        27.3 ns/op	2929.60 MB/s
BenchmarkAddVW/100-2            	10000000	       158 ns/op	5049.63 MB/s
BenchmarkAddVW/1000-2           	 1000000	      1433 ns/op	5579.77 MB/s
BenchmarkAddVW/10000-2          	  100000	     13921 ns/op	5746.39 MB/s
BenchmarkAddVW/100000-2         	   10000	    156556 ns/op	5109.97 MB/s
BenchmarkAddMulVVW/1-2          	100000000	        12.5 ns/op	5119.07 MB/s
BenchmarkAddMulVVW/2-2          	100000000	        16.2 ns/op	7918.74 MB/s
BenchmarkAddMulVVW/3-2          	100000000	        20.4 ns/op	9412.36 MB/s
BenchmarkAddMulVVW/4-2          	50000000	        23.9 ns/op	10717.71 MB/s
BenchmarkAddMulVVW/5-2          	100000000	        25.2 ns/op	12716.54 MB/s
BenchmarkAddMulVVW/10-2         	30000000	        42.6 ns/op	15024.40 MB/s
BenchmarkAddMulVVW/100-2        	 5000000	       263 ns/op	24292.58 MB/s
BenchmarkAddMulVVW/1000-2       	  500000	      2450 ns/op	26118.28 MB/s
BenchmarkAddMulVVW/10000-2      	   50000	     26845 ns/op	23840.25 MB/s
BenchmarkAddMulVVW/100000-2     	    5000	    290007 ns/op	22068.36 MB/s
PASS
ok  	math/big	58.857s

To make the comparison fair - disabled asm optimisations for the bechmarks:

$ go test -tags math_big_pure_go -bench=BenchmarkAdd math/big
goos: linux
goarch: amd64
pkg: math/big
BenchmarkAddVV/1-2     	100000000	        13.4 ns/op	4787.60 MB/s
BenchmarkAddVV/2-2     	100000000	        16.2 ns/op	7899.31 MB/s
BenchmarkAddVV/3-2     	100000000	        18.8 ns/op	10221.68 MB/s
BenchmarkAddVV/4-2     	50000000	        21.7 ns/op	11812.51 MB/s
BenchmarkAddVV/5-2     	50000000	        26.7 ns/op	11988.00 MB/s
BenchmarkAddVV/10-2    	50000000	        41.6 ns/op	15366.30 MB/s
BenchmarkAddVV/100-2   	 5000000	       355 ns/op	18007.78 MB/s
BenchmarkAddVV/1000-2  	  300000	      3550 ns/op	18026.66 MB/s
BenchmarkAddVV/10000-2 	   50000	     35846 ns/op	17853.94 MB/s
BenchmarkAddVV/100000-2         	    5000	    358935 ns/op	17830.51 MB/s
BenchmarkAddVW/1-2              	100000000	        10.1 ns/op	 789.12 MB/s
BenchmarkAddVW/2-2              	100000000	        11.3 ns/op	1411.17 MB/s
BenchmarkAddVW/3-2              	100000000	        14.0 ns/op	1711.23 MB/s
BenchmarkAddVW/4-2              	100000000	        15.9 ns/op	2013.34 MB/s
BenchmarkAddVW/5-2              	100000000	        17.0 ns/op	2351.14 MB/s
BenchmarkAddVW/10-2             	50000000	        26.0 ns/op	3072.35 MB/s
BenchmarkAddVW/100-2            	 5000000	       293 ns/op	2725.41 MB/s
BenchmarkAddVW/1000-2           	  500000	      2893 ns/op	2764.39 MB/s
BenchmarkAddVW/10000-2          	   50000	     27187 ns/op	2942.52 MB/s
BenchmarkAddVW/100000-2         	    5000	    285340 ns/op	2803.67 MB/s
BenchmarkAddMulVVW/1-2          	50000000	        21.6 ns/op	2956.66 MB/s
BenchmarkAddMulVVW/2-2          	50000000	        37.2 ns/op	3440.53 MB/s
BenchmarkAddMulVVW/3-2          	30000000	        47.5 ns/op	4039.75 MB/s
BenchmarkAddMulVVW/4-2          	20000000	        70.1 ns/op	3650.21 MB/s
BenchmarkAddMulVVW/5-2          	20000000	        78.4 ns/op	4080.71 MB/s
BenchmarkAddMulVVW/10-2         	10000000	       221 ns/op	2894.02 MB/s
BenchmarkAddMulVVW/100-2        	 1000000	      2352 ns/op	2721.05 MB/s
BenchmarkAddMulVVW/1000-2       	  100000	     13434 ns/op	4763.87 MB/s
BenchmarkAddMulVVW/10000-2      	   10000	    212209 ns/op	3015.88 MB/s
BenchmarkAddMulVVW/100000-2     	     500	   2667943 ns/op	2398.85 MB/s
PASS
ok  	math/big	55.719s

@bcmills bcmills added this to the Unplanned milestone Aug 22, 2019
@bcmills
Copy link
Contributor

bcmills commented Aug 22, 2019

Note that we currently have no MIPS builders (#31217). Builders are required per the porting policy.

I don't think anyone is planning to work on this, and without builders to check for regressions I don't see how we could reasonably accept MIPS-specific changes at the moment.

@ignatk
Copy link
Author

ignatk commented Aug 22, 2019

Is it possible to run builders on amd64 in Qemu?

@bcmills
Copy link
Contributor

bcmills commented Aug 22, 2019

See #31217 (comment).

Is qemu particularly useful for optimization anyway? (I would expect its instruction timings to differ from real hardware...)

@ignatk
Copy link
Author

ignatk commented Aug 22, 2019

Depends on the optimisations, but generally if you take qemu numbers as a baseline and compare to other qemu numbers - relative optimisations are possible.

Also, before even considering optimisations it would be nice to make the code work at least OKish, which qemu is probably more than enough. In this example it easily reproduces the issue mentioned here, which according to Debian people were encountered on real hardware.

So, I would say, in the absence of real hw, qemu is better than nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

4 participants