-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: math/rand: speed up Int31n with multiply/shift instead of modulo #16213
Comments
If we are going to change this package, it has more severe problems. https://go-review.googlesource.com/#/c/10161/ isn't ready but it is based on a much better generator, and we should use the Go 2 opportunity, should it arise, to replace the core algorithm. |
Here's the obvious fastrandn for the runtime package func fastrandn(n uint32) uint32 {
// See http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/.
// Note that we use >> 31 instead of >> 32 because fastrand only produces 31 bit numbers.
return uint32((uint64(n) * uint64(fastrand())) >> 31)
} Unfortunately, the additional function call (instead of an inline Results on amd64, with an assembly implementation: name old time/op new time/op delta Worth doing for the non-constant case. This might get delayed going in, though, since I lack hardware (ppc, mips, s390x) to test asm implementations on other platforms. |
TensorFlow adopted this technique for better hashing performance : tensorflow/tensorflow@a47a300 |
@josharian Well, I could prioritize the replacement package that's already started and we can put it somewhere supported and encourage people to go there instead. |
Using this technique in the runtime will be a lot nicer once CL 34782 lands in 1.9. |
CL https://golang.org/cl/36932 mentions this issue. |
This occurs a fair amount in the runtime for non-power-of-two n. Use an alternative, faster formulation. name old time/op new time/op delta Fastrandn/2-8 4.45ns ± 2% 2.09ns ± 3% -53.12% (p=0.000 n=14+14) Fastrandn/3-8 4.78ns ±11% 2.06ns ± 2% -56.94% (p=0.000 n=15+15) Fastrandn/4-8 4.76ns ± 9% 1.99ns ± 3% -58.28% (p=0.000 n=15+13) Fastrandn/5-8 4.96ns ±13% 2.03ns ± 6% -59.14% (p=0.000 n=15+15) name old time/op new time/op delta SelectUncontended-8 33.7ns ± 2% 33.9ns ± 2% +0.70% (p=0.000 n=49+50) SelectSyncContended-8 1.68µs ± 4% 1.65µs ± 4% -1.54% (p=0.000 n=50+45) SelectAsyncContended-8 282ns ± 1% 277ns ± 1% -1.50% (p=0.000 n=48+43) SelectNonblock-8 5.31ns ± 1% 5.32ns ± 1% ~ (p=0.275 n=45+44) SelectProdCons-8 585ns ± 3% 577ns ± 2% -1.35% (p=0.000 n=50+50) GoroutineSelect-8 1.59ms ± 2% 1.59ms ± 1% ~ (p=0.084 n=49+48) Updates #16213 Change-Id: Ib555a4d7da2042a25c3976f76a436b536487d5b7 Reviewed-on: https://go-review.googlesource.com/36932 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
This occurs a fair amount in the runtime for non-power-of-two n. Use an alternative, faster formulation. name old time/op new time/op delta Fastrandn/2-8 4.45ns ± 2% 2.09ns ± 3% -53.12% (p=0.000 n=14+14) Fastrandn/3-8 4.78ns ±11% 2.06ns ± 2% -56.94% (p=0.000 n=15+15) Fastrandn/4-8 4.76ns ± 9% 1.99ns ± 3% -58.28% (p=0.000 n=15+13) Fastrandn/5-8 4.96ns ±13% 2.03ns ± 6% -59.14% (p=0.000 n=15+15) name old time/op new time/op delta SelectUncontended-8 33.7ns ± 2% 33.9ns ± 2% +0.70% (p=0.000 n=49+50) SelectSyncContended-8 1.68µs ± 4% 1.65µs ± 4% -1.54% (p=0.000 n=50+45) SelectAsyncContended-8 282ns ± 1% 277ns ± 1% -1.50% (p=0.000 n=48+43) SelectNonblock-8 5.31ns ± 1% 5.32ns ± 1% ~ (p=0.275 n=45+44) SelectProdCons-8 585ns ± 3% 577ns ± 2% -1.35% (p=0.000 n=50+50) GoroutineSelect-8 1.59ms ± 2% 1.59ms ± 1% ~ (p=0.084 n=49+48) Updates golang#16213 Change-Id: Ib555a4d7da2042a25c3976f76a436b536487d5b7
Change https://golang.org/cl/51891 mentions this issue: |
Shuffle uses the Fisher-Yates algorithm. Since this is new API, it affords us the opportunity to use a much faster Int31n implementation that mostly avoids division. As a result, BenchmarkPerm30ViaShuffle is about 30% faster than BenchmarkPerm30, despite requiring a separate initialization loop and using function calls to swap elements. Fixes golang#20480 Updates golang#16213 Updates golang#21211 Change-Id: Ib8956c4bebed9d84f193eb98282ec16ee7c2b2d5
Change https://golang.org/cl/55972 mentions this issue: |
Shuffle uses the Fisher-Yates algorithm. Since this is new API, it affords us the opportunity to use a much faster Int31n implementation that mostly avoids division. As a result, BenchmarkPerm30ViaShuffle is about 30% faster than BenchmarkPerm30, despite requiring a separate initialization loop and using function calls to swap elements. Fixes #20480 Updates #16213 Updates #21211 Change-Id: Ib8956c4bebed9d84f193eb98282ec16ee7c2b2d5 Reviewed-on: https://go-review.googlesource.com/51891 Run-TryBot: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>
Perm and Shuffle are fundamentally doing the same work. This change makes Perm's algorithm match Shuffle's. In addition to allowing developers to switch more easily between the two methods, it affords a nice speed-up: name old time/op new time/op delta Perm3-8 75.7ns ± 1% 51.8ns ± 1% -31.59% (p=0.000 n=9+8) Perm30-8 610ns ± 1% 405ns ± 1% -33.67% (p=0.000 n=9+9) This change alters the output from Perm, given the same Source and seed. This is a change from Go 1.0 behavior. This necessitates updating the regression test. This also changes the number of calls made to the Source during Perm, which changes the output of the math/rand examples. This also slightly perturbs the output of Perm, nudging it out of the range currently accepted by TestUniformFactorial. However, it is complete unclear that the helpers relied on by TestUniformFactorial are correct. That is #21211. This change updates checkSimilarDistribution to respect closeEnough for standard deviations, which makes the test pass. The whole situation is muddy; see #21211 for details. There is an alternative implementation of Perm that avoids initializing m, which is more similar to the existing implementation, plus some optimizations: func (r *Rand) Perm(n int) []int { m := make([]int, n) max31 := n if n > 1<<31-1-1 { max31 = 1<<31 - 1 - 1 } i := 1 for ; i < max31; i++ { j := r.int31n(int32(i + 1)) m[i] = m[j] m[j] = i } for ; i < n; i++ { j := r.Int63n(int64(i + 1)) m[i] = m[j] m[j] = i } return m } This is a tiny bit faster than the implementation actually used in this change: name old time/op new time/op delta Perm3-8 51.8ns ± 1% 50.3ns ± 1% -2.83% (p=0.000 n=8+9) Perm30-8 405ns ± 1% 394ns ± 1% -2.66% (p=0.000 n=9+8) However, 3% in performance doesn't seem worth having the two algorithms diverge, nor the reduced readability of this alternative. Updates #16213. Change-Id: I11a7441ff8837ee9c241b4c88f7aa905348be781 Reviewed-on: https://go-review.googlesource.com/55972 Run-TryBot: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rob Pike <r@golang.org>
Shuffle uses the Fisher-Yates algorithm. Since this is new API, it affords us the opportunity to use a much faster Int31n implementation that mostly avoids division. As a result, BenchmarkPerm30ViaShuffle is about 30% faster than BenchmarkPerm30, despite requiring a separate initialization loop and using function calls to swap elements. Fixes golang#20480 Updates golang#16213 Updates golang#21211 Change-Id: Ib8956c4bebed9d84f193eb98282ec16ee7c2b2d5
Perm and Shuffle are fundamentally doing the same work. This change makes Perm's algorithm match Shuffle's. In addition to allowing developers to switch more easily between the two methods, it affords a nice speed-up: name old time/op new time/op delta Perm3-8 75.7ns ± 1% 51.8ns ± 1% -31.59% (p=0.000 n=9+8) Perm30-8 610ns ± 1% 405ns ± 1% -33.67% (p=0.000 n=9+9) This change alters the output from Perm, given the same Source and seed. This is a change from Go 1.0 behavior. This necessitates updating regress test. This also changes the number of calls made to the Source during Perm, which changes the output of the math/rand examples. This also slightly perturbs the output of Perm, nudging it out of the range currently accepted by TestUniformFactorial. However, it is complete unclear that the helpers relied on by TestUniformFactorial are correct. That is golang#21211. This change updates checkSimilarDistribution to respect closeEnough for standard deviations, which makes the test pass. The whole situation is muddy; see golang#21211 for details. There is an alternative implementation of Perm that avoids initializing m, which is more similar to the existing implementation, plus some optimizations: func (r *Rand) Perm(n int) []int { m := make([]int, n) max31 := n if n > 1<<31-1-1 { max31 = 1<<31 - 1 - 1 } i := 1 for ; i < max31; i++ { j := r.int31n(int32(i + 1)) m[i] = m[j] m[j] = i } for ; i < n; i++ { j := r.Int63n(int64(i + 1)) m[i] = m[j] m[j] = i } return m } This is a tiny bit faster than the implementation actually used in this change: name old time/op new time/op delta Perm3-8 51.8ns ± 1% 50.3ns ± 1% -2.83% (p=0.000 n=8+9) Perm30-8 405ns ± 1% 394ns ± 1% -2.66% (p=0.000 n=9+8) However, 3% in performance doesn't seem worth having the two algorithms diverge, nor the reduced readability of this alternative. Updates golang#16213. Change-Id: I11a7441ff8837ee9c241b4c88f7aa905348be781
Closing this proposal in favor of #21835. |
See http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/ for details.
Marking as Go2, since the compatibility guarantee prevents us from changing the output of Int31n given a seed.
We can still use this in the runtime, though; there are several places where we do
fastrand1() % n
. I will send a CL for this for Go 1.8 and have it update this issue.cc @dgryski
The text was updated successfully, but these errors were encountered: