proposal: math/rand: speed up Int31n with multiply/shift instead of modulo #16213

josharian · 2016-06-29T16:15:47Z

See http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/ for details.

Marking as Go2, since the compatibility guarantee prevents us from changing the output of Int31n given a seed.

We can still use this in the runtime, though; there are several places where we do fastrand1() % n. I will send a CL for this for Go 1.8 and have it update this issue.

cc @dgryski

The text was updated successfully, but these errors were encountered:

robpike · 2016-06-29T18:02:10Z

If we are going to change this package, it has more severe problems. https://go-review.googlesource.com/#/c/10161/ isn't ready but it is based on a much better generator, and we should use the Go 2 opportunity, should it arise, to replace the core algorithm.

josharian · 2016-06-29T18:05:52Z

Here's the obvious fastrandn for the runtime package

func fastrandn(n uint32) uint32 {
    // See http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/.
    // Note that we use >> 31 instead of >> 32 because fastrand only produces 31 bit numbers.
    return uint32((uint64(n) * uint64(fastrand())) >> 31)
}

Unfortunately, the additional function call (instead of an inline %) more than obviates the performance gains. This means implementing fastrandn in assembly and copy/pasting the core of fastrand, which is a pity.

Results on amd64, with an assembly implementation:

name old time/op new time/op delta
FastrandnNonConstant-8 5.90ns ±18% 4.62ns ±13% -21.63% (p=0.008 n=5+5)
FastrandnConstantNonPowerOfTwo-8 5.24ns ± 8% 5.14ns ±25% ~ (p=0.052 n=24+25)
FastrandnConstantPowerOfTwo-8 3.49ns ±11% 4.65ns ± 4% +33.47% (p=0.000 n=22+23)

Worth doing for the non-constant case. This might get delayed going in, though, since I lack hardware (ppc, mips, s390x) to test asm implementations on other platforms.

josharian · 2016-06-29T18:08:45Z

@robpike yes. The number of "can't fix this because compatibility" issues for math/rand is sad. (And I still don't really understand why it was ok to fix #16124.) I wish there was a way to fix them all in the stdlib without waiting for Go2. Oh well.

lemire · 2016-09-09T13:10:53Z

TensorFlow adopted this technique for better hashing performance : tensorflow/tensorflow@a47a300

robpike · 2016-09-10T00:20:24Z

@josharian Well, I could prioritize the replacement package that's already started and we can put it somewhere supported and encourage people to go there instead.

josharian · 2016-09-10T02:43:25Z

@robpike I think getting the replacement package wrapped up would be a service to the community. As for where to put it, how about go4.org? cc @bradfitz

My CL to use this approach in the runtime is still in my queue.

josharian · 2017-01-05T22:46:02Z

Using this technique in the runtime will be a lot nicer once CL 34782 lands in 1.9.

gopherbot · 2017-02-13T23:25:42Z

CL https://golang.org/cl/36932 mentions this issue.

This occurs a fair amount in the runtime for non-power-of-two n. Use an alternative, faster formulation. name old time/op new time/op delta Fastrandn/2-8 4.45ns ± 2% 2.09ns ± 3% -53.12% (p=0.000 n=14+14) Fastrandn/3-8 4.78ns ±11% 2.06ns ± 2% -56.94% (p=0.000 n=15+15) Fastrandn/4-8 4.76ns ± 9% 1.99ns ± 3% -58.28% (p=0.000 n=15+13) Fastrandn/5-8 4.96ns ±13% 2.03ns ± 6% -59.14% (p=0.000 n=15+15) name old time/op new time/op delta SelectUncontended-8 33.7ns ± 2% 33.9ns ± 2% +0.70% (p=0.000 n=49+50) SelectSyncContended-8 1.68µs ± 4% 1.65µs ± 4% -1.54% (p=0.000 n=50+45) SelectAsyncContended-8 282ns ± 1% 277ns ± 1% -1.50% (p=0.000 n=48+43) SelectNonblock-8 5.31ns ± 1% 5.32ns ± 1% ~ (p=0.275 n=45+44) SelectProdCons-8 585ns ± 3% 577ns ± 2% -1.35% (p=0.000 n=50+50) GoroutineSelect-8 1.59ms ± 2% 1.59ms ± 1% ~ (p=0.084 n=49+48) Updates #16213 Change-Id: Ib555a4d7da2042a25c3976f76a436b536487d5b7 Reviewed-on: https://go-review.googlesource.com/36932 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>

This occurs a fair amount in the runtime for non-power-of-two n. Use an alternative, faster formulation. name old time/op new time/op delta Fastrandn/2-8 4.45ns ± 2% 2.09ns ± 3% -53.12% (p=0.000 n=14+14) Fastrandn/3-8 4.78ns ±11% 2.06ns ± 2% -56.94% (p=0.000 n=15+15) Fastrandn/4-8 4.76ns ± 9% 1.99ns ± 3% -58.28% (p=0.000 n=15+13) Fastrandn/5-8 4.96ns ±13% 2.03ns ± 6% -59.14% (p=0.000 n=15+15) name old time/op new time/op delta SelectUncontended-8 33.7ns ± 2% 33.9ns ± 2% +0.70% (p=0.000 n=49+50) SelectSyncContended-8 1.68µs ± 4% 1.65µs ± 4% -1.54% (p=0.000 n=50+45) SelectAsyncContended-8 282ns ± 1% 277ns ± 1% -1.50% (p=0.000 n=48+43) SelectNonblock-8 5.31ns ± 1% 5.32ns ± 1% ~ (p=0.275 n=45+44) SelectProdCons-8 585ns ± 3% 577ns ± 2% -1.35% (p=0.000 n=50+50) GoroutineSelect-8 1.59ms ± 2% 1.59ms ± 1% ~ (p=0.084 n=49+48) Updates golang#16213 Change-Id: Ib555a4d7da2042a25c3976f76a436b536487d5b7

gopherbot · 2017-07-29T15:16:44Z

Change https://golang.org/cl/51891 mentions this issue: math/rand: add Shuffle

Shuffle uses the Fisher-Yates algorithm. Since this is new API, it affords us the opportunity to use a much faster Int31n implementation that mostly avoids division. As a result, BenchmarkPerm30ViaShuffle is about 30% faster than BenchmarkPerm30, despite requiring a separate initialization loop and using function calls to swap elements. Fixes golang#20480 Updates golang#16213 Updates golang#21211 Change-Id: Ib8956c4bebed9d84f193eb98282ec16ee7c2b2d5

gopherbot · 2017-08-15T23:34:19Z

Change https://golang.org/cl/55972 mentions this issue: math/rand: make Perm match Shuffle

Shuffle uses the Fisher-Yates algorithm. Since this is new API, it affords us the opportunity to use a much faster Int31n implementation that mostly avoids division. As a result, BenchmarkPerm30ViaShuffle is about 30% faster than BenchmarkPerm30, despite requiring a separate initialization loop and using function calls to swap elements. Fixes #20480 Updates #16213 Updates #21211 Change-Id: Ib8956c4bebed9d84f193eb98282ec16ee7c2b2d5 Reviewed-on: https://go-review.googlesource.com/51891 Run-TryBot: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>

Perm and Shuffle are fundamentally doing the same work. This change makes Perm's algorithm match Shuffle's. In addition to allowing developers to switch more easily between the two methods, it affords a nice speed-up: name old time/op new time/op delta Perm3-8 75.7ns ± 1% 51.8ns ± 1% -31.59% (p=0.000 n=9+8) Perm30-8 610ns ± 1% 405ns ± 1% -33.67% (p=0.000 n=9+9) This change alters the output from Perm, given the same Source and seed. This is a change from Go 1.0 behavior. This necessitates updating the regression test. This also changes the number of calls made to the Source during Perm, which changes the output of the math/rand examples. This also slightly perturbs the output of Perm, nudging it out of the range currently accepted by TestUniformFactorial. However, it is complete unclear that the helpers relied on by TestUniformFactorial are correct. That is #21211. This change updates checkSimilarDistribution to respect closeEnough for standard deviations, which makes the test pass. The whole situation is muddy; see #21211 for details. There is an alternative implementation of Perm that avoids initializing m, which is more similar to the existing implementation, plus some optimizations: func (r *Rand) Perm(n int) []int { m := make([]int, n) max31 := n if n > 1<<31-1-1 { max31 = 1<<31 - 1 - 1 } i := 1 for ; i < max31; i++ { j := r.int31n(int32(i + 1)) m[i] = m[j] m[j] = i } for ; i < n; i++ { j := r.Int63n(int64(i + 1)) m[i] = m[j] m[j] = i } return m } This is a tiny bit faster than the implementation actually used in this change: name old time/op new time/op delta Perm3-8 51.8ns ± 1% 50.3ns ± 1% -2.83% (p=0.000 n=8+9) Perm30-8 405ns ± 1% 394ns ± 1% -2.66% (p=0.000 n=9+8) However, 3% in performance doesn't seem worth having the two algorithms diverge, nor the reduced readability of this alternative. Updates #16213. Change-Id: I11a7441ff8837ee9c241b4c88f7aa905348be781 Reviewed-on: https://go-review.googlesource.com/55972 Run-TryBot: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rob Pike <r@golang.org>

Shuffle uses the Fisher-Yates algorithm. Since this is new API, it affords us the opportunity to use a much faster Int31n implementation that mostly avoids division. As a result, BenchmarkPerm30ViaShuffle is about 30% faster than BenchmarkPerm30, despite requiring a separate initialization loop and using function calls to swap elements. Fixes golang#20480 Updates golang#16213 Updates golang#21211 Change-Id: Ib8956c4bebed9d84f193eb98282ec16ee7c2b2d5

Perm and Shuffle are fundamentally doing the same work. This change makes Perm's algorithm match Shuffle's. In addition to allowing developers to switch more easily between the two methods, it affords a nice speed-up: name old time/op new time/op delta Perm3-8 75.7ns ± 1% 51.8ns ± 1% -31.59% (p=0.000 n=9+8) Perm30-8 610ns ± 1% 405ns ± 1% -33.67% (p=0.000 n=9+9) This change alters the output from Perm, given the same Source and seed. This is a change from Go 1.0 behavior. This necessitates updating regress test. This also changes the number of calls made to the Source during Perm, which changes the output of the math/rand examples. This also slightly perturbs the output of Perm, nudging it out of the range currently accepted by TestUniformFactorial. However, it is complete unclear that the helpers relied on by TestUniformFactorial are correct. That is golang#21211. This change updates checkSimilarDistribution to respect closeEnough for standard deviations, which makes the test pass. The whole situation is muddy; see golang#21211 for details. There is an alternative implementation of Perm that avoids initializing m, which is more similar to the existing implementation, plus some optimizations: func (r *Rand) Perm(n int) []int { m := make([]int, n) max31 := n if n > 1<<31-1-1 { max31 = 1<<31 - 1 - 1 } i := 1 for ; i < max31; i++ { j := r.int31n(int32(i + 1)) m[i] = m[j] m[j] = i } for ; i < n; i++ { j := r.Int63n(int64(i + 1)) m[i] = m[j] m[j] = i } return m } This is a tiny bit faster than the implementation actually used in this change: name old time/op new time/op delta Perm3-8 51.8ns ± 1% 50.3ns ± 1% -2.83% (p=0.000 n=8+9) Perm30-8 405ns ± 1% 394ns ± 1% -2.66% (p=0.000 n=9+8) However, 3% in performance doesn't seem worth having the two algorithms diverge, nor the reduced readability of this alternative. Updates golang#16213. Change-Id: I11a7441ff8837ee9c241b4c88f7aa905348be781

ianlancetaylor · 2018-01-09T22:59:44Z

Closing this proposal in favor of #21835.

josharian added the v2 An incompatible library change label Jun 29, 2016

josharian added this to the Unplanned milestone Jun 29, 2016

josharian self-assigned this Jun 29, 2016

josharian mentioned this issue Jan 5, 2017

proposal: non-locking FastRand #18514

Closed

rsc changed the title ~~math/rand: speed up Int31n with multiply/shift instead of modulo~~ proposal: math/rand: speed up Int31n with multiply/shift instead of modulo Jun 17, 2017

gopherbot added the Proposal label Jun 17, 2017

ianlancetaylor closed this as completed Jan 9, 2018

josharian mentioned this issue Jan 11, 2018

proposal: use PCG Source in math/rand for Go 2 #21835

Closed

josharian mentioned this issue Jul 7, 2018

proposal: math/rand: rework for Go 2 #26263

Closed

golang locked and limited conversation to collaborators Jan 9, 2019

gopherbot added the FrozenDueToAge label Jan 9, 2019

rsc unassigned josharian Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: math/rand: speed up Int31n with multiply/shift instead of modulo #16213

proposal: math/rand: speed up Int31n with multiply/shift instead of modulo #16213

josharian commented Jun 29, 2016

robpike commented Jun 29, 2016

josharian commented Jun 29, 2016

josharian commented Jun 29, 2016

lemire commented Sep 9, 2016

robpike commented Sep 10, 2016

josharian commented Sep 10, 2016

josharian commented Jan 5, 2017

gopherbot commented Feb 13, 2017

gopherbot commented Jul 29, 2017

gopherbot commented Aug 15, 2017

ianlancetaylor commented Jan 9, 2018

proposal: math/rand: speed up Int31n with multiply/shift instead of modulo #16213

proposal: math/rand: speed up Int31n with multiply/shift instead of modulo #16213

Comments

josharian commented Jun 29, 2016

robpike commented Jun 29, 2016

josharian commented Jun 29, 2016

josharian commented Jun 29, 2016

lemire commented Sep 9, 2016

robpike commented Sep 10, 2016

josharian commented Sep 10, 2016

josharian commented Jan 5, 2017

gopherbot commented Feb 13, 2017

gopherbot commented Jul 29, 2017

gopherbot commented Aug 15, 2017

ianlancetaylor commented Jan 9, 2018