-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[stdlib] Lemire’s Nearly Divisionless Random Integer Generation #25286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Lemire’s Nearly Divisionless Random Integer Generation #25286
Conversation
Implementation of Daniel Lemire’s “Fast Random Integer Generation in Interval” See https://arxiv.org/pdf/1805.10941.pdf
|
I have attempted a Swift port of @lemire's algorithm: uint64_t nearlydivisionless ( uint64_t s, uint64_t (* random64 ) ( void ) ) {
uint64_t x = random64 () ;
__uint128_t m = ( __uint128_t ) x * ( __uint128_t ) s;
uint64_t l = ( uint64_t ) m;
if (l < s) {
uint64_t t = -s % s;
while (l < t) {
x = random64 () ;
m = ( __uint128_t ) x * ( __uint128_t ) s;
l = ( uint64_t ) m;
}
}
return m >> 64;
}I'm not sure if I have correctly interpreted the negation of unsigned integer, I had to rely on https://stackoverflow.com/questions/8026694/c-unary-minus-operator-behavior-with-unsigned-operands @inlinable
public mutating func next<T: FixedWidthInteger & UnsignedInteger>(
upperBound: T
) -> T {
_precondition(upperBound != 0, "upperBound cannot be zero.")
var random: T = next()
var m = random.multipliedFullWidth(by: upperBound)
if (m.low < upperBound) {
let t = (T.max - upperBound + 1) % upperBound
while (m.low < t) {
random = next()
m = random.multipliedFullWidth(by: upperBound)
}
}
return m.high
}Is there better way to express this in Swift than In my local test with Open Questions
|
|
@swift-ci please benchmark I'm expecting to see improvement in |
|
We are able to change the implementation of
cc: @stephentyrone |
|
Out of instinct, I would replace by or depending on how clever you feel. It might be useful to point to the reference as otherwise this function might look like mysterious magic to some: Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation 29 (1), 2019 |
|
I expect to see some performance wins with the floating point random API as well. |
|
Something's messed up. I need to investigate locally why has the benchmark crashed… but it's late over here, so I'll come back to this tomorrow. |
|
@lemire I have, of course, linked to your paper in the PR's description right at the top 😉 I liked the |
|
@palimondo Sorry, I missed the first comment. |
|
Hmm.. reading through Melissa O'Neil's post, I see there is still a "Bitmask with Rejection (Unbiased) — Apple's Method" that performed even slightly better… |
|
@palimondo |
|
I thought that Melissa's conclusion was that "The fastest (unbiased) method is Lemire's with an extra tweak from me [Melissa]." Let us ping her : @imneme |
|
I'm sure you're right, I did not finish reading before posting here… 😇 All this tweaking and performance fine-tuning still hinges on the important question of:
…and I'm guessing that when considering the ARM performance, Swift would decide based on specific performance of these algorithms on A family of processors, not the reference implementations. |
|
@lemire, That is what I said. Not sure I can add much beyond that… Benchmarks are always something of a rabbit hole with more data the deeper you go but typically additional data reveals complexity rather than simple clarity. I really like @lemire's method in general. Frankly, I'd like it if there was a specific mod instruction itself was smart enough to realize when it has no work to do and short-circuit itself rather than needing any tricks. As it is (at least on x86), mod is really just a side effect of div and so it needs to go to the full amount of work, so my optimization is helpful. (Possibly there is a general compiler optimization for mod, perhaps triggered by profiling!) |
|
New blog post where I include ARM benchmarks: Nearly Divisionless Random Integer Generation On Various Systems. I include benchmarks on an ARM box. (I had the numbers inverted in an early version of the blog post: short story is that avoiding division is always good in my tests, using floats is slower.) |
|
@swift-ci please benchmark |
Performance: -O
Code size: -O
Performance: -Osize
Code size: -Osize
Performance: -Onone
Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview |
|
@swift-ci please benchmark |
|
The optimized results seem quite nice. It even looks like the I'm re-running benchmarks with @imneme's modulo optimization, to see how it fares in Swift. |
Performance: -O
Code size: -O
Performance: -Osize
Code size: -Osize
Performance: -Onone
Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview |
|
All results were identical with the exception of regressed Looking at the generated assembly for the modulo optimization on Compiler Explorer, it looks like the Swift version is not as tight as the C version for some reason… Thoughts? |
|
@palimondo One difference for sure is that Swift is careful regarding the case where the bound is zero (which leads to a division by zero) whereas the C version assumes the programmer checked. |
Swift, compared to C, seems unable to generate tightly fused instructions here for some reason (probably the division by zero check?)… removing.
|
I realize that I have forgotten to use the masking subtraction assignment operator ( So I'm removing the modulo optimization until someone more clever than me figures out how to make it work in Swift. |
|
@swift-ci benchmark |
|
@swift-ci test |
Please create two swift bugs if you haven't already:
|
stephentyrone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
a. This method works well for small (sub-word) integer types, but is bad for larger types--we want to switch to rejection sampling (what Melissa calls "Apple's method", though it's so blindingly obvious that Ian and I can't have invented it) for types bigger than a word (this takes us from O(M(n)) down to O(n) complexity). That improvement doesn't need to be part of this PR, but is worth a bug report.
b. multipliedFullWidth currently fatalErrors for Int64 on 32b systems (https://github.com/apple/swift/blob/master/stdlib/public/core/IntegerTypes.swift.gyb#L1521). That's a bug, but it either needs to be fixed before this can go in (I can do that sometime this week), or you need to work around it here for now, probably by conditionally using the old version on 32b systems.
|
@stephentyrone I'm confused by what you mean in point a). Isn't the word size 64 bits on a 64bit system and Swift does not yet support any bigger types? I'm also unsure about the bug for missing optimization. If the reason for why O'Neill's optimization didn't work in Swift is the division by zero check as Lemire hypothesized, how would this be expressible without compromising the safety? Introduction of a new operator? Anyway… I've filed SR-10909 Missed optimization for modulo and will add the architecture fork in next commit to keep the old implementation around for 32-bit platforms. |
In the standard library today, yes. In third party code or in the standard library in the future, there may be arbitrary types conforming to FixedWidthInteger. Now that we have binary and module stability, future third-party types may even be back-deployed against this version of the standard library, so it is good not to let clear inefficiencies be introduced. I'm fine with this one in the short-term, because it's no worse than the current situation, but we should resolve it when we can.
Given Even without the precondition,
Thanks! |
|
@stephentyrone I hope you don't mind that I've created a reminder to add the Support multipliedFullWidth on 32-bit systems (SR-10910) for you… 😇 |
systems until the https://bugs.swift.org/browse/SR-10910 is resolved.
Remove the old OpenBSD generation method, once 32-bit systems support multipliedFullWidth on UInt64.
|
@swift-ci please test |
|
Do we have a way to benchmark this on 32-bit systems once SR-10910 gets resolved? |
|
I'm going to review this once more tomorrow morning for numerical correctness specifically and then merge it assuming it looks good. @airspeedswift and I are discussing cherry-picking for 5.1 as well. |
No, because we'll want to have a rejection sampling path in the long run anyway for bigger-than-word types, which would also cover this case, and the existing modulus-based system is already pretty painful for 64b int on 32b. |
|
Ended up having some time tonight. This is fine to merge. |
|
n.b. @lemire @imneme thinking about this, I believe that we can actually do somewhat better still by using straight-forward rejection sampling for large ranges and this algorithm for small ranges. This would be an extra branch, but it will be perfectly predicted in most uses (and could be hoisted or specialized for algorithms that produce values in bulk). I'll write up a demo sometime soon. |
|
@stephentyrone Didn't @imneme try that here already? That version was not in the charts of the corresponding article for some reason... |
I would expect the perf tradeoff to be somewhat workload- (and hence benchmark-) dependent. It may be that it wasn't a substantial win for the use case that Melissa was testing, or there may be some issue we still need to understand, but for the top half of upperBound values, the actual selection looks the same with both approaches, and a comparison is certainly cheaper than a full-width multiply followed by a comparison. It likely doesn't matter for int64 due to the fact that the rejection branch is necessarily "unpredictable" with a good RNG, but for custom types where multiply can be much more costly than a branch mispredict (as this algorithm supports in Swift), it will matter. |
|
I meant it more like… "Is this the algorithm you had in mind? Maybe I could transcribe also that one…" 😜 … but I have overlooked my clue:
|
|
FWIW, I suspect I have misinterpreted the situation with modulo optimization. The Swift vs. C assembly comparison wasn't fair as I took a too narrow code snippet I suspected to be problematic and took out code that included important hints for the Swift compiler to perform effective optimization. Furthermore, the original code was C++! I re-did the Godbolt Compiler Explorer comparison again: I believe the generated instructions are largely equivalent. Looking again at the original article by @imneme, the modulo optimization provides visible improvement on Large-Shuffle Benchmark, but on Small-Shuffle Benchmark the version with modulo optimization,
Our benchmarks in SBS most closely correspond to small shuffle, which would explain why I saw no improvement from modulo optimization on our benchmarks. @stephentyrone I guess I should come up with a synthetic benchmark to mimic O'Neill's Large-Shuffle and we could re-introduce the modulo optimization together with your rejection sampling (when you get around to it). |
|
Here's a little critique, and a suggestion at the end. I'm not actively involved in Swift development, not even using it, so this is the best place I could find to add comments. I am worried that a mistake has been made when designing this class. Sadly it's too late now to correct it. The mistake is that the doc missed out on explaining how to create a custom next() function, especially if one wants to create a PRNG with a reproducible output (using a seed value). The problem is that the original impl of upperBounds: was apparently using the lower bits of the UInt64, whereas the new one uses the higher bits. That leads to the ugly effect that when someone implemented a PRNG that only delivered a smaller range, e.g. in the lower 32 bits, this used to work before this change when he only pulled small ranges from it. But with this change, the PRNG now fails, always returning zero for randomized subranges. The issue is shown here in a code example with sample output: https://gist.github.com/nicklockwood/cf1fde3eb24602883c17f94e63e490bc This potential problem should have been known about - it's clearly documented in any theory of PRNGs. For instance, this WP article conveniently lists examples and shows that many of them do not return good randomized values in their lower bits. As one comment above says, the API docs only warned that PRNGs might return different values after an implementation change, but they did not consider to break it entirely - yet, this change did. Agreed? To prevent this, there could have been an extra function that returns the range of significantly randomized values. This could be done by returning a divisor and max value, which in the gist example above would have been: divisor=1 and max= 233280, whereas for the system RNG it's possibly 2^32 and 2^32 (assuming only the upper 32 bit contain good random values). With that, any subrange function, like the upperBounds: one, would use this information to first reduce the value by this range before applying the upperBounds reduction. In fact, adding such a function might still be valuable as it would (a) make future implementations of custom PRNGs via a custom next() function easier, and (b) would also help other code that uses the original next() result and create valid values from it. |
The method |
|
Yes, that would be better than the current affair. Still, people might struggle with coming up with 64 bit random gens, and the need for reproducible RNGs is not uncommon. So please provide some assistance with that if you can. Also, has this been proven (and postulated) that every bit of the next()'s result properly appearing to be random and not repetitive? So, even the lowest bit has a wide randomness and is not providing a repeating pattern in a short cycle? I'm thinking of random gens that keep giving something like 0, 1, 0, 1 etc. (cycle length 2) or 0, 0, 1, 1 etc. (cycle length 4) on the lowest bit often. Because, the fact that the next upperBounds has been rewritten to prefer the upper bits suggest that the lower bits are, in fact, not as well randomized. Otherwise, I wonder if that code can't be made faster by staying with the lower bits. Might save a few cycles because you can save the down-shift, right? |
|
I agree with @xwu's answer and I think it is the correct one though I would point out that the documentation already states: "Returns a value from a uniform, independent distribution of binary data." If you have non-trivial cycles or you are offering a limited amount of randomness, it seems that you are clearly violating the "uniform + independent" ideal.
It is true that, historically, people (and standard libraries) have relied on random number generators that would produce, for example, 32-bit words with only 31 bits of randomness (or less). In fact, the history of pseudo-random number generators is filled with really bad generators. However, today, we have many really fast pseudo-random generators that produce high quality random words: it is quite easy to have a really fast generator that will produce 64-bit words that will pass hard statistical tests (e.g., consider the PCG family). If you really need it, you can rather easily go for cryptographically strong generator. Sky is the limit! Twenty or thirty years ago, there were difficult to find and difficult to write, but it is now much easier. There is no good reason to default to whatever bad generator your standard C library offers if you care about quality.
I think that trying to accomodate obviously flawed pseudo-random number generators, like the ones you describe, is unnecessary because we now have (in 2020) a large collection of high-quality generators.
It seems that you are asking for support for legacy generators.
That's a fair and probably correct point. Some of us have offered some implementations... I think that the standard library could be extended with a few options. I, for one, would be willing to provide a couple if the core team called for it. |
|
Thanks for your clarifications. I must admit I'm a bit behind on this topic, having working on and with PRNGs in the 80s, mostly. Also, thanks for taking this seriously. I only came into this because I had been interested why someone's attempt at subclassing Swift's RNG did work in older Xcode versions but not in recent ones. I tried to figure this out by reading the docs and found that they didn't provide answers to this. So, I think adding some clarifying docs is the right way to resolve this for others in the future. |
Implementation of Daniel Lemire’s “Fast Random Integer Generation in Interval”
See https://arxiv.org/pdf/1805.10941.pdf