-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ISAAC performance (take 2) #45
Conversation
This also replaces `core::num::Wrapping` with a few `wrapping_add`'s. There were about 30 conversions to and from `Wrapping`, while there are only 9 wrapping operations. Because `fill_via_u32_chunks` expects a `[u32]`, converting away was just easier.
Also uses a different solution to index without bounds checking, to recover a very little bit of lost performance.
c56c7f2
to
c34d0bb
Compare
This does some crazy things with indexing, but is 45% faster. We are no longer throwing away half of the results.
e4cc6fa
to
415ef6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely looks like a win!
But that bug with the index makes me think it would be worth adding a mixed-size extraction test for Isaac64: u32, u64, u32.
src/prng/isaac64.rs
Outdated
self.rsl[self.cnt as usize % RAND_SIZE].0 | ||
|
||
let value = self.rsl[index]; | ||
self.index += 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is wrong: self.index
should map 1→3 and 2→5 (same as 3→5). So you want self.index = 2 * index + 1
I think.
(I don't see why you insist on starting self.index
at 1; it should work if it starts at 0 too, just +1 on access.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we are thinking a bit differently about how this indexing stuff works. But I'm not sold on how it works now, so let's see if I can get my head around something else ;-)
src/prng/isaac64.rs
Outdated
// Works always, also on big-endian systems, but is slower. | ||
let tmp = self.rsl[index >> 1]; | ||
value = tmp as u32; | ||
self.rsl[index >> 1] = tmp.rotate_right(32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not tmp >> 32
? We've already used the other bits and won't reuse them (I hope)!
src/prng/isaac64.rs
Outdated
// Index as if this is a u32 slice. | ||
let rsl = unsafe { &*(&mut self.rsl as *mut [u64; RAND_SIZE] | ||
as *mut [u32; RAND_SIZE * 2]) }; | ||
value = rsl[index]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely for big-endian the index you want is index ^ 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
// it optimises to a bitwise mask). | ||
self.rsl[self.cnt as usize % RAND_SIZE].0 | ||
|
||
let value = self.rsl[index]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the index % RAND_SIZE
trick used elsewhere improve benchmarks?
It may not since the optimiser may already be able to infer the bounds on index
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a lot of measuring of this code, and also glanced over the assembly.
We already have to check the size of index
to see whether a new block of random numbers has to be generated. If the comparison uses >=
instead of ==
, the bounds check later can be optimised out. It has to use a local variable for index
though, because when self.isaac*()
is not inlined the optimiser can not see that the index gets reset.
So the bounds check is already optimised out, and the mask makes it slower (a couple of percent if I remember right).
use {Rng, CryptoRng, SeedFromRng, SeedableRng, Error}; | ||
|
||
#[allow(bad_style)] | ||
type w32 = w<u32>; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that removing this is actually a win... I mean now you have .wrapping_add
in a few places and can't just think I know this algorithm uses wrapping arithmetic everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the commit message for ChaCha I added this note:
This also replaces
core::num::Wrapping
with a fewwrapping_add
's.
There were about 30 conversions to and fromWrapping
, while there are only
9 wrapping operations.Because
fill_via_u32_chunks
expects a[u32]
, converting away was just
easier.
I agree that I know this algorithm uses wrapping arithmetic everywhere is an advantage. Not all operations are available on wrapping types though, like rotate_*
. You can maybe consider this to be a bug in the standard library.
While working with ISAAC, XorShift* and PCG it happened to many times I had to ask myself if I was working with the wrapped or the normal type, and if an operation was available.
/// Implement `fill_bytes` by reading chunks from the output buffer of a block | ||
/// based RNG. | ||
/// | ||
/// The return values are `(consumed_u32, filled_u8)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure these names are sufficiently clear without explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try write some better documentation.
rand_core/src/impls.rs
Outdated
/// The return values are `(consumed_u32, filled_u8)`. | ||
/// | ||
/// Note that on big-endian systems values in the output buffer `src` are | ||
/// mutated: they get converted to little-endian before copying. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think only src[0..consumed_u32]
are affected, not the whole of src
.
Can you try adding this test for Isaac 64? I calculated the numbers externally; hopefully I got it right. #[test]
fn test_isaac64_true_mixed_values() {
let seed: &[_] = &[1, 23, 456, 7890, 12345];
let mut rng1 = Isaac64Rng::from_seed(seed);
// Sequence is mostly to check correct interaction between 32- and
// 64-bit value retrieval.
assert_eq!(rng1.next_u64(), 547121783600835980);
assert_eq!(rng1.next_u32(), 1058730652);
assert_eq!(rng1.next_u64(), 3657128015525553718);
assert_eq!(rng1.next_u64(), 11565188192941196660);
assert_eq!(rng1.next_u32(), 288449107);
assert_eq!(rng1.next_u32(), 646103879);
assert_eq!(rng1.next_u64(), 18020149022502685743);
assert_eq!(rng1.next_u32(), 3252674613);
assert_eq!(rng1.next_u64(), 4469761996653280935);
} |
It took some time to see the bug... Sharp :-) I have removed all the ugly indexing stuff, and added a It is curious to see how adding just one extra operation changes the benchmarks by 5~10%. Using an Option instead of indexing tricks is almost as slow as just truncating Before:
After:
|
It seems our comments have crossed. I also hand-calculated the results. Is the one in the commit ok for you? |
This is a new try instead of #36. Now fills the output buffer in reverse, so
fill_bytes
keeps working exactly as before.It was now possible to move code shared between
IsaacRng
andChaChaRng
to a separaterand_core::impl::fill_via_u*_chunks
function. All the unsafe code is contained therein.The trick with 32-bit indexing to make
isaac64::next_u32
faster worked best with reading the results backwards. Now I had to think of something else, and it isn't pretty...