Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ISAAC performance (take 2) #45

Merged
merged 8 commits into from
Nov 11, 2017
Merged

Conversation

pitdicker
Copy link

This is a new try instead of #36. Now fills the output buffer in reverse, so fill_bytes keeps working exactly as before.

It was now possible to move code shared between IsaacRng and ChaChaRng to a separate rand_core::impl::fill_via_u*_chunks function. All the unsafe code is contained therein.

The trick with 32-bit indexing to make isaac64::next_u32 faster worked best with reading the results backwards. Now I had to think of something else, and it isn't pretty...

This also replaces `core::num::Wrapping` with a few `wrapping_add`'s.
There were about 30 conversions to and from `Wrapping`, while there are only
9 wrapping operations.

Because `fill_via_u32_chunks` expects a `[u32]`, converting away was just
easier.
Also uses a different solution to index without bounds checking, to recover a
very little bit of lost performance.
This does some crazy things with indexing, but is 45% faster. We are no longer
throwing away half of the results.
@pitdicker pitdicker force-pushed the isaac_optim branch 2 times, most recently from e4cc6fa to 415ef6f Compare November 11, 2017 08:01
@pitdicker
Copy link
Author

Sorry for all the pushes.

I temporarily merged #44, and big-endian tests are green (not on first try though).

Copy link
Owner

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely looks like a win!

But that bug with the index makes me think it would be worth adding a mixed-size extraction test for Isaac64: u32, u64, u32.

self.rsl[self.cnt as usize % RAND_SIZE].0

let value = self.rsl[index];
self.index += 2;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong: self.index should map 1→3 and 2→5 (same as 3→5). So you want self.index = 2 * index + 1 I think.

(I don't see why you insist on starting self.index at 1; it should work if it starts at 0 too, just +1 on access.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we are thinking a bit differently about how this indexing stuff works. But I'm not sold on how it works now, so let's see if I can get my head around something else ;-)

// Works always, also on big-endian systems, but is slower.
let tmp = self.rsl[index >> 1];
value = tmp as u32;
self.rsl[index >> 1] = tmp.rotate_right(32);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not tmp >> 32? We've already used the other bits and won't reuse them (I hope)!

// Index as if this is a u32 slice.
let rsl = unsafe { &*(&mut self.rsl as *mut [u64; RAND_SIZE]
as *mut [u32; RAND_SIZE * 2]) };
value = rsl[index];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely for big-endian the index you want is index ^ 1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

// it optimises to a bitwise mask).
self.rsl[self.cnt as usize % RAND_SIZE].0

let value = self.rsl[index];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the index % RAND_SIZE trick used elsewhere improve benchmarks?

It may not since the optimiser may already be able to infer the bounds on index.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a lot of measuring of this code, and also glanced over the assembly.

We already have to check the size of index to see whether a new block of random numbers has to be generated. If the comparison uses >= instead of ==, the bounds check later can be optimised out. It has to use a local variable for index though, because when self.isaac*() is not inlined the optimiser can not see that the index gets reset.

So the bounds check is already optimised out, and the mask makes it slower (a couple of percent if I remember right).

use {Rng, CryptoRng, SeedFromRng, SeedableRng, Error};

#[allow(bad_style)]
type w32 = w<u32>;

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that removing this is actually a win... I mean now you have .wrapping_add in a few places and can't just think I know this algorithm uses wrapping arithmetic everywhere.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit message for ChaCha I added this note:

This also replaces core::num::Wrapping with a few wrapping_add's.
There were about 30 conversions to and from Wrapping, while there are only
9 wrapping operations.

Because fill_via_u32_chunks expects a [u32], converting away was just
easier.

I agree that I know this algorithm uses wrapping arithmetic everywhere is an advantage. Not all operations are available on wrapping types though, like rotate_*. You can maybe consider this to be a bug in the standard library.

While working with ISAAC, XorShift* and PCG it happened to many times I had to ask myself if I was working with the wrapped or the normal type, and if an operation was available.

/// Implement `fill_bytes` by reading chunks from the output buffer of a block
/// based RNG.
///
/// The return values are `(consumed_u32, filled_u8)`.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure these names are sufficiently clear without explanation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try write some better documentation.

/// The return values are `(consumed_u32, filled_u8)`.
///
/// Note that on big-endian systems values in the output buffer `src` are
/// mutated: they get converted to little-endian before copying.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think only src[0..consumed_u32] are affected, not the whole of src.

@dhardy
Copy link
Owner

dhardy commented Nov 11, 2017

Can you try adding this test for Isaac 64? I calculated the numbers externally; hopefully I got it right.

    #[test]
    fn test_isaac64_true_mixed_values() {
        let seed: &[_] = &[1, 23, 456, 7890, 12345];
        let mut rng1 = Isaac64Rng::from_seed(seed);
        // Sequence is mostly to check correct interaction between 32- and
        // 64-bit value retrieval.
        assert_eq!(rng1.next_u64(), 547121783600835980);
        assert_eq!(rng1.next_u32(), 1058730652);
        assert_eq!(rng1.next_u64(), 3657128015525553718);
        assert_eq!(rng1.next_u64(), 11565188192941196660);
        assert_eq!(rng1.next_u32(), 288449107);
        assert_eq!(rng1.next_u32(), 646103879);
        assert_eq!(rng1.next_u64(), 18020149022502685743);
        assert_eq!(rng1.next_u32(), 3252674613);
        assert_eq!(rng1.next_u64(), 4469761996653280935);
    }

@pitdicker
Copy link
Author

It took some time to see the bug... Sharp :-)

I have removed all the ugly indexing stuff, and added a half_used bool. It is a little slower for next_u32, and a little faster for next_u64 and fill_bytes.

It is curious to see how adding just one extra operation changes the benchmarks by 5~10%. Using an Option instead of indexing tricks is almost as slow as just truncating next_u64.

Before:

test gen_bytes_isaac    ... bench:   1,091,355 ns/iter (+/- 8,751) = 938 MB/s
test gen_u32_isaac      ... bench:       4,205 ns/iter (+/- 33) = 951 MB/s
test gen_u64_isaac      ... bench:       8,096 ns/iter (+/- 64) = 988 MB/s

test gen_bytes_isaac64  ... bench:     563,099 ns/iter (+/- 2,616) = 1818 MB/s
test gen_u32_isaac64    ... bench:       4,237 ns/iter (+/- 26) = 944 MB/s
test gen_u64_isaac64    ... bench:       4,245 ns/iter (+/- 29) = 1884 MB/s

test gen_bytes_chacha   ... bench:   2,551,234 ns/iter (+/- 20,947) = 401 MB/s
test gen_u32_chacha     ... bench:      11,476 ns/iter (+/- 819) = 348 MB/s
test gen_u64_chacha     ... bench:      21,432 ns/iter (+/- 320) = 373 MB/s

After:

test gen_bytes_isaac    ... bench:     713,938 ns/iter (+/- 4,371) = 1434 MB/s (+53%)
test gen_u32_isaac      ... bench:       4,078 ns/iter (+/- 19) = 980 MB/s (+3%)
test gen_u64_isaac      ... bench:       7,152 ns/iter (+/- 41) = 1118 MB/s (+13%)

test gen_bytes_isaac64  ... bench:     387,755 ns/iter (+/- 2,040) = 2640 MB/s (+45%)
test gen_u32_isaac64    ... bench:       3,106 ns/iter (+/- 23) = 1287 MB/s (+36%)
test gen_u64_isaac64    ... bench:       4,143 ns/iter (+/- 17) = 1930 MB/s (+2%)

test gen_bytes_chacha   ... bench:   2,567,562 ns/iter (+/- 65,782) = 398 MB/s
test gen_u32_chacha     ... bench:      11,469 ns/iter (+/- 141) = 348 MB/s
test gen_u64_chacha     ... bench:      21,466 ns/iter (+/- 544) = 372 MB/s

@pitdicker
Copy link
Author

It seems our comments have crossed. I also hand-calculated the results. Is the one in the commit ok for you?

@dhardy dhardy merged commit fe822c0 into dhardy:master Nov 11, 2017
@pitdicker pitdicker deleted the isaac_optim branch November 11, 2017 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants