Improve ISAAC performance (take 2) #45

pitdicker · 2017-11-11T07:11:53Z

This is a new try instead of #36. Now fills the output buffer in reverse, so fill_bytes keeps working exactly as before.

It was now possible to move code shared between IsaacRng and ChaChaRng to a separate rand_core::impl::fill_via_u*_chunks function. All the unsafe code is contained therein.

The trick with 32-bit indexing to make isaac64::next_u32 faster worked best with reading the results backwards. Now I had to think of something else, and it isn't pretty...

This also replaces `core::num::Wrapping` with a few `wrapping_add`'s. There were about 30 conversions to and from `Wrapping`, while there are only 9 wrapping operations. Because `fill_via_u32_chunks` expects a `[u32]`, converting away was just easier.

Also uses a different solution to index without bounds checking, to recover a very little bit of lost performance.

This does some crazy things with indexing, but is 45% faster. We are no longer throwing away half of the results.

pitdicker · 2017-11-11T08:03:30Z

Sorry for all the pushes.

I temporarily merged #44, and big-endian tests are green (not on first try though).

dhardy

Definitely looks like a win!

But that bug with the index makes me think it would be worth adding a mixed-size extraction test for Isaac64: u32, u64, u32.

dhardy · 2017-11-11T09:08:04Z

src/prng/isaac64.rs

-        self.rsl[self.cnt as usize % RAND_SIZE].0
+
+        let value = self.rsl[index];
+        self.index += 2;


I think this is wrong: self.index should map 1→3 and 2→5 (same as 3→5). So you want self.index = 2 * index + 1 I think.

(I don't see why you insist on starting self.index at 1; it should work if it starts at 0 too, just +1 on access.)

It seems we are thinking a bit differently about how this indexing stuff works. But I'm not sold on how it works now, so let's see if I can get my head around something else ;-)

dhardy · 2017-11-11T09:15:09Z

src/prng/isaac64.rs

+            // Works always, also on big-endian systems, but is slower.
+            let tmp = self.rsl[index >> 1];
+            value = tmp as u32;
+            self.rsl[index >> 1] = tmp.rotate_right(32);


Why not tmp >> 32? We've already used the other bits and won't reuse them (I hope)!

dhardy · 2017-11-11T09:17:52Z

src/prng/isaac64.rs

+            // Index as if this is a u32 slice.
+            let rsl = unsafe { &*(&mut self.rsl as *mut [u64; RAND_SIZE]
+                                                as *mut [u32; RAND_SIZE * 2]) };
+            value = rsl[index];


Surely for big-endian the index you want is index ^ 1?

dhardy · 2017-11-11T09:31:33Z

src/prng/isaac.rs

-        // it optimises to a bitwise mask).
-        self.rsl[self.cnt as usize % RAND_SIZE].0
+
+        let value = self.rsl[index];


Does the index % RAND_SIZE trick used elsewhere improve benchmarks?

It may not since the optimiser may already be able to infer the bounds on index.

I did a lot of measuring of this code, and also glanced over the assembly.

We already have to check the size of index to see whether a new block of random numbers has to be generated. If the comparison uses >= instead of ==, the bounds check later can be optimised out. It has to use a local variable for index though, because when self.isaac*() is not inlined the optimiser can not see that the index gets reset.

So the bounds check is already optimised out, and the mask makes it slower (a couple of percent if I remember right).

dhardy · 2017-11-11T09:34:35Z

src/prng/chacha.rs

 use {Rng, CryptoRng, SeedFromRng, SeedableRng, Error};

-#[allow(bad_style)]
-type w32 = w<u32>;
-


I'm not sure that removing this is actually a win... I mean now you have .wrapping_add in a few places and can't just think I know this algorithm uses wrapping arithmetic everywhere.

In the commit message for ChaCha I added this note:

This also replaces core::num::Wrapping with a few wrapping_add's.
There were about 30 conversions to and from Wrapping, while there are only
9 wrapping operations.

Because fill_via_u32_chunks expects a [u32], converting away was just
easier.

I agree that I know this algorithm uses wrapping arithmetic everywhere is an advantage. Not all operations are available on wrapping types though, like rotate_*. You can maybe consider this to be a bug in the standard library.

While working with ISAAC, XorShift* and PCG it happened to many times I had to ask myself if I was working with the wrapped or the normal type, and if an operation was available.

dhardy · 2017-11-11T09:36:20Z

rand_core/src/impls.rs

+/// Implement `fill_bytes` by reading chunks from the output buffer of a block
+/// based RNG.
+///
+/// The return values are `(consumed_u32, filled_u8)`.


I'm not sure these names are sufficiently clear without explanation.

I will try write some better documentation.

dhardy · 2017-11-11T09:37:32Z

rand_core/src/impls.rs

+/// The return values are `(consumed_u32, filled_u8)`.
+///
+/// Note that on big-endian systems values in the output buffer `src` are
+/// mutated: they get converted to little-endian before copying.


Actually I think only src[0..consumed_u32] are affected, not the whole of src.

dhardy · 2017-11-11T14:19:02Z

Can you try adding this test for Isaac 64? I calculated the numbers externally; hopefully I got it right.

    #[test]
    fn test_isaac64_true_mixed_values() {
        let seed: &[_] = &[1, 23, 456, 7890, 12345];
        let mut rng1 = Isaac64Rng::from_seed(seed);
        // Sequence is mostly to check correct interaction between 32- and
        // 64-bit value retrieval.
        assert_eq!(rng1.next_u64(), 547121783600835980);
        assert_eq!(rng1.next_u32(), 1058730652);
        assert_eq!(rng1.next_u64(), 3657128015525553718);
        assert_eq!(rng1.next_u64(), 11565188192941196660);
        assert_eq!(rng1.next_u32(), 288449107);
        assert_eq!(rng1.next_u32(), 646103879);
        assert_eq!(rng1.next_u64(), 18020149022502685743);
        assert_eq!(rng1.next_u32(), 3252674613);
        assert_eq!(rng1.next_u64(), 4469761996653280935);
    }

pitdicker · 2017-11-11T14:20:23Z

It took some time to see the bug... Sharp :-)

I have removed all the ugly indexing stuff, and added a half_used bool. It is a little slower for next_u32, and a little faster for next_u64 and fill_bytes.

It is curious to see how adding just one extra operation changes the benchmarks by 5~10%. Using an Option instead of indexing tricks is almost as slow as just truncating next_u64.

Before:

test gen_bytes_isaac    ... bench:   1,091,355 ns/iter (+/- 8,751) = 938 MB/s
test gen_u32_isaac      ... bench:       4,205 ns/iter (+/- 33) = 951 MB/s
test gen_u64_isaac      ... bench:       8,096 ns/iter (+/- 64) = 988 MB/s

test gen_bytes_isaac64  ... bench:     563,099 ns/iter (+/- 2,616) = 1818 MB/s
test gen_u32_isaac64    ... bench:       4,237 ns/iter (+/- 26) = 944 MB/s
test gen_u64_isaac64    ... bench:       4,245 ns/iter (+/- 29) = 1884 MB/s

test gen_bytes_chacha   ... bench:   2,551,234 ns/iter (+/- 20,947) = 401 MB/s
test gen_u32_chacha     ... bench:      11,476 ns/iter (+/- 819) = 348 MB/s
test gen_u64_chacha     ... bench:      21,432 ns/iter (+/- 320) = 373 MB/s

After:

test gen_bytes_isaac    ... bench:     713,938 ns/iter (+/- 4,371) = 1434 MB/s (+53%)
test gen_u32_isaac      ... bench:       4,078 ns/iter (+/- 19) = 980 MB/s (+3%)
test gen_u64_isaac      ... bench:       7,152 ns/iter (+/- 41) = 1118 MB/s (+13%)

test gen_bytes_isaac64  ... bench:     387,755 ns/iter (+/- 2,040) = 2640 MB/s (+45%)
test gen_u32_isaac64    ... bench:       3,106 ns/iter (+/- 23) = 1287 MB/s (+36%)
test gen_u64_isaac64    ... bench:       4,143 ns/iter (+/- 17) = 1930 MB/s (+2%)

test gen_bytes_chacha   ... bench:   2,567,562 ns/iter (+/- 65,782) = 398 MB/s
test gen_u32_chacha     ... bench:      11,469 ns/iter (+/- 141) = 348 MB/s
test gen_u64_chacha     ... bench:      21,466 ns/iter (+/- 544) = 372 MB/s

pitdicker · 2017-11-11T14:22:16Z

It seems our comments have crossed. I also hand-calculated the results. Is the one in the commit ok for you?

pitdicker added 4 commits November 11, 2017 08:19

Add rand_core::impl::fill_via_u*_chunks

d67864a

Fill isaac backwards, and use fill_via_u32_chunks

f92a347

Also uses a different solution to index without bounds checking, to recover a very little bit of lost performance.

Fill isaac64 backwards, and use fill_via_u32_chunks

707c3e1

pitdicker force-pushed the isaac_optim branch from c56c7f2 to c34d0bb Compare November 11, 2017 07:20

Improve performance of isaac64::next_u32.

415ef6f

This does some crazy things with indexing, but is 45% faster. We are no longer throwing away half of the results.

pitdicker force-pushed the isaac_optim branch 2 times, most recently from e4cc6fa to 415ef6f Compare November 11, 2017 08:01

dhardy reviewed Nov 11, 2017

View reviewed changes

pitdicker added 3 commits November 11, 2017 15:13

Remove complex indexing, use a bool.

0bdb1c3

Add test for alternating between next_u64 and next_u32

5f4bedf

Improve documentation

69d940f

dhardy merged commit fe822c0 into dhardy:master Nov 11, 2017

pitdicker deleted the isaac_optim branch November 11, 2017 14:47

dhardy mentioned this pull request Nov 14, 2017

Change the Rand trait for bool rust-random/rand#113

Closed

pitdicker mentioned this pull request Nov 16, 2017

Replace ISAAC with HC-128 #53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ISAAC performance (take 2) #45

Improve ISAAC performance (take 2) #45

pitdicker commented Nov 11, 2017

pitdicker commented Nov 11, 2017

dhardy left a comment

dhardy Nov 11, 2017

pitdicker Nov 11, 2017

dhardy Nov 11, 2017

dhardy Nov 11, 2017

pitdicker Nov 11, 2017

dhardy Nov 11, 2017

pitdicker Nov 11, 2017

dhardy Nov 11, 2017

pitdicker Nov 11, 2017

dhardy Nov 11, 2017

pitdicker Nov 11, 2017

dhardy Nov 11, 2017

dhardy commented Nov 11, 2017

pitdicker commented Nov 11, 2017

pitdicker commented Nov 11, 2017

Improve ISAAC performance (take 2) #45

Improve ISAAC performance (take 2) #45

Conversation

pitdicker commented Nov 11, 2017

pitdicker commented Nov 11, 2017

dhardy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhardy commented Nov 11, 2017

pitdicker commented Nov 11, 2017

pitdicker commented Nov 11, 2017