More efficient implementation for random() in a loop ? #8

debrouxl · 2022-02-28T15:39:18Z

debrouxl
Feb 28, 2022
Maintainer

In tests/mov_inv_random.c::test_mov_inv_random(), tests/test_helper.c::random() is being called in a loop, and itself calls tests/test_helper.c::prsg(), which contains a software implementation of a Fibonacci LFSR: on every CPU, a memory read + multiple shifts + a memory write. That's 20+ instructions (more on 32-bit CPUs), with a control flow change, for generating a random value, dwarfing the memory write caused by the test's loop.

I think that RNG can be made faster at low positive or negative size cost:

in prsg(), a Galois LFSR is usually both faster and smaller than a Fibonacci LFSR, which is more efficient for a hardware implementation;
in test_mov_inv_random(), the LFSR could be moved inline inside the loop, which would allow keeping it in a CPU register and directly taking advantage of the carry flag (for a Galois LFSR, at least).

WDYT ?

martinwhitaker · 2022-02-28T16:51:30Z

martinwhitaker
Feb 28, 2022
Maintainer

This was on my list of things to look at, because I noticed the increase in fan speed when the random number sequence test is running.

The current PRSG implementation is in fact very efficient if you care about correlation between successive numbers, as it generates 32 new bits at a time, not just one. But for this application, that's probably not necessary.

0 replies

martinwhitaker · 2022-02-28T21:58:26Z

martinwhitaker
Feb 28, 2022
Maintainer

This looks like a good alternative: https://en.wikipedia.org/wiki/Xorshift

0 replies

debrouxl · 2022-02-28T22:43:03Z

debrouxl
Feb 28, 2022
Maintainer Author

That's indeed fewer shifts than a 64-bit Fibonacci LFSR, but more than a Galois LFSR :)
Nearly 20 years ago, on a M68000-based platform, I used a 3-instruction implementation of the Galois LFSR as part of a significant speed boost over the previous C implementation which didn't take advantage of the carry flag: asm volatile("lsr.w #1,%0; bcc.s 0f; eor.w %1,%0; 0: " : "=d"(seq) : "d"(mask) : "cc");, where mask is initialized once as the parallel XOR tap value, e.g. 0xB400 for a 16-bit LFSR and 0xA3000000 for a 32-bit LFSR.

Xorshift may mix the bits better; the choice depends on whether we need that, or a higher duty cycle for the memory writes we're interested in, and thereby faster turnaround time for test_mov_inv_random().

0 replies

martinwhitaker · 2022-02-28T22:54:19Z

martinwhitaker
Feb 28, 2022
Maintainer

Agreed, if we don't care that the random sequence is actually a walking bit pattern with only one random new bit per word, the Galois LFSR is fine.

0 replies

martinwhitaker · 2022-03-05T20:15:02Z

martinwhitaker
Mar 5, 2022
Maintainer

I've rewritten the code to in-line the random number generation and keep its state in a local variable. Using the xorshift algorithm roughly halved the time taken for test 8 on my main machine. Changing that to a Galois LFSR algorithm made no noticeable further improvement, so I've left it with the xorshift algorithm. Feel free to improve on that!

0 replies

debrouxl · 2022-03-11T22:03:31Z

debrouxl
Mar 11, 2022
Maintainer Author

That's good :)
Memory tests could probably be made faster by using more hand-optimized assembly, there's already some in tests/block_move.c and tests/mov_inv_fixed.c. However, while I can mostly read x86 assembly, and I have used inline ASM with C operands a fair bit on another ISA, I have basically no experience writing x86 assembly.
Temporarily relying on the compiler to perform the initial code generation work, an approach I used a couple years ago to bootstrap the conversion of a relatively small program from C to pure 68000 ASM, only goes so far, though I guess I could just try it :)

0 replies

martinwhitaker · 2022-03-11T22:49:56Z

martinwhitaker
Mar 11, 2022
Maintainer

I did test it, because there used to be hand written assembler for all the tests. The compiler generated code was just as fast. That saved having to write 64-bit variants of all the assembler code.

I could always beat a compiler when writing 68000 assembler. With a super-scalar, out-of-order, speculatively executing processor, it's a lot harder.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient implementation for random() in a loop ? #8

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

More efficient implementation for random() in a loop ? #8

debrouxl Feb 28, 2022 Maintainer

Replies: 7 comments

martinwhitaker Feb 28, 2022 Maintainer

martinwhitaker Feb 28, 2022 Maintainer

debrouxl Feb 28, 2022 Maintainer Author

martinwhitaker Feb 28, 2022 Maintainer

martinwhitaker Mar 5, 2022 Maintainer

debrouxl Mar 11, 2022 Maintainer Author

martinwhitaker Mar 11, 2022 Maintainer

debrouxl
Feb 28, 2022
Maintainer

martinwhitaker
Feb 28, 2022
Maintainer

martinwhitaker
Feb 28, 2022
Maintainer

debrouxl
Feb 28, 2022
Maintainer Author

martinwhitaker
Feb 28, 2022
Maintainer

martinwhitaker
Mar 5, 2022
Maintainer

debrouxl
Mar 11, 2022
Maintainer Author

martinwhitaker
Mar 11, 2022
Maintainer