Document bias and entropy-exhausted behavior #184

SimonSapin · 2024-07-08T10:04:49Z

choose and choose_iter incorrectly claimed to return Error::NotEnoughData when they in fact default to the first choice. This also documents that default in various other APIs.

Additionally, int_in_range (and APIs that rely on it) has bias for non-power-of-two ranges. u.int_in_range(0..=170) for example will consume one byte of entropy, and take its value modulo 171 (the size of the range) to generate the returned integer. As a result, values in 0..=84 (the first ~half of the range) are twice as likely to get chosen as the rest (assuming the underlying bytes are uniform). In general, the result distribution is only uniform if the range size is a power of two (where the modulo just masks some bits).

It would be accurate to document that return values are biased towards lower values when the range size is not a power of two, but do we want this much detail in the documented “contract” of this method?

Similarly, I just called ratio “approximate”. u.ratio(5, 7) returns true for 184 out of 256 possible underlying byte values, ~0.6% too often. In the worst case, u.ratio(84, 170) return true ~33% too often.

Notably, #[derive(Arbitrary)] chooses enum variants not with choose_index (although that seems most appropriate from reading Unstructured docs) but by always consuming 4 bytes of entropy:

// Use a multiply + shift to generate a ranged random number
// with slight bias. For details, see:
// https://lemire.me/blog/2016/06/30/fast-random-shuffling
Ok(match (u64::from(<u32 as arbitrary::Arbitrary>::arbitrary(u)?) * #count) >> 32 {
    #(#variants,)*
    _ => unreachable!()
})

int_in_range tries to minimize consumption based on the range size but that contributes to having more bias than multiply + shift. Is this a real trade-off worth having two methods?

`choose` and `choose_iter` incorrectly claimed to return `Error::NotEnoughData` when they in fact default to the first choice. This also documents that default in various other APIs. Additionally, `int_in_range` (and APIs that rely on it) has bias for non-power-of-two ranges. `u.int_in_range(0..=170)` for example will consume one byte of entropy, and take its value modulo 171 (the size of the range) to generate the returned integer. As a result, values in `0..=84` (the first ~half of the range) are twice as likely to get chosen as the rest (assuming the underlying bytes are uniform). In general, the result distribution is only uniform if the range size is a power of two (where the modulo just masks some bits). It would be accurate to document that return values are biased towards lower values when the range size is not a power of two, but do we want this much detail in the documented “contract” of this method? Similarly, I just called `ratio` “approximate”. `u.ratio(5, 7)` returns true for 184 out of 256 possible underlying byte values, ~0.6% too often. In the worst case, `u.ratio(84, 170)` return true ~33% too often. Notably, `#[derive(Arbitrary)]` chooses enum variants not with `choose_index` (although that seems most appropriate from reading `Unstructured` docs) but by always consuming 4 bytes of entropy: ```rust // Use a multiply + shift to generate a ranged random number // with slight bias. For details, see: // https://lemire.me/blog/2016/06/30/fast-random-shuffling Ok(match (u64::from(<u32 as arbitrary::Arbitrary>::arbitrary(u)?) * #count) >> 32 { #(#variants,)* _ => unreachable!() }) ``` `int_in_range` tries to minimize consumption based on the range size but that contributes to having more bias than multiply + shift. Is this a real trade-off worth having two methods?

fitzgen

Thanks!

In general, PRs to improve the accuracy of output distributions when given actually-random input bytes are appreciated (modulo performance regressions) but also note that because it is generally a grey-box fuzzer like libFuzzer that is providing the input, the input is not made of actually-random bytes and so the output distribution is going to be skewed either way (which is roughly the intention, since we want the grey-box fuzzer's knowledge of code coverage or whatever to bias the results towards things that the fuzzer deems are more interesting)

SimonSapin · 2024-07-09T11:19:41Z

Do you have a sense of whether int_in_range minimizing the number of bytes consumed is helpful, or how important it is? Compared to multiply + shift which always consumes 4 entropy bytes

SimonSapin · 2024-08-12T10:33:25Z

The clippy::legacy_numeric_constants failure on CI seems unrelated to this PR

SimonSapin changed the title ~~Document bias and behavior when running out of entropy~~ Document bias and entropy-exhausted behavior Jul 8, 2024

fitzgen approved these changes Jul 8, 2024

View reviewed changes

Manishearth approved these changes Jul 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document bias and entropy-exhausted behavior #184

Document bias and entropy-exhausted behavior #184

SimonSapin commented Jul 8, 2024

fitzgen left a comment

SimonSapin commented Jul 9, 2024

SimonSapin commented Aug 12, 2024

Document bias and entropy-exhausted behavior #184

Are you sure you want to change the base?

Document bias and entropy-exhausted behavior #184

Conversation

SimonSapin commented Jul 8, 2024

fitzgen left a comment

Choose a reason for hiding this comment

SimonSapin commented Jul 9, 2024

SimonSapin commented Aug 12, 2024