-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sequence sampling: seq
, WeightedChoice
#82
Comments
I have not looked all that closely at what the The new Is there really not a better solution? It should be possible to adapt RNG's like PCG (but anything LCG or MCG-based) to work on a certain number of bits. What if we use an RNG with just enough bits to be able to generate all possible values, and reject the ones outside the range? Then all values are guaranteed to only occur once, and no vector with many comparisons or a more complex hashmap are necessary. And no dependency on This has the disadvantage of not working with a user-supplied RNG. And maybe I am missing something big? cc @vitiral, @burdges |
😄 Realized it just after commenting. If we have to pick values from, say, 0..800 we need 10 bits. That would also mean the seed of the RNG is 10 bits. Then the RNG can never generate more than 2^10 different orderings, 1024. While there are 800! possible orderings, which is basically infinite in my book. So that is an argument to not use this method, but one that can utilize an RNG with a larger period. With an RNG with a period of 2^64 or 2^128 it is at least impossible to exhaustively list all possible orderings. |
That's some out-of-the-box thinking! I guess your custom sampling RNG can be seeded from the provided RNG, so your idea could work, if there is a way to reliably generate a high-quality random-hop RNG visiting all nodes of the given range exactly once in random-like order. That sounds hard to me! Mostly though I made this issue in the hope that someone can design a nice API tying all the above things together. |
But that is much more difficult! 😄. I see Just today I found out about this nice problem of picking a random point on a sphere: http://mathworld.wolfram.com/SpherePointPicking.html. This is the same problem as picking a direction in 3D, which comes up in games. Would it makes sense to add a module collecting stuff like this? They may use the |
That was my thought too; removing it is definitely an option. |
An PRNG can only generate unique values in a range with high probability if the range is much larger than the number of values requested. I noticed
Isn't there a use case for returning the second half of a Fisher-Yates shuffle applied to permute the first half? I think it gives you a faster combination in some situations: rust-random#169 (comment) In principle, we could provide
It becomes tricky with I suppose
After writing this, there was seemingly a mistake in I'm still vaguely uncomfortable doing all this only for |
@burdges I'm not really sure about the rest of your issue, but generic sample_indices is covered in rust-random#202 |
We might as well provide the ref and mut iterators as well:
|
We have roughly this functionality:
Some of this functionality is currently in Rough proposal: add a trait something like the following, with impls for both iterators and slices (optimised to each appropriately): pub trait SampleSeq<T> {
type Item;
// as the current seq::sample:
fn sample<R>(self, n: usize, rng: &mut R) -> Result<Vec<T>, Vec<T>>;
// as above, but with key extraction per element:
fn sample_weighted<R>(self, weight_fn: Fn<Self::Item -> f64>, n: usize, rng: &mut R) -> Result<Vec<T>, Vec<T>>;
// sample but with n==1:
fn choose<R>(self, rng: &mut R);
}
pub fn sample_indices<R>(rng: &mut R, length: usize, n: usize) -> Vec<usize>; |
A trait that can be implemented for both slices and iterators seems like a nice idea. But I don't follow it completely. I am missing |
You might be right about Choose on iterators (maybe it is strange): rust-random#198 I don't know; this needs more thought; I was just mentioning an idea to explore later. |
Slice algorithmsWhich nice algorithms do we have? I have excluded
|
Iterator algorithmsIs it possible to implement methods for iterators similar to those for slices?
|
An interesting alternative to Robert Floyd's algorithm is Jeffrey Vitter's Efficient It works by going over a list once, picking an element, and than skipping multiple elements with just the right probability. This makes it running time linear, as opposed to Robert Floyd's algorithm that has to do an exponential number of comparisons. So there will be a tipping point where one algorithm is faster than the other. What makes it interesting/unique API-wise is that it can be used as an iterator over a slice. (Not as an adapter to other iterators though, as it requires the number of elements to be known from the start.) Could be part of the methods on a slice as |
How does Method D differ from Jeffrey Vitter's Algorithm R used in |
Algorithm Z does not need to know the number of elements it can sample from. It will first fill the destination slice with many items, and then at a decreasing rate start replacing them with elements from further on in the source slice. Method D needs to know the number of elements it can sample from. It will pick a random interval and copy a first element. Then based on the number of remaining elements it will calculate an interval with the appropriate probability to skip, and then sample a second element. Then a third, until exactly the required number of elements is reached, and roughly the end of the source slice. So Algorithm Z will do many copy's (some logarithmic of |
O, I am sorry, didn't read your question close enough and confused some things. Algorithm Z is supposed to be an improved Algorithm R, even said to be optimal. But thank you for naming it, I was still searching which one was used there 👍 . R has to calculate a random value in a range for every element, Z only for some log of them. |
Some random collected thoughts: I now think implementing I don't think returning a result type for If we reverse fn choose_multiple_from<R: Rng + ?Sized>(&self,
rng: &mut R,
source: &[Self::Item],
amount: usize); What should Is It seems to me reservoir sampling from an iterator can be panellized using Rayon. Now I don't know much about Rayon. Every subdivision needs to sample to it's own reservoir of |
pub trait SliceRandom {
type Item;
/// Returns a reference to one random element of the slice, or `None` if the
/// slice is empty.
fn choose<R>(&self, rng: &mut R) -> Option<&Self::Item>
where R: Rng + ?Sized;
/// Returns a mutable reference to one random element of the slice, or
/// `None` if the slice is empty.
fn choose_mut<R>(&mut self, rng: &mut R) -> Option<&mut Self::Item>
where R: Rng + ?Sized;
/// Shuffle a slice in place.
///
/// This applies Durstenfeld's algorithm for the [Fisher–Yates shuffle](
/// https://wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle), which produces
/// an unbiased permutation.
fn shuffle<R>(&mut self, rng: &mut R) where R: Rng + ?Sized;
/// Shuffle a slice in place, but exit early.
///
/// Returns two mutable slices from the source slice. The first contains
/// `amount` elements randomly permuted. The second has the remaining
/// elements that are not fully shuffled.
///
/// This is an efficient method to select `amount` elements at random from
/// the slice, provided the slice may be mutated.
///
/// If you only need to chose elements randomly and `amount > self.len()/2`
/// then you may improve performance by taking
/// `amount = values.len() - amount` and using only the second slice.
///
/// If `amount` is greater than the number of elements in the slice, this
/// will perform a full shuffle.
fn partial_shuffle<R>(&mut self, rng: &mut R, amount: usize)
-> (&mut [Self::Item], &mut [Self::Item]) where R: Rng + ?Sized;
/// Produces an iterator that chooses `amount` elements from the slice at
/// random without repeating any.
///
/// TODO: This may internally use Jeffrey Vitter's algorithm D,
/// Robert Floyd's algorithm, or some other method.
///
/// TODO: Do we want to guarantee the choosen elements are sequential?
fn choose_multiple<R>(&self, rng: &mut R, amount: usize)
-> ReservoirSampler<R> where R: Rng + ?Sized;
/// Fill a mutable slice with elements choosen at random from a finite
/// iterator.
///
/// Uses Alan Waterman's algorithm R, or Jeffrey Vitter's algorithm Z.
///
/// Note: only algorithm R is available if the `std` feature is not enabled,
/// because algorithm Z needs `f64::log` and `f64::exp`.
///
/// TODO: What to do if the iterator does not produce enough elements to
/// fill the slice?
fn choose_multiple_from<R, I>(&mut self, rng: &mut R, iterable: I);
where R: Rng + ?Sized, I: IntoIterator<Item=Self::Item>;
}
impl<T> SliceRandom for [T] {
type Item = T;
/* ... */
}
impl<R> Iterator for ReservoirSampler<R> where R: Rng + ?Sized {
/* ... */
}
pub trait IteratorRandom: Iterator {
/// Choose one element at random from the iterator.
///
/// FIXME: is it ever possible for an iterator to return no elements?
fn choose<R>(self, rng: &mut R) -> Option<Self::Item>
where R: Rng + ?Sized
{ /* ... */ }
/// Collects `amount` values at random from the iterator into a vector.
///
/// Uses Alan Waterman's algorithm R, or Jeffrey Vitter's algorithm Z.
///
/// Note: only algorithm R is available if the `std` feature is not enabled,
/// because it algorithm Z needs `f64::log` and `f64::exp`.
#[cfg(any(feature="std", feature="alloc"))]
fn choose_multiple<R>(self, rng: &mut R, amount: usize) -> <Vec<Self::Item>
where R: Rng + ?Sized
{ /* ... */ }
}
/// Choose `amount` elements from `0..range` using [Robert Floyd's algorithm](
/// http://fermatslibrary.com/s/a-sample-of-brilliance).
///
/// It only makes `amount` calls to the random number generator instead of
/// `range` calls. It does `amount^2 / 2` extra comparisons however.
pub fn choose_from_range<R>(rng: &mut R, range: usize, amount: usize)
where R: Rng + ?Sized; As it turns out, the iterator methods don't add much, but they may be ergonomic. With @dhardy What do you think of methods like this? I think they are a little less easy to use than the current methods in the (I am still working on weighted sampling, but that is an area with many different choices) |
Uh, I was trying to ignore this until after 0.5; can it wait at least until we decide exactly what needs to be done before the 0.5 release? |
Weighted random samplingFor uniform sampling we had two primary methods: sampling with replacement, and sampling without replacement. With replacement could be implemented most naturally with a method that just picks one result, For weighted sampling we have three methods:
How to think about the difference between 'with replacement' and 'without replacement'?
The linked paper has a two examples to illustrate the difference between 'defined probabilities' and 'defined weights'. One way that helped me to thing about it:
From the paper:
The algorithm for reservoir sampling with defined probabilities is the method by M. T. Chao from 1982 A general purpose unequal probability sampling plan. Further there is one constant for all weighted sampling methods: they all have to iterate over the entire slice(/iterator) at least once, because the sum of all the probabilities must be known. For weighted sampling with replacement we may use simpler methods. For example, it is possible to do a one time set-up that calculates the sum of all weights. Than generate a random number in that range, and iterate over all weights unil the sum is greater than the random value. (Somewhat similar to the simple This just provides an overview the involved choices. The real choice of algorithms, how to 'bind' weights to elements, and what API to present is left to explore. And something else, that seems somewhat related to weighted sampling: |
The more you read, the more improvements to the algorithms you find. But I think I now have collected the current 'best' for uniform sampling. People seem very creative when coming up with algorithm names... It began with Knuth in Art of Computer Programming vol. II, where he used the names algorithm S for a selection sampling technique, and algorithm R for reservoir sampling. Selection sampling Robert Floyd improved upon it with algorithm F, and also came up with a variant that guarantees the collected samples are permuted as algorithm P. (A sample of brilliance, 1987) Algorithm S is usually faster than F, because it is not necessary to recompute the acceptable zone for every changing range. But S performs very bad when the numbers to sample gets close to the number of available elements. So it is probably best to switch between both, S when the number of items to sample is less than something like 25%, and P otherwise. Reservoir sampling Jeffrey Vitter improved upon it with algorithm X, Y and Z, which need less values from the RNG. (Random Sampling with a Reservoir, 1985) Kim-Hung Li introduced algorithm K, L, and M, which are similar to algorithm Z but have different trade-offs. Of the four L seems to be the fastest choice for most situations. (Reservoir-Sampling Algorithms of Time Complexity O(n(1 + log(N/n))), 1994) Sequential random sampling Jeffrey Vitter introduced algorithm A and D (Edit: and B and C in-between) to efficiently sample from a known number of elements sequentially, without needing extra memory. (Faster Methods for Random Sampling, 1984 and Efficient Algorithm for Sequential Random Sampling, 1987) K. Aiyappan Nair improved upon it with algorithm E. (An Improved Algorithm for Ordered Sequential Random Sampling, 1990) As I understand it, algorithm D falls back on A if that is faster, and E falls back to A in more situations. So E is the best choice. E is expected to have similar performance to SG*. |
So we have two methods for weighted random sampling without replacement. In the literature they are known as A-ES and A-Chao. They have a subtle difference, in that one takes the weights as relative to the total weight of all items ('using defined weights'), and the other as relative to the remaining items ('using defined probabilities'). Should we support both in Rand? A few points to consider:
I think we can say that for most users, A-ES is the algorithm they expect. Do we want to explain the difference, and provide users with a choice? I think a small explanation is good, together with naming both algorithms. But we do not have to include every possible algorithm in Rand. There is no problem implementing A-Chao in another crate. Weighted sampling without replacement is good to have, yet including two with subtle differences seems like too much. I would suggest Rand only implements A-ES. |
Implemented Vitter's method D today (although there might still be a couple of bugs in it...). At least enough to see how it works and performs. Jeffrey Vitter produced two papers about it. One in 1984 that details the theory behind it, and all the possibilities for optimization. Very handy 😄. The second paper from 1987 is much shorter and does not add all that much. I translated the Pascal-like code from the second paper to Rust. The loops map very cleanly to an iterator implementation in Rust. Also our exponential distribution (Ziggurat method) considerably improved performance. But because method D has to use Performance is very comparable to the current Method D has a few small advantages:
pub trait SliceRandom {
type Item;
/* ... */
/// Produces an iterator that chooses `amount` elements from the slice at
/// random without repeating any, in sequential order.
fn pick_multiple<'a, R: Rng>(&'a self, amount: usize, rng: &'a mut R)
-> RandomSampler<'a, Self::Item, R>;
// TODO: probably a pick_multiple_mut can also be useful
}
pub struct RandomSampler<'a, T: 'a, R: Rng + 'a> {
/* ... */
}
impl<'a, T: 'a, R: Rng + 'a> Iterator for RandomSampler<'a, T, R> {
type Item = &'a T;
fn next(&mut self) -> Option<&'a T> {
/* ... */
}
}
/* Example of using pick_multiple to sample 20 values from a slice with 100 integers */
let mut r = thread_rng();
let vals = (0..100).collect::<Vec<i32>>();
let small_sample: Vec<_> = vals.pick_multiple(20, &mut r).collect();
println!("{:?}", small_sample); I also tried to implement the suggested improvements in the paper from Nair (1990) as algorithm E. He claims that by doing an extra calculation, is is possible to do the relatively slow Method D less. And that is true. His method calculates a lower and an upper bound, and if they are the same, no extra calculations are necessary. But the extra work to compute the bounds for every iteration are much more work! His paper does not include any measurements. It actually makes the algorithm ~50% slower. (Edit: I may of course have messed up the implementation, but I think not because it are only a couple of straigt-forward lines). I also see no way how his proposed improvement to Algorithm A can make it any faster. Both where nice insights, but not practical. |
What's the current status here? I'm quite interested to jump in and help. Is it still up for discussion which algorithms to implement and what API to use? |
Great! there is not much more current status than what you see here. Mostly just my explorations from two months ago. Now that the 0.5 release is around the corner, we can start looking again at new features.
Yes, completely. What do you think of my proposal for an API in #82 (comment)? My effort at implementing in (still in rough shape) lives in https://github.com/pitdicker/rand/blob/slice_random/src/seq.rs. Weighted sampling is still a bit of an open problem. I was thinking that an API taking two iterators, one with the data and one with the weights, would be most flexible, but don't have anything concrete yet. Feel free to come up with any different ideas, or take code, as you want. |
Cool! I think that API for weighted sampling makes a lot of sense. I imagine something like (with one of these arguments passed as fn pick_weighted<IterItems, IterWeights, R>(items: IterItems,
weights: IterWeights,
rng: &mut R)
-> IterItems::Item where IterItems: Iterator,
IterWeights: Iterator+Clone,
IterWeights::Item: SampleUniform+From<u8>+Sum<IterWeights::Item> {
let total_weight: IterWeights::Item = weights.clone().sum();
pick_weighted_with_total(items, weights, total_weight, rng)
}
fn pick_weighted_with_total<IterItems, IterWeights, R>(items: IterItems,
weights: IterWeights,
total_weight: IterWeights::Item,
rng: &mut R)
-> IterItems::Item where IterItems: Iterator,
IterWeights: Iterator,
IterWeights::Item: SampleUniform+From<u8> {
...
} I'm happy to provide a patch for this. On a separate, and more general, topic regarding API design, I'm trying to understand why the proposals tend to stick these functions on I.e. why I certainly agree that sticking functions on existing types generally results in nicer APIs than creating global functions, i.e. However between
But I admit I'm also strongly biased by coming from C++ which for obvious reasons doesn't have a tradition of libraries extending existing types. And there are a couple of good arguments for the
Are there any general rust guidelines around this? Should we leave particular topic to a separate issue? |
There's this https://rust-lang-nursery.github.io/api-guidelines/ idk how much is applicable though |
I think I've roughly caught up on this, except for weighted sampling and not looking into all the algorithms... I quite like @pitdicker's proposed API, except that I would move One thing that stands out is that Another point is that with iterators, if the exact length is known, much more efficient implementations are possible. But I guess this situation is unlikely, so users can make their own wrappers around Is it actually possible to implement
A lot of this requires a good
Also something I was thinking about. Your rationales are important; a couple more:
I think at least methods consuming iterators should put the methods on the iterator, i.e. For slices I'm less sure. (We can even have both as with |
Having noodled on this for a few days, I can certainly see logic in having Rng only have functions which generate values, and putting functions which operate on values in slices and iterators in extension traits for those types. However that does leave APIs like But if we go with this approach I think it's important that we only stick functions like And I do still have concerns around discoverability and documentation... If we go with this approach, I agree with @dhardy that we should do |
FWIW, I'd really like to come to some sort of conclusion here. I'd enjoy writing up more patches for sequence-related functions, but don't want to waste my time on PRs that aren't going to go anywhere due to using the wrong API. |
rust-random#483 was my attempt at drawing some kind of conclusion from this. It's more complicated than it should be (I need to clean up the commits), but essentially just waiting for feedback on the basic design stuff. |
@pitdicker: Is the code for sampling multiple values at https://github.com/pitdicker/rand/blob/slice_random/src/seq.rs working? I.e. is |
It have tested it, but only lightly. But I have re-checked multiple times that it is implemented exactly as in the paper. b.t.w. @dhardy @sicking I have enjoyed working with you on rand, but it is a bit more difficult to find the time at the moment (you probably have noticed). Feel free to cc me with @pitdicker on anything, and hopefully I can do more after the summer. |
No problem @pitdicker; thank you for all the work! |
Thanks for the work @pitdicker! I've enjoyed working with you a lot. I'm actually also likely to have less time. I'm starting a new job next week so will have significantly less free time on my hands. I still have a to-do list of items that I want to get done, and hope to be able to get through it before the 1.0 release. So for now hopefully I'll still be able to stay involved. |
I believe we've finally wrapped most of this up. |
This part of
rand
has needed work for a while, and hasn't had much love recently, excepting the excellent work onseq::sample*
by @vitiral.WeightedChoice
is a distribution for sampling one item from a slice, with weights. The current API only accepts a slice (reference) and clones objects when sampling.The
seq
module has functions to:Issues to solve:
WeightedChoice
andseq::*
in completely separate modules, the first as a distribution the second as simple functions?rand
?References:
seq::*
sample
with random ordering rust-random/rand#169 [alg]: sampling with randomised orderingsample_indicies
rust-random/rand#202 [API]: investigate genericsample_indicies
WeightedChoice
is hard to use with borrow checker rust-random/rand#142 [API]:WeightedChoice
is hard to use with borrow checkerRandomChoice
implRng::pick
/Choose
WeightedChoice
owning, addsChoose
, some reorganisation)I'm going to say nothing is fixed at this point. In theory it would be nice not to break the new
seq
code.The text was updated successfully, but these errors were encountered: