Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence sampling: seq, WeightedChoice #82

Closed
5 tasks
dhardy opened this issue Jan 6, 2018 · 37 comments
Closed
5 tasks

Sequence sampling: seq, WeightedChoice #82

dhardy opened this issue Jan 6, 2018 · 37 comments

Comments

@dhardy
Copy link
Owner

dhardy commented Jan 6, 2018

This part of rand has needed work for a while, and hasn't had much love recently, excepting the excellent work on seq::sample* by @vitiral.

WeightedChoice is a distribution for sampling one item from a slice, with weights. The current API only accepts a slice (reference) and clones objects when sampling.

The seq module has functions to:

  • sample a set number of items from an iterator of unknown length, returning some of the iterated items
  • sample a set number of items from a slice, returning clones of the items
  • sample a set number of items from a slice, returning references into the original slice

Issues to solve:

  • Is this sufficiently generic/flexible? Do we cover all expected functionality?
  • Is the code succinct?
  • Is the documentation sufficient?
  • Why are WeightedChoice and seq::* in completely separate modules, the first as a distribution the second as simple functions?
  • Is there excess functionality which arguably doesn't belong in rand?

References:

I'm going to say nothing is fixed at this point. In theory it would be nice not to break the new seq code.

@pitdicker
Copy link

I have not looked all that closely at what the sample functions are supposed to do to be honest. Does this summarize it?

The new seq::sample* functions and Robert Floyds combination algorithm rust-random#144 are supposed to pick values from a range without returning any duplicates. And the problem is that all the previous results have to be kept in memory and checked against collisions.

Is there really not a better solution? It should be possible to adapt RNG's like PCG (but anything LCG or MCG-based) to work on a certain number of bits. What if we use an RNG with just enough bits to be able to generate all possible values, and reject the ones outside the range? Then all values are guaranteed to only occur once, and no vector with many comparisons or a more complex hashmap are necessary. And no dependency on alloc or std.

This has the disadvantage of not working with a user-supplied RNG. And maybe I am missing something big? cc @vitiral, @burdges

@pitdicker
Copy link

maybe I am missing something big?

😄 Realized it just after commenting. If we have to pick values from, say, 0..800 we need 10 bits. That would also mean the seed of the RNG is 10 bits. Then the RNG can never generate more than 2^10 different orderings, 1024. While there are 800! possible orderings, which is basically infinite in my book. So that is an argument to not use this method, but one that can utilize an RNG with a larger period. With an RNG with a period of 2^64 or 2^128 it is at least impossible to exhaustively list all possible orderings.

@dhardy
Copy link
Owner Author

dhardy commented Jan 6, 2018

That's some out-of-the-box thinking! I guess your custom sampling RNG can be seeded from the provided RNG, so your idea could work, if there is a way to reliably generate a high-quality random-hop RNG visiting all nodes of the given range exactly once in random-like order. That sounds hard to me!

Mostly though I made this issue in the hope that someone can design a nice API tying all the above things together.

@pitdicker
Copy link

Mostly though I made this issue in the hope that someone can design a nice API tying all the above things together.

But that is much more difficult! 😄.

I see WeightedChoice as a somewhat odd functionality that not many will use. Do you think there is a place in rand for functions like this? That implement some smart algorithm to solve a not entirely uncommon problem?

Just today I found out about this nice problem of picking a random point on a sphere: http://mathworld.wolfram.com/SpherePointPicking.html. This is the same problem as picking a direction in 3D, which comes up in games.

Would it makes sense to add a module collecting stuff like this?

They may use the Distribution trait if that is convenient, but I don't see that as a reason to live in the distributions module.

@dhardy
Copy link
Owner Author

dhardy commented Jan 6, 2018

I see WeightedChoice as a somewhat odd functionality that not many will use.

That was my thought too; removing it is definitely an option.

@burdges
Copy link

burdges commented Jan 6, 2018

An PRNG can only generate unique values in a range with high probability if the range is much larger than the number of values requested.

I noticed sample_slice incurs a potentially undesirable allocation. At least sample_slice_ref should avoid allocation by converting all the indices to references directly via something like:

fn index_to_ref<'a,T>(slice: &'a [T], index: &'a mut [usize]) -> &'a [T] {
    let s = slice.as_ptr();
    let l = slice.len();
    for i in index.iter_mut() {
        assert!(*i < l);
        *i = unsafe { s.offset(*i) } as usize;
    }
    unsafe { mem::transmute(index) };
}
// I donno if this `mem::transmute` call could/should be replaced with some more idiomatic raw pointer call.
// There is no safe linear time way to implement the mutable variant
//   fn index_to_mut<'a,T>(slice: &'a mut [T], index: &'a mut [usize]) -> &'a mut [T] 
// but unsafely index_to_ref could be coerced into doing the mutable version for us.

Isn't there a use case for returning the second half of a Fisher-Yates shuffle applied to permute the first half? I think it gives you a faster combination in some situations: rust-random#169 (comment)

In principle, we could provide no_std variants that took in this user supplied buffer and call it for the -> Vex<T> versions. At least Robert Floyds combination algorithm works:

fn combination_slice(rng: &mut R, range: std::ops::Range<usize>, s: &mut [usize]) {
    let mut s = ::slicevec::SliceVec::new(s);  // https://github.com/jonas-schievink/slicevec
    let amount = s.capacity()
    let n = range.end;
    let m = std::cmp::min(amount,n-1);
    // Should use #![feature(inclusive_range_syntax)] here in future.
    for j in n-m+1 .. n+1 {
        let t = rng.gen_range(range.start,j);
        let t = if s.contains(&t) { j-1 } else { t };
        s.push( t );
    } 
}

fn combination_vec(rng: &mut R, range: std::ops::Range<usize>, amount: usize) -> Vec<usize> {
    let mut s = Vec::with_capacity((Zero::zero()..amount).len());
    unsafe { s.set_len(amount) };
    combination_slice(rng,range,s.as_mut_slice());
    s
}

fn combination_array<const AMOUNT: usize>(rng: &mut R, range: std::ops::Range<usize>) -> [usize; AMOUNT] {
    let mut s = [0usize; AMOUNT];
    combination_slice(rng,range,&mut s);
    s
}

// Deprecated form: 
fn combination_array<A: Array>(rng: &mut R, range: std::ops::Range<usize>) -> A {
    let mut s = ArrayVec::<A>::new();
    let amount = s.capacity()
    combination_slice(rng,range,s.as_mut_slice());
    s.into_inner()
}

It becomes tricky with T: Drop types of course, but SliceVec seemingly handles them, albeit with unnecessary drops. I suppose allocators might become the preferred mechanism here.

I suppose sample_indices_cache is kinda meant to integrate Robert Floyds combination algorithm's benefits while avoiding the amount^2 / 2 comparisons by using a HashMap? It's "running backwards" so that you can output form the main loop though? Assuming that's done correctly then sample_indices_cache could avoid allocating entirely by becoming an Iterator itself, no?

pub struct SampleIndicesCache<'a> {
    rng: &'a mut R,
    length: usize,  // or Range<usize>
    cache: HashMap<usize>,
    count: usize,
}

fn sample_indices_cache<'a>(rng: &mut R, length: usize) -> SampleIndicesCache<'a> {
    let cache = HashMap::with_capacity(amount);
    SampleIndicesCache { rng, length, cache, count: 0 }
}

impl<'a> Iterator for SampleIndicesCache<'a> {
    type Item = usize;
    fn next(&mut self) {
        let i = self.count;
        if i > length { return None; }
        self.count += 1;
        let j: usize = rng.gen_range(i, self.length);

        let tmp = self.cache.remove(&i).unwrap_or(i);       // equiv: let tmp = slice[i];
        if i == j { return Some(tmp); }  // Avoid reinserting or botching return

        // let x = self.cache.remove(&j).unwrap_or(j);        // equiv: slice[i] = slice[j];
        // self.cache.insert(j, tmp);                                     // equiv: slice[j] = tmp;
        // It's faster than the previous two lines if we use the Enter API
        Some( mem::replace(self.cache.entry(j).or_insert(j), tmp) )
        // note that in the inplace version, slice[i] is automatically "returned" value
    }
}

After writing this, there was seemingly a mistake in sample_indices_cache in which j comes up twice and i==j the second time, but an early return fixes it.

I'm still vaguely uncomfortable doing all this only for usize too. Anyone want these for u64 on a 32 bit platform?

@vitiral
Copy link

vitiral commented Jan 6, 2018

@burdges I'm not really sure about the rest of your issue, but generic sample_indices is covered in rust-random#202

@burdges
Copy link

burdges commented Jan 7, 2018

We might as well provide the ref and mut iterators as well:

pub struct SampleRefsIter<'a,T: 'a> {
    slice: &'a [T],
    indices: SampleIndicesCache<'a>,
}

fn sample_refs_iter<'a,T>(rng: &'a mut R, slice: &'a [T]) -> SampleRefsIter<'a,T> {
    SampleRefsIter { slice, indices: sample_indices_cache(rng,slice.len()) }
}

impl<'a> Iterator for SampleRefsIter<'a,T> {
    type item = &'a T;
    fn next(&mut self) -> Option<&'a T> {
        self.indices.next().map(|x| {
            assert!(x < slice.len());
            unsafe { self.slice.as_ptr().offset(x) as &T } 
        })
    }
}

pub struct SampleMutsIter<'a,T: 'a>(SampleRefsIter<'a,T>);

fn sample_muts_iter<'a,T>(rng: &'a mut R, slice: &'a mut [T]) -> SampleMutsIter<'a,T> {
    SampleMutsIter(sample_refs_iter(rng,slice))
}

impl<'a> Iterator for SampleMutsIter<'a,T> {
    type item = &'a mut T;
    fn next(&mut self) -> Option<&mut T> {
        // In principle, this should be safe because the sample cannot return the same value
        // twice, but `mem::transmute` should be replaced with more idiomatic raw pointer calls
        // and worse this makes memory safety dependent upon the correctness of `HashMap`.
        // I proposed this approach to remain compatible with other mutable iterators, ala
        // https://doc.rust-lang.org/src/core/slice/mod.rs.html#1422-1426
        // but a bound like where Self: 'a might insulate us from `HashMap` but at the cost of 
        // some usability, specifically references cannot be kept between iterations 
        self.0.next().map(|x| unsafe { mem::transmute(x) }).
    }
}

@dhardy
Copy link
Owner Author

dhardy commented Mar 12, 2018

We have roughly this functionality:

  • choose 1 elt from a slice/iter
  • sample n elts from a slice/iter, by value/index/ref
  • sample values from a weighted distribution

Some of this functionality is currently in Rng (see rust-random#293), some in other places.

Rough proposal: add a trait something like the following, with impls for both iterators and slices (optimised to each appropriately):

pub trait SampleSeq<T> {
    type Item;
    // as the current seq::sample:
    fn sample<R>(self, n: usize, rng: &mut R) -> Result<Vec<T>, Vec<T>>;
    // as above, but with key extraction per element:
    fn sample_weighted<R>(self, weight_fn: Fn<Self::Item -> f64>, n: usize, rng: &mut R) -> Result<Vec<T>, Vec<T>>;
    // sample but with n==1:
    fn choose<R>(self, rng: &mut R);
}

pub fn sample_indices<R>(rng: &mut R, length: usize, n: usize) -> Vec<usize>;

@pitdicker
Copy link

A trait that can be implemented for both slices and iterators seems like a nice idea. But I don't follow it completely.

I am missing sample_slice_ref, but that is hard to implement for iterators?
Is combining choose with iterators not a very strange scenario? Do you have an example of what code using that would look like?
And I am not sure what sample_weighted does...

@dhardy
Copy link
Owner Author

dhardy commented Mar 13, 2018

You might be right about sample_slice_ref. I wonder if we could just template over both the input type and the output type, i.e. SampleSeq<&[i32], &i32>, or possibly not.

Choose on iterators (maybe it is strange): rust-random#198

I don't know; this needs more thought; I was just mentioning an idea to explore later.

@pitdicker
Copy link

pitdicker commented Mar 14, 2018

Slice algorithms

Which nice algorithms do we have? I have excluded WeightedChoice here, that seems a problem by itself...
Just diving in to figure out the possibilities here.

shuffle: Fisher-Yates shuffle

Shuffle the elements of a mutable slice.

choose: pick one element from a slice

Simply uses gen_range under the hood. Repeatedly calling this method can sometimes return the same element as before.

We could return a copy/clone of the element, a reference or a mutable reference. I would say the last two are the most useful, giving choose and choose_mut as methods, as we have now.

choose_multiple(n): pick multiple unique elements from a slice

Uses Robert Floyd's algorithm (rust-random#144). The order of the picked elements may not be random, but which elements are picked is. To also have a random order call .shuffle() on the result.

This will either have to return a vector, or have to fill a slice with the results (something to do to support no_std).

During its operation the algorithm needs to see if the resulting vector already contains a packed value. That would mean all the elements need to be unique for the method to work. So this method should not be used on the elements of slices directly, but only on a range of integers (indices) or on references to the elements.

It seems to me there are two nice ways to implement this algorithm. One as in the PR: the basic combination algorithm, only operating on a range of integers. The resulting list can also be used as indexes into a slice. Maybe call it something like choose_from_range(range, n). Or again not use an n amount, but a slice to fill with numbers, choose_from_range(range, &mut slice).

The second could work on a slice directly by creating and comparing references. This would fill a slice with references to the source slice, but I am not shure this works out with lifetimes...

b.t.w. currently seq::sample_indices_cache uses a hashmap, but I think it can better use Robert Floyd's algorithm.

partial_shuffle(n)

In rust-random#169 (comment) a nice design for a function doing a partial Fisher-Yates shuffle came up.

This is an alternative to Robert Floyd's algorithm. It makes a different trade-off: Robert Floyd's algorithm does many (an exponent of n) comparisons, and the order of the elements is not really random. A partial Fisher-Yates shuffle mutates the source slice, but does not need to do comparisons, and the order of the elements is random. Because it mutates the source slice, it may mean the slice has to be cloned before calling this method. choose_multiple(n) is fastest for a relatively small number of elements, partial_shuffle(n) for a relatively large number.

The returned values are two sub-slices of the source slice: one with the partially shuffled elements, one 'scratch space' with the remaining elements.

Convenient higher-level methods

The current sample_slice, sample_slice_ref and sample_iter can be build with choose_from_range(range, n) and partial_shuffle(n). They may pick the best method and do temporary allocations.

@pitdicker
Copy link

pitdicker commented Mar 16, 2018

Iterator algorithms

Is it possible to implement methods for iterators similar to those for slices?

shuffle: inside-out Fisher-Yates shuffle

Collect and shuffle the elements of an iterator into a vector. I found no better reference than Wikipedia. We may have to get a little creative to adapt it to Rust and avoid reading initialized memory.

This might be slightly faster than .collect().shuffle(), but probably not. We would be up against the standard library 😄, and a specialized collect implementation.

Maybe confusing and not worth it.

choose_multiple(n): pick multiple unique elements from an iterator

Algorithms that can sample from an unknown number of iterations, and that only go over all the iterations once and sequentially, are known as reservoir sampling algorithms. Algorithm Z from Random Sampling with a Reservoir by Jeffrey Vitter seems to be state of the art here.

Note that the goals here are to only go over the iteration once, not the minimize the number of RNG calls (as Robert Floyd's algorithm that works on slices).

The current sample_iter uses a much simpler algorithm than Vitter's (it needs to generate a random number for every iteration), but is otherwise similar to choose_multiple(n).

I think returning Err<...> if the iterator turns out to have less then n elements, as sample_iter does, is a good idea.

choose: pick one element from an iterator

This would just be a specialization of choose_multiple(n) to only pick one element. A faster alternative to rust-random#198.

choose_one_in_n(n)

On average pick every 1 in n elements. Suppose n = 10. This may pick two elements right after each other, or skip 40 elements, as long as each occurs with the 'right' probability and the average is 10. Not thought this through yet to be honest...

Edit: this is the only method in this list that can be implemented as an iterator itself.

Possible API

(Or how I think it would look like, modeled after core::SliceExt, core:Iterator::last and Iterator::collect to the best of my abilities...)

pub trait SliceRandom {
    type Item;

    fn choose<R: Rng + ?Sized>(&self, rng: &mut R) -> Option<&Self::Item>;
    fn choose_mut<R: Rng + ?Sized>(&mut self, rng: &mut R)
    -> Option<&mut Self::Item>;
    fn shuffle<R: Rng + ?Sized>(&mut self, rng: &mut R);
    fn choose_multiple<R: Rng + ?Sized>(&self, rng: &mut R, amount: usize)
    -> Vec<Self::Item>;
        // FIXME: requires Self::Item is Clone
    fn choose_multiple_ref<R: Rng + ?Sized>(&self, rng: &mut R, amount: usize)
    -> Vec<&Self::Item>;
        // FIXME: needs to couple the lifetimes
    fn partial_shuffle<R: Rng + ?Sized>(&mut self, rng: &mut R, amount: usize)
    -> (&mut [Self::Item], &mut [Self::Item]);
}

impl<T> SliceRandom for [T] {
    type Item = T;

    /* ... */
}

pub trait IteratorRandom: Iterator {
    fn choose<R>(self, rng: &mut R) -> Option<Self::Item>
        where R: Rng + ?Sized
    { /* ... */ }
    fn shuffle<R>(self, rng: &mut R) -> Vec<Self::Item>
        where R: Rng + ?Sized
    { /* ... */ }
    fn choose_multiple<R>(self, rng: &mut R, amount: usize)
    -> Result<Vec<Self::Item>, Vec<Self::Item>>
        where R: Rng + ?Sized
    { /* ... */ }
    fn choose_one_in_n<R, B>(self, rng: &mut R, n: usize) -> Vec<Self::Item>
        where R: Rng + ?Sized
    { /* ... */ }
}

pub fn choose_from_range(/* ... */);

@pitdicker
Copy link

An interesting alternative to Robert Floyd's algorithm is Jeffrey Vitter's Efficient
Algorithm for Sequential Random Sampling
Method D.

It works by going over a list once, picking an element, and than skipping multiple elements with just the right probability. This makes it running time linear, as opposed to Robert Floyd's algorithm that has to do an exponential number of comparisons. So there will be a tipping point where one algorithm is faster than the other.

What makes it interesting/unique API-wise is that it can be used as an iterator over a slice. (Not as an adapter to other iterators though, as it requires the number of elements to be known from the start.) Could be part of the methods on a slice as choose_multiple_iter(n).

@TheIronBorn
Copy link

TheIronBorn commented Mar 16, 2018

How does Method D differ from Jeffrey Vitter's Algorithm R used in seq::sample_iter?

@pitdicker
Copy link

pitdicker commented Mar 16, 2018

Algorithm Z does not need to know the number of elements it can sample from. It will first fill the destination slice with many items, and then at a decreasing rate start replacing them with elements from further on in the source slice.

Method D needs to know the number of elements it can sample from. It will pick a random interval and copy a first element. Then based on the number of remaining elements it will calculate an interval with the appropriate probability to skip, and then sample a second element. Then a third, until exactly the required number of elements is reached, and roughly the end of the source slice.

So Algorithm Z will do many copy's (some logarithmic of n), Method D will do only n copy's. Method D will need to know the number of elements in the source slice, Algorithm Z not. At least, that is my understanding without touching any code 😄.

@pitdicker
Copy link

pitdicker commented Mar 16, 2018

How does Method D differ from Jeffrey Vitter's Algorithm R used in seq::sample_iter?

O, I am sorry, didn't read your question close enough and confused some things. Algorithm Z is supposed to be an improved Algorithm R, even said to be optimal. But thank you for naming it, I was still searching which one was used there 👍 .

R has to calculate a random value in a range for every element, Z only for some log of them. 3n * ln(N/n) according to the paper, with N the total number of elements, and n the amount to sample.

@pitdicker
Copy link

pitdicker commented Mar 17, 2018

Some random collected thoughts:

I now think implementing shuffle for iterators is really not worth it. It is a bit confusing, less flexible and probably not even faster than the alternative.

I don't think returning a result type for choose_multiple(n), so an iterator can return an error if it has less than n elements, is worth it. Checking the length of the Vec is just as good an indication there are less elements. And is returning less if there is simply nothing more available not just how APIs that work with iterators operate right now? For example .take(5).collect() does not guarantee you end up with 5 elements.

If we reverse choose_multiple(n), we can make it workable with no_std and not do any allocations: fill a slice of n elements from a larger slice, or even from an iterator.

    fn choose_multiple_from<R: Rng + ?Sized>(&self,
                                             rng: &mut R,
                                             source: &[Self::Item],
                                             amount: usize);

What should partial_shuffle(amount) do if amount is greater then the number of elements in the slice? I suppose it should simply shuffle the entire slice, not return some option or error.

Is choose_multiple_ref a function worth having? Returning a vector of references feels strange to me. Isn't it much better to either generate a vector of indices, or to use the iterator variant with .by_ref()?

It seems to me reservoir sampling from an iterator can be panellized using Rayon. Now I don't know much about Rayon. Every subdivision needs to sample to it's own reservoir of n elements. When joining, one reservoir should be formed sample from these two. Not thought this through, but maybe worth exploring.

@pitdicker
Copy link

pitdicker commented Mar 17, 2018

pub trait SliceRandom {
    type Item;

    /// Returns a reference to one random element of the slice, or `None` if the
    /// slice is empty.
    fn choose<R>(&self, rng: &mut R) -> Option<&Self::Item>
        where R: Rng + ?Sized;

    /// Returns a mutable reference to one random element of the slice, or
    /// `None` if the slice is empty.
    fn choose_mut<R>(&mut self, rng: &mut R) -> Option<&mut Self::Item>
        where R: Rng + ?Sized;

    /// Shuffle a slice in place.
    ///
    /// This applies Durstenfeld's algorithm for the [Fisher–Yates shuffle](
    /// https://wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle), which produces
    /// an unbiased permutation.
    fn shuffle<R>(&mut self, rng: &mut R) where R: Rng + ?Sized;

    /// Shuffle a slice in place, but exit early.
    ///
    /// Returns two mutable slices from the source slice. The first contains
    /// `amount` elements randomly permuted. The second has the remaining
    /// elements that are not fully shuffled.
    ///
    /// This is an efficient method to select `amount` elements at random from
    /// the slice, provided the slice may be mutated.
    ///
    /// If you only need to chose elements randomly and `amount > self.len()/2`
    /// then you may improve performance by taking
    /// `amount = values.len() - amount` and using only the second slice.
    ///
    /// If `amount` is greater than the number of elements in the slice, this
    /// will perform a full shuffle.
    fn partial_shuffle<R>(&mut self, rng: &mut R, amount: usize)
        -> (&mut [Self::Item], &mut [Self::Item]) where R: Rng + ?Sized;

    /// Produces an iterator that chooses `amount` elements from the slice at
    /// random without repeating any.
    ///
    /// TODO: This may internally use Jeffrey Vitter's algorithm D,
    /// Robert Floyd's algorithm, or some other method.
    ///
    /// TODO: Do we want to guarantee the choosen elements are sequential?
    fn choose_multiple<R>(&self, rng: &mut R, amount: usize)
        -> ReservoirSampler<R> where R: Rng + ?Sized;

    /// Fill a mutable slice with elements choosen at random from a finite
    /// iterator.
    ///
    /// Uses Alan Waterman's algorithm R, or Jeffrey Vitter's algorithm Z.
    ///
    /// Note: only algorithm R is available if the `std` feature is not enabled,
    /// because algorithm Z needs `f64::log` and `f64::exp`.
    ///
    /// TODO: What to do if the iterator does not produce enough elements to
    /// fill the slice?
    fn choose_multiple_from<R, I>(&mut self, rng: &mut R, iterable: I);
        where R: Rng + ?Sized, I: IntoIterator<Item=Self::Item>;
}

impl<T> SliceRandom for [T] {
    type Item = T;

    /* ... */
}

impl<R> Iterator for ReservoirSampler<R> where R: Rng + ?Sized {
    /* ... */
}

pub trait IteratorRandom: Iterator {
    /// Choose one element at random from the iterator.
    ///
    /// FIXME: is it ever possible for an iterator to return no elements?
    fn choose<R>(self, rng: &mut R) -> Option<Self::Item>
        where R: Rng + ?Sized
    { /* ... */ }

    /// Collects `amount` values at random from the iterator into a vector.
    ///
    /// Uses Alan Waterman's algorithm R, or Jeffrey Vitter's algorithm Z.
    ///
    /// Note: only algorithm R is available if the `std` feature is not enabled,
    /// because it algorithm Z needs `f64::log` and `f64::exp`.
    #[cfg(any(feature="std", feature="alloc"))]
    fn choose_multiple<R>(self, rng: &mut R, amount: usize) -> <Vec<Self::Item>
        where R: Rng + ?Sized
    { /* ... */ }
}

/// Choose `amount` elements from `0..range` using [Robert Floyd's algorithm](
/// http://fermatslibrary.com/s/a-sample-of-brilliance).
///
/// It only makes `amount` calls to the random number generator instead of
/// `range` calls. It does `amount^2 / 2` extra comparisons however.
pub fn choose_from_range<R>(rng: &mut R, range: usize, amount: usize)
    where R: Rng + ?Sized;

As it turns out, the iterator methods don't add much, but they may be ergonomic.

With partial_shuffle, choose_multiple, choose_multiple_from and choose_from_range we get four different algorithms to uniquely sample multiple elements with different trade-offs. And we can do so from ranges, slices and iterators.

@dhardy What do you think of methods like this? I think they are a little less easy to use than the current methods in the seq module, but more flexible, and also usable withoud std.

(I am still working on weighted sampling, but that is an area with many different choices)

@dhardy
Copy link
Owner Author

dhardy commented Mar 17, 2018

Uh, I was trying to ignore this until after 0.5; can it wait at least until we decide exactly what needs to be done before the 0.5 release?

@pitdicker
Copy link

pitdicker commented Mar 18, 2018

Weighted random sampling

For uniform sampling we had two primary methods: sampling with replacement, and sampling without replacement. With replacement could be implemented most naturally with a method that just picks one result, choose. Without replacement needs to know how many elements to pick, with functions such as choose_multiple.

For weighted sampling we have three methods:

  • With replacement (WRS-R)
  • Without replacement, using defined probabilities (WRS-N-P)
  • Without replacement, using defined weights (WRS-N-W)

How to think about the difference between 'with replacement' and 'without replacement'?
Imagine a population of 10 items, the first with a weight 10.0, the rest with weight 1.0. When picking 4 elements, do you expect

  • on average about 2 times item 1, and 2 other items → 'with replacement'
  • a 50% chance to have item 1, and no duplicates → 'without replacement'

The linked paper has a two examples to illustrate the difference between 'defined probabilities' and 'defined weights'. One way that helped me to thing about it:

  • 'defined probabilities' views the weights as relative to the initial sum of the weights.
  • 'defined weights' views the weights as relative to the sum of the weights of all the not yet sampled elements.

From the paper:

Example 1. Assume that we want to select a weighted random sample of size m = 2 from a population of n = 4 items with weights 1, 1, 1 and 2, respectively. For problem WRS-N-P the probability of items 1, 2 and 3 to be in the random sample is 0.4, whereas the probability of item 4 is 0.8. For WRS-N-W the probability of items 1, 2 and 3 to be in the random sample is 0.433, while the probability of item 4 is 0.7.

The algorithm for reservoir sampling with defined probabilities is the method by M. T. Chao from 1982 A general purpose unequal probability sampling plan.
And the algorithm using defined weights is by Efraimidis and Spirakis from 2005 Weighted Random Sampling.

Further there is one constant for all weighted sampling methods: they all have to iterate over the entire slice(/iterator) at least once, because the sum of all the probabilities must be known.

For weighted sampling with replacement we may use simpler methods. For example, it is possible to do a one time set-up that calculates the sum of all weights. Than generate a random number in that range, and iterate over all weights unil the sum is greater than the random value. (Somewhat similar to the simple choose method for uniform sampling). This can be optimised by pre-computing all the sums of the weights. With WeightedChoice we currently have a simple implementation of this method that uses integers.

This just provides an overview the involved choices. The real choice of algorithms, how to 'bind' weights to elements, and what API to present is left to explore.

And something else, that seems somewhat related to weighted sampling:
It is possible to sample n elements by taking a constant interval sum/n and only a random starting point. This is known as stochastic universal sampling, and implemented in the Rust Random Choice crate. This picture illustrates how it works. In theory it can be implemented as an efficient iterator that returns n elements from a slice. There are situations where this method can be the right choice, but it is very different from the weighted random sampling discussed here.

@pitdicker
Copy link

pitdicker commented Mar 19, 2018

The more you read, the more improvements to the algorithms you find. But I think I now have collected the current 'best' for uniform sampling. People seem very creative when coming up with algorithm names...

It began with Knuth in Art of Computer Programming vol. II, where he used the names algorithm S for a selection sampling technique, and algorithm R for reservoir sampling.

Selection sampling
Algorithm S is attributed by Knuth to C. T. Fan, Mervin E. Muller, and Ivan Rezucha, and was also independently discovered by T. G. Jones (1962).

Robert Floyd improved upon it with algorithm F, and also came up with a variant that guarantees the collected samples are permuted as algorithm P. (A sample of brilliance, 1987)

Algorithm S is usually faster than F, because it is not necessary to recompute the acceptable zone for every changing range. But S performs very bad when the numbers to sample gets close to the number of available elements. So it is probably best to switch between both, S when the number of items to sample is less than something like 25%, and P otherwise.

Reservoir sampling
Algorithm R is attributed to Alan Waterman.

Jeffrey Vitter improved upon it with algorithm X, Y and Z, which need less values from the RNG. (Random Sampling with a Reservoir, 1985)

Kim-Hung Li introduced algorithm K, L, and M, which are similar to algorithm Z but have different trade-offs. Of the four L seems to be the fastest choice for most situations. (Reservoir-Sampling Algorithms of Time Complexity O(n(1 + log(N/n))), 1994)

Sequential random sampling
In Sequential random sampling (1985) J. H. Ahrens and U. Dieter showed several variants to algorithm S. They renamed S to SU, and introduced SG and SG* as a geometric method, and SH as a method employing hashing techniques. They all have a small disadvantage, in that they need more memory then the reservoir size.

Jeffrey Vitter introduced algorithm A and D (Edit: and B and C in-between) to efficiently sample from a known number of elements sequentially, without needing extra memory. (Faster Methods for Random Sampling, 1984 and Efficient Algorithm for Sequential Random Sampling, 1987)

K. Aiyappan Nair improved upon it with algorithm E. (An Improved Algorithm for Ordered Sequential Random Sampling, 1990)

As I understand it, algorithm D falls back on A if that is faster, and E falls back to A in more situations. So E is the best choice. E is expected to have similar performance to SG*.

@pitdicker
Copy link

pitdicker commented Mar 21, 2018

So we have two methods for weighted random sampling without replacement. In the literature they are known as A-ES and A-Chao. They have a subtle difference, in that one takes the weights as relative to the total weight of all items ('using defined weights'), and the other as relative to the remaining items ('using defined probabilities'). Should we support both in Rand?

A few points to consider:

  • The difference between both algorithms is insignificant when the number of elements to sample is much smaller the available elements.
  • A-Chao, the algorithm by by M.T. Chao, is from 1982. It takes the weights as relative to the remaining items. The problem of taking the weights as relative to the total weight of all item was not cracked until 2005, by Efraimidis and Spirakis (A-ES).
  • Sampling with A-ES gives the results that are intuitive to most people, with easy to calculate probabilities. You need to be quite statistically oriented to have an intuintion for the effects of weights with method A-Chao (when the number of elements to sample is relatively large).

I think we can say that for most users, A-ES is the algorithm they expect.

Do we want to explain the difference, and provide users with a choice? I think a small explanation is good, together with naming both algorithms. But we do not have to include every possible algorithm in Rand. There is no problem implementing A-Chao in another crate. Weighted sampling without replacement is good to have, yet including two with subtle differences seems like too much.

I would suggest Rand only implements A-ES.

@pitdicker
Copy link

pitdicker commented Mar 24, 2018

Implemented Vitter's method D today (although there might still be a couple of bugs in it...). At least enough to see how it works and performs.

Jeffrey Vitter produced two papers about it. One in 1984 that details the theory behind it, and all the possibilities for optimization. Very handy 😄. The second paper from 1987 is much shorter and does not add all that much. I translated the Pascal-like code from the second paper to Rust.

The loops map very cleanly to an iterator implementation in Rust. Also our exponential distribution (Ziggurat method) considerably improved performance. But because method D has to use ln() and exp() a couple of times, it is not possible to use with just core, only with std. Edit: Well, the full method D is not available, but the often a bit slower method A can be used with no_std.

Performance is very comparable to the current sample_slice and sample_slice_ref. Sometimes a bit faster, sometimes a bit slower, usually within 10%. I was partly disappointed by the performance, but another way to look at it is that the current implementation of sample_slice is pretty good.

Method D has a few small advantages:

  • no need for extra memory (so no allocations),
  • elements are returned in order,
  • no set-up cost,
  • no HashMap or other data structure required (I think a 'pure' algorithm is neater, but that is probably just a useless preference).
pub trait SliceRandom {
    type Item;

    /* ... */
    /// Produces an iterator that chooses `amount` elements from the slice at
    /// random without repeating any, in sequential order.
    fn pick_multiple<'a, R: Rng>(&'a self, amount: usize, rng: &'a mut R)
        -> RandomSampler<'a, Self::Item, R>;

    // TODO: probably a pick_multiple_mut can also be useful
}

pub struct RandomSampler<'a, T: 'a, R: Rng + 'a> {
    /* ... */
}

impl<'a, T: 'a, R: Rng + 'a> Iterator for RandomSampler<'a, T, R> {
    type Item = &'a T;

    fn next(&mut self) -> Option<&'a T> {
        /* ... */
    }
}

/* Example of using pick_multiple to sample 20 values from a slice with 100 integers */
let mut r = thread_rng();
let vals = (0..100).collect::<Vec<i32>>();
let small_sample: Vec<_> = vals.pick_multiple(20, &mut r).collect();
println!("{:?}", small_sample);

I also tried to implement the suggested improvements in the paper from Nair (1990) as algorithm E. He claims that by doing an extra calculation, is is possible to do the relatively slow Method D less. And that is true. His method calculates a lower and an upper bound, and if they are the same, no extra calculations are necessary.

But the extra work to compute the bounds for every iteration are much more work! His paper does not include any measurements. It actually makes the algorithm ~50% slower. (Edit: I may of course have messed up the implementation, but I think not because it are only a couple of straigt-forward lines). I also see no way how his proposed improvement to Algorithm A can make it any faster. Both where nice insights, but not practical.

@sicking
Copy link

sicking commented May 18, 2018

What's the current status here? I'm quite interested to jump in and help.

Is it still up for discussion which algorithms to implement and what API to use?

@pitdicker
Copy link

What's the current status here? I'm quite interested to jump in and help.

Great! there is not much more current status than what you see here. Mostly just my explorations from two months ago. Now that the 0.5 release is around the corner, we can start looking again at new features.

Is it still up for discussion which algorithms to implement and what API to use?

Yes, completely. What do you think of my proposal for an API in #82 (comment)? My effort at implementing in (still in rough shape) lives in https://github.com/pitdicker/rand/blob/slice_random/src/seq.rs.

Weighted sampling is still a bit of an open problem. I was thinking that an API taking two iterators, one with the data and one with the weights, would be most flexible, but don't have anything concrete yet.

Feel free to come up with any different ideas, or take code, as you want.

@sicking
Copy link

sicking commented May 20, 2018

Cool!

I think that API for weighted sampling makes a lot of sense. I imagine something like (with one of these arguments passed as self):

    fn pick_weighted<IterItems, IterWeights, R>(items: IterItems,
                                                weights: IterWeights,
                                                rng: &mut R)
        -> IterItems::Item where IterItems: Iterator,
                                 IterWeights: Iterator+Clone,
                                 IterWeights::Item: SampleUniform+From<u8>+Sum<IterWeights::Item> {
        let total_weight: IterWeights::Item = weights.clone().sum();
        pick_weighted_with_total(items, weights, total_weight, rng)
    }

    fn pick_weighted_with_total<IterItems, IterWeights, R>(items: IterItems,
                                                           weights: IterWeights,
                                                           total_weight: IterWeights::Item,
                                                           rng: &mut R)
        -> IterItems::Item where IterItems: Iterator,
                                 IterWeights: Iterator,
                                 IterWeights::Item: SampleUniform+From<u8> {
        ...
    }

I'm happy to provide a patch for this.

On a separate, and more general, topic regarding API design, I'm trying to understand why the proposals tend to stick these functions on [T] and Iterator rather than on Rng?

I.e. why vec.shuffle(rng) rather than rng.shuffle(vec)?

I certainly agree that sticking functions on existing types generally results in nicer APIs than creating global functions, i.e. vec.shuffle(rng) would be a lot nicer than shuffle(vec, rng). And in instances when there's no other object to tie an operation to then it certainly seems like the right thing to do, i.e. vec.shuffle() is a lot nicer than shuffle(vec).

However between vec.shuffle(rng) and rng.shuffle(vec) I see a few advantages with the latter.

  • Reading documentation for the Rng trait would show all operations that objects implementing Rng can be used for.
  • If you see code like vec.choose_multiple(rng, 10) on github or elsewhere, it's less clear where to look for documentation for the choose_multiple function.
  • For functions like choose_weighted, which operates on two iterators, or choose_multiple_from, which operates on an iterator and a slice, it feels quite arbitrary which of the objects should be passed as self and which should be passed as a normal argument. Using the Rng as self can be done consistently.

But I admit I'm also strongly biased by coming from C++ which for obvious reasons doesn't have a tradition of libraries extending existing types.

And there are a couple of good arguments for the vec.shuffle(rng) style that I can think of:

  • For functions that can operate either on an iterator or a slice it means that we don't have to disambiguate using function names. I.e. we can have iter.pick(rng) and vec.pick(rng), rather than have to use rng.pick(vec) and rng.pick_from_iter(iter).
  • For functions that can be expected to be put in long call chains, it can result in cleaner code: get_a_vector(foo).pick(rng).do_some_stuff() is cleaner than rng.pick(get_a_vector(foo)).do_some_stuff().

Are there any general rust guidelines around this? Should we leave particular topic to a separate issue?

@TheIronBorn
Copy link

There's this https://rust-lang-nursery.github.io/api-guidelines/ idk how much is applicable though

@dhardy
Copy link
Owner Author

dhardy commented May 26, 2018

I think I've roughly caught up on this, except for weighted sampling and not looking into all the algorithms...

I quite like @pitdicker's proposed API, except that I would move SliceRandom::choose_multiple_from to IteratorRandom::choose_multiple_fill.

One thing that stands out is that choose and choose_multiple are basically just wrappers around gen_range and sample_indices, so it doesn't matter much if they don't cover all uses perfectly (though it may still be worth having _mut versions). I think on iterators, choose_multiple can just be a wrapper around choose_multiple_fill.

Another point is that with iterators, if the exact length is known, much more efficient implementations are possible. But I guess this situation is unlikely, so users can make their own wrappers around gen_range / sample_indices in this case.

Is it actually possible to implement partial_shuffle significantly more efficiently than with choose_multiple and shuffle? I saw the neat variant of Floyd's algorithm, but it requires insertions at arbitrary positions which is generally not efficient.

SliceRandom::choose_multiple_iter sounds like an interesting extra, though if performance is good enough we could just use this instead of returning a Vec for choose_multiple. Note that the sample_indices approach should also map well to an iterator over output values.

A lot of this requires a good sample_indices function, and I see there are a lot of possible algorithms we could use (much more than rust-lang-nursery#479). It may or may not be worth generalising this (provided it doesn't compromise performance): allow lower-bound other than 0, and be generic over types (probably T: From<u8>).

choose_one_in_n: as I understand this is simply
iter.filter(|_| rng.gen_bool(p)), hence does not appear an important addition.


I.e. why vec.shuffle(rng) rather than rng.shuffle(vec)?

Also something I was thinking about. Your rationales are important; a couple more:

  • In some ways this results in a cleaner API, though less concise.
  • Method calls typically follow the data; for things like Rng::gen the data is source is the RNG but for choose and shuffle the source is the slice/iterator.

I think at least methods consuming iterators should put the methods on the iterator, i.e. iter.choose_multiple(&mut rng, n).

For slices I'm less sure. (We can even have both as with Distribution::sample and Rng::sample, though probably this isn't desirable). Of course having iter.choose(&mut rng) but rng.choose(slice) is a bit confusing for users, so consistency is probably the best option. This leaves the question of whether we should deprecate Rng::choose and Rng::shuffle.

@sicking
Copy link

sicking commented Jun 4, 2018

Method calls typically follow the data...

Having noodled on this for a few days, I can certainly see logic in having Rng only have functions which generate values, and putting functions which operate on values in slices and iterators in extension traits for those types.

However that does leave APIs like choose_weighted which take two iterators. Randomly picking one of those iterators as self doesn't feel very nice. If choose_weighted (and choose_weighted_with_total) is the only instance APIs that operate on multiple data sources, then ultimately it's probably ok that this one is a bit ugly. But of course it's nice if we make it as good as it can be.

But if we go with this approach I think it's important that we only stick functions like shuffle and choose on SliceRandom. If some functions which operate on slices live on Rng and some don't, then it's very easy miss that the ones don't exist at all.

And I do still have concerns around discoverability and documentation...

If we go with this approach, I agree with @dhardy that we should do IteratorRandom::choose_multiple_fill rather than SliceRandom::choose_multiple_from

@sicking
Copy link

sicking commented Jun 6, 2018

FWIW, I'd really like to come to some sort of conclusion here. I'd enjoy writing up more patches for sequence-related functions, but don't want to waste my time on PRs that aren't going to go anywhere due to using the wrong API.

@dhardy
Copy link
Owner Author

dhardy commented Jun 7, 2018

rust-random#483 was my attempt at drawing some kind of conclusion from this. It's more complicated than it should be (I need to clean up the commits), but essentially just waiting for feedback on the basic design stuff.

@sicking
Copy link

sicking commented Jul 5, 2018

@pitdicker: Is the code for sampling multiple values at https://github.com/pitdicker/rand/blob/slice_random/src/seq.rs working? I.e. is RandomSampler and SequentialRandomSampler working? I wanted to try to benchmark it against other algorithms for multi-sampling.

@pitdicker
Copy link

Is the code for sampling multiple values at pitdicker/rand:src/seq.rs@slice_random working?

It have tested it, but only lightly. But I have re-checked multiple times that it is implemented exactly as in the paper.

b.t.w. @dhardy @sicking I have enjoyed working with you on rand, but it is a bit more difficult to find the time at the moment (you probably have noticed). Feel free to cc me with @pitdicker on anything, and hopefully I can do more after the summer.

@dhardy
Copy link
Owner Author

dhardy commented Jul 5, 2018

No problem @pitdicker; thank you for all the work!

@sicking
Copy link

sicking commented Jul 5, 2018

Thanks for the work @pitdicker! I've enjoyed working with you a lot.

I'm actually also likely to have less time. I'm starting a new job next week so will have significantly less free time on my hands. I still have a to-do list of items that I want to get done, and hope to be able to get through it before the 1.0 release. So for now hopefully I'll still be able to stay involved.

@dhardy
Copy link
Owner Author

dhardy commented Aug 23, 2018

I believe we've finally wrapped most of this up.

@dhardy dhardy closed this as completed Aug 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants