Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sample with random ordering #169

Closed
TheIronBorn opened this issue Aug 24, 2017 · 17 comments
Closed

sample with random ordering #169

TheIronBorn opened this issue Aug 24, 2017 · 17 comments

Comments

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Aug 24, 2017

I was using rand::sample to shuffle an arbitrary number of elements from a list, until I noticed that the ordering was stable, and so no shuffling was happening, only random selection.

Is there reason to add such functionality to the crate? Or is the expected way of doing this shuffling, then slicing? Or sampling, then shuffling?


rand: 0.3.16
rustc: 1.21.0-nightly
cargo: 0.22.0-nightly

@TheIronBorn
Copy link
Collaborator Author

TheIronBorn commented Aug 24, 2017

Merely exiting the Fisher–Yates shuffle early will suffice for this

@TheIronBorn TheIronBorn changed the title sample without stable ordering sample with random ordering Aug 24, 2017
@TheIronBorn
Copy link
Collaborator Author

TheIronBorn commented Aug 31, 2017

I threw this together:

/// Randomly sample up to `amount` elements from a vector.
/// The order of elements in the sample is random.
///
/// This applies Durstenfeld's algorithm for the [Fisher–Yates shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_modern_algorithm)
/// which produces an unbiased permutation.
///
/// # Example
///
/// ```rust
/// use rand::{thread_rng, Rng};
///
/// let mut rng = thread_rng();
/// let mut y = [1, 2, 3];
/// rng.shuffle_unstable(&mut y);
/// println!("{:?}", y);
/// rng.shuffle_unstable(&mut y);
/// println!("{:?}", y);
/// ```
fn sample_unstable<T: Clone>(&mut self, mut values: Vec<T>, amount: usize) -> Vec<T>
    where Self: Sized
{
    let len = values.len();
    let mut i = len;
    while i > len-amount {
        // invariant: elements with index > i have been locked in place.
        i -= 1;
        // lock element i in place.
        values.swap(i, self.gen_range(0, i + 1));
    }
    values.split_off(i)
}

It should run in O(amount).

Note that when amount > n/2 it is faster to choose n - amount items to remove. This will keep most of the original ordering though, falling short of the reason I opened this issue.

@clarfonthey
Copy link
Contributor

@TheIronBorn why not use Vec::split_off and an &mut Vec<T> as an argument?

@TheIronBorn
Copy link
Collaborator Author

I'm not sure this should mutate the input list, as sample doesn't. Though this does have more in common with shuffle which does mutate its list.
I wasn't aware of split_off, I'll change the above.

Thanks

@clarfonthey
Copy link
Contributor

IMO it's better to mutate the list and remove the elements than it is to just drop the entire list and only return the samples

@TheIronBorn
Copy link
Collaborator Author

Something more like this?

// returns the sampled items, leaving the others in `values`
fn sample_unstable<T: Clone>(&mut self, values: &mut Vec<T>, amount: usize) -> Vec<T>
    where Self: Sized
{
    let len = values.len();
    let mut i = len;
    while i > len-amount {
        // invariant: elements with index > i have been locked in place.
        i -= 1;
        // lock element i in place.
        values.swap(i, self.gen_range(0, i + 1));
    }
    values.split_off(i)
}

@TheIronBorn
Copy link
Collaborator Author

TheIronBorn commented Aug 31, 2017

And the k > n/2 could look like this:

fn sample_unstable<T>(&mut self, values: &mut [T], amount: usize) -> &[T]
    where Self: Sized
{
    let len = values.len();
    let mut i = len;

    if amount > len/2 {
        while i > amount {
            // invariant: elements with index > i have been locked in place.
            i -= 1;
            // lock element i in place.
            values.swap(i, self.gen_range(0, i + 1));
        }
        values.split_at(i).0
    } else {
        while i > len-amount {
            // invariant: elements with index > i have been locked in place.
            i -= 1;
            // lock element i in place.
            values.swap(i, self.gen_range(0, i + 1));
        }
        values.split_at(i).1
    }
}

It reduces the worst case time complexity to O(n/2).

(also this uses slices rather than vectors, unsure which would be preferred)

@TheIronBorn
Copy link
Collaborator Author

TheIronBorn commented Aug 31, 2017

I don't know how the k > n/2 should be implemented. It would break the "perfectly shuffled ordering" promise and yet have a significantly improved worst case performance. Even without it sample_unstable would be better than sample's O(n) as far as I can tell

@burdges
Copy link
Contributor

burdges commented Aug 31, 2017

I just used a Range in #144 which I should probably fix to being usize only.

@TheIronBorn
Copy link
Collaborator Author

Perhaps it would be sufficient to merely add to the documentation, pointing users to further reading.

@burdges
Copy link
Contributor

burdges commented Aug 31, 2017

I missed that you're just exiting the Fisher–Yates shuffle early when I posted that.

I'd suggest first using split_at_mut and making the return type (&mut [T], &mut [T]), and second returning both parts but explain how to use them, so roughly like

/// Split a mutable slice into a random sample of `amount` elements
///
/// This applies Durstenfeld's algorithm for the [Fisher–Yates shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_modern_algorithm) for an unbiased permutation, but exits
/// early after choosing `amount` elements.  The first slice returned has `amount` elements
/// randomly permuted, while the second has `values.len()-amount` elements only randomly
/// chosen.  If you only need to chose elements randomly and `amount > values.len()/2` then
/// you may improve performance by taking `amount = values.len() - amount` and using
/// only the second slice.
///
/// # Examples
/// ...
fn sample_mut<T>(&mut self, values: &mut [T], amount: usize) -> (&mut [T], &mut [T])
    where Self: Sized
{
    let len = values.len();
    let mut i = len;

    while i > len-amount {
        // invariant: elements with index > i have been locked in place.
        i -= 1;
        // lock element i in place.
        values.swap(i, self.gen_range(0, i + 1));
    }
    let r = values.split_at_mut(i);
    (r.1,r.0)
}

@TheIronBorn
Copy link
Collaborator Author

Oh yeah that's far better

@TheIronBorn
Copy link
Collaborator Author

TheIronBorn commented Aug 31, 2017

Still unsure whether this should become a pull request. @burdges's code alleviates the "perfectly random" concern, but this is perhaps too close to shuffle to be named after sample. Does anyone have guidance for how best to let users achieve this functionality?

@burdges
Copy link
Contributor

burdges commented Sep 1, 2017

I named it sample_mut but names like split_sample_mut or partial_shuffle or split_shuffle work too. I like the simplicity here because Rng will get simpler in the redesign, so maybe this can stay and shuffle can be defined using it :

fn shuffle<T>(&mut self, values: &mut [T]) where Self: Sized {
    self.sample_mut(values,values.len()) 
}

@vitiral
Copy link
Contributor

vitiral commented Jan 6, 2018

The new rand::seq::sample_slice* methods always result in a properly shuffled output (they simply exist the Fisher-Yates early, as you suggested).

@pitdicker
Copy link
Contributor

Does that mean this issue is solved?

@TheIronBorn
Copy link
Collaborator Author

TheIronBorn commented Jan 6, 2018

Yeah, that looks great!

I would note that seq::sample* doesn't have the "choose n > len / 2 in any order" optimization this has. Though that's really another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants