Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample #195

Merged
merged 9 commits into from
Dec 5, 2017
Merged

Sample #195

merged 9 commits into from
Dec 5, 2017

Conversation

vitiral
Copy link
Contributor

@vitiral vitiral commented Nov 24, 2017

This implements #195, in particular it:

  • depricates rand::sample and bumps cargo version.
  • adds rand::seq::sample_iter function which is very similar to rand::sample but returns a Result.
  • adds the seq::sample_slice and seq::sample_slice_ref functions which sample from a slice in O(amount) time and memory.
  • adds the seq::sample_indices which can sample a range of indicies in O(amount) time and memory (helper function if this is desired).

Second Try:

- renames `sample` -> `sample_reservoir`. `sample_reservoir` is then imported `as sample` to avoid breaking backwards compatibility (for now) with a TODO for removing it. - adds the `Sample` trait which adds a `sample(rng, amout) -> Container` method. - adds the `SampleRef` trait which adds a `sample_ref(rng, amount) -> Container<&T>` method

Comments and criticism welcome!

Original Ticket:

This is an an initial PR for sanity and feedback for #194. This PR contains **breaking changes** so should not be merged into master.

There is still a lot to do here, but I'm hitting a critical issue where the XorShiftRng doesn't seem to be repeatable with the same seed. When running the new test I'm getting log output like:

cache length=4                                  
cache j=3                        
inplace length=4                     
inplace j=0 

So j is different values for the same loop number. Does anyone know a repeatable (with the same seed) random number generator I can use?

@dhardy
Copy link
Member

dhardy commented Nov 24, 2017

All included PRNGs (including XorShiftRng) should be repeatable, if you create them using SeedableRng::from_seed with fixed seed (i.e. not created by thread_rng as I see in your test).

@vitiral
Copy link
Contributor Author

vitiral commented Nov 24, 2017

@dhardy I'm really not seeing that. I simplified the test by commenting stuff out, and this is the stdout:

# cargo test sample::test::test_sample_equivalent
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
     Running target/debug/deps/rand-6f255ac3db0f7285

running 1 test
test sample::test::test_sample_equivalent ... FAILED

failures:

---- sample::test::test_sample_equivalent stdout ----
        inplace length=3
inplace j=0
cache length=3
cache j=1
thread 'sample::test::test_sample_equivalent' panicked at 'assertion failed: `(left == right)`
  left: `[0]`,
 right: `[1]`', src/sample.rs:248:12

j is different for cache and inplace, but they are doing the exact same thing in terms of how they call gen_range... I'm pretty confused at this point ☹️

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know, but I'd think it's a bug in your code, not XorShiftRng.

src/lib.rs Outdated
@@ -260,6 +260,7 @@ pub use os::OsRng;

pub use isaac::{IsaacRng, Isaac64Rng};
pub use chacha::ChaChaRng;
pub use sample::{sample, sample_reservoir};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should remove this pub use? Especially because I want to introduce a Sample trait.

src/sample.rs Outdated
&mut XorShiftRng::from_seed(seed), vals.len(), amount);
let cache = sample_range_cache(
&mut XorShiftRng::from_seed(seed), 0..vals.len(), amount);
assert_eq!(inplace, cache);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do your methods guarantee the same ordering?

@vitiral
Copy link
Contributor Author

vitiral commented Nov 24, 2017

yup, I had a logic error. Getting pretty close.

@vitiral vitiral force-pushed the sample branch 2 times, most recently from c75f356 to 566fbaf Compare November 25, 2017 14:01
@vitiral
Copy link
Contributor Author

vitiral commented Nov 25, 2017

Okay, I think I'm basically done. Comments welcome!

src/sample.rs Outdated
let vec: Vec<usize> = (0..length).collect();
assert_eq!(
vec.sample(&mut xor_rng(seed), amount),
regular,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing comma may be an issue in some builds? At least, Travis complains on some builds (also line 325)

@vitiral
Copy link
Contributor Author

vitiral commented Nov 25, 2017

@dhardy sorry, I was running on nightly before. Ammended a fix!

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not so happy as is; the usage of traits in the API seems unnecessary and not fully taken advantage of. Better to simplify?

src/lib.rs Outdated
sample_reservoir as sample,
sample_reservoir,
Sample,
SampleRef};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not keen on re-exporting these names. Also, Sample conflicts with one of my (to be) proposed changes!

Note: the code is all about sampling from sequences, either Vec or slices, or iterators over a more general container, so usage of a seq module doesn't seem inappropriate to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could rename them SampleContainer and SampleRefContainer. See my comments below for whether we should even keep the traits or go with a function as you suggested elsewhere.

src/sample.rs Outdated
@@ -0,0 +1,330 @@
// Copyright 2013-2014 The Rust Project Developers. See the COPYRIGHT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

year

src/lib.rs Outdated
@@ -260,6 +260,12 @@ pub use os::OsRng;

pub use isaac::{IsaacRng, Isaac64Rng};
pub use chacha::ChaChaRng;
pub use sample::{
// TODO: `sample` name will be deprecated in 1.0, use `sample_reservoir` instead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try #[depercated(since="VER", note="renamed to sample_reservoir"). I don't think there's any point delaying the deprecation. RFC: https://github.com/rust-lang/rfcs/blob/master/text/1270-deprecation.md

src/sample.rs Outdated
where R: Rng,
{
if amount > length {
panic!("`amount` must be less than or equal to `slice.len()`");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no slice param so update msg

src/sample.rs Outdated
///
/// TODO: IMO this should be made public since it can be generally useful, although
/// there might be a way to make the output type more generic/compact.
fn sample_indices<R>(rng: &mut R, length: usize, amount: usize) -> Vec<usize>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So make it public? Prototype seems okay. Maybe it should say: samples amount numbers from the range 0..length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can make it more general, sample_range to sample over any numeric range, and even make the input types generic. I'd like to do that before exporting it.

I'll open a ticket once this gets merged. It's out of scope and just a "nice to have".

src/sample.rs Outdated
///
/// This is intended to be implemented for containers that:
/// - Can be sampled in `O(amount)` time.
/// - Whos items can be `cloned`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/sample.rs Outdated
/// let sample = sample_reservoir(&mut rng, 1..100, 5);
/// println!("{:?}", sample);
/// ```
pub fn sample_reservoir<T, I, R>(rng: &mut R, iterable: I, amount: usize) -> Vec<T>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a special case of your traits now, i.e. impl<I: Iterator> Sample for I?

Copy link
Contributor Author

@vitiral vitiral Nov 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as said above, sample_reservoir is far slower.

We could use sample_reservoir for iterators or "generic" types like HashMaps, but we will not be able to guarantee performance of O(m).

On that note, I've investigated sampiling from HashMaps in O(m) and I don't think it can be done. Even the iterator represenation is required to iterate over the full capacity in order to guarantee selection of all elements -- it has no way to select only the "filled buckets".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, anything done without creating an ordered view of items in the HashMap will not be reproducible, even with a reproducible generator. Theoretical and maybe not important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I think special-casing HashMap is not really possible, at least without architecting a HashMap for just that purpose (in which case... provide your own sampling function for it since you are creating it for that purpose).

For standard hashmap's someone will have to use sample_reservoir and eat the O(N) computational cost.

src/sample.rs Outdated
type Sampled = Vec<T>;

fn sample<R: Rng>(&self, rng: &mut R, amount: usize) -> Vec<T> {
self.as_slice().sample(rng, amount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this just uses the slice representation directly? This makes me wonder whether the trait-based approach is worth it at all.

src/sample.rs Outdated
/// - Whos items can be `cloned`.
///
/// If cloning is impossible or expensive, use `sample_ref` instead.
pub trait Sample {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that traits can't be implemented outside of the module where (a) the trait is defined or (b) the target type is defined. This means no one else can implement Sample for std::collections types.

Likely though, this isn't necessary: users can instead create an iterator and call sample on that. Which begs the questions: (1) why use traits instead of plain functions and (2) why not let a more general function like sample_reservoir take an optional length (or a length and an iterator), and use the appropriate algorithm when length is known?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample_reservoir requires iterating through the entire sequence length N to sample m samples, so is O(N) computationally and O(m) in memory.

Sample::sample does not iterate through the sequence, so is O(m) for both memory and computation.

Copy link
Member

@dhardy dhardy Nov 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just knowing the length allows O(m) computation of indices... but getting those values from the iterator is still O(N). Ok.

The first question still stands though: why use a Sample trait instead of a plain function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, sample guarantees that the order of the returned elements are random, where as sample_resevoir does not.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 25, 2017

So, with HashMap probably being impossible to implement in O(m), I think it makes more sense to have sample and sample_ref functions that accept a slice as the set to sample over.

If you knew the number of nodes in each branch of a tree (at every branch) then you could implement a sampling mechanism in O(m + log N) since you could select a random number of elements to select in each branch based on how many elements there are. If that is possible with rust's BTree* types, the trait might still be worth it.

@dhardy
Copy link
Member

dhardy commented Nov 25, 2017

I suspect even an O(m) BTreeMap implementation would be impossible... it depends whether the BTreeMap insert/remove functions update counts at each node or not — no, it looks like there is a single length counter for BTreeMap, not one per node.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 25, 2017

It looks like rust doesn't store any information about the number of children nodes for each node, which is pretty standard for a BTree and I wouldn't expect them to.

With that knowledge I'd say that not having the traits and essentially redefining sample as what Sample::sample for &[T] is in this PR is probably the way to go (as well as sample_ref).

This will of course be an actually breaking change, since the type signature of sample would change, so it would have to wait until 1.0 (edit: or some other "major" release)

@dhardy
Copy link
Member

dhardy commented Nov 25, 2017

You could also make it a non-breaking change with a different name (e.g. sample_slice or seq::sample).

@vitiral
Copy link
Contributor Author

vitiral commented Nov 25, 2017

that's true. We could put it in there and just mark sample as deprecated (for now) and eventually replace it with seq::sample.

I think that's the route I'd like to go down. However module seq doesn't exist yet, should I make it?

@dhardy
Copy link
Member

dhardy commented Nov 25, 2017

Give me your opinion: does a seq or sequences module with your sample* functions and a couple of other things as here make sense? If so, yes. IMO WeightedChoice doesn't really fit in distributions.

It's also possible to #[deprecated(..)] pub use seq::sample; but not publically import other names from seq, which I have a slight preference for, but I'll leave you to decide that.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 25, 2017

I would say that makes sense to me, but what else is distributions supposed to do? Could there be a use of WeightedChoice outside of seq? (Edit: if so I would say that keeping it in distributions makes sense)

Hmm... I think for now we should deprecate rand::sample and not export anything from rand::seq like you suggest, but in a major release we will export things and swap rand::sample with seq::sample as well as export sample_ref and sample_reservoir (all at the same time).

I'm debating whether to call it seq::sample or seq::sample_slice. I feel pretty convinced seq::sample (eventually rand:sample as well) is the right name since it is basically the fastest possible way to sample (the algorithm could probably be improved obviously, but it will never be faster in the Big-O sense).

@dhardy
Copy link
Member

dhardy commented Nov 25, 2017

distributions deals with a lot already: conversion to other types, uniform sampling from ranges, sampling from 0..1 interval for floats, sampling from Normal, LogNormal, Exponential distributions, and sampling from a subset of ASCII/unicode characters (this one uses sequences::Choose).

"Sampling" is a fairly generic term, e.g. sampling from a normal distribution. seq::sample is fine IMO because the function is about sampling from sequences, also sample_slice, whereas rand::sample conflicts with Distribution::sample, the Sample trait and Sample::sample.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 25, 2017 via email

@vitiral
Copy link
Contributor Author

vitiral commented Nov 25, 2017

@dhardy okay, I think I've addressed all your concerns.

I also:

  • used seq::sample_slice in the main example, instead of sample_reservoir.
  • added new boundary-case tests and fixed some minor bugs.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 25, 2017

also note that I have minor changes to sample_reservoir

@vitiral
Copy link
Contributor Author

vitiral commented Nov 25, 2017

also, I made sample_indices public.

Should sample_reservoir be named sample_iter instead? Name it what it does rather than how it does it right?

@vitiral
Copy link
Contributor Author

vitiral commented Nov 25, 2017

I'm going to do that, since I think it's the right thing. Let me know if you want me to reverse it.

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this revision myself.

src/lib.rs Outdated
@@ -260,6 +260,8 @@ pub use os::OsRng;

pub use isaac::{IsaacRng, Isaac64Rng};
pub use chacha::ChaChaRng;
#[deprecated(since="0.3.18", note="renamed to seq::sample_iter")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the current version; I guess this should be 0.3.19 with a corresponding version bump in Cargo.toml.

src/seq.rs Outdated
} else {
// Don't hang onto extra memory. There is a corner case where
// `amount <<< len(iterable)` that we want to avoid.
reservoir.shrink_to_fit();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're not going with your own advice in #194 to make this fallible? I'm not sure personally; this is an iterator so shrinking to available data is common behaviour.

///
/// The cache avoids allocating the entire `length` of values. This is especially useful when
/// `amount <<< length`, i.e. select 3 non-repeating from 1_000_000
fn sample_indices_cache<R>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative approach would be to use a fixed Range::new(0, length) (slightly faster than using a different range each time), ignore any samples already in your HashSet, and stop when the set.len() == amount. Not sure which would be faster. Obviously if amount is close to length there will be many clashes and resamples, but you don't use this in that case anyway.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 26, 2017 via email

@vitiral
Copy link
Contributor Author

vitiral commented Nov 26, 2017 via email

@dhardy
Copy link
Member

dhardy commented Nov 26, 2017

Er... the method I mentioned saves calling Range::new each iteration, which involves a little work (BTW this is quite different in current rand and in my branch, but each version involves some extra work). If length >= amount * 2 then the expected number of samples is no more than amount * 2, but admittedly worst-case behaviour is far worse (unbounded in fact). HashSet look-ups are (expected) constant time, so it's an O(amount) algorithm, same as yours. Benchmarking would be needed to see which is faster in practice (and it depends heavily on length/amount), but due to worst-case behaviour I agree, better not to use the algorithm I suggested.

Yes, better not to panic. I'm undecided whether it should silently reduce length to match the input or not.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 26, 2017

Hmm, I would not have expected that. If that is the case I will change it to use Rng::next_f64 instead:

let j = (rng.next_f64 * (length - i)) as usize + i

Then we don't have to pay that penalty. I'm surprised it does not already desurgar to this, will your branch be changing this?

@vitiral
Copy link
Contributor Author

vitiral commented Nov 26, 2017

I changed it to use let j = (rng.next_f64 * (length - i)) as usize + i instead. I was surprised to see that if I changed inplace to use the faster method but not cache that they got different results. I would think the underlying method would be the same though. I will add a separate bug as that is not expected behavior IMO.

@dhardy
Copy link
Member

dhardy commented Nov 26, 2017

Don't use floating point sampling; you will not get an even distribution. Just use the range code you have before, it's fine.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 26, 2017

sounds good, I reset it to the previous commit.

Also, it sounds like you have some performance improvements for gen_range comming down the pipe, so it shouldn't be too much of an issue in the near future.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 27, 2017

I think I'm going to change it back to sample_reservoir and do a Result. Does that make sense to you?

@dhardy
Copy link
Member

dhardy commented Nov 27, 2017

I don't know the answer to that. Both options sound ok.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 27, 2017

I kept it sample_iter but used a Result. I think having a different type signture will help prevent people from using sample_iter like sample_slice because they are really not the same!

@vitiral
Copy link
Contributor Author

vitiral commented Nov 27, 2017

I hope this is finally done! 😄 Thanks for the review!

src/seq.rs Outdated
/// The following can be returned:
/// - `Ok`: `Vec` of `amount` non-repeating randomly sampled elements. The order is not random.
/// - `Err`: `Vec` of *less than* `amount` elements in sequential order. This is considered an
/// error since exactly `amount` elements is typically expected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Err case I'd prefer more precise documentation: in case less than amount elements are available, a Vec of exactly the elements in iterable is returned. The order is not random.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the order is actually guaranteed to be sequential. I've updated the doc, let me know if that is better.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 27, 2017

weird, the rust-stable travis doesn't seem to be running

src/seq.rs Outdated
/// This method is used internally by the slice sampling methods, but it can sometimes be useful to
/// have the indices themselves so this is provided as an alternative.
///
/// Panics if `amount > self.len()`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not self.len(). Also slice.len() in the panic text is technically wrong, but may still be the best description.

@vitiral
Copy link
Contributor Author

vitiral commented Dec 3, 2017

@dhardy should be fixed now. What do you mean slice.len() isn't valid?

@dhardy dhardy mentioned this pull request Dec 4, 2017
@dhardy dhardy merged commit 10218d4 into rust-random:master Dec 5, 2017
pitdicker pushed a commit to pitdicker/rand that referenced this pull request Apr 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants