Sample #195

vitiral · 2017-11-24T16:07:46Z

This implements #195, in particular it:

depricates rand::sample and bumps cargo version.
adds rand::seq::sample_iter function which is very similar to rand::sample but returns a Result.
adds the seq::sample_slice and seq::sample_slice_ref functions which sample from a slice in O(amount) time and memory.
adds the seq::sample_indices which can sample a range of indicies in O(amount) time and memory (helper function if this is desired).

Second Try:

- renames `sample` -> `sample_reservoir`. `sample_reservoir` is then imported `as sample` to avoid breaking backwards compatibility (for now) with a TODO for removing it. - adds the `Sample` trait which adds a `sample(rng, amout) -> Container` method. - adds the `SampleRef` trait which adds a `sample_ref(rng, amount) -> Container<&T>` method

Comments and criticism welcome!

Original Ticket:

This is an an initial PR for sanity and feedback for #194. This PR contains **breaking changes** so should not be merged into master.

There is still a lot to do here, but I'm hitting a critical issue where the XorShiftRng doesn't seem to be repeatable with the same seed. When running the new test I'm getting log output like:

cache length=4                                  
cache j=3                        
inplace length=4                     
inplace j=0

So j is different values for the same loop number. Does anyone know a repeatable (with the same seed) random number generator I can use?

dhardy · 2017-11-24T16:54:47Z

All included PRNGs (including XorShiftRng) should be repeatable, if you create them using SeedableRng::from_seed with fixed seed (i.e. not created by thread_rng as I see in your test).

vitiral · 2017-11-24T17:09:00Z

@dhardy I'm really not seeing that. I simplified the test by commenting stuff out, and this is the stdout:

# cargo test sample::test::test_sample_equivalent
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
     Running target/debug/deps/rand-6f255ac3db0f7285

running 1 test
test sample::test::test_sample_equivalent ... FAILED

failures:

---- sample::test::test_sample_equivalent stdout ----
        inplace length=3
inplace j=0
cache length=3
cache j=1
thread 'sample::test::test_sample_equivalent' panicked at 'assertion failed: `(left == right)`
  left: `[0]`,
 right: `[1]`', src/sample.rs:248:12

j is different for cache and inplace, but they are doing the exact same thing in terms of how they call gen_range... I'm pretty confused at this point ☹️

dhardy

Don't know, but I'd think it's a bug in your code, not XorShiftRng.

dhardy · 2017-11-24T19:17:17Z

src/lib.rs

@@ -260,6 +260,7 @@ pub use os::OsRng;

 pub use isaac::{IsaacRng, Isaac64Rng};
 pub use chacha::ChaChaRng;
+pub use sample::{sample, sample_reservoir};


I wonder if we should remove this pub use? Especially because I want to introduce a Sample trait.

dhardy · 2017-11-24T19:28:11Z

src/sample.rs

+                &mut XorShiftRng::from_seed(seed), vals.len(), amount);
+            let cache = sample_range_cache(
+                &mut XorShiftRng::from_seed(seed), 0..vals.len(), amount);
+            assert_eq!(inplace, cache);


Do your methods guarantee the same ordering?

vitiral · 2017-11-24T23:06:52Z

yup, I had a logic error. Getting pretty close.

vitiral · 2017-11-25T14:05:32Z

Okay, I think I'm basically done. Comments welcome!

dhardy · 2017-11-25T14:09:01Z

src/sample.rs

+            let vec: Vec<usize> = (0..length).collect();
+            assert_eq!(
+                vec.sample(&mut xor_rng(seed), amount),
+                regular,


Trailing comma may be an issue in some builds? At least, Travis complains on some builds (also line 325)

vitiral · 2017-11-25T14:11:58Z

@dhardy sorry, I was running on nightly before. Ammended a fix!

…eservoir

dhardy

Sorry, not so happy as is; the usage of traits in the API seems unnecessary and not fully taken advantage of. Better to simplify?

dhardy · 2017-11-25T14:21:10Z

src/lib.rs

+    sample_reservoir as sample,
+    sample_reservoir,
+    Sample,
+    SampleRef};


I'm not keen on re-exporting these names. Also, Sample conflicts with one of my (to be) proposed changes!

Note: the code is all about sampling from sequences, either Vec or slices, or iterators over a more general container, so usage of a seq module doesn't seem inappropriate to me.

we could rename them SampleContainer and SampleRefContainer. See my comments below for whether we should even keep the traits or go with a function as you suggested elsewhere.

dhardy · 2017-11-25T14:22:27Z

src/sample.rs

@@ -0,0 +1,330 @@
+// Copyright 2013-2014 The Rust Project Developers. See the COPYRIGHT


dhardy · 2017-11-25T14:29:01Z

src/lib.rs

@@ -260,6 +260,12 @@ pub use os::OsRng;

 pub use isaac::{IsaacRng, Isaac64Rng};
 pub use chacha::ChaChaRng;
+pub use sample::{
+    // TODO: `sample` name will be deprecated in 1.0, use `sample_reservoir` instead


Try #[depercated(since="VER", note="renamed to sample_reservoir"). I don't think there's any point delaying the deprecation. RFC: https://github.com/rust-lang/rfcs/blob/master/text/1270-deprecation.md

dhardy · 2017-11-25T14:31:03Z

src/sample.rs

+    where R: Rng,
+{
+    if amount > length {
+        panic!("`amount` must be less than or equal to `slice.len()`");


There's no slice param so update msg

dhardy · 2017-11-25T14:32:04Z

src/sample.rs

+///
+/// TODO: IMO this should be made public since it can be generally useful, although
+/// there might be a way to make the output type more generic/compact.
+fn sample_indices<R>(rng: &mut R, length: usize, amount: usize) -> Vec<usize>


So make it public? Prototype seems okay. Maybe it should say: samples amount numbers from the range 0..length.

I think I can make it more general, sample_range to sample over any numeric range, and even make the input types generic. I'd like to do that before exporting it.

I'll open a ticket once this gets merged. It's out of scope and just a "nice to have".

dhardy · 2017-11-25T14:33:17Z

src/sample.rs

+///
+/// This is intended to be implemented for containers that:
+/// - Can be sampled in `O(amount)` time.
+/// - Whos items can be `cloned`.


Whose
https://en.wiktionary.org/wiki/whose

dhardy · 2017-11-25T14:35:11Z

src/sample.rs

+/// let sample = sample_reservoir(&mut rng, 1..100, 5);
+/// println!("{:?}", sample);
+/// ```
+pub fn sample_reservoir<T, I, R>(rng: &mut R, iterable: I, amount: usize) -> Vec<T>


Shouldn't this be a special case of your traits now, i.e. impl<I: Iterator> Sample for I?

as said above, sample_reservoir is far slower.

We could use sample_reservoir for iterators or "generic" types like HashMaps, but we will not be able to guarantee performance of O(m).

On that note, I've investigated sampiling from HashMaps in O(m) and I don't think it can be done. Even the iterator represenation is required to iterate over the full capacity in order to guarantee selection of all elements -- it has no way to select only the "filled buckets".

Also, anything done without creating an ordered view of items in the HashMap will not be reproducible, even with a reproducible generator. Theoretical and maybe not important.

Ya, I think special-casing HashMap is not really possible, at least without architecting a HashMap for just that purpose (in which case... provide your own sampling function for it since you are creating it for that purpose).

For standard hashmap's someone will have to use sample_reservoir and eat the O(N) computational cost.

dhardy · 2017-11-25T14:38:03Z

src/sample.rs

+    type Sampled = Vec<T>;
+
+    fn sample<R: Rng>(&self, rng: &mut R, amount: usize) -> Vec<T> {
+        self.as_slice().sample(rng, amount)


So this just uses the slice representation directly? This makes me wonder whether the trait-based approach is worth it at all.

dhardy · 2017-11-25T14:41:21Z

src/sample.rs

+/// - Whos items can be `cloned`.
+///
+/// If cloning is impossible or expensive, use `sample_ref` instead.
+pub trait Sample {


Note that traits can't be implemented outside of the module where (a) the trait is defined or (b) the target type is defined. This means no one else can implement Sample for std::collections types.

Likely though, this isn't necessary: users can instead create an iterator and call sample on that. Which begs the questions: (1) why use traits instead of plain functions and (2) why not let a more general function like sample_reservoir take an optional length (or a length and an iterator), and use the appropriate algorithm when length is known?

sample_reservoir requires iterating through the entire sequence length N to sample m samples, so is O(N) computationally and O(m) in memory.

Sample::sample does not iterate through the sequence, so is O(m) for both memory and computation.

Just knowing the length allows O(m) computation of indices... but getting those values from the iterator is still O(N). Ok.

The first question still stands though: why use a Sample trait instead of a plain function?

Also, sample guarantees that the order of the returned elements are random, where as sample_resevoir does not.

vitiral · 2017-11-25T15:28:51Z

So, with HashMap probably being impossible to implement in O(m), I think it makes more sense to have sample and sample_ref functions that accept a slice as the set to sample over.

If you knew the number of nodes in each branch of a tree (at every branch) then you could implement a sampling mechanism in O(m + log N) since you could select a random number of elements to select in each branch based on how many elements there are. If that is possible with rust's BTree* types, the trait might still be worth it.

dhardy · 2017-11-25T15:37:25Z

I suspect even an O(m) BTreeMap implementation would be impossible... it depends whether the BTreeMap insert/remove functions update counts at each node or not — no, it looks like there is a single length counter for BTreeMap, not one per node.

vitiral · 2017-11-25T15:37:53Z

It looks like rust doesn't store any information about the number of children nodes for each node, which is pretty standard for a BTree and I wouldn't expect them to.

With that knowledge I'd say that not having the traits and essentially redefining sample as what Sample::sample for &[T] is in this PR is probably the way to go (as well as sample_ref).

This will of course be an actually breaking change, since the type signature of sample would change, so it would have to wait until 1.0 (edit: or some other "major" release)

dhardy · 2017-11-25T15:43:34Z

You could also make it a non-breaking change with a different name (e.g. sample_slice or seq::sample).

vitiral · 2017-11-25T15:45:57Z

that's true. We could put it in there and just mark sample as deprecated (for now) and eventually replace it with seq::sample.

I think that's the route I'd like to go down. However module seq doesn't exist yet, should I make it?

dhardy · 2017-11-25T15:54:35Z

Give me your opinion: does a seq or sequences module with your sample* functions and a couple of other things as here make sense? If so, yes. IMO WeightedChoice doesn't really fit in distributions.

It's also possible to #[deprecated(..)] pub use seq::sample; but not publically import other names from seq, which I have a slight preference for, but I'll leave you to decide that.

vitiral · 2017-11-25T16:23:51Z

I would say that makes sense to me, but what else is distributions supposed to do? Could there be a use of WeightedChoice outside of seq? (Edit: if so I would say that keeping it in distributions makes sense)

Hmm... I think for now we should deprecate rand::sample and not export anything from rand::seq like you suggest, but in a major release we will export things and swap rand::sample with seq::sample as well as export sample_ref and sample_reservoir (all at the same time).

I'm debating whether to call it seq::sample or seq::sample_slice. I feel pretty convinced seq::sample (eventually rand:sample as well) is the right name since it is basically the fastest possible way to sample (the algorithm could probably be improved obviously, but it will never be faster in the Big-O sense).

dhardy · 2017-11-25T16:52:16Z

distributions deals with a lot already: conversion to other types, uniform sampling from ranges, sampling from 0..1 interval for floats, sampling from Normal, LogNormal, Exponential distributions, and sampling from a subset of ASCII/unicode characters (this one uses sequences::Choose).

"Sampling" is a fairly generic term, e.g. sampling from a normal distribution. seq::sample is fine IMO because the function is about sampling from sequences, also sample_slice, whereas rand::sample conflicts with Distribution::sample, the Sample trait and Sample::sample.

vitiral · 2017-11-25T16:54:35Z

I'll do sample_slice then. Hopefully will address all your comments this afternoon

vitiral · 2017-11-25T21:43:10Z

@dhardy okay, I think I've addressed all your concerns.

I also:

used seq::sample_slice in the main example, instead of sample_reservoir.
added new boundary-case tests and fixed some minor bugs.

vitiral · 2017-11-25T21:47:25Z

also note that I have minor changes to sample_reservoir

vitiral · 2017-11-25T22:15:12Z

also, I made sample_indices public.

Should sample_reservoir be named sample_iter instead? Name it what it does rather than how it does it right?

vitiral · 2017-11-25T22:18:04Z

I'm going to do that, since I think it's the right thing. Let me know if you want me to reverse it.

dhardy

I like this revision myself.

dhardy · 2017-11-26T08:34:25Z

src/lib.rs

@@ -260,6 +260,8 @@ pub use os::OsRng;

 pub use isaac::{IsaacRng, Isaac64Rng};
 pub use chacha::ChaChaRng;
+#[deprecated(since="0.3.18", note="renamed to seq::sample_iter")]


That's the current version; I guess this should be 0.3.19 with a corresponding version bump in Cargo.toml.

dhardy · 2017-11-26T08:38:33Z

src/seq.rs

+    } else {
+        // Don't hang onto extra memory. There is a corner case where
+        // `amount <<< len(iterable)` that we want to avoid.
+        reservoir.shrink_to_fit();


So you're not going with your own advice in #194 to make this fallible? I'm not sure personally; this is an iterator so shrinking to available data is common behaviour.

dhardy · 2017-11-26T08:53:49Z

src/seq.rs

+///
+/// The cache avoids allocating the entire `length` of values. This is especially useful when
+/// `amount <<< length`, i.e. select 3 non-repeating from 1_000_000
+fn sample_indices_cache<R>(


An alternative approach would be to use a fixed Range::new(0, length) (slightly faster than using a different range each time), ignore any samples already in your HashSet, and stop when the set.len() == amount. Not sure which would be faster. Obviously if amount is close to length there will be many clashes and resamples, but you don't use this in that case anyway.

vitiral · 2017-11-26T10:49:17Z

For the alternative implementation, this one is faster. I'm not actually sure if what you suggest saves you *anything* (maybe adding `i` every iteration?), but it looses you a LOT. It would literally be possible (although unlikely) to run for 1000x longer or more using your method, especially if amount was close to length / 2. The other thing this method gains is that it is identical to the inplace version. Literally they are the same method just optimized for different relations of amount to length.

vitiral · 2017-11-26T10:55:40Z

I'm also starting to think that maybe sample_reservoir was the right name... It's good to call out the method since as you mention it has different behavior. I think panicking is a bad idea. We could maybe return an Enum { Sampled(Vec), Collected(Vec)} to signify whether there was any *actual* sampling that happened. Maybe just using Result in that case would be fine as well, as it could be considered an error that the amount < length (then people could just unwrap it if they were certain of the length). I'll bump the cargo version

dhardy · 2017-11-26T11:45:53Z

Er... the method I mentioned saves calling Range::new each iteration, which involves a little work (BTW this is quite different in current rand and in my branch, but each version involves some extra work). If length >= amount * 2 then the expected number of samples is no more than amount * 2, but admittedly worst-case behaviour is far worse (unbounded in fact). HashSet look-ups are (expected) constant time, so it's an O(amount) algorithm, same as yours. Benchmarking would be needed to see which is faster in practice (and it depends heavily on length/amount), but due to worst-case behaviour I agree, better not to use the algorithm I suggested.

Yes, better not to panic. I'm undecided whether it should silently reduce length to match the input or not.

vitiral · 2017-11-26T15:31:21Z

Hmm, I would not have expected that. If that is the case I will change it to use Rng::next_f64 instead:

let j = (rng.next_f64 * (length - i)) as usize + i

Then we don't have to pay that penalty. I'm surprised it does not already desurgar to this, will your branch be changing this?

vitiral · 2017-11-26T16:03:26Z

I changed it to use let j = (rng.next_f64 * (length - i)) as usize + i instead. I was surprised to see that if I changed inplace to use the faster method but not cache that they got different results. I would think the underlying method would be the same though. I will add a separate bug as that is not expected behavior IMO.

dhardy · 2017-11-26T16:29:58Z

Don't use floating point sampling; you will not get an even distribution. Just use the range code you have before, it's fine.

vitiral · 2017-11-26T16:35:04Z

sounds good, I reset it to the previous commit.

Also, it sounds like you have some performance improvements for gen_range comming down the pipe, so it shouldn't be too much of an issue in the near future.

vitiral · 2017-11-27T02:17:33Z

I think I'm going to change it back to sample_reservoir and do a Result. Does that make sense to you?

dhardy · 2017-11-27T08:32:42Z

I don't know the answer to that. Both options sound ok.

vitiral · 2017-11-27T14:19:07Z

I kept it sample_iter but used a Result. I think having a different type signture will help prevent people from using sample_iter like sample_slice because they are really not the same!

vitiral · 2017-11-27T14:22:50Z

I hope this is finally done! 😄 Thanks for the review!

dhardy · 2017-11-27T14:58:14Z

src/seq.rs

+/// The following can be returned:
+/// - `Ok`: `Vec` of `amount` non-repeating randomly sampled elements. The order is not random.
+/// - `Err`: `Vec` of *less than* `amount` elements in sequential order. This is considered an
+///   error since exactly `amount` elements is typically expected.


For the Err case I'd prefer more precise documentation: in case less than amount elements are available, a Vec of exactly the elements in iterable is returned. The order is not random.

the order is actually guaranteed to be sequential. I've updated the doc, let me know if that is better.

vitiral · 2017-11-27T20:01:57Z

weird, the rust-stable travis doesn't seem to be running

dhardy · 2017-12-01T11:52:01Z

src/seq.rs

+/// This method is used internally by the slice sampling methods, but it can sometimes be useful to
+/// have the indices themselves so this is provided as an alternative.
+///
+/// Panics if `amount > self.len()`


It's not self.len(). Also slice.len() in the panic text is technically wrong, but may still be the best description.

vitiral · 2017-12-03T14:28:55Z

@dhardy should be fixed now. What do you mean slice.len() isn't valid?

Sample

vitiral mentioned this pull request Nov 24, 2017

rand::sample's API should be fallible #194

Closed

dhardy reviewed Nov 24, 2017

View reviewed changes

vitiral force-pushed the sample branch 2 times, most recently from c75f356 to 566fbaf Compare November 25, 2017 14:01

dhardy reviewed Nov 25, 2017

View reviewed changes

vitiral force-pushed the sample branch from 566fbaf to 3e9eac0 Compare November 25, 2017 14:10

close rust-random#194: add Sample trait and rename sample -> sample_r…

3e9eac0

…eservoir

dhardy requested changes Nov 25, 2017

View reviewed changes

vitiral force-pushed the sample branch from 5bcc725 to 9723127 Compare November 25, 2017 21:42

vitiral force-pushed the sample branch from 9723127 to 9bf4c95 Compare November 25, 2017 21:46

vitiral force-pushed the sample branch from 9bf4c95 to 74df6c0 Compare November 25, 2017 21:54

fixing comments

74df6c0

vitiral added 2 commits November 25, 2017 15:20

rename sample_reservoir -> sample_iter

4f78755

clean up some warts

dbb1059

dhardy approved these changes Nov 26, 2017

View reviewed changes

dhardy mentioned this pull request Nov 26, 2017

Revise sample function and move to seq module dhardy/rand#64

Closed

vitiral force-pushed the sample branch from 0f82a68 to dbb1059 Compare November 26, 2017 16:34

add distribution test

6de365d

make sample_iter return Result<Vec, Vec> and fix docs

15c1d86

bump cargo version

3a4015a

dhardy reviewed Nov 27, 2017

View reviewed changes

vitiral force-pushed the sample branch from 34154f0 to 70ffaaa Compare November 27, 2017 16:00

clarify Err in sample_iter

70ffaaa

dhardy reviewed Dec 1, 2017

View reviewed changes

clean up len docs

201ec34

dhardy mentioned this pull request Dec 4, 2017

Add Rng::pick method #198

Closed

dhardy merged commit 10218d4 into rust-random:master Dec 5, 2017

dhardy mentioned this pull request Jan 6, 2018

Sequence sampling: seq, WeightedChoice dhardy/rand#82

Closed

5 tasks

pitdicker pushed a commit to pitdicker/rand that referenced this pull request Apr 4, 2018

Merge pull request rust-random#195 from vitiral/sample

9ce5e6d

Sample

		@@ -0,0 +1,330 @@
		// Copyright 2013-2014 The Rust Project Developers. See the COPYRIGHT

Sample #195

Sample #195

Conversation

vitiral commented Nov 24, 2017 • edited Loading

Second Try:

Original Ticket:

dhardy commented Nov 24, 2017 • edited Loading

vitiral commented Nov 24, 2017 • edited Loading

dhardy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitiral commented Nov 24, 2017

vitiral commented Nov 25, 2017

Choose a reason for hiding this comment

vitiral commented Nov 25, 2017

dhardy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitiral Nov 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhardy Nov 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitiral commented Nov 25, 2017

dhardy commented Nov 25, 2017

vitiral commented Nov 25, 2017 • edited Loading

dhardy commented Nov 25, 2017

vitiral commented Nov 25, 2017 • edited Loading

dhardy commented Nov 25, 2017

vitiral commented Nov 25, 2017 • edited Loading

dhardy commented Nov 25, 2017

vitiral commented Nov 25, 2017 via email

vitiral commented Nov 25, 2017

vitiral commented Nov 25, 2017

vitiral commented Nov 25, 2017

vitiral commented Nov 25, 2017

dhardy left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitiral commented Nov 26, 2017 via email

vitiral commented Nov 26, 2017 via email

dhardy commented Nov 26, 2017

vitiral commented Nov 26, 2017 • edited Loading

vitiral commented Nov 26, 2017

dhardy commented Nov 26, 2017

vitiral commented Nov 26, 2017

vitiral commented Nov 27, 2017

dhardy commented Nov 27, 2017

vitiral commented Nov 27, 2017

vitiral commented Nov 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitiral commented Nov 27, 2017

Choose a reason for hiding this comment

vitiral commented Dec 3, 2017

vitiral commented Nov 24, 2017 •

edited

Loading

dhardy commented Nov 24, 2017 •

edited

Loading

vitiral commented Nov 24, 2017 •

edited

Loading

vitiral Nov 25, 2017 •

edited

Loading

dhardy Nov 25, 2017 •

edited

Loading

vitiral commented Nov 25, 2017 •

edited

Loading

vitiral commented Nov 25, 2017 •

edited

Loading

vitiral commented Nov 25, 2017 •

edited

Loading

dhardy left a comment •

edited

Loading

vitiral commented Nov 26, 2017 •

edited

Loading