-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sample #195
Sample #195
Conversation
All included PRNGs (including XorShiftRng) should be repeatable, if you create them using |
@dhardy I'm really not seeing that. I simplified the test by commenting stuff out, and this is the stdout:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know, but I'd think it's a bug in your code, not XorShiftRng.
src/lib.rs
Outdated
@@ -260,6 +260,7 @@ pub use os::OsRng; | |||
|
|||
pub use isaac::{IsaacRng, Isaac64Rng}; | |||
pub use chacha::ChaChaRng; | |||
pub use sample::{sample, sample_reservoir}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should remove this pub use? Especially because I want to introduce a Sample trait.
src/sample.rs
Outdated
&mut XorShiftRng::from_seed(seed), vals.len(), amount); | ||
let cache = sample_range_cache( | ||
&mut XorShiftRng::from_seed(seed), 0..vals.len(), amount); | ||
assert_eq!(inplace, cache); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do your methods guarantee the same ordering?
yup, I had a logic error. Getting pretty close. |
c75f356
to
566fbaf
Compare
Okay, I think I'm basically done. Comments welcome! |
src/sample.rs
Outdated
let vec: Vec<usize> = (0..length).collect(); | ||
assert_eq!( | ||
vec.sample(&mut xor_rng(seed), amount), | ||
regular, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing comma may be an issue in some builds? At least, Travis complains on some builds (also line 325)
@dhardy sorry, I was running on nightly before. Ammended a fix! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, not so happy as is; the usage of traits in the API seems unnecessary and not fully taken advantage of. Better to simplify?
src/lib.rs
Outdated
sample_reservoir as sample, | ||
sample_reservoir, | ||
Sample, | ||
SampleRef}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not keen on re-exporting these names. Also, Sample
conflicts with one of my (to be) proposed changes!
Note: the code is all about sampling from sequences, either Vec
or slices, or iterators over a more general container, so usage of a seq
module doesn't seem inappropriate to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could rename them SampleContainer
and SampleRefContainer
. See my comments below for whether we should even keep the traits or go with a function as you suggested elsewhere.
src/sample.rs
Outdated
@@ -0,0 +1,330 @@ | |||
// Copyright 2013-2014 The Rust Project Developers. See the COPYRIGHT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
year
src/lib.rs
Outdated
@@ -260,6 +260,12 @@ pub use os::OsRng; | |||
|
|||
pub use isaac::{IsaacRng, Isaac64Rng}; | |||
pub use chacha::ChaChaRng; | |||
pub use sample::{ | |||
// TODO: `sample` name will be deprecated in 1.0, use `sample_reservoir` instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try #[depercated(since="VER", note="renamed to sample_reservoir")
. I don't think there's any point delaying the deprecation. RFC: https://github.com/rust-lang/rfcs/blob/master/text/1270-deprecation.md
src/sample.rs
Outdated
where R: Rng, | ||
{ | ||
if amount > length { | ||
panic!("`amount` must be less than or equal to `slice.len()`"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no slice
param so update msg
src/sample.rs
Outdated
/// | ||
/// TODO: IMO this should be made public since it can be generally useful, although | ||
/// there might be a way to make the output type more generic/compact. | ||
fn sample_indices<R>(rng: &mut R, length: usize, amount: usize) -> Vec<usize> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So make it public? Prototype seems okay. Maybe it should say: samples amount numbers from the range 0..length
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I can make it more general, sample_range
to sample over any numeric range, and even make the input types generic. I'd like to do that before exporting it.
I'll open a ticket once this gets merged. It's out of scope and just a "nice to have".
src/sample.rs
Outdated
/// | ||
/// This is intended to be implemented for containers that: | ||
/// - Can be sampled in `O(amount)` time. | ||
/// - Whos items can be `cloned`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/sample.rs
Outdated
/// let sample = sample_reservoir(&mut rng, 1..100, 5); | ||
/// println!("{:?}", sample); | ||
/// ``` | ||
pub fn sample_reservoir<T, I, R>(rng: &mut R, iterable: I, amount: usize) -> Vec<T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a special case of your traits now, i.e. impl<I: Iterator> Sample for I
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as said above, sample_reservoir
is far slower.
We could use sample_reservoir
for iterators or "generic" types like HashMaps, but we will not be able to guarantee performance of O(m)
.
On that note, I've investigated sampiling from HashMaps in O(m)
and I don't think it can be done. Even the iterator represenation is required to iterate over the full capacity
in order to guarantee selection of all elements -- it has no way to select only the "filled buckets".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, anything done without creating an ordered view of items in the HashMap will not be reproducible, even with a reproducible generator. Theoretical and maybe not important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I think special-casing HashMap is not really possible, at least without architecting a HashMap for just that purpose (in which case... provide your own sampling function for it since you are creating it for that purpose).
For standard hashmap's someone will have to use sample_reservoir
and eat the O(N)
computational cost.
src/sample.rs
Outdated
type Sampled = Vec<T>; | ||
|
||
fn sample<R: Rng>(&self, rng: &mut R, amount: usize) -> Vec<T> { | ||
self.as_slice().sample(rng, amount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this just uses the slice representation directly? This makes me wonder whether the trait-based approach is worth it at all.
src/sample.rs
Outdated
/// - Whos items can be `cloned`. | ||
/// | ||
/// If cloning is impossible or expensive, use `sample_ref` instead. | ||
pub trait Sample { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that traits can't be implemented outside of the module where (a) the trait is defined or (b) the target type is defined. This means no one else can implement Sample
for std::collections
types.
Likely though, this isn't necessary: users can instead create an iterator and call sample on that. Which begs the questions: (1) why use traits instead of plain functions and (2) why not let a more general function like sample_reservoir
take an optional length (or a length and an iterator), and use the appropriate algorithm when length is known?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample_reservoir
requires iterating through the entire sequence length N
to sample m
samples, so is O(N)
computationally and O(m)
in memory.
Sample::sample
does not iterate through the sequence, so is O(m)
for both memory and computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just knowing the length allows O(m) computation of indices... but getting those values from the iterator is still O(N). Ok.
The first question still stands though: why use a Sample
trait instead of a plain function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, sample
guarantees that the order of the returned elements are random, where as sample_resevoir
does not.
So, with HashMap probably being impossible to implement in If you knew the number of nodes in each branch of a tree (at every branch) then you could implement a sampling mechanism in |
I suspect even an O(m) BTreeMap implementation would be impossible... it depends whether the BTreeMap insert/remove functions update counts at each node or not — no, it looks like there is a single |
It looks like rust doesn't store any information about the number of children nodes for each node, which is pretty standard for a BTree and I wouldn't expect them to. With that knowledge I'd say that not having the traits and essentially redefining This will of course be an actually breaking change, since the type signature of |
You could also make it a non-breaking change with a different name (e.g. |
that's true. We could put it in there and just mark I think that's the route I'd like to go down. However module |
Give me your opinion: does a It's also possible to |
I would say that makes sense to me, but what else is Hmm... I think for now we should deprecate I'm debating whether to call it |
"Sampling" is a fairly generic term, e.g. sampling from a normal distribution. |
I'll do sample_slice then.
Hopefully will address all your comments this afternoon
|
@dhardy okay, I think I've addressed all your concerns. I also:
|
also note that I have minor changes to |
also, I made Should |
I'm going to do that, since I think it's the right thing. Let me know if you want me to reverse it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this revision myself.
src/lib.rs
Outdated
@@ -260,6 +260,8 @@ pub use os::OsRng; | |||
|
|||
pub use isaac::{IsaacRng, Isaac64Rng}; | |||
pub use chacha::ChaChaRng; | |||
#[deprecated(since="0.3.18", note="renamed to seq::sample_iter")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the current version; I guess this should be 0.3.19
with a corresponding version bump in Cargo.toml.
src/seq.rs
Outdated
} else { | ||
// Don't hang onto extra memory. There is a corner case where | ||
// `amount <<< len(iterable)` that we want to avoid. | ||
reservoir.shrink_to_fit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're not going with your own advice in #194 to make this fallible? I'm not sure personally; this is an iterator so shrinking to available data is common behaviour.
/// | ||
/// The cache avoids allocating the entire `length` of values. This is especially useful when | ||
/// `amount <<< length`, i.e. select 3 non-repeating from 1_000_000 | ||
fn sample_indices_cache<R>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative approach would be to use a fixed Range::new(0, length)
(slightly faster than using a different range each time), ignore any samples already in your HashSet, and stop when the set.len() == amount
. Not sure which would be faster. Obviously if amount is close to length there will be many clashes and resamples, but you don't use this in that case anyway.
For the alternative implementation, this one is faster. I'm not actually
sure if what you suggest saves you *anything* (maybe adding `i` every
iteration?), but it looses you a LOT. It would literally be possible
(although unlikely) to run for 1000x longer or more using your method,
especially if amount was close to length / 2.
The other thing this method gains is that it is identical to the inplace
version. Literally they are the same method just optimized for different
relations of amount to length.
|
I'm also starting to think that maybe sample_reservoir was the right
name... It's good to call out the method since as you mention it has
different behavior.
I think panicking is a bad idea. We could maybe return an Enum {
Sampled(Vec), Collected(Vec)} to signify whether there was any *actual*
sampling that happened. Maybe just using Result in that case would be fine
as well, as it could be considered an error that the amount < length (then
people could just unwrap it if they were certain of the length).
I'll bump the cargo version
|
Er... the method I mentioned saves calling Yes, better not to panic. I'm undecided whether it should silently reduce length to match the input or not. |
Hmm, I would not have expected that. If that is the case I will change it to use Rng::next_f64 instead:
Then we don't have to pay that penalty. I'm surprised it does not already desurgar to this, will your branch be changing this? |
I changed it to use |
Don't use floating point sampling; you will not get an even distribution. Just use the range code you have before, it's fine. |
sounds good, I reset it to the previous commit. Also, it sounds like you have some performance improvements for |
I think I'm going to change it back to |
I don't know the answer to that. Both options sound ok. |
I kept it |
I hope this is finally done! 😄 Thanks for the review! |
src/seq.rs
Outdated
/// The following can be returned: | ||
/// - `Ok`: `Vec` of `amount` non-repeating randomly sampled elements. The order is not random. | ||
/// - `Err`: `Vec` of *less than* `amount` elements in sequential order. This is considered an | ||
/// error since exactly `amount` elements is typically expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the Err
case I'd prefer more precise documentation: in case less than amount
elements are available, a Vec
of exactly the elements in iterable
is returned. The order is not random.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the order is actually guaranteed to be sequential. I've updated the doc, let me know if that is better.
weird, the rust-stable travis doesn't seem to be running |
src/seq.rs
Outdated
/// This method is used internally by the slice sampling methods, but it can sometimes be useful to | ||
/// have the indices themselves so this is provided as an alternative. | ||
/// | ||
/// Panics if `amount > self.len()` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not self.len()
. Also slice.len()
in the panic text is technically wrong, but may still be the best description.
@dhardy should be fixed now. What do you mean |
This implements #195, in particular it:
rand::sample
and bumps cargo version.rand::seq::sample_iter
function which is very similar torand::sample
but returns aResult
.seq::sample_slice
andseq::sample_slice_ref
functions which sample from a slice inO(amount)
time and memory.seq::sample_indices
which can sample a range of indicies inO(amount)
time and memory (helper function if this is desired).Second Try:
Comments and criticism welcome!
Original Ticket:
There is still a lot to do here, but I'm hitting a critical issue where the XorShiftRng doesn't seem to be repeatable with the same seed. When running the new test I'm getting log output like:
So
j
is different values for the same loop number. Does anyone know a repeatable (with the same seed) random number generator I can use?