-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update sampling.jl #45
Conversation
I thought that code Edit: the reason I removed that part of code is for efficiency. See another comment below. |
You can stop Fisher-Yates prematurely, for what it's worth and then only sort the sampled values. I don't think it's possible to do better than that. Perhaps that's what your code was doing. |
@StefanKarpinski The current approach is basically as what you said. We are using fisher-yates terminated after Suppose we are trying to draw However, sampling without replacement only happens for |
@lindahua I see what you're saying, and you're right, but what you've described isn't quite how my code works. I've posted the original code here, and I hope that you'll have another look over it and perhaps run some benchmarks to clear up any confusion about its performance. What you'll find is that for For example, if I want As I said, please do have a look yourself to make sure I'm not missing anything. |
The algorithm used in your gist looks quite interesting and I believe it would be faster when you want to draw a large number of samples. I will play with it a little bit later. |
Great, thanks - let me know if you want any more info. |
I went deep into my dark playground today and got some benchmark data. You can follow along yourself with this gist My takeaway: @one-more-minute is correct: his method is great for k ~ n situations and it looks like (and performs quite similarly to) @stevengj 's new But I'm a bit concerned with the (unknown) statistics of the algorithm. This field is littered with algorithms that intuitively should work, but don't due to subtle biases as you're probably already aware. Another takeaway is that (as @Stevegj talks about in this issue comment) Floyd's algorithm is superior to the current in-order algorithms for all sizes (barring @one-more-minute 's) and even the naive permuted version of Floyd's algorithm as described here (which uses |
Vitter has an algorithm that may be superior to Floyd's algorithm for small subsets, and produces in-order output with no extra storage or sort pass. I haven't yet implemented or benchmarked this algorithm, however. For larger subsets (bigger than 1/10 or so), my experiments suggest that it is probably best to use the O(n) algorithm from Knuth TAOCP vol. 2 (Algorithm S, section 3.4.2), which also produces in-order output:
|
It might also be nice to mirror the naming in |
I have to run, but quickly wanted to share these results. Here Vitter is Vitter D from the appendix of the 1984 ACM article (switching to algo A for 'denser' subsets. Randsubseq2 is your Knuth code above. Also, Vitter is the best for smaller requested sets (Knuth does poorly here): Updated gist is here and most likely has dog-eating bugs. |
Cool! Basically, the upshot is that Vitter's O(n) "Algorithm A" (barely) beats Knuth's O(n) "Algorithm S", and that Vitter's O(|output|) "Algorithm D" wins when the output is < 1/15 of the input. |
So for licensing purposes, do we need a 'clean room' implementation of the
|
Implementations based on pseudocode descriptions of the algorithm are generally supposed to be okay, I think, although excessively complex pseudocode seems to be a gray area and the usual caveats of legal uncertainty apply. The whole point of pseudocode is that it is supposed to be boiling down the mathematical essence of the algorithm, independent of its expression in any particular programming language, and mathematical algorithms per se are not supposed to be copyrightable. Direct translation of Fortran code is probably not okay, though. |
(I will never, ever, publish anything in ACM, for this reason. For years, they've exploited the copyright-naiveté of authors who, in most cases, clearly intended to make their code usable by anyone with few or no restrictions, often posting their code to Netlib. ACM has taken this code, claimed the copyright ownership, and used it to create a huge copyright land mine covering many of the most important algorithms in computer science and numerical analysis.) |
If there's still interest in using my approach (it looks like it might be at least as good as Vitter for large k, assuming it's correct – and it's entirely clean-room so no licensing issues) I'm happy to explain the statistics of it so it can be verified, when I have some time. For the record, I did do some sanity checks and I'm pretty confident in the maths of it, so it certainly isn't obviously biased – but I can understand that that's not really enough. |
For large But I really think we should be able to use Vitter's algorithm for an implementation from the pseudocode in his paper. |
The code this comment refers to was removed, I've just updated it so it's not confusing.
@lindahua While I'm here, mind if I ask why the dedicated ordered sampling without replacement algorithm was removed? I included it originally because it's faster than fisher-yates + sort in most cases. I'm sure there's a good reason for it's removal, I'm just curious.