Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update sampling.jl #45

Merged
merged 1 commit into from
Feb 24, 2014
Merged

Update sampling.jl #45

merged 1 commit into from
Feb 24, 2014

Conversation

MikeInnes
Copy link
Contributor

The code this comment refers to was removed, I've just updated it so it's not confusing.

@lindahua While I'm here, mind if I ask why the dedicated ordered sampling without replacement algorithm was removed? I included it originally because it's faster than fisher-yates + sort in most cases. I'm sure there's a good reason for it's removal, I'm just curious.

@lindahua
Copy link
Contributor

I thought that code did not produce samples in equal-probability. Let me find the old code and explain more.

Edit: the reason I removed that part of code is for efficiency. See another comment below.

@StefanKarpinski
Copy link
Contributor

You can stop Fisher-Yates prematurely, for what it's worth and then only sort the sampled values. I don't think it's possible to do better than that. Perhaps that's what your code was doing.

lindahua added a commit that referenced this pull request Feb 24, 2014
@lindahua lindahua merged commit beedf4b into JuliaStats:master Feb 24, 2014
@lindahua
Copy link
Contributor

@StefanKarpinski The current approach is basically as what you said. We are using fisher-yates terminated after n steps. We only sort the sampled values when requested (the keyword argument ordered is set to true.)

Suppose we are trying to draw n samples from k distinct values. @one-more-minute your approach of using binomial sampling is great when we do sampling with replacement when n >> k. This was demonstrated in your earlier PR that has been merged.

However, sampling without replacement only happens for k < n (usually k << n). Your previous method for such case is to do repetitive bernoulli trials, which implies that you have to generate O(n) random numbers in order to draw k samples. Clearly, it would not be as efficient as using Fisher-Yate. Even in the cases you need ordered samples, with an additional sorting step that takes O(k log k), the current approach is still more efficient than generating O(n) random numbers.

@MikeInnes
Copy link
Contributor Author

@lindahua I see what you're saying, and you're right, but what you've described isn't quite how my code works. I've posted the original code here, and I hope that you'll have another look over it and perhaps run some benchmarks to clear up any confusion about its performance.

What you'll find is that for k << n, Fisher-Yates+Sort is indeed faster. However, for (roughly) k > n/20, my algorithm becomes better; in fact, for k > n/10 it's faster than Fisher-Yates before you even get to the sort.

For example, if I want 9e7 samples out of 1e8, Fisher-Yates+Sort takes 23 seconds versus 1.5 for my code - a 15 times improvement. And this isn't just limited to large values of n; sampling 50 values out of 100 is 2.5 times faster with my code, and remains faster as low as k = 3 (albeit not by much).

As I said, please do have a look yourself to make sure I'm not missing anything.

@lindahua
Copy link
Contributor

The algorithm used in your gist looks quite interesting and I believe it would be faster when you want to draw a large number of samples.

I will play with it a little bit later.

@MikeInnes
Copy link
Contributor Author

Great, thanks - let me know if you want any more info.

@binarybana
Copy link

I went deep into my dark playground today and got some benchmark data. You can follow along yourself with this gist

10k-9k-w
10k-100-wout

My takeaway: @one-more-minute is correct: his method is great for k ~ n situations and it looks like (and performs quite similarly to) @stevengj 's new randsubseq that just landed in base (and is only put in these comparisons as a curiosity since it doesn't retrieve exactly k samples unlike all other algorithms here).

But I'm a bit concerned with the (unknown) statistics of the algorithm. This field is littered with algorithms that intuitively should work, but don't due to subtle biases as you're probably already aware.

Another takeaway is that (as @Stevegj talks about in this issue comment) Floyd's algorithm is superior to the current in-order algorithms for all sizes (barring @one-more-minute 's) and even the naive permuted version of Floyd's algorithm as described here (which uses unshift! on arrays) is better for short sequences as seen in the second picture above, especially when initially switching to prematurely terminated Fisher-Yates.

@stevengj
Copy link
Contributor

stevengj commented May 7, 2014

Vitter has an algorithm that may be superior to Floyd's algorithm for small subsets, and produces in-order output with no extra storage or sort pass. I haven't yet implemented or benchmarked this algorithm, however.

For larger subsets (bigger than 1/10 or so), my experiments suggest that it is probably best to use the O(n) algorithm from Knuth TAOCP vol. 2 (Algorithm S, section 3.4.2), which also produces in-order output:

function randsubseq!(S::AbstractVector, A::AbstractArray)
    m = length(S)
    m == 0 && return S # needed for correctness: code below requires m > 0
    n = length(A)
    m == n && return copy!(S, A)
    0 <= m <= n || throw(DomainError())
    i = 0; j = 0
    while true
        if (n - i)*rand() <= m - j
           S[j += 1] = A[i += 1]
           j == m && return S
        else
           i += 1
        end
    end
end

@stevengj
Copy link
Contributor

stevengj commented May 7, 2014

It might also be nice to mirror the naming in Base to make this more discoverable: as in my Knuth code above, make randsubseq!(S, A) select a random length(S) subsequence of A (without replacement), and let randsubseq(m::Integer, A) = randsubseq!(Array(eltype(A), m), A).

@binarybana
Copy link

vitter_rss2_10k-8k

I have to run, but quickly wanted to share these results. Here Vitter is Vitter D from the appendix of the 1984 ACM article (switching to algo A for 'denser' subsets. Randsubseq2 is your Knuth code above.

Also, Vitter is the best for smaller requested sets (Knuth does poorly here):
vitter_10k-200

Updated gist is here and most likely has dog-eating bugs.

@stevengj
Copy link
Contributor

stevengj commented May 7, 2014

Cool! Basically, the upshot is that Vitter's O(n) "Algorithm A" (barely) beats Knuth's O(n) "Algorithm S", and that Vitter's O(|output|) "Algorithm D" wins when the output is < 1/15 of the input.

@binarybana
Copy link

So for licensing purposes, do we need a 'clean room' implementation of the
ACM code? Or can we use/relicense a translation of the code found in the
appendix?
On May 7, 2014 2:14 PM, "Steven G. Johnson" notifications@github.com
wrote:

Cool! Basically, the upshot is that Vitter's O(n) "Algorithm A" beats
Knuth's O(n) "Algorithm S", and that Vitter's O(|output|) "Algorithm D"
wins for when the output is < 1/15 of the input.


Reply to this email directly or view it on GitHubhttps://github.com//pull/45#issuecomment-42470469
.

@stevengj
Copy link
Contributor

stevengj commented May 7, 2014

Implementations based on pseudocode descriptions of the algorithm are generally supposed to be okay, I think, although excessively complex pseudocode seems to be a gray area and the usual caveats of legal uncertainty apply. The whole point of pseudocode is that it is supposed to be boiling down the mathematical essence of the algorithm, independent of its expression in any particular programming language, and mathematical algorithms per se are not supposed to be copyrightable.

Direct translation of Fortran code is probably not okay, though.

@stevengj
Copy link
Contributor

stevengj commented May 7, 2014

(I will never, ever, publish anything in ACM, for this reason. For years, they've exploited the copyright-naiveté of authors who, in most cases, clearly intended to make their code usable by anyone with few or no restrictions, often posting their code to Netlib. ACM has taken this code, claimed the copyright ownership, and used it to create a huge copyright land mine covering many of the most important algorithms in computer science and numerical analysis.)

@MikeInnes
Copy link
Contributor Author

If there's still interest in using my approach (it looks like it might be at least as good as Vitter for large k, assuming it's correct – and it's entirely clean-room so no licensing issues) I'm happy to explain the statistics of it so it can be verified, when I have some time.

For the record, I did do some sanity checks and I'm pretty confident in the maths of it, so it certainly isn't obviously biased – but I can understand that that's not really enough.

@stevengj
Copy link
Contributor

stevengj commented May 8, 2014

For large k, we can just use Knuth's Algorithm S if we can't use Vitter.

But I really think we should be able to use Vitter's algorithm for an implementation from the pseudocode in his paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants