Update sampling.jl #45

MikeInnes · 2014-02-23T13:38:53Z

The code this comment refers to was removed, I've just updated it so it's not confusing.

@lindahua While I'm here, mind if I ask why the dedicated ordered sampling without replacement algorithm was removed? I included it originally because it's faster than fisher-yates + sort in most cases. I'm sure there's a good reason for it's removal, I'm just curious.

lindahua · 2014-02-23T13:54:06Z

I thought that code ~~did not produce samples in equal-probability~~. Let me find the old code and explain more.

Edit: the reason I removed that part of code is for efficiency. See another comment below.

StefanKarpinski · 2014-02-23T22:48:26Z

You can stop Fisher-Yates prematurely, for what it's worth and then only sort the sampled values. I don't think it's possible to do better than that. Perhaps that's what your code was doing.

Update sampling.jl

lindahua · 2014-02-24T15:27:57Z

@StefanKarpinski The current approach is basically as what you said. We are using fisher-yates terminated after n steps. We only sort the sampled values when requested (the keyword argument ordered is set to true.)

Suppose we are trying to draw n samples from k distinct values. @one-more-minute your approach of using binomial sampling is great when we do sampling with replacement when n >> k. This was demonstrated in your earlier PR that has been merged.

However, sampling without replacement only happens for k < n (usually k << n). Your previous method for such case is to do repetitive bernoulli trials, which implies that you have to generate O(n) random numbers in order to draw k samples. Clearly, it would not be as efficient as using Fisher-Yate. Even in the cases you need ordered samples, with an additional sorting step that takes O(k log k), the current approach is still more efficient than generating O(n) random numbers.

MikeInnes · 2014-02-24T20:19:45Z

@lindahua I see what you're saying, and you're right, but what you've described isn't quite how my code works. I've posted the original code here, and I hope that you'll have another look over it and perhaps run some benchmarks to clear up any confusion about its performance.

What you'll find is that for k << n, Fisher-Yates+Sort is indeed faster. However, for (roughly) k > n/20, my algorithm becomes better; in fact, for k > n/10 it's faster than Fisher-Yates before you even get to the sort.

For example, if I want 9e7 samples out of 1e8, Fisher-Yates+Sort takes 23 seconds versus 1.5 for my code - a 15 times improvement. And this isn't just limited to large values of n; sampling 50 values out of 100 is 2.5 times faster with my code, and remains faster as low as k = 3 (albeit not by much).

As I said, please do have a look yourself to make sure I'm not missing anything.

lindahua · 2014-02-24T21:10:01Z

The algorithm used in your gist looks quite interesting and I believe it would be faster when you want to draw a large number of samples.

I will play with it a little bit later.

MikeInnes · 2014-02-24T21:57:32Z

Great, thanks - let me know if you want any more info.

binarybana · 2014-05-06T21:11:53Z

I went deep into my dark playground today and got some benchmark data. You can follow along yourself with this gist

My takeaway: @one-more-minute is correct: his method is great for k ~ n situations and it looks like (and performs quite similarly to) @stevengj 's new randsubseq that just landed in base (and is only put in these comparisons as a curiosity since it doesn't retrieve exactly k samples unlike all other algorithms here).

But I'm a bit concerned with the (unknown) statistics of the algorithm. This field is littered with algorithms that intuitively should work, but don't due to subtle biases as you're probably already aware.

Another takeaway is that (as @Stevegj talks about in this issue comment) Floyd's algorithm is superior to the current in-order algorithms for all sizes (barring @one-more-minute 's) and even the naive permuted version of Floyd's algorithm as described here (which uses unshift! on arrays) is better for short sequences as seen in the second picture above, especially when initially switching to prematurely terminated Fisher-Yates.

stevengj · 2014-05-07T04:06:17Z

Vitter has an algorithm that may be superior to Floyd's algorithm for small subsets, and produces in-order output with no extra storage or sort pass. I haven't yet implemented or benchmarked this algorithm, however.

For larger subsets (bigger than 1/10 or so), my experiments suggest that it is probably best to use the O(n) algorithm from Knuth TAOCP vol. 2 (Algorithm S, section 3.4.2), which also produces in-order output:

function randsubseq!(S::AbstractVector, A::AbstractArray)
    m = length(S)
    m == 0 && return S # needed for correctness: code below requires m > 0
    n = length(A)
    m == n && return copy!(S, A)
    0 <= m <= n || throw(DomainError())
    i = 0; j = 0
    while true
        if (n - i)*rand() <= m - j
           S[j += 1] = A[i += 1]
           j == m && return S
        else
           i += 1
        end
    end
end

stevengj · 2014-05-07T16:00:42Z

It might also be nice to mirror the naming in Base to make this more discoverable: as in my Knuth code above, make randsubseq!(S, A) select a random length(S) subsequence of A (without replacement), and let randsubseq(m::Integer, A) = randsubseq!(Array(eltype(A), m), A).

binarybana · 2014-05-07T19:06:16Z

I have to run, but quickly wanted to share these results. Here Vitter is Vitter D from the appendix of the 1984 ACM article (switching to algo A for 'denser' subsets. Randsubseq2 is your Knuth code above.

Also, Vitter is the best for smaller requested sets (Knuth does poorly here):

Updated gist is here and most likely has dog-eating bugs.

stevengj · 2014-05-07T19:14:51Z

Cool! Basically, the upshot is that Vitter's O(n) "Algorithm A" (barely) beats Knuth's O(n) "Algorithm S", and that Vitter's O(|output|) "Algorithm D" wins when the output is < 1/15 of the input.

binarybana · 2014-05-07T19:34:14Z

So for licensing purposes, do we need a 'clean room' implementation of the
ACM code? Or can we use/relicense a translation of the code found in the
appendix?
On May 7, 2014 2:14 PM, "Steven G. Johnson" notifications@github.com
wrote:

Cool! Basically, the upshot is that Vitter's O(n) "Algorithm A" beats
Knuth's O(n) "Algorithm S", and that Vitter's O(|output|) "Algorithm D"
wins for when the output is < 1/15 of the input.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/45#issuecomment-42470469
.

stevengj · 2014-05-07T20:11:52Z

Implementations based on pseudocode descriptions of the algorithm are generally supposed to be okay, I think, although excessively complex pseudocode seems to be a gray area and the usual caveats of legal uncertainty apply. The whole point of pseudocode is that it is supposed to be boiling down the mathematical essence of the algorithm, independent of its expression in any particular programming language, and mathematical algorithms per se are not supposed to be copyrightable.

Direct translation of Fortran code is probably not okay, though.

stevengj · 2014-05-07T20:20:34Z

(I will never, ever, publish anything in ACM, for this reason. For years, they've exploited the copyright-naiveté of authors who, in most cases, clearly intended to make their code usable by anyone with few or no restrictions, often posting their code to Netlib. ACM has taken this code, claimed the copyright ownership, and used it to create a huge copyright land mine covering many of the most important algorithms in computer science and numerical analysis.)

MikeInnes · 2014-05-08T08:56:52Z

If there's still interest in using my approach (it looks like it might be at least as good as Vitter for large k, assuming it's correct – and it's entirely clean-room so no licensing issues) I'm happy to explain the statistics of it so it can be verified, when I have some time.

For the record, I did do some sanity checks and I'm pretty confident in the maths of it, so it certainly isn't obviously biased – but I can understand that that's not really enough.

stevengj · 2014-05-08T12:22:13Z

For large k, we can just use Knuth's Algorithm S if we can't use Vitter.

But I really think we should be able to use Vitter's algorithm for an implementation from the pseudocode in his paper.

Update sampling.jl

6f859dc

lindahua added a commit that referenced this pull request Feb 24, 2014

Merge pull request #45 from one-more-minute/patch-1

beedf4b

Update sampling.jl

lindahua merged commit beedf4b into JuliaStats:master Feb 24, 2014

MikeInnes mentioned this pull request Apr 7, 2014

Add ordered sampling without replacement algorithm #54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update sampling.jl #45

Update sampling.jl #45

MikeInnes commented Feb 23, 2014

lindahua commented Feb 23, 2014

StefanKarpinski commented Feb 23, 2014

lindahua commented Feb 24, 2014

MikeInnes commented Feb 24, 2014

lindahua commented Feb 24, 2014

MikeInnes commented Feb 24, 2014

binarybana commented May 6, 2014

stevengj commented May 7, 2014

stevengj commented May 7, 2014

binarybana commented May 7, 2014

stevengj commented May 7, 2014

binarybana commented May 7, 2014

stevengj commented May 7, 2014

stevengj commented May 7, 2014

MikeInnes commented May 8, 2014

stevengj commented May 8, 2014

Update sampling.jl #45

Update sampling.jl #45

Conversation

MikeInnes commented Feb 23, 2014

lindahua commented Feb 23, 2014

StefanKarpinski commented Feb 23, 2014

lindahua commented Feb 24, 2014

MikeInnes commented Feb 24, 2014

lindahua commented Feb 24, 2014

MikeInnes commented Feb 24, 2014

binarybana commented May 6, 2014

stevengj commented May 7, 2014

stevengj commented May 7, 2014

binarybana commented May 7, 2014

stevengj commented May 7, 2014

binarybana commented May 7, 2014

stevengj commented May 7, 2014

stevengj commented May 7, 2014

MikeInnes commented May 8, 2014

stevengj commented May 8, 2014