WIP: Avoid duplicates in sprand #4726

timholy · 2013-11-04T11:16:08Z

While fixing #4723, I noticed that sprand() often provided matrices of lower density than requested. At the cost of slower performance, here's a fix. Do we want this? If so, do we need to use more sophisticated sampling procedures (as in Distributions)?

JeffBezanson · 2013-11-06T02:36:07Z

@ViralBShah

ViralBShah · 2013-11-06T03:35:25Z

Using stuff from Distributions would be nice - perhaps we have an overloaded sprand in Distributions?

In general, we should be ok with giving up some performance for a better sprand, but the choice of the cutoff feels a bit ad-hoc. @dmbates Would be nice to have your thoughts.

johnmyleswhite · 2013-11-06T05:32:15Z

Might be nice to have an interface that looked something like sprand(Poisson(1.232), 10, 10 0.1).

timholy · 2013-11-06T11:54:06Z

@ViralBShah, ad-hoc pretty much sums it up---basically I threw this out there to report the issue as much as fix it.

We could at least make it so that it's an "algorithmic cutoff" but not a "result cutoff": when collisions are unlikely (here, corresponding to the path for density < 0.1), we could check for duplicates and as needed generate more entries. However, even with this strategy there will be an uncomfortable difference in the performance of sprand(10^4, 10^4, 0.099) from sprand(10^4, 10^4, 0.101). Yuck.

Another problem with my few-lines solution is that it requires storing something the size of the dense array just to generate a sparse array. You could easily run out of memory this way, although I guess once density crosses 0.33 then the sparse array is going to take more memory than the dense one anyway (on 64bit). It also seems that most sampling-without-replacement algorithms (including the Fisher-Yates algorithm in Distributions) involve storing the full list, so it has the same problem as the one in this PR. Here's one that seems to avoid this problem, but I have no experience with such things. I guess we should ask ourselves if we need these matrices to be of high-quality randomness, or whether we can accept something of lower quality.

@johnmyleswhite, do you just mean that the interface should match a pattern established in Distributions? You can supply a rng as the final argument, not the first.

ViralBShah · 2013-11-06T17:00:00Z

Can we hold on this until after the 0.2 release?

ViralBShah · 2013-11-06T17:04:19Z

Matlab tries harder than we do to get the density right. I would love to to make this higher quality. Perhaps there can be an optional argument that gives the quick and dirty solution that may be off when there are more collisions at higher densities.

timholy · 2013-11-06T17:18:46Z

I think waiting until after 0.2 is a very good idea.

johnmyleswhite · 2013-11-09T18:58:07Z

Just to touch on this for post-0.2: I only thought it might be nice for the interface here to be consistent with the pattern established in Distributions, where the distribution drawn from (rather than its rand function) is provided as the first argument.

JeffBezanson · 2013-11-24T03:38:58Z

bump

ViralBShah · 2013-11-24T04:18:18Z

One thing we can do is to keep generating tuples until we hit the required density. The other thing we could do is to explicitly tell the user that we will only draw m*n*density nonzeros, without worrying about collisions.

timholy · 2013-11-24T14:10:48Z

I've got a solution on the way, hopefully later today.

timholy · 2013-11-24T18:53:23Z

OK, this caused a wee bit more adventure than intended...

This new version seems reasonable to me. It takes two main steps:

It splits on density=0.5, and for higher densities it randomly draws the indices you want to skip (leave as zeros). If you have to keep sampling until you get enough draws, at very high densities you will have to vastly oversample, and that's not good. Splitting on 0.5 prevents this from ever being a serious problem.
It approximately "inverts" the birthday problem using Newton's method to estimate how many draws it will take to generate the target number of random samples.

This doesn't hit the target density exactly (there's basically sqrt(N) uncertainty), but it does a much better job than before, and when the density is high also a much better job than Matlab. In Matlab, when I asked for 0.9 I got around 0.64, whereas this hits pretty closely no matter what you choose. If we want to hit it exactly, we could oversample and then randomly ditch items, but in my personal view that's not worth it. If others disagree, I'm happy to change this.

StefanKarpinski · 2013-11-24T20:31:55Z

This is really cool! I think that having an algorithm whose actual expected density matches what was given is good enough, especially since this keeps the variance of that expectation fairly low (as opposed to having ridiculously high variance for large densities). Clearly, the fact that Matlab doesn't give anywhere close to the requested density means this can't be hugely important to get exact.

timholy · 2013-11-25T00:55:35Z

Thanks, Stefan. As far as I know, this is ready to go. @ViralBShah, if you approve, feel free to merge.

timholy · 2013-11-25T02:46:21Z

@johnmyleswhite, I'm also happy to discuss API changes as a separate issue, but I had the impression that one goal for 0.3 is to minimize API changes. If that's still true, then perhaps we should wait until 0.4.

WIP: Avoid duplicates in sprand

ViralBShah · 2013-11-25T04:39:40Z

@timholy This is awesome stuff. We do not need to hit the density exactly - this is perfect.

ViralBShah · 2013-11-25T04:41:30Z

BTW, I think we should go ahead with API changes for sprand. It is unlikely to be widely used as of now, and the sooner we change, the better. We do not want to make API changes for commonly used stuff for 0.3, but we do need to keep improving stuff like this.

StefanKarpinski · 2013-11-25T13:55:43Z

Honestly, I'm not sure how realistic the "no API changes in 0.3" thing is, unless we make it a really short release cycle. In which case, we should prepare for a bunch of breaking changes in 0.4.

On Nov 24, 2013, at 11:41 PM, "Viral B. Shah" notifications@github.com wrote:

BTW, I think we should go ahead with API changes for sprand. It is unlikely to be widely used as of now, and the sooner we change, the better. We do not want to make API changes for commonly used stuff for 0.3, but we do need to keep improving stuff like this.

—
Reply to this email directly or view it on GitHub.

Avoid duplicates in sprand

7503310

sprand: handle all-zero matrix without error

d8269ab

ViralBShah added a commit that referenced this pull request Nov 25, 2013

Merge pull request #4726 from JuliaLang/teh/sprand

bd0100e

WIP: Avoid duplicates in sprand

ViralBShah merged commit bd0100e into master Nov 25, 2013

timholy mentioned this pull request May 1, 2014

Speed of sprand() #6708

Closed

timholy deleted the teh/sprand branch July 24, 2014 00:08

ViralBShah added randomness Random number generation and the Random stdlib sparse Sparse arrays labels Nov 22, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Avoid duplicates in sprand #4726

WIP: Avoid duplicates in sprand #4726

timholy commented Nov 4, 2013

JeffBezanson commented Nov 6, 2013

ViralBShah commented Nov 6, 2013

johnmyleswhite commented Nov 6, 2013

timholy commented Nov 6, 2013

ViralBShah commented Nov 6, 2013

ViralBShah commented Nov 6, 2013

timholy commented Nov 6, 2013

johnmyleswhite commented Nov 9, 2013

JeffBezanson commented Nov 24, 2013

ViralBShah commented Nov 24, 2013

timholy commented Nov 24, 2013

timholy commented Nov 24, 2013

StefanKarpinski commented Nov 24, 2013

timholy commented Nov 25, 2013

timholy commented Nov 25, 2013

ViralBShah commented Nov 25, 2013

ViralBShah commented Nov 25, 2013

StefanKarpinski commented Nov 25, 2013

WIP: Avoid duplicates in sprand #4726

WIP: Avoid duplicates in sprand #4726

Conversation

timholy commented Nov 4, 2013

JeffBezanson commented Nov 6, 2013

ViralBShah commented Nov 6, 2013

johnmyleswhite commented Nov 6, 2013

timholy commented Nov 6, 2013

ViralBShah commented Nov 6, 2013

ViralBShah commented Nov 6, 2013

timholy commented Nov 6, 2013

johnmyleswhite commented Nov 9, 2013

JeffBezanson commented Nov 24, 2013

ViralBShah commented Nov 24, 2013

timholy commented Nov 24, 2013

timholy commented Nov 24, 2013

StefanKarpinski commented Nov 24, 2013

timholy commented Nov 25, 2013

timholy commented Nov 25, 2013

ViralBShah commented Nov 25, 2013

ViralBShah commented Nov 25, 2013

StefanKarpinski commented Nov 25, 2013