Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Avoid duplicates in sprand #4726

Merged
merged 2 commits into from
Nov 25, 2013
Merged

WIP: Avoid duplicates in sprand #4726

merged 2 commits into from
Nov 25, 2013

Conversation

timholy
Copy link
Member

@timholy timholy commented Nov 4, 2013

While fixing #4723, I noticed that sprand() often provided matrices of lower density than requested. At the cost of slower performance, here's a fix. Do we want this? If so, do we need to use more sophisticated sampling procedures (as in Distributions)?

@JeffBezanson
Copy link
Member

@ViralBShah

@ViralBShah
Copy link
Member

Using stuff from Distributions would be nice - perhaps we have an overloaded sprand in Distributions?

In general, we should be ok with giving up some performance for a better sprand, but the choice of the cutoff feels a bit ad-hoc. @dmbates Would be nice to have your thoughts.

@johnmyleswhite
Copy link
Member

Might be nice to have an interface that looked something like sprand(Poisson(1.232), 10, 10 0.1).

@timholy
Copy link
Member Author

timholy commented Nov 6, 2013

@ViralBShah, ad-hoc pretty much sums it up---basically I threw this out there to report the issue as much as fix it.

We could at least make it so that it's an "algorithmic cutoff" but not a "result cutoff": when collisions are unlikely (here, corresponding to the path for density < 0.1), we could check for duplicates and as needed generate more entries. However, even with this strategy there will be an uncomfortable difference in the performance of sprand(10^4, 10^4, 0.099) from sprand(10^4, 10^4, 0.101). Yuck.

Another problem with my few-lines solution is that it requires storing something the size of the dense array just to generate a sparse array. You could easily run out of memory this way, although I guess once density crosses 0.33 then the sparse array is going to take more memory than the dense one anyway (on 64bit). It also seems that most sampling-without-replacement algorithms (including the Fisher-Yates algorithm in Distributions) involve storing the full list, so it has the same problem as the one in this PR. Here's one that seems to avoid this problem, but I have no experience with such things. I guess we should ask ourselves if we need these matrices to be of high-quality randomness, or whether we can accept something of lower quality.

@johnmyleswhite, do you just mean that the interface should match a pattern established in Distributions? You can supply a rng as the final argument, not the first.

@ViralBShah
Copy link
Member

Can we hold on this until after the 0.2 release?

@ViralBShah
Copy link
Member

Matlab tries harder than we do to get the density right. I would love to to make this higher quality. Perhaps there can be an optional argument that gives the quick and dirty solution that may be off when there are more collisions at higher densities.

@timholy
Copy link
Member Author

timholy commented Nov 6, 2013

I think waiting until after 0.2 is a very good idea.

@johnmyleswhite
Copy link
Member

Just to touch on this for post-0.2: I only thought it might be nice for the interface here to be consistent with the pattern established in Distributions, where the distribution drawn from (rather than its rand function) is provided as the first argument.

@JeffBezanson
Copy link
Member

bump

@ViralBShah
Copy link
Member

One thing we can do is to keep generating tuples until we hit the required density. The other thing we could do is to explicitly tell the user that we will only draw m*n*density nonzeros, without worrying about collisions.

@timholy
Copy link
Member Author

timholy commented Nov 24, 2013

I've got a solution on the way, hopefully later today.

@timholy
Copy link
Member Author

timholy commented Nov 24, 2013

OK, this caused a wee bit more adventure than intended...

This new version seems reasonable to me. It takes two main steps:

  • It splits on density=0.5, and for higher densities it randomly draws the indices you want to skip (leave as zeros). If you have to keep sampling until you get enough draws, at very high densities you will have to vastly oversample, and that's not good. Splitting on 0.5 prevents this from ever being a serious problem.
  • It approximately "inverts" the birthday problem using Newton's method to estimate how many draws it will take to generate the target number of random samples.

This doesn't hit the target density exactly (there's basically sqrt(N) uncertainty), but it does a much better job than before, and when the density is high also a much better job than Matlab. In Matlab, when I asked for 0.9 I got around 0.64, whereas this hits pretty closely no matter what you choose. If we want to hit it exactly, we could oversample and then randomly ditch items, but in my personal view that's not worth it. If others disagree, I'm happy to change this.

@StefanKarpinski
Copy link
Member

This is really cool! I think that having an algorithm whose actual expected density matches what was given is good enough, especially since this keeps the variance of that expectation fairly low (as opposed to having ridiculously high variance for large densities). Clearly, the fact that Matlab doesn't give anywhere close to the requested density means this can't be hugely important to get exact.

@timholy
Copy link
Member Author

timholy commented Nov 25, 2013

Thanks, Stefan. As far as I know, this is ready to go. @ViralBShah, if you approve, feel free to merge.

@timholy
Copy link
Member Author

timholy commented Nov 25, 2013

@johnmyleswhite, I'm also happy to discuss API changes as a separate issue, but I had the impression that one goal for 0.3 is to minimize API changes. If that's still true, then perhaps we should wait until 0.4.

ViralBShah added a commit that referenced this pull request Nov 25, 2013
WIP: Avoid duplicates in sprand
@ViralBShah ViralBShah merged commit bd0100e into master Nov 25, 2013
@ViralBShah
Copy link
Member

@timholy This is awesome stuff. We do not need to hit the density exactly - this is perfect.

@ViralBShah
Copy link
Member

BTW, I think we should go ahead with API changes for sprand. It is unlikely to be widely used as of now, and the sooner we change, the better. We do not want to make API changes for commonly used stuff for 0.3, but we do need to keep improving stuff like this.

@StefanKarpinski
Copy link
Member

Honestly, I'm not sure how realistic the "no API changes in 0.3" thing is, unless we make it a really short release cycle. In which case, we should prepare for a bunch of breaking changes in 0.4.

On Nov 24, 2013, at 11:41 PM, "Viral B. Shah" notifications@github.com wrote:

BTW, I think we should go ahead with API changes for sprand. It is unlikely to be widely used as of now, and the sooner we change, the better. We do not want to make API changes for commonly used stuff for 0.3, but we do need to keep improving stuff like this.


Reply to this email directly or view it on GitHub.

@timholy timholy mentioned this pull request May 1, 2014
@timholy timholy deleted the teh/sprand branch July 24, 2014 00:08
@ViralBShah ViralBShah added randomness Random number generation and the Random stdlib sparse Sparse arrays labels Nov 22, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
randomness Random number generation and the Random stdlib sparse Sparse arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants