-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Avoid duplicates in sprand #4726
Conversation
Using stuff from Distributions would be nice - perhaps we have an overloaded In general, we should be ok with giving up some performance for a better |
Might be nice to have an interface that looked something like |
@ViralBShah, ad-hoc pretty much sums it up---basically I threw this out there to report the issue as much as fix it. We could at least make it so that it's an "algorithmic cutoff" but not a "result cutoff": when collisions are unlikely (here, corresponding to the path for density < 0.1), we could check for duplicates and as needed generate more entries. However, even with this strategy there will be an uncomfortable difference in the performance of Another problem with my few-lines solution is that it requires storing something the size of the dense array just to generate a sparse array. You could easily run out of memory this way, although I guess once density crosses 0.33 then the sparse array is going to take more memory than the dense one anyway (on 64bit). It also seems that most sampling-without-replacement algorithms (including the Fisher-Yates algorithm in Distributions) involve storing the full list, so it has the same problem as the one in this PR. Here's one that seems to avoid this problem, but I have no experience with such things. I guess we should ask ourselves if we need these matrices to be of high-quality randomness, or whether we can accept something of lower quality. @johnmyleswhite, do you just mean that the interface should match a pattern established in Distributions? You can supply a rng as the final argument, not the first. |
Can we hold on this until after the 0.2 release? |
Matlab tries harder than we do to get the density right. I would love to to make this higher quality. Perhaps there can be an optional argument that gives the quick and dirty solution that may be off when there are more collisions at higher densities. |
I think waiting until after 0.2 is a very good idea. |
Just to touch on this for post-0.2: I only thought it might be nice for the interface here to be consistent with the pattern established in Distributions, where the distribution drawn from (rather than its rand function) is provided as the first argument. |
bump |
One thing we can do is to keep generating tuples until we hit the required density. The other thing we could do is to explicitly tell the user that we will only draw |
I've got a solution on the way, hopefully later today. |
OK, this caused a wee bit more adventure than intended... This new version seems reasonable to me. It takes two main steps:
This doesn't hit the target density exactly (there's basically |
This is really cool! I think that having an algorithm whose actual expected density matches what was given is good enough, especially since this keeps the variance of that expectation fairly low (as opposed to having ridiculously high variance for large densities). Clearly, the fact that Matlab doesn't give anywhere close to the requested density means this can't be hugely important to get exact. |
Thanks, Stefan. As far as I know, this is ready to go. @ViralBShah, if you approve, feel free to merge. |
@johnmyleswhite, I'm also happy to discuss API changes as a separate issue, but I had the impression that one goal for 0.3 is to minimize API changes. If that's still true, then perhaps we should wait until 0.4. |
WIP: Avoid duplicates in sprand
@timholy This is awesome stuff. We do not need to hit the density exactly - this is perfect. |
BTW, I think we should go ahead with API changes for |
Honestly, I'm not sure how realistic the "no API changes in 0.3" thing is, unless we make it a really short release cycle. In which case, we should prepare for a bunch of breaking changes in 0.4.
|
While fixing #4723, I noticed that
sprand()
often provided matrices of lower density than requested. At the cost of slower performance, here's a fix. Do we want this? If so, do we need to use more sophisticated sampling procedures (as in Distributions)?