-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
incorrect statistics for sprand #6714
Comments
Basically, the problem seems to lie in the algorithm for computing a random subset of One possible algorithm is Algorithm S from Knuth volume II, section 3.4.2, which I think is the same as Vitter's algorithm. However, this method requires O(N) time if I remember correctly, so if n << N you want to switch to a different algorithm. |
I spent some time looking into sampling algorithms, and this was the best I could come up with. (I made it up myself, because I was unhappy with some aspect of everything else I found.) Something that's Did you try increasing the number of iterations on the Newton's method inversion of the birthday problem? Maybe the error is there, or is just due to something dumb like an off-by-one bug (well, here, an off-by-0.3 bug 😄). But I have to confess that I don't personally see this as a particularly big problem; we're already so much closer to the target density than, say, Matlab, that I wonder this actually matters or is largely an intellectual exercise? But if you think it's important, by all means make improvements. |
I think that if I were doing statistical experiments with sparse random matrices, I would be unhappy to be limited to 2 significant digits. If we have a general randsubset function, then it will be even more important to have the right statistics. I certainly agree that we can't use O(N) algorithms for very sparse subsets. A randsubset function would want to switch between algorithms depending on the parameters. |
Steps that might fix this:
|
More Newton iterations doesn't help, but dithering seems to help, i.e.:
However, it looks like you are using only an approximate solution to the birthday problem, so there may be some intrinsic error. Is that right? |
It's been so long I really can't remember, but certainly I don't recall (intending) to use an approximation. That doesn't mean I didn't screw it up somehow 😄. |
It looks like you are using something like approximation 10 or 11 from this article?. The actual birthday-problem probability is a much more complicated recurrence relation that doesn't look anything like what your Newton iteration could be solving. |
Well, what you want is the expected number of unique values, not the probability of duplications. Let's see if I've done this correctly. If for any specific realization of sampling you let
where |
@stevengj, it occurs to me that if you asked because you're still seeing some tiny error in the mean number of nonzeros, it could be because we didn't get the dithering quite right. I'll bet the problem arises from the function's curvature between |
@timholy, looks like the corrected dithering does the trick! |
Great to hear! |
Do you know of any good way to compute |
Good catch, that's indeed something to worry about. How about |
Or rather |
Right, that should do it. Does this function have a name? |
If it does, I don't know it. But it does seem common enough to be in some math library. |
Since the random-subset problem shows up so often and is subtle to implement well, it would be good to put this in an exported subroutine of some sort. (If we adopt my However, there are several variants, so it's not clear what the interface should look like. Given a container
We could have an interface like:
(For the case of |
The notion of how to "bill" this is some of what held me back from splitting this off from For hitting the target exactly, Floyd's algorithm looks good (it's basically what Stefan was using, but using a Regarding order, I used I haven't thought about the |
@andreasnoackjensen, the reservoir problem is when you don't know the length of But you're right, I should look at the algorithms in StatsBase. And yes, we are talking about random sampling without replacement. However, at first glance it looks like Floyd's algorithm is better than |
@timholy, Floyd's algorithm doesn't reject any random numbers. It is just:
(I also suspect that it is not worth optimizing the |
Gotcha, thanks. |
It looks like you don't actually need Newton's method. The number of expected trials |
Nice catch, indeed. |
Hmm, if m > n/20 (roughly) it looks like Knuth's Algorithm S is substantially faster than either your algorithm or Floyd's algorithm, and produces ordered output. And according to exercise 8 of Knuth section 3.4.2, it may be possible to substantially speed up algorithm S for the case of m << n, simply by calculating how many elements to skip rather than testing the inputs one at a time. Something like this may end up being the way to go, and apparently it's described in detail by Faster methods for random sampling, Commun. ACM 27, 7 (July 1984). 703-718. Worth looking into, at any rate. |
Sounds good, let's go with whatever works best. |
The more I think about it, the more I think we really just want two functions:
The first can use Vitter's algorithm. The latter is what I need for my column-by-column The good news is that the second function looks quite straightforward to implement efficiently. Basically, you just need to skip sequentially through the array |
The following
(I've done some basic statistical tests, but an extra pair of eyeballs would be appreciated.) |
I like it! Is it faster than your speed-improved version, or just the one in base? I imagine the main disadvantage of this approach is the number of times Is the distribution of number produced exactly Poisson? I would guess that this must be approximately true (to reasonably high precision) of my algorithm as well, but I agree that yours seems to have a distribution that would be more easily analyzed. A couple of minor comments:
would greatly reduce the frequency with which one will have to do a Also, since |
@timholy, my version in #6726 is faster than all of the versions above. The distribution is exactly a binomial distribution, which asymptotes to a Poisson distribution. I thought about allocating a few extra standard deviations as you suggest. I don't think it will actually make much difference in practice because My version in #6726 correctly handles the case where Probably further discussion should happen in #6726. |
When looking into #6708, I noticed that the
sprand
function seems to generate sparse matrices with slightly the wrong density. For example:does not seem to be converging to 100.
The text was updated successfully, but these errors were encountered: