-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster and simpler sprand for SparseMatrixCSC #30494
Conversation
Would this be considered a breaking change as the same random seed produces different sparse matrices? |
I don't understand (have to look closely) what impact this has on the produced distribution. It is a breaking change, but it will be in stdlib. |
At first glance, it seems like this will produce the same distribution and is indeed a lot simpler. (I don't think we should be guaranteeing specific results from given random seeds, as otherwise we won't be able to change any of the random algorithms.) |
Should be the same distribution (each of the n*m entries is nonzero independently with probability p). It outputs a different random matrix because it uses fewer random draws. In addition, it allocates only the output arrays (it needs no temp arrays). |
I agree. Just for the record, the following passes (I can add this or a similar test if you think it's useful)
|
FYI, here is Numpy's policy on RNG https://www.numpy.org/neps/nep-0019-rng-policy.html (which was recently weakened). It'd be nice if Julia also has a written policy for the RNG explaining exactly what kind of compatibility guarantee should be expected. |
The following would fix #30502. Should I add it or better keep things separated?
Additionally it would be nice to remove the T parameter from the versions accepting rfn, as you can just pass rfn returning the type you want (T would remain on the versions without rfn, of course). |
A separate PR for #30502 would be better. |
An interesting read, thanks. Can't say that the policy pictured is completely clear-cut though (e.g. breaking changes are "allowed with caution"). In any case it seems to restrict the required stability to "core" RNGs rather than more high level such as this one. Would you agree? |
Ok, thanks. I'll wait to see the fate for this one first then. |
What I agree is that the performance improvement is awesome and we should have it (thanks for the PR!). What I'm not sure is to allow "non-core" random functions to change with less consideration. If Random.jl come up with some compatibility policy on versioning, I think it should be uniformly applied to any post-1.0 Julia packages defining random functions. In any case, I think an ideal situation would be that such libraries are completely decoupled from Julia's release cycle (though I know it's hard for Random.jl) so that "freezing" the random number algorithms per project can easily be done by a Project.toml. We can't do such decoupling at the moment so I guess I'll trust core devs decisions. |
I've added some tests making output distribution of |
We should rebase after merging #30576 and make sure the |
Can you rebase on master? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand the former version and don't have time now to dive into it, but the proposed version here is straightforward (so much so that it doesn't requires any comment to understand the logic), which is a very good point in itself. As it's also generally faster, it seems like a no-brainer to merge this change... the question is when?! (wrt to non-breaking RNG policy)
I'm of the impression that in the last year it was mentioned (more or less formally) more than once that streams of randomly generated numbers would be reproducible after Julia 1.0 is out, at least until stdlibs are versioned. I feel that merging this now would require a clear (official?) note on what is considered breaking wrt random numbers generation. |
I don't think we ever made a promise around the structure of sparse random matrices. I think it is fine to merge. |
Failure is unrelated. |
You could provide your own random-number generator. For your application, it sounds like you don't care too much about the statistical quality of the pseudorandom numbers, so you could supply e.g. a 32-bit linear congruential generator with a fixed seed in a few lines of code: using Random
mutable struct LCG <: AbstractRNG
state::UInt32
end
Base.rand(r::LCG, ::Type{UInt32}) = r.state = 0x0019660d * r.state + 0x3c6ef35f
Base.rand(r::LCG, ::Random.SamplerTrivial{Random.UInt52{UInt64},UInt64}) = (rand(r, UInt32) % UInt64) + ((rand(r, UInt32) % UInt64) << 20) at which point you can do e.g. It might be good to have a package with a simple LCG random-number generator for applications such as this. |
No, I take that back, even if the output of the RNG is stable, I'm don't think we want to guarantee the exact way in which the RNG is used (i.e. the exact sequence of calls) in a function like It seems like you really need to just host the files (or your own algorithm to generate matrices). Depending on the precise algorithm used by (You could define your own file format with e.g. |
One could use an RNG to generate I, J, V vectors and define a sparse matrix in terms of that instead of using using Random, SparseArrays
function mysprand(rng::AbstractRNG, m::Int, n::Int, p::Real)
N = round(Int, p*m*n)
sparse(rand(rng, 1:m, N), rand(rng, 1:n, N), rand(rng, N), m, n, (a, b) -> a)
end
mutable struct LCG <: AbstractRNG
state::UInt32
end
Base.rand(r::LCG, ::Type{UInt32}) = r.state = 0x0019660d * r.state + 0x3c6ef35f
Base.rand(r::LCG, ::Random.SamplerTrivial{Random.UInt52{UInt64},UInt64}) = (rand(r, UInt32) % UInt64) + ((rand(r, UInt32) % UInt64) << 20)
S = mysprand(LCG(123), 10^5, 10^4, 0.0001) |
In our use-case: The I agree that you can generate the sparse matrices from |
Many journals these days allow you to post code as supplementary information attached to the paper. 11 lines of code is not long and complicated in such a context.
Yes, definitely you want to be able to clearly state the distribution of the test matrices that were used, and (our) But I really think you need to provide either code generating the matrix or post the matrix as a file if you want readers to be confident that they will be able reproduce exactly the same matrices and not just the same distributions, especially readers from many years in the future. Requiring us to forever produce exactly the same matrix for a given RNG is really too constraining on the implementation, and is not something you would want to rely on in a paper anyway. |
The body is literally only two lines. You could golf it down to one pretty easily, e.g.: mysprand(rng::AbstractRNG, m::Int, n::Int, p::Real, N::Int=round(Int, p*m*n)) =
sparse(rand(rng, 1:m, N), rand(rng, 1:n, N), rand(rng, N), m, n, (a, b) -> a)
You don't need the LCG if all you want is predictable sparse matrix construction, you can just use the default RNG, which lets you pare the mysprand(m::Int, n::Int, p::Real, N::Int=round(Int, p*m*n)) =
sparse(rand(1:m, N), rand(1:n, N), rand(N), m, n, (a, b) -> a) If you want to simplify further, there's no need for type annotations: mysprand(m, n, p, N=round(Int, p*m*n)) =
sparse(rand(1:m, N), rand(1:n, N), rand(N), m, n, (a, b) -> a) I have a hard time believing that this is too scary for anyone. |
Thanks for the code. It was already clear to me that one can make it easier. In my case this is still too much (julia-specific) code to put in a numerical linear algebra paper. As a reader I would see it as a distraction and puts a disproportionate amount of attention on the generation of the example matrices. Even those two lines of code in a manuscript would be too much for my taste. Putting it as a separate attached file would also likely be frowned upon, considering it is just a random example.
In my case, it is more important that the function is stable over julia versions. It is not so important what distribution is used. A stable distribution does not help me much if the way the sampling is done changes, since I want reprodicability. I think we will not get anywhere further in this discussion and I don't think this is my decision. I can live with whatever the outcome. Whoever makes the decision should be aware that it might annoy some people who assumed |
What about putting the matrix-generating code in your own small package and refer to it in the paper so people needing it for testing can just clone it and use it? |
You are maybe right it's bad to rely on it. It is however not an unusual practice in my field where people have been relying on MATLAB. The generated matrices have not changed there for a very long time. This has been quite convenient. I understand if someone really needs an efficient Unless: Do you mean that you would also be open to modify the |
Don't you say what julia version the code is run on? Anyway, see this for a discussion on pretty much exactly this: https://www.numpy.org/neps/nep-0019-rng-policy.html Especially:
If you want reproducibility the way to do it is to give a manifest and a julia version. |
Yes. I understand this means it is reproducible. If someone wants to try a different algorithm on my matrices, they would have to run an old version of julia. Doable, but this obstacle does not encourage others use my example. In NonlinearEigenproblems.jl we have benchmark examples which are randomly generated, for the only reason that one should be able to easily compare algorithms on identical examples. They were designed under the (apparently incorrect) assumption that the random number generators are always ǵenerating the same sequence indep of julia version .
Nice. A corresponding documentation for julia random numbers would be nice. In particular, somewhere it would be good to point out that unit tests should never use any random number generators since the sampling is not stable. This will cause problems for eg ArnoldiMethod.jl (the starting vectors are random). |
Certainly you shouldn’t use tests that depend sensitively on the specific random sequence. It should be fine to use tests that are robust to different samples of the same distribution, but this is a good property to have for your algorithms anyway! (You probably still want to fix the random seed, since if the tests are run millions of times you may get a very unlikely sample, such as a poorly conditioned matrix, that breaks your tests. But this caveat shouldn’t apply to changes in the random streams that occur a handful of times over the lifetime of Julia.) |
(Similarly, if your paper shows test results from an |
You make this sound like this should have been obvious. I think these things can be very hidden and developers have not taken this into account. A policy document would have helped. A github search gives approx 50 hits on randn in the stdlib tests. I doubt all of these tests would pass if you completely changed the sequence randn generates, and that's just in stdlib. I could even argue in this PR: the jldoctest is wrong since it depends on the specific random sequence. See line 1487. Depends in a sensitive way on randn sequence. Consider a later PR which changes the randn-sequence because of efficiency. They will make the sprandn test fail:
Julia already got some bad reputation before 1.0 that things were changing all the time (mostly unjustified according to me). Changing the random sequence will break the code for some people. At the moment this change is not even in the NEWS-file. |
That's the intention of that test. |
I don't think a sprandn unit test should test for valid changes in randn. If someone changes the sequence in randn (in package Base) you will obtain a failed test in sprandn (in package SparseArrays) although there is actually nothing wrong in either package, following the policy that the randn-sequence is allowed to change. |
Documentation tests are different. If the sprand documentation shows a specific example, the example needs to be updated if the random stream changes. The doctest ensures that this is done. |
I think you are dismissing this too lightly. If SparseArrays was in a different repo and i.e. separate versioning cycles, how is that developer expected to make it pass on two julia versions that generate different randn-sequences? Note that sprandn calls randn. Maintaining different package versions for different julia version? If-statements on version numbers? Maybe acceptable considering its a rare situation. It's an example of a hidden complication. |
I feel you are anyway missing my main point. To quote myself
Independent of the jldoctest-example, I'm a bit surprised that nobody in this discussion has acknowledged that this can cause trouble for some developers, and that it may deserve to be mentioned in the julia 1.2 NEWS-file, at least. Just the fact that you detected that COSMO.jl's unit tests started fail, would have been enough to qualify as a mention in the NEWS-file in my opinion. A policy document would also reduce the friction, or just a reference to numpys policy document if that is accurate also for julia. |
(Personally, I don't even feel that strongly about this, although it will make this impression due to this long thread. It's not a big deal for us since I know how to fix the problems in our case and we will drop 1.0-support before our main release. I was more concerned that julia again can be criticized to make changes that break developers code. That criticism should be gone after 1.0.) |
Any change at all has the possibility to "break" code (https://xkcd.com/1172/). The term "breaking code" is used when a change in julia makes code that is * confirming to the parts of Julia that the developers of Julia voluntarily determined should work in the same way in future releases *. Even so, a NEWS entry for this as well as a policy for rng would/is likely a good idea. |
It definitely needs to be in the NEWS file. If that's all you're asking for then certainly no one disagrees with that. The people here are acutely aware problems this may cause: we run the test suites of all registered Julia packages for new releases and make sure that they are fixed to pass. |
IMPORTANT: If you need to reproduce matrices prior to this change, your best bet is to use the old version of the sprand function that you can find here.
The code for sprand for
SparseMatrixCSC
picks nonempty columns first, and then fill them (with the conditional probability of being nonempty). This seems to be done for faster construction for very sparse matrices (so that empty columns are to be expected). I think however that this does not give a real advantage, because one still needs to fill the colptr array, even for empty columns. I implemented a simpler version, and seems to be twice as fast or more in most cases. Am I overlooking something? Some benchmarks:MASTER:
PR:
Raw data
EDIT: Some simple innest-loop optimization improves speed further: