-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort SparseCSC elements function and speed up sparse matrix multiplication #8316
Conversation
…. In addition replacing the dual transpose used in the sparse matrix multiplication. This seems to outperform in particular with larger matrices. julia> x = sprand(n, m, 0.0002); julia> s = sprand(m, k, 0.001); julia> @time y = x * s; elapsed time: 56.447965143 seconds (11903648416 bytes allocated, 0.23% gc time) julia> @time y = spmm(x, s); elapsed time: 35.558409459 seconds (13790626144 bytes allocated, 5.49% gc time)
@gchisholm a heads-up: since Julia's macros share syntax with GitHub's user notification syntax, it's important to quote code with backticks (single backticks for inline, triple for block) to avoid being bad GitHub citizens by pinging random non-Julia related folks on Julia issues. Thanks! |
Interesting addition. Needs some module scoping so it can be seen properly where you're using it. It's also messy to read due to inconsistent indentation, can you try to follow https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md#general-formatting-guidelines-for-julia-code-contributions ? |
What we really want is in place sorting with array views. That will avoid all the copying. Also probably a way to pick between sorting and double transpose. |
We should be able to allocate once and sort in a range. |
@tkelman I will clean up the code, I had it in a private code base and quickly and poorly put it in a pull. My apologies. @ViralBShah Agreed the repeated allocation if suboptimal, I did have some code to check if additional allocations were required based on the number of rows in a particular column:
But this massively reduced the performance, I was violating some Julia language optimization that I did not understand. In addition if we add methods to choose between the double transpose, we should also add a method to return a dense from a sparse product. What would the use case be that a double transpose was desired? |
Needs some more cleanup and testing before anyone wastes their time looking at it. Will get back to it early next week. |
Indentation looks much better now, thanks. What may help make it easier to work on this and also pull in updates to master in the meantime is if you create a specific new branch in your fork for working on it. Might not end up being necessary here, but a pointer for next time :) |
@gchisholm I believe that there may be cases in multiplication where the double transpose is faster, but I haven't tested it yet. Let's make it easy to benchmark both. What values of I wonder if we can just add an optional parameter to sparse matmul in the following way, so that it can continue to work as it does right now, but with sortCSC set to true, one can try the new method with
We will then benchmark on various different types of matrices. If sortCSC is a clear winner, we will make it the default. Otherwise, we will have to keep it as an option. We certainly cannot be allocating the new array inside the BTW, sorting is what we used to do earlier and switched to double-transpose. The double transpose should ideally be faster as it is So, all in all, this is great and should give us a faster sparse multiplication. I will have a go at it once you get a chance to go over this again as you mentioned in your comment. |
@StefanKarpinski @kmsquire Do we have a way to sort a portion of an array in-place and get its permutation also in-place in a pre-allocated permutation array, without incurring in any extra allocations? |
It's a little bit annoying to do this right now. The way to do this would be to get the permutation for the portion of the array, and then apply that permutation to sort it. Getting the permutation would involve calling (from https://github.com/JuliaLang/julia/blob/master/base/sort.jl#L340-L342)
The permutation could be applied with tmp = v[start:stop] # copy(v[start:stop]) after array views are merged
perm = sortperm(tmp)
v[start:stop] = tmp[perm] # but do this in place There are probably more efficient ways to do (or at least nicer interfaces for) all of this, but none of them are implemented right now. This whole idea will also probably make more sense when Array views are merged. |
I think that applying the permutation later to sort it is ok, so long as we can avoid memory allocations across different calls to sort. Certainly waiting on array views. |
I realized later that the first permutation can just be written as |
… is faster than the dual transpose version, in particular in large sparse structures. Needs improvement with the use of sortperm which is causing a large amount of allocation.
6c7c7e3
to
1a4c02f
Compare
|
In this case, the garbage is generated largely by |
* Allows preallocation (and therefore, reuse) of an index array * Useful for #8316
@ViralBShah, see #8792. |
@gchisholm I have done some experiments, and it seems that doing a transpose is faster if there is on the order of 1 element per row or column, but the sorting is faster in other cases. All this is highly dependent on matrix structure and size. I will make it an optional parameter, and also put in some defaults. |
If you want to experiment, use this, but you will need the latest master.
|
I suspect, we can even special case the sorting of 1, 2, and 3 elements. |
For very large and very sparse matrices, we end up spending more than half the time just sorting. I wonder if we can just generate all the tuples in such cases and call |
This uses sortSparseMatrixCSC!(S) instead of double transpose to sort the result. (Viral committing an improved version of Glen Chisholm's original PR).
Calling |
Actually, by the time I was done, the double transpose was consistently slower in various experiments with identity, dense matrices in sparse form, and sprand generated matrices of different sizes. We should probably have a proper perf suite. In my opinion, matrices from UFL sparse tend to be small, and are not representative of the scale of things that is common today. Putting a set of sparse matrices together for benchmarks is a separate project, which would be useful by itself beyond sparse multiplication. |
Figured it was easier to provide the double transpose option now. Both methods can be accessed with:
|
Add tests for sparse matrix multiplication that exercise both, the sortcols and the doubletranspose method.
Thoughts on whether or not the 3 commits here should be backported? If so, might be best to wait until right after 0.3.3 just so there's more time for testing, comparison, etc. |
Isn't this an API change, which will potentially make 0.3.3 and future releases incompatible with earlier releases in the 0.3 series. We can backport a different algorithm if it is shown faster in a more comprehensive set of tests, but not the API change. |
@ViralBShah yes good point, we can avoid backporting this, and the exporting of |
A suggested addition.
Replacing the dual transpose used in the sparse matrix multiplication with an in place sort of the rowval and nzval elements. This seems to outperform in particular with larger matrices.
Avoiding the gc time by improving the allocation of the two Array's would further improve performance, I did test with a if statement to avoid allocations, but that seemed to decrease perf particularly with smaller pairs of matrix.
Either way, I assume a Sparse sort function would have some longer term benefit.
[pao: quoting code block]