Sort SparseCSC elements function and speed up sparse matrix multiplication #8316

gchisholm · 2014-09-11T20:47:55Z

A suggested addition.

Replacing the dual transpose used in the sparse matrix multiplication with an in place sort of the rowval and nzval elements. This seems to outperform in particular with larger matrices.

julia> x = sprand(n, m, 0.0002);

julia> s = sprand(m, k, 0.001);

julia> @time y = x * s;
elapsed time: 56.447965143 seconds (11903648416 bytes allocated, 0.23% gc time)

julia> @time y = spmm(x, s);
elapsed time: 35.558409459 seconds (13790626144 bytes allocated, 5.49% gc time)

Avoiding the gc time by improving the allocation of the two Array's would further improve performance, I did test with a if statement to avoid allocations, but that seemed to decrease perf particularly with smaller pairs of matrix.

Either way, I assume a Sparse sort function would have some longer term benefit.

[pao: quoting code block]

@time

…. In addition replacing the dual transpose used in the sparse matrix multiplication. This seems to outperform in particular with larger matrices. julia> x = sprand(n, m, 0.0002); julia> s = sprand(m, k, 0.001); julia> @time y = x * s; elapsed time: 56.447965143 seconds (11903648416 bytes allocated, 0.23% gc time) julia> @time y = spmm(x, s); elapsed time: 35.558409459 seconds (13790626144 bytes allocated, 5.49% gc time)

pao · 2014-09-11T22:53:28Z

@gchisholm a heads-up: since Julia's macros share syntax with GitHub's user notification syntax, it's important to quote code with backticks (single backticks for inline, triple for block) to avoid being bad GitHub citizens by pinging random non-Julia related folks on Julia issues. Thanks!

tkelman · 2014-09-11T22:58:40Z

Interesting addition. Needs some module scoping so it can be seen properly where you're using it. It's also messy to read due to inconsistent indentation, can you try to follow https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md#general-formatting-guidelines-for-julia-code-contributions ?

ViralBShah · 2014-09-12T02:40:28Z

What we really want is in place sorting with array views. That will avoid all the copying. Also probably a way to pick between sorting and double transpose.

ViralBShah · 2014-09-12T02:43:11Z

We should be able to allocate once and sort in a range.

gchisholm · 2014-09-12T03:35:34Z

@tkelman I will clean up the code, I had it in a private code base and quickly and poorly put it in a pull. My apologies.

@ViralBShah Agreed the repeated allocation if suboptimal, I did have some code to check if additional allocations were required based on the number of rows in a particular column:

        numrows = (col_end - col_start) +1;
        if old_alloc < numrows
        row = Array(Ti, numrows)
            val = Array(Tv, numrows)
            old_alloc = numrows;
        end

But this massively reduced the performance, I was violating some Julia language optimization that I did not understand.

In addition if we add methods to choose between the double transpose, we should also add a method to return a dense from a sparse product.

What would the use case be that a double transpose was desired?

gchisholm · 2014-09-12T19:36:11Z

Needs some more cleanup and testing before anyone wastes their time looking at it. Will get back to it early next week.

tkelman · 2014-09-12T20:47:45Z

Indentation looks much better now, thanks. What may help make it easier to work on this and also pull in updates to master in the meantime is if you create a specific new branch in your fork for working on it. Might not end up being necessary here, but a pointer for next time :)

ViralBShah · 2014-09-14T06:18:45Z

@gchisholm I believe that there may be cases in multiplication where the double transpose is faster, but I haven't tested it yet. Let's make it easy to benchmark both. What values of n, m and k do you use above? I wonder if you end up with an extremely sparse product and hence it is faster.

I wonder if we can just add an optional parameter to sparse matmul in the following way, so that it can continue to work as it does right now, but with sortCSC set to true, one can try the new method with *(A,B,sortCSC=true).

*(A::SparseMatrixCSC, B::SparseMatrixCSC; sortCSC=false)`

We will then benchmark on various different types of matrices. If sortCSC is a clear winner, we will make it the default. Otherwise, we will have to keep it as an option. We certainly cannot be allocating the new array inside the for loop for every iteration. One way to avoid is by creating subArrays and sorting them in place, but sortperm does not have an in-place version.

BTW, sorting is what we used to do earlier and switched to double-transpose. The double transpose should ideally be faster as it is O(nnz), whereas the sorting is O(n x_col log(x_col)), where x_col is the size of each col. It is also possible there is some performance issue in the double transpose.

So, all in all, this is great and should give us a faster sparse multiplication. I will have a go at it once you get a chance to go over this again as you mentioned in your comment.

ViralBShah · 2014-09-14T06:19:39Z

@StefanKarpinski @kmsquire Do we have a way to sort a portion of an array in-place and get its permutation also in-place in a pre-allocated permutation array, without incurring in any extra allocations?

kmsquire · 2014-09-14T07:39:53Z

It's a little bit annoying to do this right now.

The way to do this would be to get the permutation for the portion of the array, and then apply that permutation to sort it.

Getting the permutation would involve calling (from https://github.com/JuliaLang/julia/blob/master/base/sort.jl#L340-L342)

~~perm = sort!([1:stop-start+1], alg, Perm(ord(lt,by,rev,order),sub(v, start:stop)))~~
perm = sortperm(sub(v, start:stop))

The permutation could be applied with permute!(sub(v, start:stop), perm) (or permute!! with the same args if the permutation is allowed to be destroyed). But if the size of the chunk is large, it's probably faster to copy it out, then back in again:

tmp = v[start:stop] # copy(v[start:stop]) after array views are merged
perm = sortperm(tmp)
v[start:stop] = tmp[perm] # but do this in place

There are probably more efficient ways to do (or at least nicer interfaces for) all of this, but none of them are implemented right now.

This whole idea will also probably make more sense when Array views are merged.

ViralBShah · 2014-09-14T08:41:41Z

I think that applying the permutation later to sort it is ok, so long as we can avoid memory allocations across different calls to sort. Certainly waiting on array views.

kmsquire · 2014-09-15T05:07:10Z

I realized later that the first permutation can just be written as perm = sortperm(sub(v, start:stop)) (fixed inline). My original thought was to keep track of the original indices, which in theory is possible, but probably not necessary. This doesn't actually look so bad, then.

… is faster than the dual transpose version, in particular in large sparse structures. Needs improvement with the use of sortperm which is causing a large amount of allocation.

ViralBShah · 2014-10-16T05:22:55Z

spmm is the new version from this PR, and is considerably faster than the current implementation.

julia> m=10^5; n=10^5; k=10^5;
julia> x = sprand(n,m,0.0002);
julia> s = sprand(n,m,0.001);
julia> @time spmm(x,s);
elapsed time: 27.015908333 seconds (15292728544 bytes allocated, 6.46% gc time)
julia> @time x*s;
elapsed time: 110.150445452 seconds (15279436352 bytes allocated, 0.40% gc time)

ViralBShah · 2014-10-23T10:36:42Z

In this case, the garbage is generated largely by sortperm. @kmsquire Do we have a way to pass the result array to sortperm? If I can reuse the result array in the loop, it should give huge gains. For many problems, I am seeing a significant portion of the time going to GC, and I think it is possible to do this whole computation without allocating extra memory in every iteration.

* Allows preallocation (and therefore, reuse) of an index array * Useful for #8316

kmsquire · 2014-10-23T20:58:41Z

@ViralBShah, see #8792.

ViralBShah · 2014-10-25T07:39:24Z

@gchisholm I have done some experiments, and it seems that doing a transpose is faster if there is on the order of 1 element per row or column, but the sorting is faster in other cases. All this is highly dependent on matrix structure and size. I will make it an optional parameter, and also put in some defaults.

ViralBShah · 2014-10-25T07:40:02Z

If you want to experiment, use this, but you will need the latest master.

function sortCSC!{Tv,Ti}(A::SparseMatrixCSC{Tv,Ti})
    m, n = size(A)
    colptr = A.colptr; rowval = A.rowval; nzval = A.nzval

    index = zeros(Ti, m)
    row = zeros(Ti, m)
    val = zeros(Tv, m)

    for i = 1:n
        @inbounds col_start = colptr[i]
        @inbounds col_end = (colptr[i+1] - 1)

        numrows = int((col_end - col_start) + 1)

        jj = 1
        @simd for j = col_start:col_end
            @inbounds row[jj] = rowval[j]
            @inbounds val[jj] = nzval[j]
            jj += 1
        end

        sortperm!(pointer_to_array(pointer(index), numrows),
                  pointer_to_array(pointer(row), numrows))

        jj = 1;
        @simd for j = col_start:col_end
            @inbounds rowval[j] = row[index[jj]]
            @inbounds nzval[j] = val[index[jj]]
            jj += 1
        end
    end
end

ViralBShah · 2014-10-25T07:42:28Z

I suspect, we can even special case the sorting of 1, 2, and 3 elements.

ViralBShah · 2014-10-25T07:53:56Z

For very large and very sparse matrices, we end up spending more than half the time just sorting. I wonder if we can just generate all the tuples in such cases and call sparse().

ViralBShah · 2014-10-25T07:54:16Z

Cc: @mlubin @IainNZ

This uses sortSparseMatrixCSC!(S) instead of double transpose to sort the result. (Viral committing an improved version of Glen Chisholm's original PR).

mlubin · 2014-10-25T15:41:55Z

Calling sparse() is the same as the double transpose, I think. I don't see a parameter for this in your commit, @ViralBShah.
I haven't personally benchmarked this, but the double transpose seems much more common in sparse linear algebra texts than sorting. Benchmarking on random sparse matrices is not sufficient, I would've liked to see some benchmarks with matrices from the matrix market or the UFL sparse matrix collection before changing the default.

ViralBShah · 2014-10-25T15:49:31Z

Actually, by the time I was done, the double transpose was consistently slower in various experiments with identity, dense matrices in sparse form, and sprand generated matrices of different sizes.

We should probably have a proper perf suite. In my opinion, matrices from UFL sparse tend to be small, and are not representative of the scale of things that is common today. Putting a set of sparse matrices together for benchmarks is a separate project, which would be useful by itself beyond sparse multiplication.

…8316).

ViralBShah · 2014-10-25T16:30:23Z

Figured it was easier to provide the double transpose option now. Both methods can be accessed with:

Base.LinAlg.spmatmul(A,B,sortindices=:doubletranspose)
Base.LinAlg.spmatmul(A,B,sortindices=:sortcols)

Add tests for sparse matrix multiplication that exercise both, the sortcols and the doubletranspose method.

tkelman · 2014-11-18T10:26:31Z

Thoughts on whether or not the 3 commits here should be backported? If so, might be best to wait until right after 0.3.3 just so there's more time for testing, comparison, etc.

ViralBShah · 2014-11-18T18:51:56Z

Isn't this an API change, which will potentially make 0.3.3 and future releases incompatible with earlier releases in the 0.3 series. We can backport a different algorithm if it is shown faster in a more comprehensive set of tests, but not the API change.

tkelman · 2014-11-19T03:22:16Z

@ViralBShah yes good point, we can avoid backporting this, and the exporting of transpose! which doesn't apply cleanly without doing this change.

tkelman added the sparse Sparse arrays label Sep 12, 2014

Corrected scope and indentation, I hope

bd38148

ViralBShah self-assigned this Sep 14, 2014

ViralBShah added this to the 0.4 milestone Sep 14, 2014

Functional sorting of the matrix rather than the dual transpose. This…

7314a38

… is faster than the dual transpose version, in particular in large sparse structures. Needs improvement with the use of sortperm which is causing a large amount of allocation.

jiahao force-pushed the master branch 3 times, most recently from 6c7c7e3 to 1a4c02f Compare October 11, 2014 22:06

kmsquire added a commit that referenced this pull request Oct 23, 2014

Added sortperm!

1a3a702

* Allows preallocation (and therefore, reuse) of an index array * Useful for #8316

kmsquire mentioned this pull request Oct 23, 2014

Added sortperm! #8792

Merged

ViralBShah closed this Oct 25, 2014

ViralBShah pushed a commit that referenced this pull request Oct 25, 2014

Faster sparse matrix multiplication (#8316)

f585452

This uses sortSparseMatrixCSC!(S) instead of double transpose to sort the result. (Viral committing an improved version of Glen Chisholm's original PR).

ViralBShah added a commit that referenced this pull request Oct 25, 2014

Provide a way to sort sparse matrix columns with a double transpose (#…

fe86d75

…8316).

ViralBShah added a commit that referenced this pull request Oct 26, 2014

Fix a bug in the doubletranspose method for non-square matrices (#8316).

98c882f

Add tests for sparse matrix multiplication that exercise both, the sortcols and the doubletranspose method.

stevengj mentioned this pull request Sep 5, 2018

Multiplication of sparse matrices could be faster (2x - 3x) with a small change #29022

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort SparseCSC elements function and speed up sparse matrix multiplication #8316

Sort SparseCSC elements function and speed up sparse matrix multiplication #8316

gchisholm commented Sep 11, 2014

pao commented Sep 11, 2014

tkelman commented Sep 11, 2014

ViralBShah commented Sep 12, 2014

ViralBShah commented Sep 12, 2014

gchisholm commented Sep 12, 2014

gchisholm commented Sep 12, 2014

tkelman commented Sep 12, 2014

ViralBShah commented Sep 14, 2014

ViralBShah commented Sep 14, 2014

kmsquire commented Sep 14, 2014

ViralBShah commented Sep 14, 2014

kmsquire commented Sep 15, 2014

ViralBShah commented Oct 16, 2014

ViralBShah commented Oct 23, 2014

kmsquire commented Oct 23, 2014

ViralBShah commented Oct 25, 2014

ViralBShah commented Oct 25, 2014

ViralBShah commented Oct 25, 2014

ViralBShah commented Oct 25, 2014

ViralBShah commented Oct 25, 2014

mlubin commented Oct 25, 2014

ViralBShah commented Oct 25, 2014

ViralBShah commented Oct 25, 2014

tkelman commented Nov 18, 2014

ViralBShah commented Nov 18, 2014

tkelman commented Nov 19, 2014

Sort SparseCSC elements function and speed up sparse matrix multiplication #8316

Sort SparseCSC elements function and speed up sparse matrix multiplication #8316

Conversation

gchisholm commented Sep 11, 2014

pao commented Sep 11, 2014

tkelman commented Sep 11, 2014

ViralBShah commented Sep 12, 2014

ViralBShah commented Sep 12, 2014

gchisholm commented Sep 12, 2014

gchisholm commented Sep 12, 2014

tkelman commented Sep 12, 2014

ViralBShah commented Sep 14, 2014

ViralBShah commented Sep 14, 2014

kmsquire commented Sep 14, 2014

ViralBShah commented Sep 14, 2014

kmsquire commented Sep 15, 2014

ViralBShah commented Oct 16, 2014

ViralBShah commented Oct 23, 2014

kmsquire commented Oct 23, 2014

ViralBShah commented Oct 25, 2014

ViralBShah commented Oct 25, 2014

ViralBShah commented Oct 25, 2014

ViralBShah commented Oct 25, 2014

ViralBShah commented Oct 25, 2014

mlubin commented Oct 25, 2014

ViralBShah commented Oct 25, 2014

ViralBShah commented Oct 25, 2014

tkelman commented Nov 18, 2014

ViralBShah commented Nov 18, 2014

tkelman commented Nov 19, 2014