Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort SparseCSC elements function and speed up sparse matrix multiplication #8316

Closed
wants to merge 3 commits into from

Conversation

gchisholm
Copy link

A suggested addition.

Replacing the dual transpose used in the sparse matrix multiplication with an in place sort of the rowval and nzval elements. This seems to outperform in particular with larger matrices.

julia> x = sprand(n, m, 0.0002);

julia> s = sprand(m, k, 0.001);

julia> @time y = x * s;
elapsed time: 56.447965143 seconds (11903648416 bytes allocated, 0.23% gc time)

julia> @time y = spmm(x, s);
elapsed time: 35.558409459 seconds (13790626144 bytes allocated, 5.49% gc time)

Avoiding the gc time by improving the allocation of the two Array's would further improve performance, I did test with a if statement to avoid allocations, but that seemed to decrease perf particularly with smaller pairs of matrix.

Either way, I assume a Sparse sort function would have some longer term benefit.

[pao: quoting code block]

…. In addition replacing the dual transpose used in the sparse matrix multiplication. This seems to outperform in particular with larger matrices.

julia> x = sprand(n, m, 0.0002);

julia> s = sprand(m, k, 0.001);

julia> @time y = x * s;
elapsed time: 56.447965143 seconds (11903648416 bytes allocated, 0.23% gc time)

julia> @time y = spmm(x, s);
elapsed time: 35.558409459 seconds (13790626144 bytes allocated, 5.49% gc time)
@pao
Copy link
Member

pao commented Sep 11, 2014

@gchisholm a heads-up: since Julia's macros share syntax with GitHub's user notification syntax, it's important to quote code with backticks (single backticks for inline, triple for block) to avoid being bad GitHub citizens by pinging random non-Julia related folks on Julia issues. Thanks!

@tkelman
Copy link
Contributor

tkelman commented Sep 11, 2014

Interesting addition. Needs some module scoping so it can be seen properly where you're using it. It's also messy to read due to inconsistent indentation, can you try to follow https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md#general-formatting-guidelines-for-julia-code-contributions ?

@ViralBShah
Copy link
Member

What we really want is in place sorting with array views. That will avoid all the copying. Also probably a way to pick between sorting and double transpose.

@ViralBShah
Copy link
Member

We should be able to allocate once and sort in a range.

@gchisholm
Copy link
Author

@tkelman I will clean up the code, I had it in a private code base and quickly and poorly put it in a pull. My apologies.

@ViralBShah Agreed the repeated allocation if suboptimal, I did have some code to check if additional allocations were required based on the number of rows in a particular column:

        numrows = (col_end - col_start) +1;
        if old_alloc < numrows
        row = Array(Ti, numrows)
            val = Array(Tv, numrows)
            old_alloc = numrows;
        end

But this massively reduced the performance, I was violating some Julia language optimization that I did not understand.

In addition if we add methods to choose between the double transpose, we should also add a method to return a dense from a sparse product.

What would the use case be that a double transpose was desired?

@tkelman tkelman added the sparse Sparse arrays label Sep 12, 2014
@gchisholm
Copy link
Author

Needs some more cleanup and testing before anyone wastes their time looking at it. Will get back to it early next week.

@tkelman
Copy link
Contributor

tkelman commented Sep 12, 2014

Indentation looks much better now, thanks. What may help make it easier to work on this and also pull in updates to master in the meantime is if you create a specific new branch in your fork for working on it. Might not end up being necessary here, but a pointer for next time :)

@ViralBShah ViralBShah self-assigned this Sep 14, 2014
@ViralBShah
Copy link
Member

@gchisholm I believe that there may be cases in multiplication where the double transpose is faster, but I haven't tested it yet. Let's make it easy to benchmark both. What values of n, m and k do you use above? I wonder if you end up with an extremely sparse product and hence it is faster.

I wonder if we can just add an optional parameter to sparse matmul in the following way, so that it can continue to work as it does right now, but with sortCSC set to true, one can try the new method with *(A,B,sortCSC=true).

*(A::SparseMatrixCSC, B::SparseMatrixCSC; sortCSC=false)`

We will then benchmark on various different types of matrices. If sortCSC is a clear winner, we will make it the default. Otherwise, we will have to keep it as an option. We certainly cannot be allocating the new array inside the for loop for every iteration. One way to avoid is by creating subArrays and sorting them in place, but sortperm does not have an in-place version.

BTW, sorting is what we used to do earlier and switched to double-transpose. The double transpose should ideally be faster as it is O(nnz), whereas the sorting is O(n x_col log(x_col)), where x_col is the size of each col. It is also possible there is some performance issue in the double transpose.

So, all in all, this is great and should give us a faster sparse multiplication. I will have a go at it once you get a chance to go over this again as you mentioned in your comment.

@ViralBShah
Copy link
Member

@StefanKarpinski @kmsquire Do we have a way to sort a portion of an array in-place and get its permutation also in-place in a pre-allocated permutation array, without incurring in any extra allocations?

@ViralBShah ViralBShah added this to the 0.4 milestone Sep 14, 2014
@kmsquire
Copy link
Member

It's a little bit annoying to do this right now.

The way to do this would be to get the permutation for the portion of the array, and then apply that permutation to sort it.

Getting the permutation would involve calling (from https://github.com/JuliaLang/julia/blob/master/base/sort.jl#L340-L342)

perm = sort!([1:stop-start+1], alg, Perm(ord(lt,by,rev,order),sub(v, start:stop)))
perm = sortperm(sub(v, start:stop))

The permutation could be applied with permute!(sub(v, start:stop), perm) (or permute!! with the same args if the permutation is allowed to be destroyed). But if the size of the chunk is large, it's probably faster to copy it out, then back in again:

tmp = v[start:stop] # copy(v[start:stop]) after array views are merged
perm = sortperm(tmp)
v[start:stop] = tmp[perm] # but do this in place

There are probably more efficient ways to do (or at least nicer interfaces for) all of this, but none of them are implemented right now.

This whole idea will also probably make more sense when Array views are merged.

@ViralBShah
Copy link
Member

I think that applying the permutation later to sort it is ok, so long as we can avoid memory allocations across different calls to sort. Certainly waiting on array views.

@kmsquire
Copy link
Member

I realized later that the first permutation can just be written as perm = sortperm(sub(v, start:stop)) (fixed inline). My original thought was to keep track of the original indices, which in theory is possible, but probably not necessary. This doesn't actually look so bad, then.

… is faster than the dual transpose version, in particular in large sparse structures. Needs improvement with the use of sortperm which is causing a large amount of allocation.
@jiahao jiahao force-pushed the master branch 3 times, most recently from 6c7c7e3 to 1a4c02f Compare October 11, 2014 22:06
@ViralBShah
Copy link
Member

spmm is the new version from this PR, and is considerably faster than the current implementation.

julia> m=10^5; n=10^5; k=10^5;
julia> x = sprand(n,m,0.0002);
julia> s = sprand(n,m,0.001);
julia> @time spmm(x,s);
elapsed time: 27.015908333 seconds (15292728544 bytes allocated, 6.46% gc time)
julia> @time x*s;
elapsed time: 110.150445452 seconds (15279436352 bytes allocated, 0.40% gc time)

@ViralBShah
Copy link
Member

In this case, the garbage is generated largely by sortperm. @kmsquire Do we have a way to pass the result array to sortperm? If I can reuse the result array in the loop, it should give huge gains. For many problems, I am seeing a significant portion of the time going to GC, and I think it is possible to do this whole computation without allocating extra memory in every iteration.

kmsquire added a commit that referenced this pull request Oct 23, 2014
* Allows preallocation (and therefore, reuse) of an index array
* Useful for #8316
@kmsquire kmsquire mentioned this pull request Oct 23, 2014
@kmsquire
Copy link
Member

@ViralBShah, see #8792.

@ViralBShah
Copy link
Member

@gchisholm I have done some experiments, and it seems that doing a transpose is faster if there is on the order of 1 element per row or column, but the sorting is faster in other cases. All this is highly dependent on matrix structure and size. I will make it an optional parameter, and also put in some defaults.

@ViralBShah
Copy link
Member

If you want to experiment, use this, but you will need the latest master.

function sortCSC!{Tv,Ti}(A::SparseMatrixCSC{Tv,Ti})
    m, n = size(A)
    colptr = A.colptr; rowval = A.rowval; nzval = A.nzval

    index = zeros(Ti, m)
    row = zeros(Ti, m)
    val = zeros(Tv, m)

    for i = 1:n
        @inbounds col_start = colptr[i]
        @inbounds col_end = (colptr[i+1] - 1)

        numrows = int((col_end - col_start) + 1)

        jj = 1
        @simd for j = col_start:col_end
            @inbounds row[jj] = rowval[j]
            @inbounds val[jj] = nzval[j]
            jj += 1
        end

        sortperm!(pointer_to_array(pointer(index), numrows),
                  pointer_to_array(pointer(row), numrows))

        jj = 1;
        @simd for j = col_start:col_end
            @inbounds rowval[j] = row[index[jj]]
            @inbounds nzval[j] = val[index[jj]]
            jj += 1
        end
    end
end

@ViralBShah
Copy link
Member

I suspect, we can even special case the sorting of 1, 2, and 3 elements.

@ViralBShah
Copy link
Member

For very large and very sparse matrices, we end up spending more than half the time just sorting. I wonder if we can just generate all the tuples in such cases and call sparse().

@ViralBShah
Copy link
Member

Cc: @mlubin @IainNZ

@ViralBShah ViralBShah closed this Oct 25, 2014
ViralBShah pushed a commit that referenced this pull request Oct 25, 2014
This uses sortSparseMatrixCSC!(S) instead of double transpose to sort the result.
(Viral committing an improved version of Glen Chisholm's original PR).
@mlubin
Copy link
Member

mlubin commented Oct 25, 2014

Calling sparse() is the same as the double transpose, I think. I don't see a parameter for this in your commit, @ViralBShah.
I haven't personally benchmarked this, but the double transpose seems much more common in sparse linear algebra texts than sorting. Benchmarking on random sparse matrices is not sufficient, I would've liked to see some benchmarks with matrices from the matrix market or the UFL sparse matrix collection before changing the default.

@ViralBShah
Copy link
Member

Actually, by the time I was done, the double transpose was consistently slower in various experiments with identity, dense matrices in sparse form, and sprand generated matrices of different sizes.

We should probably have a proper perf suite. In my opinion, matrices from UFL sparse tend to be small, and are not representative of the scale of things that is common today. Putting a set of sparse matrices together for benchmarks is a separate project, which would be useful by itself beyond sparse multiplication.

@ViralBShah
Copy link
Member

Figured it was easier to provide the double transpose option now. Both methods can be accessed with:

Base.LinAlg.spmatmul(A,B,sortindices=:doubletranspose)
Base.LinAlg.spmatmul(A,B,sortindices=:sortcols)

ViralBShah added a commit that referenced this pull request Oct 26, 2014
Add tests for sparse matrix multiplication that exercise both,
the sortcols and the doubletranspose method.
@tkelman
Copy link
Contributor

tkelman commented Nov 18, 2014

Thoughts on whether or not the 3 commits here should be backported? If so, might be best to wait until right after 0.3.3 just so there's more time for testing, comparison, etc.

@ViralBShah
Copy link
Member

Isn't this an API change, which will potentially make 0.3.3 and future releases incompatible with earlier releases in the 0.3 series. We can backport a different algorithm if it is shown faster in a more comprehensive set of tests, but not the API change.

@tkelman
Copy link
Contributor

tkelman commented Nov 19, 2014

@ViralBShah yes good point, we can avoid backporting this, and the exporting of transpose! which doesn't apply cleanly without doing this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sparse Sparse arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants