spmatmul sparse matrix multiplication - performance improvements #30372

KlausC · 2018-12-12T21:24:24Z

The implementation of Gustavson's algorithm used to need a sorting-step ofter the result matrix is constructed. The algorithm has been modified to produce the result vectors properly ordered.
Speed improvement is about 4 times at comparable space usage.
old results:

julia> n = 10000;
julia> Random.seed!(0); A = sprandn(n, n, 0.01); b = randn(n);  nnz(A)
1000598
julia> @benchmark A * A
BenchmarkTools.Trial: 
  memory estimate:  977.61 MiB
  allocs estimate:  28
  --------------
  minimum time:     5.815 s (3.76% GC)
  median time:      5.815 s (3.76% GC)
  mean time:        5.815 s (3.76% GC)
  maximum time:     5.815 s (3.76% GC)
  --------------
  samples:          1
  evals/sample:     1

new results:

julia> @benchmark A * A
BenchmarkTools.Trial: 
  memory estimate:  1.04 GiB
  allocs estimate:  10
  --------------
  minimum time:     1.299 s (3.74% GC)
  median time:      1.328 s (5.37% GC)
  mean time:        1.327 s (5.55% GC)
  maximum time:     1.353 s (7.64% GC)
  --------------
  samples:          4
  evals/sample:     1

jebej · 2018-12-13T01:47:52Z

this doesn't seem to work with SparseVectors

stdlib/SparseArrays/src/linalg.jl

ViralBShah · 2018-12-13T04:25:17Z

Can you test this with matrices that are 10^6x10^6 with 1 and 10 nonzeros per row? IIRC, we used to have this method in the very early days but changed it.

There are definitely problem sizes for which this will be faster (but not asymptotically), and if we want these benefits, we need to have some reasonable heuristics to decide which variant to pick.

KristofferC · 2018-12-13T04:36:13Z

I had some benchmarks in #29682 that could be used here as well.

ViralBShah · 2018-12-14T15:59:11Z

Just a thought - we can consider having an optional parameter for accumulating into a dense vector or by sorting. The default would have to be the one that is asymptotically good, but this dense accumulator approach could be picked optionally.

KlausC · 2018-12-16T18:33:45Z

Hi @ViralBShah,
I adapted your first proposal and found a robust and simple way to make the decision between full scan (my first proposal in this PR for the less sparse corner cases) and the sorting of the row indices (the original software which handles the asymptotic case better - huge matrix with strong sparsity).
The decision is made for each result column separately.
Interestinly the decision depends on nnz * log2(nnz) * 3 < m, where m is the result column length and nnz is the number of nonzeros in one column.

Some additional changes are:

initial estimation of nnz of result
replace Vector{Int} by Vector{Bool} to tag nonzero slots.
re-use nzvals as dense akkumulator storage.
use in-place sort! for each result column individually - removed sparseMatrixSort! at end.
As a result all observed cases so far have speed accelerations.

This picture shows the evaluation of the alternatives. x-axis is log2(m), y-axis is log2(nnz). The blue points have better performance for the scanning approach, the yellow point for the sorting approach. The solid line is the graph of nnz -> 3 nnz*log2(nnz).

Benchmark results:
The measurements were done in a copy of the original software SparseWrappers.spmatmul_orig and the modified version SparseWrappers.spmatmul. The last one is identical to the version in this PR.

My own original example:

julia> n = 10000;
julia> Random.seed!(0); A = sprandn(n, n, 0.01); b = randn(n);  nnz(A)
1000598
julia> @btime SparseWrappers.spmatmul_orig($A, $A);
  5.259 s (28 allocations: 122.60 MiB)
julia> @btime SparseWrappers.spmatmul($A, $A);
  1.400 s (8 allocations: 1.04 GiB)

The asymptotic case of @ViralBShah

julia> n = 10^6
1000000
julia> Random.seed!(0); A = sprandn(n, n, 10 / n); b = randn(n);  nnz(A)
10003563
julia> @btime SparseWrappers.spmatmul_orig($A, $A);
  9.613 s (24 allocations: 2.43 GiB)
julia> @btime SparseWrappers.spmatmul($A, $A);
  6.449 s (9 allocations: 1.65 GiB)

julia> Random.seed!(0); A = sprandn(n, n, 1 / n); b = randn(n);  nnz(A)
1000757
julia> @btime SparseWrappers.spmatmul_orig($A, $A);
  220.634 ms (20 allocations: 106.86 MiB)
julia> @btime SparseWrappers.spmatmul($A, $A);
  162.402 ms (11 allocations: 42.20 MiB)

The benchmarks proposed by @KristofferC

julia> for N = (^).(10, 1:7)
           S = spdiagm(0 => ones(N), 1 => -ones(N-1));
           SA = sparse(S')
           @show N
           @btime SparseWrappers.spmatmul_orig($S, $SA)
           @btime SparseWrappers.spmatmul($S, $SA)
       end
N = 10
  782.991 ns (10 allocations: 1.80 KiB)
  674.351 ns (7 allocations: 2.22 KiB)
N = 100
  6.028 μs (10 allocations: 11.81 KiB)
  4.024 μs (5 allocations: 8.11 KiB)
N = 1000
  55.681 μs (12 allocations: 110.41 KiB)
  36.536 μs (7 allocations: 77.89 KiB)
N = 10000
  544.767 μs (18 allocations: 1.07 MiB)
  356.850 μs (8 allocations: 775.78 KiB)
N = 100000
  5.501 ms (18 allocations: 10.68 MiB)
  3.685 ms (9 allocations: 7.57 MiB)
N = 1000000
  56.884 ms (18 allocations: 106.81 MiB)
  54.472 ms (9 allocations: 75.72 MiB)
N = 10000000
  995.334 ms (18 allocations: 1.04 GiB)
  620.350 ms (9 allocations: 757.22 MiB)

julia> using MatrixDepot # v0.8.0
populating internal database...

julia> md = mdopen("Wissgott/parabolic_fem")
(RS Wissgott/parabolic_fem(#1853)  525825x525825(3674625/2100225) 2007 [A, b] 'computational fluid dynamics problem' [Parabolic FEM, diffusion-convection reaction, constant homogenous diffusion]()

julia> A = sparse(md.A); # MatrixDepot delivers a Symmetric{SparseMatrixCSC}
julia> @btime SparseWrappers.spmatmul_orig($A, $A);
  422.052 ms (20 allocations: 248.35 MiB)
julia> @btime SparseWrappers.spmatmul($A, $A);
  273.490 ms (9 allocations: 435.51 MiB)

ViralBShah · 2018-12-16T22:08:50Z

This is some really nice work @KlausC!

Is it possible to for the spmatmul_orig tests to be run with sortcols and doubleTranspose - just so that we have a comprehensive testing of all cases? Also, can we try to test more sprand cases with n varying from 10 to 10^6 and a few varying densities?

It should hopefully be easy to do this easily, and will serve as a good record for this PR.

Is there any benefit from inlining the prefer_sort and ilog2?

stdlib/SparseArrays/test/sparse.jl

KlausC · 2018-12-16T23:24:25Z

The same examples with sortindices=:doubletranspose showed inferior results, essentially.

julia> for N = (^).(10, 1:7)
           S = spdiagm(0 => ones(N), 1 => -ones(N-1));
           SA = sparse(S')
           @show N
           @btime SparseWrappers.spmatmul_orig($S, $SA, sortindices=:doubletranspose)
           @btime SparseWrappers.spmatmul_orig($S, $SA, sortindices=:sortcols)
           @btime SparseWrappers.spmatmul($S, $SA)
       end
N = 10
  830.519 ns (12 allocations: 2.25 KiB)
  761.190 ns (10 allocations: 1.80 KiB)
  657.576 ns (7 allocations: 2.22 KiB)
N = 100
  5.906 μs (12 allocations: 15.16 KiB)
  5.898 μs (10 allocations: 11.81 KiB)
  3.900 μs (5 allocations: 8.11 KiB)
N = 1000
  49.402 μs (16 allocations: 141.72 KiB)
  53.948 μs (12 allocations: 110.41 KiB)
  35.144 μs (7 allocations: 77.89 KiB)
N = 10000
  475.227 μs (20 allocations: 1.37 MiB)
  524.470 μs (18 allocations: 1.07 MiB)
  347.792 μs (8 allocations: 775.78 KiB)
N = 100000
  5.015 ms (20 allocations: 13.73 MiB)
  5.319 ms (18 allocations: 10.68 MiB)
  3.577 ms (9 allocations: 7.57 MiB)
N = 1000000
  73.128 ms (20 allocations: 137.33 MiB)
  70.665 ms (18 allocations: 106.81 MiB)
  53.516 ms (9 allocations: 75.72 MiB)
N = 10000000
  1.150 s (20 allocations: 1.34 GiB)
  960.450 ms (18 allocations: 1.04 GiB)
  566.719 ms (9 allocations: 757.22 MiB)

julia> n = 10^6
1000000

julia> Random.seed!(0); A = sprandn(n, n, 1 / n); b = randn(n);  nnz(A)
1000757
julia> S = SA = A;
julia> @btime SparseWrappers.spmatmul_orig($S, $SA, sortindices=:doubletranspose)
  404.522 ms (22 allocations: 106.89 MiB)
julia> @btime SparseWrappers.spmatmul_orig($S, $SA, sortindices=:sortcols)
  243.995 ms (20 allocations: 106.86 MiB)
julia> @btime SparseWrappers.spmatmul($S, $SA)
  170.740 ms (11 allocations: 42.20 MiB)

julia> n = 10^6
1000000
julia> Random.seed!(0); A = sprandn(n, n, 10 / n); b = randn(n);  nnz(A)
10003563
julia> @btime SparseWrappers.spmatmul_orig($S, $SA, sortindices=:doubletranspose);
  39.763 s (26 allocations: 3.91 GiB)
julia> @btime SparseWrappers.spmatmul_orig($S, $SA, sortindices=:sortcols);
  9.845 s (24 allocations: 2.43 GiB)
julia> @btime SparseWrappers.spmatmul($S, $SA);
  6.227 s (9 allocations: 1.65 GiB)

julia> using MatrixDepot
julia> md = mdopen("Wissgott/parabolic_fem")
(RS Wissgott/parabolic_fem(#1853)  525825x525825(3674625/2100225) 2007 [A, b] 'computational fluid dynamics problem' [Parabolic FEM, diffusion-convection reaction, constant homogenous diffusion]()

julia> A = sparse(md.A);
julia> S = SA = A;
julia> @btime SparseWrappers.spmatmul_orig($S, $SA, sortindices=:doubletranspose);
  715.782 ms (22 allocations: 392.31 MiB)
julia> @btime SparseWrappers.spmatmul_orig($S, $SA, sortindices=:sortcols);
  417.349 ms (20 allocations: 248.35 MiB)
julia> @btime SparseWrappers.spmatmul($S, $SA);
  278.560 ms (9 allocations: 435.51 MiB)

ViralBShah · 2018-12-17T01:08:09Z

@KlausC The tests should be an easy fix. I also propose that we mention in the NEWS file about this performance enhancement, and that we no longer have the two options for spmatmul. Since spmatmul is in an stdlib and also not an exported function, I think we should be ok with this.

ViralBShah · 2018-12-17T01:12:19Z

Also, for future reference, here's a paper by @aydinbuluc on state of the art on sparse matrix multiplication: https://people.eecs.berkeley.edu/~aydin/spgemm_p2s218.pdf with code at https://bitbucket.org/YusukeNagasaka/mtspgemmlib

KlausC · 2018-12-17T15:53:23Z

... performance tests and see if it helps?

        if numrows <= 16
            alg = Base.Sort.InsertionSort
        else
            alg = Base.Sort.QuickSort
        end

I tried it and did not see a difference. The QuickSort algorithm itself branches to InsertionSort, if numrows <= 20, so the best we could gain is one invoke.

ViralBShah · 2018-12-17T15:55:22Z

Thanks!

ViralBShah · 2018-12-17T15:56:29Z

I am actually happy to merge this as is (the test changes are not that important).

KlausC · 2018-12-17T19:36:30Z

I committed added test cases and NEWS.md after your merged. How do I get these last changes saved?

) * General performance improvements for sparse matmul Details for the polyalgorithm are in: #30372 (cherry picked from commit fae262c)

…372) * General performance improvements for sparse matmul Details for the polyalgorithm are in: JuliaLang/julia#30372

KlausC added 15 commits November 19, 2018 14:06

added sprandn methods with Type

7073d71

Merge remote-tracking branch 'upstream/master'

e43667e

Merge remote-tracking branch 'upstream/master'

4bef0c4

Merge remote-tracking branch 'upstream/master'

468723c

Merge remote-tracking branch 'upstream/master'

b2b63ec

Merge remote-tracking branch 'upstream/master'

ef2bc34

Merge remote-tracking branch 'upstream/master'

0b2dbeb

Merge remote-tracking branch 'upstream/master'

1192355

Merge remote-tracking branch 'upstream/master'

d7282cb

Merge remote-tracking branch 'upstream/master'

d423468

Merge remote-tracking branch 'upstream/master'

07f744f

Merge remote-tracking branch 'upstream/master'

957aa85

improved speed and space for spmatmul

2bad969

reduced number of allocations

7a49ff0

increase initial allocation size

526b33f

StefanKarpinski requested a review from KristofferC December 12, 2018 21:42

ViralBShah requested changes Dec 13, 2018

View reviewed changes

stdlib/SparseArrays/src/linalg.jl Show resolved Hide resolved

ViralBShah added the sparse Sparse arrays label Dec 13, 2018

general performance improvements of spmatmul

8559591

ViralBShah approved these changes Dec 16, 2018

View reviewed changes

ViralBShah requested changes Dec 16, 2018

View reviewed changes

stdlib/SparseArrays/test/sparse.jl Show resolved Hide resolved

ViralBShah added the backport pending 1.1 label Dec 17, 2018

ViralBShah approved these changes Dec 17, 2018

View reviewed changes

ViralBShah merged commit fae262c into JuliaLang:master Dec 17, 2018

KlausC mentioned this pull request Dec 18, 2018

test cases and NEWS for #30372 (sparse matmul) #30433

Merged

KristofferC pushed a commit that referenced this pull request Dec 20, 2018

spmatmul sparse matrix multiplication - performance improvements (#30372

3d8942d

) * General performance improvements for sparse matmul Details for the polyalgorithm are in: #30372 (cherry picked from commit fae262c)

KristofferC mentioned this pull request Dec 20, 2018

Backports for 1.1.0 #30309

Merged

52 tasks

ViralBShah pushed a commit that referenced this pull request Dec 20, 2018

test cases and NEWS for #30372 (sparse matmul) (#30433)

e7e33cc

ViralBShah added Backport pending 1.1.1 and removed backport pending 1.0 labels Dec 20, 2018

ViralBShah added the backport pending 1.0 label Dec 30, 2018

KristofferC pushed a commit that referenced this pull request Jan 11, 2019

spmatmul sparse matrix multiplication - performance improvements (#30372

7d63456

) * General performance improvements for sparse matmul Details for the polyalgorithm are in: #30372 (cherry picked from commit fae262c)

KristofferC mentioned this pull request Jan 11, 2019

Backports for 1.0.4 #30536

Closed

53 tasks

StefanKarpinski added triage This should be discussed on a triage call backport 1.0 and removed triage This should be discussed on a triage call labels Jan 31, 2019

JeffBezanson removed backport 1.0 triage This should be discussed on a triage call labels Jan 31, 2019

KlausC mentioned this pull request Dec 13, 2019

Proposal for AbstractWrappedArray in order to mitigate "AbstractArray-fallback" #31563

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spmatmul sparse matrix multiplication - performance improvements #30372

spmatmul sparse matrix multiplication - performance improvements #30372

KlausC commented Dec 12, 2018

jebej commented Dec 13, 2018

ViralBShah commented Dec 13, 2018

KristofferC commented Dec 13, 2018

ViralBShah commented Dec 14, 2018

KlausC commented Dec 16, 2018 •

edited by ViralBShah

Loading

ViralBShah commented Dec 16, 2018 •

edited

Loading

KlausC commented Dec 16, 2018

ViralBShah commented Dec 17, 2018

ViralBShah commented Dec 17, 2018 •

edited

Loading

KlausC commented Dec 17, 2018

ViralBShah commented Dec 17, 2018

ViralBShah commented Dec 17, 2018

KlausC commented Dec 17, 2018

spmatmul sparse matrix multiplication - performance improvements #30372

spmatmul sparse matrix multiplication - performance improvements #30372

Conversation

KlausC commented Dec 12, 2018

jebej commented Dec 13, 2018

ViralBShah commented Dec 13, 2018

KristofferC commented Dec 13, 2018

ViralBShah commented Dec 14, 2018

KlausC commented Dec 16, 2018 • edited by ViralBShah Loading

ViralBShah commented Dec 16, 2018 • edited Loading

KlausC commented Dec 16, 2018

ViralBShah commented Dec 17, 2018

ViralBShah commented Dec 17, 2018 • edited Loading

KlausC commented Dec 17, 2018

ViralBShah commented Dec 17, 2018

ViralBShah commented Dec 17, 2018

KlausC commented Dec 17, 2018

KlausC commented Dec 16, 2018 •

edited by ViralBShah

Loading

ViralBShah commented Dec 16, 2018 •

edited

Loading

ViralBShah commented Dec 17, 2018 •

edited

Loading