Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparseArrays: mul!(W, X', V) much slower than mul!(V, X, W) for Float32 entries #822

Closed
severinson opened this issue Mar 18, 2021 · 2 comments
Labels
performance Must go faster sparse Sparse arrays

Comments

@severinson
Copy link

severinson commented Mar 18, 2021

I'm seeing surprisingly low performance for mul!(W, X', V) when X is a SparseMatrixCSC with Float64 entries (X=sprand(2504, 100000, 0.05)) and W and V are dense matrices with Float32 entries. This operation takes about an order of magnitude longer than the same operation when W and V are dense matrices with Float64 entries. However, if X has Bool entries (X=sprand(Bool, 2504, 100000, 0.05)) I don't see any performance difference between Float64 and Float32 entries for V and W.

> using SparseArrays, LinearAlgebra

# Float64 X, V and W
> X = sprand(2504, 100000, 0.05); V = randn(2504, 3); W = zeros(100000, 3);

## transposed X
> @time mul!(W, X', V);
0.082900 seconds (80.11 k allocations: 4.399 MiB)
> @time mul!(W, X', V);
0.033305 seconds (1 allocation: 48 bytes)

## non-transposed X
> @time mul!(V, X, W);
0.122455 seconds (46.03 k allocations: 2.414 MiB)
> @time mul!(V, X, W);
0.088595 seconds

# Float64 X, Float32 V and W
> X = sprand(2504, 100000, 0.05); V = randn(Float32, 2504, 3); W = zeros(Float32, 100000, 3);

## transposed X
> @time mul!(W, X', V);
0.369262 seconds (77.55 k allocations: 4.190 MiB)
> @time mul!(W, X', V);
0.324316 seconds (1 allocation: 48 bytes) # about 10x slower than the same operation with Float64 entries

## non-transposed X
> @time mul!(V, X, W);
 0.123769 seconds (46.30 k allocations: 2.425 MiB)
> @time mul!(V, X, W);
0.087341 seconds # about the same performance as the same operation with Float64 entries
> versioninfo()
Julia Version 1.5.4
Commit 69fcb5745b (2021-03-11 19:13 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 7 1700X Eight-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, znver1)
> LinearAlgebra.versioninfo()
BLAS: libopenblas (OpenBLAS 0.3.9  USE64BITINT DYNAMIC_ARCH NO_AFFINITY Zen MAX_THREADS=32)
LAPACK: libopenblas64_
@dkarrasch dkarrasch added performance Must go faster sparse Sparse arrays labels Mar 23, 2021
@dkarrasch
Copy link
Member

I have benchmarked the hell out of different approaches to this in JuliaLang/julia#38876 (just as others before me, see the references there), but couldn't find a better way to do it. The lesson so far is that, if you can afford memory-wise, then check if materializing the adjoint/transpose before multiplication is beneficial. That may depend strongly on your specific use case, like sparsity, size etc. Materializing the adjoint is not necessarily beneficial, though:

julia> X = sprand(2504, 100000, 0.05); V = randn(Float32, 2504, 3); W = zeros(Float32, 100000, 3);

julia> @btime mul!($W, copy(($X)'), $V, true, false);
  365.290 ms (9 allocations: 191.07 MiB)

julia> @btime mul!($W, $(X'), $V, true, false);
  266.900 ms (0 allocations: 0 bytes)

Apparently, most of the time is spent on conversion:

julia> X = sprand(Float32, 2504, 100000, 0.05); V = randn(Float32, 2504, 3); W = zeros(Float32, 100000, 3);

julia> @btime mul!($W, copy(($X)'), $V, true, false);
  372.054 ms (11 allocations: 143.27 MiB)

julia> @btime mul!($W, $(X'), $V, true, false);
  44.057 ms (0 allocations: 0 bytes)

Not sure how much error it would introduce to first downscale X to Float32.

@vtjnash vtjnash closed this as completed Apr 13, 2021
@vtjnash
Copy link
Member

vtjnash commented Apr 13, 2021

Closing as "couldn't find a better way to do it" seems to be about optimal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster sparse Sparse arrays
Projects
None yet
Development

No branches or pull requests

3 participants