Closed
Description
I'm seeing surprisingly low performance for mul!(W, X', V)
when X
is a SparseMatrixCSC
with Float64
entries (X=sprand(2504, 100000, 0.05)
) and W
and V
are dense matrices with Float32
entries. This operation takes about an order of magnitude longer than the same operation when W
and V
are dense matrices with Float64
entries. However, if X
has Bool
entries (X=sprand(Bool, 2504, 100000, 0.05)
) I don't see any performance difference between Float64
and Float32
entries for V
and W
.
> using SparseArrays, LinearAlgebra
# Float64 X, V and W
> X = sprand(2504, 100000, 0.05); V = randn(2504, 3); W = zeros(100000, 3);
## transposed X
> @time mul!(W, X', V);
0.082900 seconds (80.11 k allocations: 4.399 MiB)
> @time mul!(W, X', V);
0.033305 seconds (1 allocation: 48 bytes)
## non-transposed X
> @time mul!(V, X, W);
0.122455 seconds (46.03 k allocations: 2.414 MiB)
> @time mul!(V, X, W);
0.088595 seconds
# Float64 X, Float32 V and W
> X = sprand(2504, 100000, 0.05); V = randn(Float32, 2504, 3); W = zeros(Float32, 100000, 3);
## transposed X
> @time mul!(W, X', V);
0.369262 seconds (77.55 k allocations: 4.190 MiB)
> @time mul!(W, X', V);
0.324316 seconds (1 allocation: 48 bytes) # about 10x slower than the same operation with Float64 entries
## non-transposed X
> @time mul!(V, X, W);
0.123769 seconds (46.30 k allocations: 2.425 MiB)
> @time mul!(V, X, W);
0.087341 seconds # about the same performance as the same operation with Float64 entries
> versioninfo()
Julia Version 1.5.4
Commit 69fcb5745b (2021-03-11 19:13 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: AMD Ryzen 7 1700X Eight-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, znver1)
> LinearAlgebra.versioninfo()
BLAS: libopenblas (OpenBLAS 0.3.9 USE64BITINT DYNAMIC_ARCH NO_AFFINITY Zen MAX_THREADS=32)
LAPACK: libopenblas64_