Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use mul! in Diagonal*Matrix #42321

Merged
merged 16 commits into from
Oct 5, 2021
Merged

Use mul! in Diagonal*Matrix #42321

merged 16 commits into from
Oct 5, 2021

Conversation

jishnub
Copy link
Contributor

@jishnub jishnub commented Sep 20, 2021

This provides a performance boost for matrices of numbers. See the discourse post

On master

julia> A = ones(3000, 2000); DR = Diagonal(ones(size(A, 2))); DL = Diagonal(ones(size(A,1)));

julia> @btime $A * $DR;
  41.844 ms (4 allocations: 45.78 MiB)

julia> @btime $DL * $A;
  42.521 ms (2 allocations: 45.78 MiB)

After this PR

julia> @btime $A * $DR;
  28.670 ms (4 allocations: 45.78 MiB)

julia> @btime $DL * $A;
  29.377 ms (2 allocations: 45.78 MiB)

@MasonProtter
Copy link
Contributor

MasonProtter commented Sep 20, 2021

Interesting finding! Is there any reason to not implement this with broadcasting though? Here are the numbers I'm seeing

julia> using BenchmarkTools: @btime

julia> using LinearAlgebra: Diagonal, mul!, similar, promote_op, transpose

julia> foreach((10, 50, 100, 500, 1_000, 5_000, 10_000)) do N
           M = N + rand(-5:5)
           A = rand(N, M)
           Dr = Diagonal(float.(1:M))

           println((;N, M))
           println("Current algorithm")
           o1 = @btime $A * $Dr
           println("Proposed algorithm")
           o2 = @btime $A *̂ $Dr
           println("Broadcasting")
           o3 = @btime $A .* transpose($Dr.diag)
           println()

           @assert o1  o2  o3
       end
(N = 10, M = 14)
Current algorithm
  212.190 ns (3 allocations: 1.31 KiB)
Proposed algorithm
  192.421 ns (3 allocations: 1.31 KiB)
Broadcasting
  141.649 ns (1 allocation: 1.22 KiB)

(N = 50, M = 48)
Current algorithm
  2.612 μs (4 allocations: 18.92 KiB)
Proposed algorithm
  1.439 μs (4 allocations: 18.92 KiB)
Broadcasting
  1.346 μs (2 allocations: 18.83 KiB)

(N = 100, M = 99)
Current algorithm
  8.623 μs (4 allocations: 77.55 KiB)
Proposed algorithm
  4.386 μs (4 allocations: 77.55 KiB)
Broadcasting
  4.286 μs (2 allocations: 77.45 KiB)

(N = 500, M = 495)
Current algorithm
  212.060 μs (4 allocations: 1.89 MiB)
Proposed algorithm
  124.020 μs (4 allocations: 1.89 MiB)
Broadcasting
  126.190 μs (2 allocations: 1.89 MiB)

(N = 1000, M = 999)
Current algorithm
  1.336 ms (4 allocations: 7.62 MiB)
Proposed algorithm
  799.228 μs (4 allocations: 7.62 MiB)
Broadcasting
  801.688 μs (2 allocations: 7.62 MiB)

(N = 5000, M = 5001)
Current algorithm
  64.819 ms (4 allocations: 190.77 MiB)
Proposed algorithm
  43.458 ms (4 allocations: 190.77 MiB)
Broadcasting
  43.699 ms (2 allocations: 190.77 MiB)

(N = 10000, M = 9997)
Current algorithm
  266.747 ms (4 allocations: 762.71 MiB)
Proposed algorithm
  160.720 ms (4 allocations: 762.71 MiB)
Broadcasting
  159.481 ms (2 allocations: 762.71 MiB)

It seems that just using broadcast is simpler and just as fast or faster.


julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: AMD Ryzen Threadripper 2990WX 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 26

@MasonProtter
Copy link
Contributor

MasonProtter commented Sep 20, 2021

Normally, matrix multiplication involves a mixture of broadcast and reduction, but involving a Diagonal explicitly removes the reduction part and you're left with just an elementwise operation.

Given how much work was put into our broadcast machinery, we might as well use it.

@N5N3
Copy link
Member

N5N3 commented Sep 21, 2021

In fact mul!()/rmul!()/lmul!() for Diagonal is implemented via broadcast.

There should be no performance difference between 3-arg mul!() and manual broadcast, as the alpha and beta are propagated as const.

@MasonProtter
Copy link
Contributor

Hm. Then why is there such a big performance difference here?

@N5N3
Copy link
Member

N5N3 commented Sep 21, 2021

The difference holds only for small martix. It might come from:

  1. You manual broadcast use lazy transpose, while mul! use premutedims, which is create a new Matrix.
  2. mul!(z,x,y::Diagnal) calculate z .= x .* permutedims(y.diag) .+ false. Maybe the .+ was not omited by the optimizer?
    Should not happend, as
julia> f(x) = x .+= false
f (generic function with 1 method)

julia> @code_llvm f([1,2,3])
;  @ REPL[40]:1 within `f`
; Function Attrs: uwtable
define nonnull {}* @japi1_f_6357({}* %0, {}** %1, i32 %2) #0 {
top:
  %3 = alloca {}**, align 8
  store volatile {}** %1, {}*** %3, align 8
  %4 = load {}*, {}** %1, align 8
  ret {}* %4
}

@MasonProtter
Copy link
Contributor

MasonProtter commented Sep 21, 2021

Hm. That eager permutedims Should probably be revisited. Seems like a waste of allocations that was done in the past because we didn’t handle structs containing mutable data well.

@N5N3
Copy link
Member

N5N3 commented Sep 21, 2021

@code_typed shows that their loop body have almost the same IR, except some redundant code in mul!, which should be dropped by LLVM and does no harm to performance:

%308 = Base.mul_float(%304, %307)::Float64%309 = Base.copysign_float(0.0, %308)::Float64%310 = Base.ifelse(true, %308, %309)::Float64%311 = Base.add_float(1.0, %310)::Float64%312 = Base.ifelse(false, %311, %310)::Float64 

BTW, mul! seems to perform more checks before the loop kernal, maybe this also contributes to the speed gap for small matrix.

@jishnub
Copy link
Contributor Author

jishnub commented Sep 21, 2021

The difference for small matrices seems to be entirely due to permutedims being an array reshape that allocates:

julia> A = ones(10, 14); DR = Diagonal(ones(size(A, 2)));

julia> @btime permutedims($DR.diag);
  98.516 ns (2 allocations: 96 bytes)

julia> @btime transpose($DR.diag);
  2.674 ns (0 allocations: 0 bytes)

julia> @btime $A * $DR;
  439.354 ns (3 allocations: 1.31 KiB)

julia> @btime $A .* transpose($DR.diag);
  344.735 ns (1 allocation: 1.22 KiB)

@N5N3
Copy link
Member

N5N3 commented Sep 21, 2021

It seems mul!() is done via broadcast only when the non-Diagonal input has strided layout.
For example:

julia> @which mul!(zeros(100,100), Diagonal(1:100), Symmetric(randn(100,100)), true, false)
mul!(C::AbstractMatrix, A::AbstractVecOrMat, B::AbstractVecOrMat, alpha::Number, beta::Number) in LinearAlgebra at C:\Users\MYJ\AppData\Local\Programs\Julia-1.7.0-beta4\share\julia\stdlib\v1.7\LinearAlgebra\src\matmul.jl:301
julia> @which mul!(zeros(100,100), Diagonal(1:100), randn(100,100), true, false)
mul!(out::AbstractMatrix, A::Diagonal, in::StridedMatrix{T} where T, alpha::Number, beta::Number) in LinearAlgebra at C:\Users\MYJ\AppData\Local\Programs\Julia-1.7.0-beta4\share\julia\stdlib\v1.7\LinearAlgebra\src\diagonal.jl:370

Such restrictions might be added to avoid random memory access?
Anyway, the 5-arg mul! should be optimized before or in this PR.
If most Base matrix type is faster with broadcast rather than l/rmul!() + copyto!(), I think it's fine to extend StrideMatrix to AbstractMatrix.

Edit: on master it has been relaxed with #41188.

@jishnub
Copy link
Contributor Author

jishnub commented Sep 21, 2021

Would it make sense to improve mul! in a separate PR? Changing StrideMatrix to AbstractMatrix might need to be tested thoroughly.

@N5N3
Copy link
Member

N5N3 commented Sep 21, 2021

Both seems OK to me.
At present, in test.jl, lmul!, rmul! and mul! appears 11, 10, and 13 times, seperately.
While * appears 170 times.
Even if we assume that half of them are the reference for comparison, * is tested much more thoroughly.
Without this PR, we can say that lmul! and rmul! are well tested via *.
After this PR, things turns to be mul! is tested via *, while lmul! and rmul! not.
So it might be ok to do this change in this PR?

@dkarrasch
Copy link
Member

Hm. That eager permutedims Should probably be revisited. Seems like a waste of allocations that was done in the past because we didn’t handle structs containing mutable data well.

No, we use permutedims because transpose operates recursively, but all we want is to "reshape" the vector into a row vector for broadcasting. If there is another (lazy) way of telling broadcasting that the vector should be interpreted as 1xn, then we can go with that. Otherwise, products with matrices of matrices will just go wrong.

@dkarrasch dkarrasch added linear algebra Linear algebra performance Must go faster labels Sep 21, 2021
@MasonProtter
Copy link
Contributor

Non-recursive lazy transpose sounds very nice to have!

@MasonProtter
Copy link
Contributor

No, we use permutedims because transpose operates recursively, but all we want is to "reshape" the vector into a row vector for broadcasting.

That shouldn’t be a problem if the method is restricted to operating on Diagonal{T, Vector{T}} where T <: Number though right?

I guess if someone made some wacky number type that could be transpose’d that could be a problem?

@dkarrasch
Copy link
Member

We could check if the eltype of D is a Number, and in that case use transpose in the broadcasting statement... or, as you say, use multiple dispatch.

@N5N3
Copy link
Member

N5N3 commented Sep 22, 2021

Since (*)(A::AbstractMatrix, D::Diagonal) is accelerated,
Maybe we can also optimize (*)(A::AbstractTriangular, D::Diagonal) with that in this PR.
For example:

@inline function _copy_diag!(data, diag)
    for i in 1:length(diag)
        data[i,i] = diag[i]
    end
    data
end
for Tri in (:UpperTriangular, :LowerTriangular)
    UTri = Symbol(:Unit, Tri)
    @eval (*)(A::$Tri, D::Diagonal) = $Tri(A.data * D)
    @eval (*)(A::$UTri, D::Diagonal) = $Tri(_copy_diag!(A.data * D, D.diag))
    @eval (*)(D::Diagonal, A::$Tri) = $Tri(D * A.data)
    @eval (*)(D::Diagonal, A::$UTri) = $Tri(_copy_diag!(D * A.data, D.diag))
end

Some Benchmark (With This PR)

julia> A = LowerTriangular(randn(1000,1000));

julia> @btime $A * $Diagonal(ones(1000)); # before
  2.215 ms (5 allocations: 7.64 MiB)

julia> Revise.track(LinearAlgebra)

julia> @btime $A * $Diagonal(ones(1000)); # after
  1.208 ms (5 allocations: 7.64 MiB)

@N5N3
Copy link
Member

N5N3 commented Sep 22, 2021

BTW, I found that the above pattern also works for (l/r)mul!.
We can put all these definitions together to simplify code:

for Tri in (:UpperTriangular, :LowerTriangular)
    UTri = Symbol(:Unit, Tri)
    for fun in (:*, :rmul!)
        @eval $fun(A::$Tri, D::Diagonal) = $Tri($fun(A.data, D))
        @eval $fun(A::$UTri, D::Diagonal) = $Tri(_copy_diag!($fun(A.data, D), D.diag))
    end
    for fun in (:*, :lmul!)
        @eval $fun(D::Diagonal, A::$Tri) = $Tri($fun(D, A.data))
        @eval $fun(D::Diagonal, A::$UTri) = $Tri(_copy_diag!($fun(D, A.data), D.diag))
    end
    @eval mul!(out::$Tri, D::Diagonal, A::$Tri) = $Tri(mul!(out.data, D, A.Data))
    @eval mul!(out::$Tri, D::Diagonal, A::$UTri) = $Tri(_copy_diag!(mul!(out.data, D, A.Data), D.diag))
    @eval mul!(out::$Tri, A::$Tri, D::Diagonal) = $Tri(mul!(out.data, A.Data, D))
    @eval mul!(out::$Tri, A::$UTri, D::Diagonal) = $Tri(_copy_diag!(mul!(out.data, A.Data, D), D.diag))
end

The above code support seperate optimization for *() and (r/l)mul!.
And it's more easy to cover all cases, as at present only rmul! is optimized for Union{LowerTriangular,UpperTriangular}

rmul!(A::Union{LowerTriangular,UpperTriangular}, D::Diagonal) = typeof(A)(rmul!(A.data, D))

Corresponding lmul! has been missing for years, at least since the time this file was created.

@dkarrasch Should this be included in a seperate PR?

@jishnub
Copy link
Contributor Author

jishnub commented Sep 22, 2021

Perhaps this should be specialized to cases where A.data * D is fast? This shouldn't hit slow AbstractArray multiplication methods. The present methods that rely on a copy get around these methods for the cases where A.data is not a Matrix. Even the method added in this PR might need to be restricted (perhaps to StridedMatrix?).

Currently on master

julia> A = LowerTriangular(reshape(1:1600, 40, 40)); D = Diagonal(ones(size(A,2)));

julia> @btime $A * $D;
  8.931 μs (3 allocations: 12.72 KiB)

julia> @btime $A.data * $D;
  2.308 μs (3 allocations: 12.72 KiB)

julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
  61.452 μs (7 allocations: 12.95 KiB)

With this PR, we're hitting mul! in A.data * D, which will be slow for AbstractArrays in general.

Edit: Oops, this benchmark was on Julia 1.6 and not master. On master I obtain

julia> A = LowerTriangular(reshape(1:1600, 40, 40)); D = Diagonal(ones(size(A,2)));

julia> @btime $A * $D;
  3.408 μs (3 allocations: 12.72 KiB)

julia> @btime $A.data * $D;
  2.331 μs (3 allocations: 12.72 KiB)

julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
  3.239 μs (3 allocations: 12.72 KiB)

julia> @btime mul!(similar($A.data, size($A.data)), $A.data, convert(AbstractArray{eltype($A)}, $D));
  1.010 μs (4 allocations: 13.11 KiB)

julia> A = LowerTriangular(reshape(1:1600, 40, 40)); D = Diagonal(ones(Int, size(A,2)));

julia> @btime $A * $D;
  2.985 μs (3 allocations: 12.72 KiB)

julia> @btime $A.data * $D;
  2.396 μs (3 allocations: 12.72 KiB)

julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
  918.500 ns (3 allocations: 12.72 KiB)

julia> A = LowerTriangular(reshape(Float64.(1:1600), 40, 40)); D = Diagonal(ones(size(A,2)));

julia> @btime $A * $D;
  2.913 μs (3 allocations: 12.72 KiB)

julia> @btime $A.data * $D;
  1.998 μs (3 allocations: 12.72 KiB)

julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
  791.733 ns (3 allocations: 12.72 KiB)

So mul! seems superior if the eltypes match, and converting the eltype of the Diagonal seems to be a good strategy.

@N5N3
Copy link
Member

N5N3 commented Sep 22, 2021

If I have not made a mistake:
The difference between mul! and r/lmul! based *(A::AbstractMatrix,D::Diagnal) is:

  1. r/lmul! --> copyto!(similar(A, Tdest, size(A)), A) .*= D.diag
  2. mul! --> similar(A, Tdest, size(A)) .= A .* D.diag

So r/lmul! pre-promote A if needed, while mul! not. (Both of them promote D during calculation)
Since all elements in A is accessed only once, should pre-promotion offer any performance gain?
(Promotion is vectorizable In most cases.)

BTW, for your example

julia> A = LowerTriangular(reshape(1:1600, 40, 40)); D = Diagonal(ones(size(A,2)));
julia> @btime $A.data * $D;
  2.331 μs (3 allocations: 12.72 KiB)
julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
  3.239 μs (3 allocations: 12.72 KiB)

Their return type are not the same: A.data * D returns Matrix{Float64}, while mul! version returns Matrix{Int64}. So the "regression" comes from extra trunction.
On my PC, mul! is faster for Matrix{Float64}

julia> @btime $A.data * $D;
  1.950 μs (3 allocations: 12.72 KiB)
julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
  2.422 μs (3 allocations: 12.72 KiB)
julia> @btime mul!(similar($A.data, Float64, size($A.data)), $A.data, $D);
  1.250 μs (3 allocations: 12.72 KiB)

@dkarrasch
Copy link
Member

I don't follow why we need this strange Numbers distinction. I think the main difference between mul! and [r/l]mul!-based implementations is that in the first case we only allocate a similar array and write the result into it, whereas in the latter case we allocate a similar matrix, then fill it with values of A and then pass over it again to multiply each value. So, the only difference is whether we fill the destination array with the (promoted) values of A and then modify inplace or whether we write into an undef destination array. But why the distinction with Numbers? Why not just the size checks, the allocation of the destination array and then mul! into it?

@jishnub
Copy link
Contributor Author

jishnub commented Sep 27, 2021

@dkarrasch @N5N3 could you take a look now? I have incorporated the changes suggested. I have added some extra methods for size check and the actual multiplication to avoid replicating code, hope this is fine.

I've fixed the five-term mul! by adding a branch instead of adding extra three-term mul! methods. This fixes the bug on master as well:

on master

julia> A = reshape([[1 2; 3 4], zeros(Int,2,2), zeros(Int, 2, 2), [5 6; 7 8]], 2, 2)
2×2 Matrix{Matrix{Int64}}:
 [1 2; 3 4]  [0 0; 0 0]
 [0 0; 0 0]  [5 6; 7 8]

julia> D = Diagonal(A)
2×2 Diagonal{Matrix{Int64}, Vector{Matrix{Int64}}}:
 [1 2; 3 4]           
            [5 6; 7 8]

julia> A * D
2×2 Matrix{Matrix{Int64}}:
 [7 10; 15 22]  [0 0; 0 0]
 [0 0; 0 0]     [67 78; 91 106]

julia> mul!(similar(A), A, D)
ERROR: MethodError: no method matching +(::Matrix{Int64}, ::Bool)
For element-wise addition, use broadcasting with dot syntax: array .+ scalar

This PR

julia> mul!(similar(A), A, D)
2×2 Matrix{Matrix{Int64}}:
 [7 10; 15 22]  [0 0; 0 0]
 [0 0; 0 0]     [67 78; 91 106]

Copy link
Member

@N5N3 N5N3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some recommendations:

  1. There seems so reason to retain the l\rmul! in L245, L255. We can make them call 3-arg mul! (to reuse the input check).
  2. It looks good to have an optimized mul! for Diagonal outputs. (L374 is suboptimal in this case). With it, we can omit L327 and L328.
  3. Add require_one_based_indexing in the size check function?
  4. Edited: L339 seems suboptimal for eltype <: matrix. As mul!(A,D,B) fallbacks to A .= (D.diag .* B) .* true, which allocates twice unnecessarily in this case.

@dkarrasch
Copy link
Member

4\. As 3-arg `mul!(A,D,B)` calls `A .= D.diag .* B . * 1`, which allocates twice unnecessarily in this case.

No, it doesn't. Broadcasting will fuse all operations, including writing into A.

@N5N3
Copy link
Member

N5N3 commented Sep 27, 2021

My fault, didn't notice that we have optimize *(::Matrix,::Matrix,::Number) via mat_mat_scalar

@dkarrasch
Copy link
Member

Just some recommendations:

1. There seems so reason to retain the `l\rmul!` in L245, L255. We can make them call 3-arg `mul!` (to reuse the input check).

I agree with this one.

2. It looks good to have an optimized `mul!` for `Diagonal` outputs. (L374 is suboptimal in this case). With it, we can omit L327 and L328.

I agree with this one, as well.

3. Add `require_one_based_indexing` in the size check function?

This is unnecessary. One-based index checking is only necessary when we index into an array in a loop that has a loop variable i in 1:n. Otherwise, there is no need for such a check.

4. L339 seems suboptimal for `eltype <:  matrix`. As 3-arg `mul!(A,D,B)` calls `A .= D.diag .* B . * 1`, which allocates twice unnecessarily in this case.

Up to the obsolete multiplication with 1 there is no issue with memory allocation. I don't think we want to introduce more runtime checks here.

@dkarrasch
Copy link
Member

My fault, didn't notice that we have optimize *(::Matrix,::Matrix,::Number) via mat_mat_scalar

It has nothing to do with that method. It is using broadcasting, which internally writes out the loops and writes into the destination array within that loop. So that operation is in-place, as you can check with BenchmarkTools.

@N5N3
Copy link
Member

N5N3 commented Sep 27, 2021

For Number element I aggree with you. But for Matrix element:

julia> a = [randn(3,3) for i in 1:10]; b = [randn(3,3) for i in 1:10];
julia> @btime $a .* $b;
  507.216 ns (11 allocations: 1.38 KiB)
julia> @btime $a .* $b .* 1;
  895.122 ns (22 allocations: 2.67 KiB)
julia> @btime @. $a * $b * 1;
  515.544 ns (11 allocations: 1.38 KiB)

I just notice that we won't call *(::Matrix,::Matrix,::Number) since we use .*s. So the allocation does happen twice.

@dkarrasch
Copy link
Member

So the allocation does happen twice.

Are you sure? IIRC, that special multiplication recognizes constant trues and avoids performung multiplication by true = 1.

@N5N3
Copy link
Member

N5N3 commented Sep 27, 2021

julia> g(a, b) = a .* b .* true
g (generic function with 1 method)
julia> @btime g($a,$b);
  880.000 ns (21 allocations: 2.62 KiB)
julia> g2(a, b) = a .* b
g2 (generic function with 1 method)
julia> @btime g2($a,$b);
  509.896 ns (11 allocations: 1.38 KiB)

I think the above example is persuasive enough.
At present, replacing a * true with a is done by llvm.
Since matrix * number call broadcast_preserving_zero_d, where materialize definitely call similar , I don't think llvm can prove this * true is droppable.

An easy solution is to replace A .* B .*s alpha with .*s(A, B, alpha), and make it call *(A,B,C) if needed.
Edit: .*s(A, B, alpha) only works when both A and B has matrix elements. If one of them isa AbstractArray{<:Number}, then *(A,B,true) also falls back to (A*B) * true and allocates twice.

julia> a = randn(10,10);
julia> f(a,b) = a * b * true
f (generic function with 1 method)
julia> @btime f($a,$a);
  300.000 ns (1 allocation: 896 bytes)
julia> @btime f($a,$1);
  206.667 ns (2 allocations: 1.75 KiB)
julia> @btime f($1,$a);
  204.071 ns (2 allocations: 1.75 KiB)

So also add something somthing like?

*(a::AbstractArray{<:Number},b::Number,c::Number) = b * c * a
*(a::Number,b::AbstractArray{<:Number},c::Number) = a * c * b

(I'm not sure whether Number should be restricted toUnion{Real,Complex} here.)

@dkarrasch
Copy link
Member

So also add something somthing like?

*(a::AbstractArray{<:Number},b::Number,c::Number) = b * c * a
*(a::Number,b::AbstractArray{<:Number},c::Number) = a * c * b

Better not to change the order of multiplication and use broadcast:

*(a::AbstractArray{<:Number},b::Number,c::Number) = a .* b .* c
*(a::Number,b::AbstractArray{<:Number},c::Number) = a .* b .* c

I'm not sure how sustainable such definitions are, though. These would help with the array of arrays case, but what about the array of arrays of arrays? Feels a bit weird to mess with such a basic operation.

@jishnub
Copy link
Contributor Author

jishnub commented Oct 1, 2021

If we add an isone check while broadcasting over *ₛ

Broadcast.broadcasted(::typeof(*ₛ), out, beta) =
    iszero(beta::Number) ? false :
    isone(beta::Number) ? broadcasted(identity, out) : broadcasted(*, out, beta)

we may get rid of the extra allocation:

julia> D
2×2 Diagonal{Matrix{Int64}, Vector{Matrix{Int64}}}:
 [7 15; 10 22]           
               [67 91; 78 106]

julia> A = collect(D)
2×2 Matrix{Matrix{Int64}}:
 [7 15; 10 22]  [0 0; 0 0]
 [0 0; 0 0]     [67 91; 78 106]

julia> B = similar(A);

julia> @btime $B .= $D.diag .* $A;
  323.026 ns (4 allocations: 384 bytes)

julia> @btime mul!($B, $D, $A, true, false);
  336.775 ns (4 allocations: 384 bytes)

@jishnub
Copy link
Contributor Author

jishnub commented Oct 1, 2021

I've also specialized zero and one for Diagonals, so now these functions work for nested diagonals:

julia> D = Diagonal([collect(reshape(1:4, 2, 2)), collect(reshape(5:8, 2, 2))])
2×2 Diagonal{Matrix{Int64}, Vector{Matrix{Int64}}}:
 [1 3; 2 4]           
            [5 7; 6 8]

julia> D2 = Diagonal([D, D]);

julia> D2 + zero(D2) == D2
true

julia> D2 * one(D2) == D2
true

@N5N3
Copy link
Member

N5N3 commented Oct 2, 2021

I think the behaviour change of .*ₛ is acceptable.
In most case .*ₛ is used to dispatch alpha and beta, so the performance influence won't be large?

BTW, if someone dislike such modification, maybe we can use MulAddMul to achieve the same goal:

@noinline ___muldiag!(out,A,B,f::MulAddMul{<:Any,true}) = @. out = f(A * B) # out might be undef
@noinline ___muldiag!(out,A,B,f::MulAddMul) = @. out = f(A * B,out) 
@inline __muldiag!(out, D::Diagonal, B, f) = ___muldiag!(out, D.diag, B, f)
@inline __muldiag!(out, A, D::Diagonal, f) = ___muldiag!(out, A, permutedims(D.diag), f)
@inline function _muldiag!(out, A, B, alpha, beta)
    _muldiag_size_check(out, A, B)
    iszero(alpha) && return _rmul_or_fill!(out, beta)
    __muldiag!(out, A, B, MulAddMul(alpha, beta))
end

On my desktop this dynamic dispatch tooks about 20ns.
So the performance does regress slightly compared with this PR in some cases, such as non-const beta == 0.

@jishnub jishnub changed the title Use mul! in Diagonal*Matrix of numbers Use mul! in Diagonal*Matrix Oct 5, 2021
@jishnub
Copy link
Contributor Author

jishnub commented Oct 5, 2021

@dkarrasch I've made most of the changes suggested, could you take a look at this?

@dkarrasch dkarrasch merged commit a8db751 into JuliaLang:master Oct 5, 2021
@jishnub jishnub deleted the muldiag branch October 5, 2021 09:47
@staticfloat
Copy link
Member

@dkarrasch @jishnub I am observing a pretty substantial slowdown on the LinearAlgebra tests due to this commit.

Checking out a8db7510b97571cb9ad6cc89fc2fae527c741ea6 and the commit before it, I see (for instance) the LinearAlgebra/addmul tests go from 280s to 340s. This is when running on a Intel(R) Xeon(R) CPU E3-1241 v3 @ 3.50GHz.

Copy link
Member

@N5N3 N5N3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick glance.

Comment on lines 10 to +12
Broadcast.broadcasted(::typeof(*ₛ), out, beta) =
iszero(beta::Number) ? false : broadcasted(*, out, beta)
iszero(beta::Number) ? false :
isone(beta::Number) ? broadcasted(identity, out) : broadcasted(*, out, beta)
Copy link
Member

@N5N3 N5N3 Feb 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change might introduce more instability?
Usage in this PR should be union-splitable, and the branch should be eliminated in 3-args mul! and l/rmul!
But it might mess else where?

Comment on lines +245 to +246
rmul!(A::AbstractMatrix, D::Diagonal) = mul!(A, A, D)
lmul!(D::Diagonal, B::AbstractVecOrMat) = mul!(B, D, B)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rmul! and lmul! won't be vectorlized on master as it is a self-inplace broadcast. (4x slower)
Add @inline is useless here. Without fix like #43185, the only solution is recoding with for-loop.

But this should not be caught by CI as the test size is usually small.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look, @N5N3! I vaguely remember that you proposed in a similar situation the fix by inlining the rhs. Why doesn't that help here?

Copy link
Member

@N5N3 N5N3 Mar 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, our broadcast unaliases the src from the dest to avoid memory contention, which prevents LLVM to prove dest === src.
Our for-loop based implement doesn't include such unaliasing stage, thus @inline is enough to enable simd.

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants