-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mul! performance regression on master #34013
Comments
Likely happened some time between
But there's not much between: julia/stdlib/LinearAlgebra/src/matmul.jl Lines 552 to 578 in 1a6baef
returns:
|
So the issue seems to come from when the constructor alpha, beta = promote(α, β, zero(T))
if alpha isa T && beta isa T
return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul(alpha, beta))
else
return generic_matmatmul!(C, 'N', 'N', A, B, MulAddMul(α, β))
end In #33743, we construct the |
@dkarrasch Thanks for the comment, looks like julia thinks
|
Now that you mention it, it's not surprising: the exact type (including its parameters) of |
No, actually, I think we do have function barriers, |
@daviehh Have you considered using using LinearAlgebra
using BenchmarkTools
using StaticArrays
ndim = 3
m1 = SMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))
m2 = SMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))
ou = MMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))
@btime mul!($ou, $m1, $m2) |
That can be one weird way to get around this issue... performance seems to be the reason why julia has special code for 2x2 and 3x3 matrices multiplication in the first place due to the overhead of sending matrices to I tried StaticArrays before, for one of my particular project, the code need to tackle both large and small matrices, and it can be messy to have separate code and keep them synced & maintained to support both standard and static arrays. |
Bump, this needs to be fixed / worked-around with some priority. |
It looks like a single |
Thanks for the fix! Since the 2x2 and 3x3 matrix multiplication is dealt with differently, can this be put in the nanosildier's benchmark script? possibly append 2 or 3 to the SIZES at |
As I said in #34394 (comment), #34394 does not quite make sense to me since |
As a longer term solution, I think it's better to defer bringing 0'ness to the type domain as late as possible (which is the opposite of what we are doing ATM). I think something like this might work: struct Zero <: Number end
*(::Zero, x) = zero(x)
*(x, ::Zero) = zero(x)
function mul_impl!(C, A, B, alpha, beta)
iszero(beta) && !(beta isa Zero) && return mul_impl!(C, A, B, alpha, Zero())
...
end |
It seems that #34384 together with #29634 (comment) is the better solution. This should make everybody happy for the moment (i.e., for v1.4), since it makes 3- and 5-arg mul! for small matrices and multiplication of structured matrices performant. From there, we can think about further improvements. I think @daviehh suggested to add the 2x2 and 3x3 multiplication tests into the benchmark suite. That would help the future development. |
I included #29634 (comment) in #34384 |
Yes, #34384 together with #29634 (comment) is the better solution, thanks a lot! Also added the test for 3 and 5 arg |
#34601 is yet another PR to fix for this. It doesn't add the |
@KristofferC Thanks! Sorry for pinging/nagging but can the corresponding nano-soldier script be updated to include the benchmark test for the 2x2 and 3x3 matrix productions? I've submitted a PR at JuliaCI/BaseBenchmarks.jl#255 |
Merged, but the Nanosoldier bot also need to update to use the latest BaseBenchmarks for it to take effect. |
On the current master branch,
mul!
of small matrices can be 10x slower than 1.3 (and also allocates)test code:
With the release-1.3 version,
on master:
While it's nano-seconds, when one has
mul!
in the inner-most/hot loop, this can easily translate to big performance degradation when most of the calculation is with such matrix productions. Larger matrices (e.g.ndim = 30
) appears unaffected (dispatched to BLAS?)2.732 μs (0 allocations: 0 bytes)
on 1.3 and2.774 μs (0 allocations: 0 bytes)
on master.The text was updated successfully, but these errors were encountered: