-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mul! performance regression on master #684
Comments
Likely happened some time between
But there's not much between:
returns:
|
So the issue seems to come from when the constructor alpha, beta = promote(α, β, zero(T))
if alpha isa T && beta isa T
return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul(alpha, beta))
else
return generic_matmatmul!(C, 'N', 'N', A, B, MulAddMul(α, β))
end In JuliaLang/julia#33743, we construct the |
@dkarrasch Thanks for the comment, looks like julia thinks
|
Now that you mention it, it's not surprising: the exact type (including its parameters) of |
No, actually, I think we do have function barriers, |
@daviehh Have you considered using using LinearAlgebra
using BenchmarkTools
using StaticArrays
ndim = 3
m1 = SMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))
m2 = SMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))
ou = MMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))
@btime mul!($ou, $m1, $m2) |
That can be one weird way to get around this issue... performance seems to be the reason why julia has special code for 2x2 and 3x3 matrices multiplication in the first place due to the overhead of sending matrices to I tried StaticArrays before, for one of my particular project, the code need to tackle both large and small matrices, and it can be messy to have separate code and keep them synced & maintained to support both standard and static arrays. |
Bump, this needs to be fixed / worked-around with some priority. |
It looks like a single |
Thanks for the fix! Since the 2x2 and 3x3 matrix multiplication is dealt with differently, can this be put in the nanosildier's benchmark script? possibly append 2 or 3 to the SIZES at |
Putting triage since we need to decide between JuliaLang/julia#34384, JuliaLang/julia#34394 for 1.4. What do people think? |
As I said in JuliaLang/julia#34394 (comment), JuliaLang/julia#34394 does not quite make sense to me since |
As a longer term solution, I think it's better to defer bringing 0'ness to the type domain as late as possible (which is the opposite of what we are doing ATM). I think something like this might work: struct Zero <: Number end
*(::Zero, x) = zero(x)
*(x, ::Zero) = zero(x)
function mul_impl!(C, A, B, alpha, beta)
iszero(beta) && !(beta isa Zero) && return mul_impl!(C, A, B, alpha, Zero())
...
end |
It seems that JuliaLang/julia#34384 together with JuliaLang/julia#29634 (comment) is the better solution. This should make everybody happy for the moment (i.e., for v1.4), since it makes 3- and 5-arg mul! for small matrices and multiplication of structured matrices performant. From there, we can think about further improvements. I think @daviehh suggested to add the 2x2 and 3x3 multiplication tests into the benchmark suite. That would help the future development. |
I included JuliaLang/julia#29634 (comment) in JuliaLang/julia#34384 |
Yes, JuliaLang/julia#34384 together with JuliaLang/julia#29634 (comment) is the better solution, thanks a lot! Also added the test for 3 and 5 arg |
JuliaLang/julia#34601 is yet another PR to fix for this. It doesn't add the |
@KristofferC Thanks! Sorry for pinging/nagging but can the corresponding nano-soldier script be updated to include the benchmark test for the 2x2 and 3x3 matrix productions? I've submitted a PR at JuliaCI/BaseBenchmarks.jl#255 |
Merged, but the Nanosoldier bot also need to update to use the latest BaseBenchmarks for it to take effect. |
On the current master branch,
mul!
of small matrices can be 10x slower than 1.3 (and also allocates)test code:
With the release-1.3 version,
on master:
While it's nano-seconds, when one has
mul!
in the inner-most/hot loop, this can easily translate to big performance degradation when most of the calculation is with such matrix productions. Larger matrices (e.g.ndim = 30
) appears unaffected (dispatched to BLAS?)2.732 μs (0 allocations: 0 bytes)
on 1.3 and2.774 μs (0 allocations: 0 bytes)
on master.The text was updated successfully, but these errors were encountered: