Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mul! performance regression for 2x2 and 3x3 matrices #34384

Closed
wants to merge 7 commits into from

Conversation

tkf
Copy link
Member

@tkf tkf commented Jan 15, 2020

This is a quick patch to fix JuliaLang/LinearAlgebra.jl#684.

On my laptop, the result of the benchmark @btime mul!($ou, $m1, $m2) from JuliaLang/LinearAlgebra.jl#684 is:

  • Before (07a16d6): 335.059 ns (1 allocation: 16 bytes)
  • After (9274757): 37.268 ns (0 allocations: 0 bytes)

@goggle
Copy link
Contributor

goggle commented Jan 15, 2020

I can confirm these benchmarks:

using LinearAlgebra
using BenchmarkTools
for ndim = 2:3
    m1 = rand(ComplexF64, ndim, ndim)
    m2 = rand(ComplexF64, ndim, ndim)
    ou = similar(m1)
    @btime mul!($ou, $m1, $m2)
end

on master:

364.115 ns (1 allocation: 16 bytes)
387.926 ns (1 allocation: 16 bytes)

on PR branch:

18.957 ns (0 allocations: 0 bytes)
35.137 ns (0 allocations: 0 bytes)

Version info:

Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, ivybridge)

@ViralBShah ViralBShah added the linear algebra Linear algebra label Jan 15, 2020
@daviehh
Copy link
Contributor

daviehh commented Jan 15, 2020

One additional thing: it appears while this @inline fixed the 3-argument mul!, the issue with constant propagation of 5-argument mul! may still remain:

using LinearAlgebra
using BenchmarkTools


ndim = 3

al = .5;
bt = .7im;

m1 = rand(ComplexF64,ndim,ndim);
m2 = rand(ComplexF64,ndim,ndim);
ou = rand(ComplexF64,ndim,ndim);

@btime mul!($ou, $m1, $m2);

@btime mul!($ou, $m1, $m2, $al, $bt);

@btime mul!($ou, $m1, $m2, .5, .7im);

41.007 ns (0 allocations: 0 bytes)
334.309 ns (3 allocations: 80 bytes)
54.090 ns (0 allocations: 0 bytes)

a bit better if one also incoperates #29634 (comment)

40.915 ns (0 allocations: 0 bytes)
77.679 ns (1 allocation: 32 bytes)
54.872 ns (0 allocations: 0 bytes)

@KristofferC
Copy link
Member

KristofferC commented Jan 15, 2020

How about just copy pasting two versions of 2x2 and 3x3 mul? One for 5 arg and one for 3 arg? Relying on constant propagation seems brittle.

@daviehh
Copy link
Contributor

daviehh commented Jan 15, 2020

There are many places where the 2x2 and 3x3 mul is called inside the various wrappers, so copy-pasting to all may introduce bloat/make it less maintainable for future changes... one can do the short-circuit MulAddMul alternatively #34394, is it easier to improve upon that?

@tkf
Copy link
Member Author

tkf commented Jan 15, 2020

MulAddMul thing was useful to improve performance for structured matrices; e.g., #29634 (comment)

I think the branching trick #29634 (comment) is an easier solution.

@tkf tkf mentioned this pull request Jan 16, 2020
@StefanKarpinski
Copy link
Member

Inlining all of gemm_wrapper! seems very excessive: all we really want to inline is the part that checks if the 2x2 and 3x3 cases should be called. If you want to inline this, it should be refactored so that the only part that's actually inlined is this:

if mA == 2 && nA == 2 && nB == 2
    return matmul2x2!(C, tA, tB, A, B, MulAddMul(α, β))
end
if mA == 3 && nA == 3 && nB == 3
    return matmul3x3!(C, tA, tB, A, B, MulAddMul(α, β))
end

It should be further streamlined to something like this:

if 2 == size(A, 1) && 2 == size(A, 2) && 2 == size(B, 1) && 2 == size(B, 2)
    return matmul2x2!(C, tA, tB, A, B, MulAddMul(α, β))
elseif 3 == size(A, 1) && 3 == size(A, 2) && 3 == size(B, 1) && 3 == size(B, 2)
    return matmul3x3!(C, tA, tB, A, B, MulAddMul(α, β))
end

After that there can be a non-inlined call to a function that checks dimension compatibility, looks at the stride arguments, etc. and does the relatively slow, expensive GEMM call.

@tkf
Copy link
Member Author

tkf commented Jan 31, 2020

I think 1c614e3 implements what you suggested.

I also created another PR #34601 that does not add new @inline.

@StefanKarpinski
Copy link
Member

Great! Does this version still fix the performance issue?

@KristofferC
Copy link
Member

Seems to have some CI problems:

Test Failed at /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.5/LinearAlgebra/test/matmul.jl:586
  Expression: mul!(copy(C), A, B, ��, 1.0) == C
   Evaluated: [NaN NaN NaN; NaN NaN NaN; NaN NaN NaN] == [1.0 1.0 1.0; 1.0 1.0 1.0; 1.0 1.0 1.0]

@StefanKarpinski
Copy link
Member

Nananana_cee557_2609250

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mul! performance regression on master
6 participants