-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Octavian.jl for large mixed-mode CPU calculations. #125
base: master
Are you sure you want to change the base?
Conversation
Interestingly, this only speeds up 1.9. I can't imagine Octavian.jl being that much slower on <1.9? |
Codecov ReportPatch and project coverage have no change.
Additional details and impacted files@@ Coverage Diff @@
## master #125 +/- ##
=======================================
Coverage 30.27% 30.27%
=======================================
Files 11 11
Lines 786 786
=======================================
Hits 238 238
Misses 548 548 ☔ View full report in Codecov by Sentry. |
For timings, I get julia> @time using Octavian
0.217284 seconds (396.12 k allocations: 21.375 MiB, 6.10% gc time, 6.09% compilation time)
julia> @benchmark matmul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 13 samples with 1 evaluation.
Range (min … max): 43.139 ms … 44.684 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 43.791 ms ┊ GC (median): 0.00%
Time (mean ± σ): 43.750 ms ± 447.341 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ ▁ ▁ ▁ ▁ ▁▁█ ▁ ▁ ▁
█▁▁█▁▁▁█▁▁▁█▁█▁▁▁█▁▁▁▁▁▁▁███▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
43.1 ms Histogram: frequency by time 44.7 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark matmul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float16, 2048, 2048); B=rand(Float16, 2048, 2048))
BenchmarkTools.Trial: 14 samples with 1 evaluation.
Range (min … max): 42.711 ms … 43.548 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 43.004 ms ┊ GC (median): 0.00%
Time (mean ± σ): 43.067 ms ± 267.509 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁▁ █ █ ▁▁ ▁ ▁ ▁ ▁▁
█▁▁▁▁▁▁██▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁█▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁██ ▁
42.7 ms Histogram: frequency by time 43.5 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark mul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 19 samples with 1 evaluation.
Range (min … max): 44.262 ms … 54.795 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 45.080 ms ┊ GC (median): 0.00%
Time (mean ± σ): 47.153 ms ± 3.564 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█▂
▅▅▁██▅▁▅▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▅▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▅ ▁
44.3 ms Histogram: frequency by time 54.8 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> versioninfo()
Julia Version 1.10.0-DEV.1608
Commit 0e8af1c162 (2023-06-30 04:06 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
Threads: 11 on 8 virtual cores
Environment:
JULIA_PATH = @.
LD_LIBRARY_PATH = /usr/local/lib/
JULIA_NUM_THREADS = 8
julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
├ [ILP64] libmkl_rt.so
└ [ LP64] libmkl_rt.so Which, aside from Although, github actions CI is generally restricted to 1 core, so single threaded is probably representative. I don't know about buildkite. |
I'm surprised it isn't <1.8, as 1.8 added
It should not be compiling for differently sized inputs, only different types. julia> C=zeros(Float32, 2048, 2048); A=rand(Float16, 2048, 2048); B=rand(Float16, 2048, 2048);
julia> @time using Octavian
0.205357 seconds (396.14 k allocations: 21.375 MiB, 2.34% gc time, 6.29% compilation time)
julia> @time @eval matmul!(C,A,B);
10.354272 seconds (25.52 M allocations: 1.312 GiB, 2.72% gc time, 99.67% compilation time) Code coverage: julia> @time @eval matmul!(C,A,B);
202.818763 seconds (82.94 M allocations: 3.568 GiB, 0.28% gc time, 34.71% compilation time) But hopefully only GemmKernel's coverage gets taken with |
Thanks for the input! Yes, we're only using a single thread, as we use multiple processes to run multiple tests in parallel.
We're just setting |
Disabling coverage on 1.6-1.8 didn't help, so this seems like a different issue. |
LinearAlgebra is hilariously slow for large mixed-mode (i.e. not supported by BLAS) multiplications:
Octavian.jl fares quite a bit better:
However, replacing all of our
LinearAlgebra.mul!
uses withOctavian.matmul!
regresses test time. @chriselrod is that expected? I guess there's a significant compilation-time overhead for invoking Octavian.jl with many differently typed and sized inputs?For now, only use Octavian for large mixed-mode cases, which gets test times back to before #124.