Use Octavian.jl for large mixed-mode CPU calculations. #125

maleadt · 2023-07-02T08:14:17Z

LinearAlgebra is hilariously slow for large mixed-mode (i.e. not supported by BLAS) multiplications:

# non mixed-mode
julia> @benchmark mul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 289 samples with 1 evaluation.
 Range (min … max):  12.774 ms …  15.729 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.110 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.218 ms ± 469.316 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▁ ▁█▄▄▃▄
  ▃▆██▇██████▅▇▅▃▃▁▁▁▁▂▁▁▁▂▂▃▁▂▂▁▁▃▁▁▃▁▂▁▁▁▂▂▂▁▁▁▁▁▁▂▁▁▁▂▁▁▃▂▂ ▃
  12.8 ms         Histogram: frequency by time         15.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

# mixed-mode
julia> @benchmark mul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float16, 2048, 2048); B=rand(Float16, 2048, 2048))
BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  8.342 s …   8.429 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.361 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.375 s ± 28.960 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁        █ ▁ ▁                    ▁▁                    ▁
  █▁▁▁▁▁▁▁▁█▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  8.34 s         Histogram: frequency by time        8.43 s <

 Memory estimate: 20.81 KiB, allocs estimate: 3.

Octavian.jl fares quite a bit better:

julia> @benchmark matmul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 452 samples with 1 evaluation.
 Range (min … max):  128.814 ms … 132.015 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     129.092 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   129.234 ms ± 412.416 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▂▂ ▂█▂▂
  ▄▄▆██████████▇▅▆▅▃▅▃▄▃▃▄▄▂▂▃▃▃▃▄▂▃▃▃▃▁▃▂▃▃▁▁▁▃▂▂▂▁▂▃▂▃▃▂▁▃▃▂▃ ▃
  129 ms           Histogram: frequency by time          130 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

However, replacing all of our LinearAlgebra.mul! uses with Octavian.matmul! regresses test time. @chriselrod is that expected? I guess there's a significant compilation-time overhead for invoking Octavian.jl with many differently typed and sized inputs?

For now, only use Octavian for large mixed-mode cases, which gets test times back to before #124.

maleadt · 2023-07-02T08:20:29Z

Benchmark results for commit 4dac743 (comparing to 51bf8ee):
No regressions or improvements detected.

maleadt · 2023-07-02T09:16:21Z

Interestingly, this only speeds up 1.9. I can't imagine Octavian.jl being that much slower on <1.9?

codecov · 2023-07-02T09:20:27Z

Codecov Report

Patch and project coverage have no change.

Comparison is base (781f1de) 30.27% compared to head (4dac743) 30.27%.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #125   +/-   ##
=======================================
  Coverage   30.27%   30.27%           
=======================================
  Files          11       11           
  Lines         786      786           
=======================================
  Hits          238      238           
  Misses        548      548

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

chriselrod · 2023-07-02T09:33:39Z

For timings, I get

julia> @time using Octavian
  0.217284 seconds (396.12 k allocations: 21.375 MiB, 6.10% gc time, 6.09% compilation time)

julia> @benchmark matmul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 13 samples with 1 evaluation.
 Range (min … max):  43.139 ms …  44.684 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     43.791 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   43.750 ms ± 447.341 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁  ▁   ▁   ▁ ▁   ▁       ▁▁█              ▁ ▁              ▁  
  █▁▁█▁▁▁█▁▁▁█▁█▁▁▁█▁▁▁▁▁▁▁███▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  43.1 ms         Histogram: frequency by time         44.7 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark matmul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float16, 2048, 2048); B=rand(Float16, 2048, 2048))
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  42.711 ms …  43.548 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     43.004 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   43.067 ms ± 267.509 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁      ▁▁  █ █              ▁▁ ▁   ▁         ▁            ▁▁  
  █▁▁▁▁▁▁██▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁█▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁██ ▁
  42.7 ms         Histogram: frequency by time         43.5 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 19 samples with 1 evaluation.
 Range (min … max):  44.262 ms … 54.795 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     45.080 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   47.153 ms ±  3.564 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █▂                                                        
  ▅▅▁██▅▁▅▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▅▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▅ ▁
  44.3 ms         Histogram: frequency by time        54.8 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.10.0-DEV.1608
Commit 0e8af1c162 (2023-06-30 04:06 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
  Threads: 11 on 8 virtual cores
Environment:
  JULIA_PATH = @.
  LD_LIBRARY_PATH = /usr/local/lib/
  JULIA_NUM_THREADS = 8

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries: 
├ [ILP64] libmkl_rt.so
└ [ LP64] libmkl_rt.so

Which, aside from mul!, are much better timings than you report here.
My laptop isn't a particularly powerful machine.
Perhaps you started Julia with only a single thread?

Although, github actions CI is generally restricted to 1 core, so single threaded is probably representative. I don't know about buildkite.

chriselrod · 2023-07-02T09:47:05Z

Interestingly, this only speeds up 1.9. I can't imagine Octavian.jl being that much slower on <1.9?

I'm surprised it isn't <1.8, as 1.8 added --code-coverage=user, which made a tremendous difference vs --code-coverage=all for Octavian.

However, replacing all of our LinearAlgebra.mul! uses with Octavian.matmul! regresses test time. @chriselrod is that expected? I guess there's a significant compilation-time overhead for invoking Octavian.jl with many differently typed and sized inputs?

It should not be compiling for differently sized inputs, only different types.
That said, latency is significant; no code-coverage:

julia> C=zeros(Float32, 2048, 2048); A=rand(Float16, 2048, 2048); B=rand(Float16, 2048, 2048);   

julia> @time using Octavian
  0.205357 seconds (396.14 k allocations: 21.375 MiB, 2.34% gc time, 6.29% compilation time)

julia> @time @eval matmul!(C,A,B);
 10.354272 seconds (25.52 M allocations: 1.312 GiB, 2.72% gc time, 99.67% compilation time)

Code coverage:

julia> @time @eval matmul!(C,A,B);

202.818763 seconds (82.94 M allocations: 3.568 GiB, 0.28% gc time, 34.71% compilation time)

But hopefully only GemmKernel's coverage gets taken with --coverage=user?

maleadt · 2023-07-02T19:30:38Z

Thanks for the input!

Yes, we're only using a single thread, as we use multiple processes to run multiple tests in parallel.
However, I had not started with OPENBLAS_NUM_THREADS=1, so the comparison to OpenBLAS above was unfair, and is actually much closer to what you report. Still, running the entire GemmKernels.jl test suite with Octavian.jl is much slower than when using OpenBLAS. I'll have to look into this more closely.

But hopefully only GemmKernel's coverage gets taken with --coverage=user?

We're just setting coverage=true with Pkg.test. It doesn't seem like that uses --coverage=user, https://github.com/JuliaLang/Pkg.jl/blob/e8197dd0ed8132d4a7619f3657363c8415249c47/src/Operations.jl#L1672-L1681, but I don't think it's doing the equivalent of =all?

maleadt · 2023-07-03T08:08:28Z

But hopefully only GemmKernel's coverage gets taken with --coverage=user?

We're just setting coverage=true with Pkg.test. It doesn't seem like that uses --coverage=user, https://github.com/JuliaLang/Pkg.jl/blob/e8197dd0ed8132d4a7619f3657363c8415249c47/src/Operations.jl#L1672-L1681, but I don't think it's doing the equivalent of =all?

Disabling coverage on 1.6-1.8 didn't help, so this seems like a different issue.

Use Octavian.jl for large mixed-mode CPU calculations.

be74c17

Only run with coverage on 1.9.

4dac743

maleadt force-pushed the tb/octavian branch from e7add33 to 4dac743 Compare July 3, 2023 07:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Octavian.jl for large mixed-mode CPU calculations. #125

Use Octavian.jl for large mixed-mode CPU calculations. #125

maleadt commented Jul 2, 2023

maleadt commented Jul 2, 2023 •

edited

Loading

maleadt commented Jul 2, 2023

codecov bot commented Jul 2, 2023 •

edited

Loading

chriselrod commented Jul 2, 2023

chriselrod commented Jul 2, 2023

maleadt commented Jul 2, 2023 •

edited

Loading

maleadt commented Jul 3, 2023

Use Octavian.jl for large mixed-mode CPU calculations. #125

Are you sure you want to change the base?

Use Octavian.jl for large mixed-mode CPU calculations. #125

Conversation

maleadt commented Jul 2, 2023

maleadt commented Jul 2, 2023 • edited Loading

maleadt commented Jul 2, 2023

codecov bot commented Jul 2, 2023 • edited Loading

Codecov Report

chriselrod commented Jul 2, 2023

chriselrod commented Jul 2, 2023

maleadt commented Jul 2, 2023 • edited Loading

maleadt commented Jul 3, 2023

maleadt commented Jul 2, 2023 •

edited

Loading

codecov bot commented Jul 2, 2023 •

edited

Loading

maleadt commented Jul 2, 2023 •

edited

Loading