Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing the default LU? #357

Closed
ChrisRackauckas opened this issue Aug 8, 2023 · 74 comments
Closed

Changing the default LU? #357

ChrisRackauckas opened this issue Aug 8, 2023 · 74 comments

Comments

@ChrisRackauckas
Copy link
Member

ChrisRackauckas commented Aug 8, 2023

This is a thread for investigating changes to the LU defaults, based off of benchmarks like #356 .

(Note: there's a Mac-specific version 3 posts down)

using BenchmarkTools, Random, VectorizationBase
using LinearAlgebra, LinearSolve, MKL_jll
nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5

function luflop(m, n = m; innerflop = 2)
    sum(1:min(m, n)) do k
        invflop = 1
        scaleflop = isempty((k + 1):m) ? 0 : sum((k + 1):m)
        updateflop = isempty((k + 1):n) ? 0 :
                     sum((k + 1):n) do j
            isempty((k + 1):m) ? 0 : sum((k + 1):m) do i
                innerflop
            end
        end
        invflop + scaleflop + updateflop
    end
end

algs = [LUFactorization(), GenericLUFactorization(), RFLUFactorization(), MKLLUFactorization(), FastLUFactorization(), SimpleLUFactorization()]
res = [Float64[] for i in 1:length(algs)]

ns = 4:8:500
for i in 1:length(ns)
    n = ns[i]
    @info "$n × $n"
    rng = MersenneTwister(123)
    global A = rand(rng, n, n)
    global b = rand(rng, n)
    global u0= rand(rng, n)
    
    for j in 1:length(algs)
        bt = @belapsed solve(prob, $(algs[j])).u setup=(prob = LinearProblem(copy(A), copy(b); u0 = copy(u0), alias_A=true, alias_b=true))
        push!(res[j], luflop(n) / bt / 1e9)
    end
end

using Plots
__parameterless_type(T) = Base.typename(T).wrapper
parameterless_type(x) = __parameterless_type(typeof(x))
parameterless_type(::Type{T}) where {T} = __parameterless_type(T)

p = plot(ns, res[1]; ylabel = "GFLOPs", xlabel = "N", title = "GFLOPs for NxN LU Factorization", label = string(Symbol(parameterless_type(algs[1]))), legend=:outertopright)
for i in 2:length(res)
    plot!(p, ns, res[i]; label = string(Symbol(parameterless_type(algs[i]))))
end
p

savefig("lubench.png")
savefig("lubench.pdf")

lubench.pdf
lubench

The justification for RecursiveFactorization.jl still looks very strong from the looks of this.

julia> versioninfo()
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 32 on 32 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 32

Needs examples on other systems.

@ChrisRackauckas
Copy link
Member Author

OpenBLAS looks drunk.

@oscardssmith
Copy link
Contributor

openBLAS is just multithreading way too small. RLFU looks like a good option. Is the (pre)compile time impact reasonable?

@ChrisRackauckas
Copy link
Member Author

ChrisRackauckas commented Aug 8, 2023

A Mac version:

using BenchmarkTools, Random, VectorizationBase
using LinearAlgebra, LinearSolve
nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5

function luflop(m, n = m; innerflop = 2)
    sum(1:min(m, n)) do k
        invflop = 1
        scaleflop = isempty((k + 1):m) ? 0 : sum((k + 1):m)
        updateflop = isempty((k + 1):n) ? 0 :
                     sum((k + 1):n) do j
            isempty((k + 1):m) ? 0 : sum((k + 1):m) do i
                innerflop
            end
        end
        invflop + scaleflop + updateflop
    end
end

algs = [LUFactorization(), GenericLUFactorization(), RFLUFactorization(), AppleAccelerateLUFactorization(), FastLUFactorization(), SimpleLUFactorization()]
res = [Float64[] for i in 1:length(algs)]

ns = 4:8:500
for i in 1:length(ns)
    n = ns[i]
    @info "$n × $n"
    rng = MersenneTwister(123)
    global A = rand(rng, n, n)
    global b = rand(rng, n)
    global u0= rand(rng, n)
    
    for j in 1:length(algs)
        bt = @belapsed solve(prob, $(algs[j])).u setup=(prob = LinearProblem(copy(A), copy(b); u0 = copy(u0), alias_A=true, alias_b=true))
        push!(res[j], luflop(n) / bt / 1e9)
    end
end

using Plots
__parameterless_type(T) = Base.typename(T).wrapper
parameterless_type(x) = __parameterless_type(typeof(x))
parameterless_type(::Type{T}) where {T} = __parameterless_type(T)

p = plot(ns, res[1]; ylabel = "GFLOPs", xlabel = "N", title = "GFLOPs for NxN LU Factorization", label = string(Symbol(parameterless_type(algs[1]))), legend=:outertopright)
for i in 2:length(res)
    plot!(p, ns, res[i]; label = string(Symbol(parameterless_type(algs[i]))))
end
p

savefig("lubench.png")
savefig("lubench.pdf")

lubench.pdf
lubench

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores

@oscardssmith
Copy link
Contributor

It's interesting that RFLU does poorly on mac. Does it not know about the lower vectorization width?

@ViralBShah
Copy link
Contributor

ViralBShah commented Aug 8, 2023

Ideally you should have a tune or plan API, which can do it on a user's system. Applications that care about it can opt into a tuning run and save these preferences. If something like a switch-able BLAS package gets done, this sort of tuning can be even easier.

@vpuri3
Copy link
Member

vpuri3 commented Aug 8, 2023

Running the Mac specific version from above comment

 julia +beta --startup-file=no --proj
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.0-beta1 (2023-07-25)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(tmp) pkg> st
Status `~/.julia/dev/tmp/Project.toml`
  [13e28ba4] AppleAccelerate v0.4.0
  [6e4b80f9] BenchmarkTools v1.3.2
  [7ed4a6bd] LinearSolve v2.5.0
  [dde4c033] Metal v0.5.0
  [91a5bcdd] Plots v1.38.17
  [3d5dd08c] VectorizationBase v0.21.64

julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M2
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores
Environment:
  JULIA_NUM_PRECOMPILE_TASKS = 8
  JULIA_DEPOT_PATH = /Users/vp/.julia
  JULIA_PKG_DEVDIR = /Users/vp/.julia/dev

[lubench.pdf](https://github.com/SciML/LinearSolve.jl/files/12
lubench
292316/lubench.pdf)

@oscardssmith
Copy link
Contributor

oscardssmith commented Aug 8, 2023

Julia Version 1.11.0-DEV.235
Commit 9f9e989f24 (2023-08-06 04:35 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
  Threads: 6 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 4,1

image
Looks like RFLU is really good for small sizes but isn't doing that well once all your threads have non-trivial sized problems.

@ChrisRackauckas
Copy link
Member Author

I wonder if that's a trend of Intel vs AMD. Need more data.

@nilshg
Copy link

nilshg commented Aug 8, 2023

Here is:

julia> versioninfo()
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, tigerlake)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1

image

Might make sense to remove the two lowest performing methods here to speed up benchmarks given they probably won't be chosen?

@ChrisRackauckas
Copy link
Member Author

yeah maybe though it doesn't change the plot much and confirms at what point the BLAS matters

@oscardssmith
Copy link
Contributor

I wonder if that's a trend of Intel vs AMD. Need more data.

I'm pretty sure it's number of threads. Openblas generally is pretty bad at ramping up the number of threads as size increases and often just goes straight from 1 to full multi-threading. As such on CPUs with lots of cores it performs incredibly badly in the region where it should be using 2-4 cores and is instead using 16.

@ChrisRackauckas
Copy link
Member Author

But that gives no explanation to what was actually mentioned, which has no OpenBLAS but is RecursiveFactorization vs MKL and where the cutoff is. From the looks so far, I'd say:

  1. On Mac, default to always using accelerate
  2. On Intel, default to RecursiveFactorization cutoff at n=150 switch to MKL
  3. On AMD, default to RecursiveFactorization no cutoff

and never doing OpenBLAS.

@ejmeitz
Copy link

ejmeitz commented Aug 8, 2023

When I tried to run this the code errored on the 348 x 348 problem size (twice). Not sure what's up with that since I made a clean tmp environment. Could just be me but thought I'd share.

MethodError: no method matching add_bisecting_if_branches!(::Expr, ::Int64, ::Int64, ::Int64, ::Bool)
The applicable method may be too new: running in world age 35110, while current world is 37210.
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 80 × Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 40 on 80 virtual cores

@DaniGlez
Copy link

DaniGlez commented Aug 8, 2023

Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 7800X3D 8-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 16 on 16 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1

image

@oscardssmith
Copy link
Contributor

that's interesting. There are now two of us that have a fairly noticeable performance bump in the 350 to 380 region for RFLU. Any ideas as to what could be causing it? It looks like we aren't using more threads soon enough maybe?

@ejmeitz
Copy link

ejmeitz commented Aug 8, 2023

Considering mine crashes in the RFLU at 350, something definitely going on there.

@ChrisRackauckas
Copy link
Member Author

@chriselrod

@chriselrod
Copy link
Contributor

Multithreaded and singlethreaded are going to look very different.

MKL does very well multithreaded, while OpenBLAS is awful (as has been discussed).

RF does not scale well with multiple threads. That is a known issue, but Yingbo and I never had time to address it.

@chriselrod
Copy link
Contributor

@ejmeitz, you aren't doing something weird like not using precompiled modules, are you?

@ejmeitz
Copy link

ejmeitz commented Aug 8, 2023

When I added the packages it spammed the message below. I started from a clean env so I just kind of ignored the messages, but that is probably the issue. I did run precompile after adding all the packages just to be sure though.

┌ Warning: Module VectorizationBase with build ID fafbfcfd-2196-d9ff-0000-9410f7322d5a is missing from the cache.
│ This may mean VectorizationBase [3d5dd08c-fd9d-11e8-17fa-ed2836048c2f] does not support precompilation but is imported by a module that does.

@chriselrod
Copy link
Contributor

chriselrod commented Aug 8, 2023

Nuke your precompile cache and try again.

$ rm -rf ~/.julia/compiled/

@ejmeitz
Copy link

ejmeitz commented Aug 8, 2023

That fixed it, thanks! RFLU seems better on my machine for longer than some of the others.

lubench

Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 80 × Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 40 on 80 virtual cores

@chriselrod
Copy link
Contributor

chriselrod commented Aug 8, 2023

lubench

julia> versioninfo()
Julia Version 1.11.0-DEV.238
Commit 8b8da91ad7 (2023-08-08 01:11 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
  Threads: 53 on 36 virtual cores

@ChrisRackauckas
Copy link
Member Author

@ejmeitz your GFLOPs are generally very low, getting stomped by much cheaper CPUs. Maybe it's the clock rate?

@chriselrod
Copy link
Contributor

chriselrod commented Aug 8, 2023

Likely, because of the poor multithreaded scaling.

The big winner here is the Ryzen 7800X3D. It probably wants more multithreading to kick in at a much smaller size.

@ejmeitz
Copy link

ejmeitz commented Aug 8, 2023

I noticed that too. I also thought it would be the clocks (max turbo is 4 GHz) but it still felt low to me. Probably a combo of poor multithread scaling and the clocks being low. I can run it on 128 core AMD CPU if you'd be curious to see that data.

@ChrisRackauckas
Copy link
Member Author

Definitely curious now.

@Leticia-maria
Copy link

I have run (sorry for the delay, had to install/upgrade some dependencies):

Julia Version 1.10.0-alpha1
Commit f8ad15f7b16 (2023-07-06 10:36 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M1 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 5 on 6 virtual cores
Environment:
  LD_LIBRARY_PATH = 
  JULIA_NUM_THREADS = 4
Screenshot 2023-08-08 at 10 59 55 AM

@mastrof
Copy link

mastrof commented Aug 8, 2023

Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 PRO 6850U with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 16 on 16 virtual cores

lubench

This is instead what I get running julia with -t 1
lubench_1

@albheim
Copy link

albheim commented Aug 9, 2023

Using single julia thread

julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores

lubench-1

Using julia -t auto

julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 11 on 8 virtual cores

lubench

@lgoettgens
Copy link

single thread

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores

lubench1

julia -t auto

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 8 on 8 virtual cores

lubench

@ViralBShah
Copy link
Contributor

@ChrisRackauckas Where can I find out what each of the factorizations does?

@chriselrod
Copy link
Contributor

I think we may want

@static if Sys.ARCH === :x86_64
const libMKL = MKL_jll.libmkl_rt # more convenient name
function mkl_set_num_threads(N::Integer)
    ccall((:MKL_Set_Num_Threads,libMKL), Cvoid, (Int32,), N % Int32)
end
mkl_set_num_threads(Threads.nthreads())
end

for single threaded comparisons.

Single threaded performance where we actually restrict the solves to a single thread would likely be useful for ensemble solves.

@ChrisRackauckas
Copy link
Member Author

@ViralBShah just the standard docs https://docs.sciml.ai/LinearSolve/stable/solvers/solvers/. Accelerate isn't in there yet though. I'll PR with Metal.jl as an option too when I land.

@ViralBShah
Copy link
Contributor

Note that the way to get/set MKL threads should be through the domain API: JuliaLinearAlgebra/libblastrampoline#74

LBT doesn't do that but should be easy to do it here.

@chriselrod
Copy link
Contributor

chriselrod commented Aug 9, 2023

Note that the way to get/set MKL threads should be through the domain API: JuliaLinearAlgebra/libblastrampoline#74

LBT doesn't do that but should be easy to do it here.

Okay, so we should use something like

using MKL_jll
mkl_blas_set_num_threads(numthreads::Int) =
           Bool(ccall((:MKL_Domain_Set_Num_Threads, MKL_jll.libmkl_rt),
           Cuint, (Cint,Cint), numthreads, 1))

Or, more elaborately

using MKL_jll
mkl_set_num_threads(numthreads::Int, domain::Cint = zero(Cint)) =
           Bool(ccall((:MKL_Domain_Set_Num_Threads, MKL_jll.libmkl_rt),
           Cuint, (Cint,Cint), numthreads, domain))


const MKL_DOMAIN_ALL = Cint(0)
const MKL_DOMAIN_BLAS = Cint(1)
const MKL_DOMAIN_FFT = Cint(2)
const MKL_DOMAIN_VML = Cint(3)
const MKL_DOMAIN_PARDISO = Cint(4)

mkl_set_num_blas_threads(numthreads) = 
  mkl_set_num_threads(numthreads, MKL_DOMAIN_BLAS)
mkl_set_num_fft_threads(numthreads) = 
  mkl_set_num_threads(numthreads, MKL_DOMAIN_FFT)
mkl_set_num_vml_threads(numthreads) = 
  mkl_set_num_threads(numthreads, MKL_DOMAIN_VML)
mkl_set_num_pardiso_threads(numthreads) = 
  mkl_set_num_threads(numthreads, MKL_DOMAIN_PARDISO)

@chriselrod
Copy link
Contributor

chriselrod commented Aug 9, 2023

lubench_singlethread
Single threaded on the 10980XE.
Multithreaded: #357 (comment)

Note that RFLU did not benefit from multithreading until like 450x450. It was hurt below that.

@ChrisRackauckas
Copy link
Member Author

ChrisRackauckas commented Aug 9, 2023

I setup Metal.jl:

#361

using BenchmarkTools, Random, VectorizationBase
using LinearAlgebra, LinearSolve, Metal
nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5

function luflop(m, n = m; innerflop = 2)
    sum(1:min(m, n)) do k
        invflop = 1
        scaleflop = isempty((k + 1):m) ? 0 : sum((k + 1):m)
        updateflop = isempty((k + 1):n) ? 0 :
                     sum((k + 1):n) do j
            isempty((k + 1):m) ? 0 : sum((k + 1):m) do i
                innerflop
            end
        end
        invflop + scaleflop + updateflop
    end
end

algs = [AppleAccelerateLUFactorization(), MetalLUFactorization()]
res = [Float32[] for i in 1:length(algs)]

ns = 200:600:15000
for i in 1:length(ns)
    n = ns[i]
    @info "$n × $n"
    rng = MersenneTwister(123)
    global A = rand(rng, Float32, n, n)
    global b = rand(rng, Float32, n)
    global u0= rand(rng, Float32, n)
    
    for j in 1:length(algs)
        bt = @belapsed solve(prob, $(algs[j])).u setup=(prob = LinearProblem(copy(A), copy(b); u0 = copy(u0), alias_A=true, alias_b=true))
        GC.gc()
        push!(res[j], luflop(n) / bt / 1e9)
    end
end

using Plots
__parameterless_type(T) = Base.typename(T).wrapper
parameterless_type(x) = __parameterless_type(typeof(x))
parameterless_type(::Type{T}) where {T} = __parameterless_type(T)

p = plot(ns, res[1]; ylabel = "GFLOPs", xlabel = "N", title = "GFLOPs for NxN LU Factorization", label = string(Symbol(parameterless_type(algs[1]))), legend=:outertopright)
for i in 2:length(res)
    plot!(p, ns, res[i]; label = string(Symbol(parameterless_type(algs[i]))))
end
p

savefig("metal_large_lubench.png")
savefig("metal_large_lubench.pdf")

metallubench.pdf
metallubench

metal_large_lubench.pdf
metal_large_lubench

@ChrisRackauckas
Copy link
Member Author

Can I get some results of folks doing CUDA offloading with the following script?

using BenchmarkTools, Random, VectorizationBase
using LinearAlgebra, LinearSolve, CUDA, MKL_jll
nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5

function luflop(m, n = m; innerflop = 2)
    sum(1:min(m, n)) do k
        invflop = 1
        scaleflop = isempty((k + 1):m) ? 0 : sum((k + 1):m)
        updateflop = isempty((k + 1):n) ? 0 :
                     sum((k + 1):n) do j
            isempty((k + 1):m) ? 0 : sum((k + 1):m) do i
                innerflop
            end
        end
        invflop + scaleflop + updateflop
    end
end

algs = [MKLLUFactorization(), CUDAOffloadFactorization()]
res = [Float32[] for i in 1:length(algs)]

ns = 200:400:10000
for i in 1:length(ns)
    n = ns[i]
    @info "$n × $n"
    rng = MersenneTwister(123)
    global A = rand(rng, Float32, n, n)
    global b = rand(rng, Float32, n)
    global u0= rand(rng, Float32, n)
    
    for j in 1:length(algs)
        bt = @belapsed solve(prob, $(algs[j])).u setup=(prob = LinearProblem(copy(A), copy(b); u0 = copy(u0), alias_A=true, alias_b=true))
        push!(res[j], luflop(n) / bt / 1e9)
    end
end

using Plots
__parameterless_type(T) = Base.typename(T).wrapper
parameterless_type(x) = __parameterless_type(typeof(x))
parameterless_type(::Type{T}) where {T} = __parameterless_type(T)

p = plot(ns, res[1]; ylabel = "GFLOPs", xlabel = "N", title = "GFLOPs for NxN LU Factorization", label = string(Symbol(parameterless_type(algs[1]))), legend=:outertopright)
for i in 2:length(res)
    plot!(p, ns, res[i]; label = string(Symbol(parameterless_type(algs[i]))))
end
p

savefig("cudaoffloadlubench.png")
savefig("cudaoffloadlubench.pdf")

@joelandman
Copy link

julia> algs = [MKLLUFactorization(), CUDAOffloadFactorization()]
ERROR: UndefVarError: CUDAOffloadFactorization not defined
Stacktrace:
[1] top-level scope
@ REPL[8]:1
[2] top-level scope
@ ~/.julia/packages/CUDA/tVtYo/src/initialization.jl:185

(@v1.9) pkg> st CUDA
Status ~/.julia/environments/v1.9/Project.toml
[052768ef] CUDA v4.4.0

Is this CUDA.jl from the repo tip?

@ChrisRackauckas
Copy link
Member Author

It should load when you do using CUDA because it's an extension library.

@christiangnrd
Copy link
Contributor

@joelandman @ChrisRackauckas There's a capitalization typo in the benchmark it should be CudaOffloadFactorization.

@chriselrod
Copy link
Contributor

[ Info: 200 × 200
ERROR: MethodError: no method matching getrf!(::Matrix{Float32}; ipiv::Vector{Int64}, info::Base.RefValue{Int64})

Closest candidates are:
  getrf!(::AbstractMatrix{<:Float64}; ipiv, info, check)
   @ LinearSolveMKLExt ~/.julia/packages/LinearSolve/Tcmzb/ext/LinearSolveMKLExt.jl:13

@carstenbauer
Copy link

carstenbauer commented Aug 9, 2023

I can run the snippet on A100 and A40. However, I get

ERROR: LoadError: UndefVarError: `MKLLUFactorization` not defined
Stacktrace:
 [1] top-level scope
   @ /scratch/pc2-mitarbeiter/bauerc/playground/linearsolvetest/script.jl:21
 [2] include(fname::String)
   @ Base.MainInclude ./client.jl:478
 [3] top-level scope
   @ REPL[1]:1
in expression starting at /scratch/pc2-mitarbeiter/bauerc/playground/linearsolvetest/script.jl:21

Update: With LinearSolve#main I get the same error as @chriselrod above:

julia> include("script.jl")
[ Info: 200 × 200
ERROR: LoadError: MethodError: no method matching getrf!(::Matrix{Float32}; ipiv::Vector{Int64}, info::Base.RefValue{Int64})

Closest candidates are:
  getrf!(::AbstractMatrix{<:Float64}; ipiv, info, check)
   @ LinearSolveMKLExt /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/ext/LinearSolveMKLExt.jl:13

Stacktrace:
  [1] #solve!#2
    @ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/ext/LinearSolveMKLExt.jl:45 [inlined]
  [2] solve!
    @ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/ext/LinearSolveMKLExt.jl:39 [inlined]
  [3] #solve!#6
    @ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/src/common.jl:197 [inlined]
  [4] solve!
    @ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/src/common.jl:196 [inlined]
  [5] #solve#5
    @ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/src/common.jl:193 [inlined]
  [6] solve
[...]

@joelandman
Copy link

I threw together a quick patch for LinearSolverMKLExt.jl to accomodate the Float32 version. @ChrisRackauckas please let me know if you want a PR or a patch for it. Results incoming (running now)

@ChrisRackauckas
Copy link
Member Author

I put an MKL 32-bit patch into the MKL PR #361. I noticed that it's not using the MKL backsolve so that could potentially make that a bit faster, but it shouldn't effect the CUDA cutoff point.

@joelandman
Copy link

cudaoffloadlubench.pdf
cudaoffloadlubench

joe@zap:~ $ nvidia-smi
Wed Aug 9 14:02:47 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 43C P8 1W / 80W | 6MiB / 6144MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1621 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+

@joelandman
Copy link

Same zen2 laptop

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD Ryzen 7 4800H with Radeon Graphics
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
Threads: 16 on 16 virtual cores
Environment:
LD_LIBRARY_PATH =
JULIA_HOME = /home/joe/local

@carstenbauer
Copy link

carstenbauer commented Aug 9, 2023

Julia 1.9.2 (1 Julia thread)

A40 + Intel(R) Xeon(R) Gold 6148F:
cudaoffloadlubench_A40

A100 + AMD EPYC 7742:
cudaoffloadlubench_A100

@tylerjthomas9
Copy link

RTX 3090 + 5950x

cudaoffloadlubench_3090

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 1 on 32 virtual cores
NVIDIA-SMI 525.125.06   
Driver Version: 525.125.06   
CUDA Version: 12.0
NVIDIA GeForce RTX 3090 (420W)

A6000 ADA + EPYC 7713

cudaoffloadlubench_a6000

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 256 × AMD EPYC 7713 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 1 on 256 virtual cores
NVIDIA-SMI 535.86.05              
Driver Version: 535.86.05    
CUDA Version: 12.2 
NVIDIA RTX 6000 Ada (300W)

@zygi
Copy link

zygi commented Aug 9, 2023

i9-13900K, RTX 4090

cudaoffloadlubench

Versioninfo:

Details

``` julia> versioninfo() Julia Version 1.9.2 Commit e4ee485e909 (2023-07-05 09:39 UTC) Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 32 × 13th Gen Intel(R) Core(TM) i9-13900K WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-14.0.6 (ORCJIT, goldmont) Threads: 1 on 32 virtual cores ``` ``` julia> CUDA.versioninfo() CUDA runtime 12.1, artifact installation CUDA driver 12.2 NVIDIA driver 535.86.5

CUDA libraries:

  • CUBLAS: 12.1.3
  • CURAND: 10.3.2
  • CUFFT: 11.0.2
  • CUSOLVER: 11.4.5
  • CUSPARSE: 12.1.0
  • CUPTI: 18.0.0
  • NVML: 12.0.0+535.86.5

Julia packages:

  • CUDA: 4.4.0
  • CUDA_Driver_jll: 0.5.0+1
  • CUDA_Runtime_jll: 0.6.0+0

Toolchain:

  • Julia: 1.9.2
  • LLVM: 14.0.6
  • PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
  • Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
0: NVIDIA GeForce RTX 4090 (sm_89, 18.567 GiB / 23.988 GiB available)



</p>
</details> 

@christiangnrd
Copy link
Contributor

All running julia 1.9.2

RTX 3060 + 3700X

cudaoffloadlubenchhome

M2 Max 30-core GPU

metal_large_lubench

@chriselrod
Copy link
Contributor

I was hoping that these benchmarks would show that we should drop RecursiveFactorization from the defaults now that we have MKL, but they just don't show that.

Another 8% improvement in RF coming up:
JuliaLinearAlgebra/RecursiveFactorization.jl#84
The butterfly may help even more, when we get around to it.

@ChrisRackauckas
Copy link
Member Author

It looks like GPU offloading doesn't make sense once things are using MKL. Cutoff is >1000

@chriselrod
Copy link
Contributor

chriselrod commented Aug 11, 2023

cudaoffloadlubench
Accidentally forgot to do --startup=no.
By the looks of it, I'd need a 4090 to beat this CPU.

@mikeingold
Copy link

For what it's worth, a similar system to the given example but a generation older and apparently running fewer threads (16 vs 32).

julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 × AMD Ryzen 9 3950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 16 on 32 virtual cores

Environment:
  JULIA_NUM_THREADS = 16

lubench

@ChrisRackauckas
Copy link
Member Author

Thanks everyone, the defaults take these into account. Now MKL is defaulted to in many scenarios (along with AppleAccelerate on macs), with a switch at 200 which seems to be a roughly optimal spot to go from RFLU to MKL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests