-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
macOS Accelerate Optimizations and Julia versus Python Benchmarks on Apple Silicon #50806
Comments
Interesting! How does the Apple Silicon binary from julialang.org/downloads compare? |
I now include julialang.org results above, which are comparable to the performance of MacPorts binaries. I also updated the Julia benchmarks to use a more accurate BenchmarkTools.jl timing with |
Did you try |
That you! That's what I was looking for: Julia (AppleAccelerate.jl) Matrix Multiplication (GFLOPS)
|
I’m going to reopen this issue as a feature request for compiled Julia on Apple Silicon to use the Accelerate framework by default. I cannot imagine a scenario where one wouldn’t want this—it’s a factor of 3–4x performance. |
How large were these benchmarks? Were matrix factorizations part of it? The discussion here seemed to conclude that AA was indeed much better for matrix-matrix multiply but often worse on LU. Nor was it easy to see the O(N^3) cost as the dimension increased.. I'd stick with OpenBLAS. |
I haven't seen comprehensive benchmarks published, but it's easy to run a few cases. I observe a 3–4x speedup for BLAS, and comparable performance on standard matrix decompositions, with the exception of the large, dense SVDs. MY own preference based on my own workloads and observations is a preference for an Accelerate framework default. Julia Benchmark CodeJulia Benchmark Codeusing BenchmarkTools
using LinearAlgebra
using Printf
# using AppleAccelerate
j_type = Float32
for sz in [2048, 4096, 8192, 16384]
a = randn(j_type, sz, sz);
ts = @belapsed qr($a);
@printf("| %d\t| %.1f\t|\n", sz, 2*sz^2*(sz - sz/3) / ts / 1e9)
end
for sz in [2048, 4096, 8192, 16384]
a = randn(j_type, sz, sz);
ts = @belapsed svd($a);
@printf("| %d\t| %.1f\t|\n", sz, sz^3*(2 + 11) / ts / 1e9)
end Julia Matrix Decompositions (GFLOPS)
|
[Solution to this issue: Use AppleAccelerate.jl.]
I'm a maintainer for the MacPorts
julia
port and am trying to make sure we have the best build formula for Apple Silicon.I see a ≈2.5–3× difference in the CPU performance of Julia versus Python on basic dense matrix operations, which suggests that we may not be using the Accelerate framework appropriately. (Metal.jl benchmarks are comparable within a few TFLOPS to to PyTorch+MPS, so at least that part looks okay.)
How does one ensure that Julia is compiled to use macOS Accelerate optimizations? We use the direct build instructions provided by Julia itself, so this performance issue may arise from Julia.
https://github.com/macports/macports-ports/blob/0f6d1c42dfc3bda20673e34529c51ab34a4f3da4/lang/julia/Portfile#L57-L58
On a Mac Studio M2 Ultra, I observe that Numpy with Accelerate achieves about 2.5–3 TFLOPS for dense ops, but Julia achieves 1–1.4 TFLOPS, using both MacPorts and julialang.org binaries.
Here's some basic benchmarking code and results:
Benchmarks on Mac Studio M2 Ultra
Julia Benchmark Code
Julia Benchmark Code
Julia (MacPorts) Matrix Multiplication (GFLOPS)
Julia (julialang.org) Matrix Multiplication (GFLOPS)
Julia (AppleAccelerate.jl) Matrix Multiplication (GFLOPS)
Python Benchmark Code
Python Benchmark Code
Python Matrix Multiplication (GFLOPS)
The text was updated successfully, but these errors were encountered: