Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

macOS Accelerate Optimizations and Julia versus Python Benchmarks on Apple Silicon #50806

Open
essandess opened this issue Aug 5, 2023 · 7 comments
Labels
performance Must go faster system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips

Comments

@essandess
Copy link

essandess commented Aug 5, 2023

[Solution to this issue: Use AppleAccelerate.jl.]

I'm a maintainer for the MacPorts julia port and am trying to make sure we have the best build formula for Apple Silicon.

I see a ≈2.5–3× difference in the CPU performance of Julia versus Python on basic dense matrix operations, which suggests that we may not be using the Accelerate framework appropriately. (Metal.jl benchmarks are comparable within a few TFLOPS to to PyTorch+MPS, so at least that part looks okay.)

How does one ensure that Julia is compiled to use macOS Accelerate optimizations? We use the direct build instructions provided by Julia itself, so this performance issue may arise from Julia.

https://github.com/macports/macports-ports/blob/0f6d1c42dfc3bda20673e34529c51ab34a4f3da4/lang/julia/Portfile#L57-L58

On a Mac Studio M2 Ultra, I observe that Numpy with Accelerate achieves about 2.5–3 TFLOPS for dense ops, but Julia achieves 1–1.4 TFLOPS, using both MacPorts and julialang.org binaries.

Here's some basic benchmarking code and results:

Benchmarks on Mac Studio M2 Ultra

Julia Benchmark Code
Julia Benchmark Code
# using AppleAccelerate
using BenchmarkTools
using Metal
using Printf

j_type = Float32

for sz in [2048, 4096, 8192, 16384]
    a = randn(j_type, sz, sz);
    b = randn(j_type, sz, sz);
    a_mtl = MtlArray(a);
    b_mtl = MtlArray(b);
    ts = @belapsed $a * $b;
    ts_mtl = @belapsed ($a_mtl * $b_mtl)[1, 1];
    @printf("| %d\t| %.1f\t| %.1f\t|\n", sz, sz^2*(2*sz - 1) / ts / 1e9, sz^2*(2*sz - 1) / ts_mtl / 1e9)
end
Julia (MacPorts) Matrix Multiplication (GFLOPS)
Size Julia Metal.jl
2048 1068.3 11071.3
4096 1168.5 16652.6
8192 1350.6 18281.8
16384 1353.1 17988.2
Julia (julialang.org) Matrix Multiplication (GFLOPS)
Size Julia Metal.jl
2048 962.1 10760.9
4096 1162.4 16134.1
8192 1348.3 17379.8
16384 1322.1 17831.5
Julia (AppleAccelerate.jl) Matrix Multiplication (GFLOPS)
Size AppleAccelerate.jl Metal.jl
2048 3301.1 10474.0
4096 3588.8 16004.8
8192 4018.3 17385.5
16384 4187.6 17944.2
Python Benchmark Code
Python Benchmark Code
import numpy as np
import torch
import time

mpsDevice = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

rg = np.random.default_rng(1)
np_type = np.float32
torch_type = torch.float32

print("Python Matrix Multiplication (GFLOPS)\n")
print("| Size\t| Numpy+Accelerate \t| PyTorch+MPS |")
print("| -----:\t| -----:\t| -----: |")
for size in (2048, 4096, 8192, 16384):
    a_np = rg.random((size, size), dtype=np_type)
    b_np = rg.random((size, size), dtype=np_type)
    a_torch = torch.randn((size, size), dtype=torch_type, device=mpsDevice)
    b_torch = torch.randn((size, size), dtype=torch_type, device=mpsDevice)
    ts_np = %timeit -n1 -r5 -q -o a_np @ b_np
    ts_torch = %timeit -n1 -r5 -q -o (a_torch @ b_torch)[0, 0].cpu()
    print("| {:d}\t| {:.1f}\t| {:.1f} |".format(size, size**2*(2*size - 1) / np.median(ts_np.all_runs) / 1e9, size**2*(2*size - 1) / np.median(ts_torch.all_runs) / 1e9))
Python Matrix Multiplication (GFLOPS)
Size Numpy+Accelerate PyTorch+MPS
2048 2134.5 10679.2
4096 2626.8 20309.6
8192 2845.0 20988.9
16384 3015.1 19577.4
julia> versioninfo()

Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.5.0)
  CPU: 24 × Apple M2 Ultra
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 16 virtual cores
@brenhinkeller brenhinkeller added performance Must go faster system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips labels Aug 5, 2023
@brenhinkeller
Copy link
Contributor

Interesting! How does the Apple Silicon binary from julialang.org/downloads compare?

@brenhinkeller brenhinkeller added the building Build system, or building Julia or its dependencies label Aug 5, 2023
@essandess
Copy link
Author

I now include julialang.org results above, which are comparable to the performance of MacPorts binaries. I also updated the Julia benchmarks to use a more accurate BenchmarkTools.jl timing with @belapsed.

@giordano
Copy link
Contributor

giordano commented Aug 6, 2023

Did you try AppleAccelerate.jl, which doesn't require rebuilding Julia?

@essandess
Copy link
Author

That you! That's what I was looking for:

Julia (AppleAccelerate.jl) Matrix Multiplication (GFLOPS)
Size AppleAccelerate.jl Metal.jl
2048 3301.1 10474.0
4096 3588.8 16004.8
8192 4018.3 17385.5
16384 4187.6 17944.2

@essandess
Copy link
Author

I’m going to reopen this issue as a feature request for compiled Julia on Apple Silicon to use the Accelerate framework by default. I cannot imagine a scenario where one wouldn’t want this—it’s a factor of 3–4x performance.

@essandess essandess reopened this Aug 9, 2023
@ctkelley
Copy link

ctkelley commented Aug 9, 2023

How large were these benchmarks? Were matrix factorizations part of it? The discussion here seemed to conclude that AA was indeed much better for matrix-matrix multiply but often worse on LU. Nor was it easy to see the O(N^3) cost as the dimension increased..

I'd stick with OpenBLAS.

@essandess
Copy link
Author

I haven't seen comprehensive benchmarks published, but it's easy to run a few cases. I observe a 3–4x speedup for BLAS, and comparable performance on standard matrix decompositions, with the exception of the large, dense SVDs.

MY own preference based on my own workloads and observations is a preference for an Accelerate framework default.

Julia Benchmark Code
Julia Benchmark Code
using BenchmarkTools
using LinearAlgebra
using Printf
# using AppleAccelerate

j_type = Float32

for sz in [2048, 4096, 8192, 16384]
   a = randn(j_type, sz, sz);
   ts = @belapsed qr($a);
   @printf("| %d\t| %.1f\t|\n", sz, 2*sz^2*(sz - sz/3) / ts / 1e9)
end
for sz in [2048, 4096, 8192, 16384]
   a = randn(j_type, sz, sz);
   ts = @belapsed svd($a);
  @printf("| %d\t| %.1f\t|\n", sz, sz^3*(2 + 11) / ts / 1e9)
end
Julia Matrix Decompositions (GFLOPS)
Size QR (OpenBLAS) QR (AA) SVD (OpenBLAS) SVD (AA)
2048 167.3 122.9 122.3 163.0
4096 226.2 189.3 134.7 98.7
8192 300.5 316.9 198.1 80.9
16384 359.2 370.8 255.8 81.6

@brenhinkeller brenhinkeller removed the building Build system, or building Julia or its dependencies label Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips
Projects
None yet
Development

No branches or pull requests

4 participants