macOS Accelerate Optimizations and Julia versus Python Benchmarks on Apple Silicon #50806

essandess · 2023-08-05T21:24:22Z

[Solution to this issue: Use AppleAccelerate.jl.]

I'm a maintainer for the MacPorts julia port and am trying to make sure we have the best build formula for Apple Silicon.

I see a ≈2.5–3× difference in the CPU performance of Julia versus Python on basic dense matrix operations, which suggests that we may not be using the Accelerate framework appropriately. (Metal.jl benchmarks are comparable within a few TFLOPS to to PyTorch+MPS, so at least that part looks okay.)

How does one ensure that Julia is compiled to use macOS Accelerate optimizations? We use the direct build instructions provided by Julia itself, so this performance issue may arise from Julia.

https://github.com/macports/macports-ports/blob/0f6d1c42dfc3bda20673e34529c51ab34a4f3da4/lang/julia/Portfile#L57-L58

On a Mac Studio M2 Ultra, I observe that Numpy with Accelerate achieves about 2.5–3 TFLOPS for dense ops, but Julia achieves 1–1.4 TFLOPS, using both MacPorts and julialang.org binaries.

Here's some basic benchmarking code and results:

Benchmarks on Mac Studio M2 Ultra

Julia Benchmark Code

# using AppleAccelerate
using BenchmarkTools
using Metal
using Printf

j_type = Float32

for sz in [2048, 4096, 8192, 16384]
    a = randn(j_type, sz, sz);
    b = randn(j_type, sz, sz);
    a_mtl = MtlArray(a);
    b_mtl = MtlArray(b);
    ts = @belapsed $a * $b;
    ts_mtl = @belapsed ($a_mtl * $b_mtl)[1, 1];
    @printf("| %d\t| %.1f\t| %.1f\t|\n", sz, sz^2*(2*sz - 1) / ts / 1e9, sz^2*(2*sz - 1) / ts_mtl / 1e9)
end

Julia (MacPorts) Matrix Multiplication (GFLOPS)

Size	Julia	Metal.jl
2048	1068.3	11071.3
4096	1168.5	16652.6
8192	1350.6	18281.8
16384	1353.1	17988.2

Julia (julialang.org) Matrix Multiplication (GFLOPS)

Size	Julia	Metal.jl
2048	962.1	10760.9
4096	1162.4	16134.1
8192	1348.3	17379.8
16384	1322.1	17831.5

Julia (AppleAccelerate.jl) Matrix Multiplication (GFLOPS)

Size	AppleAccelerate.jl	Metal.jl
2048	3301.1	10474.0
4096	3588.8	16004.8
8192	4018.3	17385.5
16384	4187.6	17944.2

Python Benchmark Code

import numpy as np
import torch
import time

mpsDevice = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

rg = np.random.default_rng(1)
np_type = np.float32
torch_type = torch.float32

print("Python Matrix Multiplication (GFLOPS)\n")
print("| Size\t| Numpy+Accelerate \t| PyTorch+MPS |")
print("| -----:\t| -----:\t| -----: |")
for size in (2048, 4096, 8192, 16384):
    a_np = rg.random((size, size), dtype=np_type)
    b_np = rg.random((size, size), dtype=np_type)
    a_torch = torch.randn((size, size), dtype=torch_type, device=mpsDevice)
    b_torch = torch.randn((size, size), dtype=torch_type, device=mpsDevice)
    ts_np = %timeit -n1 -r5 -q -o a_np @ b_np
    ts_torch = %timeit -n1 -r5 -q -o (a_torch @ b_torch)[0, 0].cpu()
    print("| {:d}\t| {:.1f}\t| {:.1f} |".format(size, size**2*(2*size - 1) / np.median(ts_np.all_runs) / 1e9, size**2*(2*size - 1) / np.median(ts_torch.all_runs) / 1e9))

Python Matrix Multiplication (GFLOPS)

Size	Numpy+Accelerate	PyTorch+MPS
2048	2134.5	10679.2
4096	2626.8	20309.6
8192	2845.0	20988.9
16384	3015.1	19577.4

julia> versioninfo()

Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.5.0)
  CPU: 24 × Apple M2 Ultra
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 16 virtual cores

The text was updated successfully, but these errors were encountered:

brenhinkeller · 2023-08-05T23:39:40Z

Interesting! How does the Apple Silicon binary from julialang.org/downloads compare?

essandess · 2023-08-06T00:49:55Z

I now include julialang.org results above, which are comparable to the performance of MacPorts binaries. I also updated the Julia benchmarks to use a more accurate BenchmarkTools.jl timing with @belapsed.

giordano · 2023-08-06T06:04:13Z

Did you try AppleAccelerate.jl, which doesn't require rebuilding Julia?

essandess · 2023-08-06T11:11:15Z

That you! That's what I was looking for:

Julia (AppleAccelerate.jl) Matrix Multiplication (GFLOPS)

Size	AppleAccelerate.jl	Metal.jl
2048	3301.1	10474.0
4096	3588.8	16004.8
8192	4018.3	17385.5
16384	4187.6	17944.2

essandess · 2023-08-09T13:51:23Z

I’m going to reopen this issue as a feature request for compiled Julia on Apple Silicon to use the Accelerate framework by default. I cannot imagine a scenario where one wouldn’t want this—it’s a factor of 3–4x performance.

ctkelley · 2023-08-09T14:41:36Z

How large were these benchmarks? Were matrix factorizations part of it? The discussion here seemed to conclude that AA was indeed much better for matrix-matrix multiply but often worse on LU. Nor was it easy to see the O(N^3) cost as the dimension increased..

I'd stick with OpenBLAS.

essandess · 2023-08-10T02:31:56Z

I haven't seen comprehensive benchmarks published, but it's easy to run a few cases. I observe a 3–4x speedup for BLAS, and comparable performance on standard matrix decompositions, with the exception of the large, dense SVDs.

MY own preference based on my own workloads and observations is a preference for an Accelerate framework default.

Julia Benchmark Code

using BenchmarkTools
using LinearAlgebra
using Printf
# using AppleAccelerate

j_type = Float32

for sz in [2048, 4096, 8192, 16384]
   a = randn(j_type, sz, sz);
   ts = @belapsed qr($a);
   @printf("| %d\t| %.1f\t|\n", sz, 2*sz^2*(sz - sz/3) / ts / 1e9)
end
for sz in [2048, 4096, 8192, 16384]
   a = randn(j_type, sz, sz);
   ts = @belapsed svd($a);
  @printf("| %d\t| %.1f\t|\n", sz, sz^3*(2 + 11) / ts / 1e9)
end

Julia Matrix Decompositions (GFLOPS)

Size	QR (OpenBLAS)	QR (AA)	SVD (OpenBLAS)	SVD (AA)
2048	167.3	122.9	122.3	163.0
4096	226.2	189.3	134.7	98.7
8192	300.5	316.9	198.1	80.9
16384	359.2	370.8	255.8	81.6

brenhinkeller added performance Must go faster system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips labels Aug 5, 2023

brenhinkeller added the building Build system, or building Julia or its dependencies label Aug 5, 2023

essandess closed this as completed Aug 6, 2023

essandess reopened this Aug 9, 2023

brenhinkeller removed the building Build system, or building Julia or its dependencies label Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

macOS Accelerate Optimizations and Julia versus Python Benchmarks on Apple Silicon #50806

macOS Accelerate Optimizations and Julia versus Python Benchmarks on Apple Silicon #50806

essandess commented Aug 5, 2023 •

edited

Loading

Julia Benchmark Code

Python Benchmark Code

brenhinkeller commented Aug 5, 2023

essandess commented Aug 6, 2023

giordano commented Aug 6, 2023

essandess commented Aug 6, 2023

essandess commented Aug 9, 2023

ctkelley commented Aug 9, 2023

essandess commented Aug 10, 2023

macOS Accelerate Optimizations and Julia versus Python Benchmarks on Apple Silicon #50806

macOS Accelerate Optimizations and Julia versus Python Benchmarks on Apple Silicon #50806

Comments

essandess commented Aug 5, 2023 • edited Loading

Benchmarks on Mac Studio M2 Ultra

Julia Benchmark Code

Julia (MacPorts) Matrix Multiplication (GFLOPS)

Julia (julialang.org) Matrix Multiplication (GFLOPS)

Julia (AppleAccelerate.jl) Matrix Multiplication (GFLOPS)

Python Benchmark Code

Python Matrix Multiplication (GFLOPS)

brenhinkeller commented Aug 5, 2023

essandess commented Aug 6, 2023

giordano commented Aug 6, 2023

essandess commented Aug 6, 2023

Julia (AppleAccelerate.jl) Matrix Multiplication (GFLOPS)

essandess commented Aug 9, 2023

ctkelley commented Aug 9, 2023

essandess commented Aug 10, 2023

Julia Benchmark Code

Julia Matrix Decompositions (GFLOPS)

essandess commented Aug 5, 2023 •

edited

Loading