BLAS support for M1 ARM64 via Apple's Accelerate #869

domschl · 2021-09-19T14:35:31Z

The default BLAS Julia uses is OpenBLAS. Apple's M1 has proprietary dedicated matrix hardware that is only accessible via Apple's Accelerate BLAS implementation. That proprietary interface can provide 2x to 4x speedups for some linear algebra use cases (see https://discourse.julialang.org/t/does-mac-m1-in-multithreads-is-slower-that-in-single-thread/61114/12?u=kristoffer.carlsson for some benchmarks and discussion.)

Since Julia 1.7 there's a BLAS multiplexer:

https://github.com/staticfloat/libblastrampoline
(this currently -- as far as I understood it -- requires still proprietary code for each BLAS, and for M1 there's so far only a minimal shim via AppleAccelerateLinAlgWrapper.jl )

So in theory, it should be possible to extend this so that depending on a given platform either OpenBLAS or other BLAS solutions are used transparently by default.

So this issue discusses what needs to be done to have Apple's Accelerate access to M1 hardware acceleration available by default in Julia

ViralBShah · 2021-09-19T20:43:53Z

The plan was to have LBT as a way to pick a different BLAS than the default OpenBLAS for now. That requires you to load a package every time you start Julia to change the default. Eventually, once the Preferences mechanism becomes standard, we want to use that so that users can pick a different BLAS by default.

I don't think we want to depend on the Apple provided BLAS by default for the M1 for now.

domschl · 2021-09-20T09:13:12Z

Some (anecdotal) benchmark scenarios that might illustrate why Accelerate makes sense (at least for gemm type of operations). I've done a machine learning framework and benchmark suite, syncognite/bench written in C++ using eigen3 (so neither Julia nor OpenBLAS), but it might serve as a hint for what kind of performance improvements are possible.

The following table compare a benchmark run with native eigen3 on ARM64, and a second run using Apple's Accelerate within eigen (USE_SYSTEM_BLAS variable). The percentage columns show the speed-up for forward (fw) and backward (bw) path of neural network training when using accelerate, e.g:

Affine layers (basically a matrix mult): 480-600% faster
Affine layer with ReLU: 420-500% faster
LSTM recurrent networks: 200-300%

all that with very low energy usage!

Layer             fw(ms) fw/N(ms)   %  bw(ms) bw/N(ms)   %
OneHot             1.068  0.0107  0.1   0.000  0.0000 66.6
Affine             0.464  0.0050 620.1  1.210  0.0130 485.4
AffineRelu         0.538  0.0056 506.4  1.294  0.0135 424.4
Relu               0.023  0.0002  5.8   0.423  0.0042 -1.8
Nonlin-Relu        0.064  0.0006 -1.9   0.067  0.0007  0.7
Nonlin-Sgmd        0.126  0.0012 -1.2   0.067  0.0007 -1.8
Nonlin-Tanh        0.099  0.0010 -0.0   0.070  0.0007 -0.5
Nonlin-Selu        0.746  0.0075 -0.1   0.649  0.0065 -0.4
Nonlin-Resilu      0.202  0.0020  0.2   0.243  0.0024 -0.6
Dropout            0.713  0.0072 -0.8   0.030  0.0003  1.5
BatchNorm          0.159  0.0016 -3.1   0.217  0.0022 -4.6
SpatialBtchNrm     2.610  0.0260 -1.0   3.936  0.0393 -0.2
Convolution       54.388  0.5470 20.3  58.559  0.5869 45.1
Pooling            6.421  0.0643 -0.0  22.589  0.2262  1.7
LSTM              27.616  0.2778 299.8 78.172  0.7677 198.0
RNN               16.929  0.1660 110.4 33.648  0.3297 114.6
TemporalAff        5.704  0.0569 105.7  5.545  0.0554 209.3
WordEmbed          5.379  0.0541 129.1  4.227  0.0419 116.0
SVM                0.229  0.0023  0.5   0.225  0.0022  0.9
Softmax            0.228  0.0023  0.1   0.022  0.0002  1.5
TemporalSM         4.356  0.0434 -0.8   1.401  0.0140  0.1
TwoLayerNet        1.776  0.0177 300.5  2.555  0.0255 425.1

DilumAluthge · 2021-09-20T09:15:07Z

I think @chriselrod has played around with this a little bit.

ViralBShah · 2021-09-20T12:44:57Z

The first thing is to create an MKL.jl like package for Apple Accelerate. We already have some support in LBT for Accelerate - so this should be fairly quick.

chriselrod · 2021-09-20T12:50:29Z

A difficulty is that accelerate uses the 32 bit API, which is why AppleAccelerateLinAlgWrapper manually defines the methods it uses (and is based on Elliot's code).

(Also, AppleAccelerateLinAlgWrapper has a deliberately cumbersome name to avoid stealing/squatting on potentially valuable names for a future such package that supersedes it.)

ViralBShah · 2022-09-05T23:44:17Z

Since Accelerate only has an LP64 BLAS and Julia wants ILP64, this is quite difficult, unless we can somehow make this choice dynamic as discussed in #891.

It should be possible to have a separate wrapper like @chriselrod discusses above that packages can directly invoke, but swapping it in as the default BLAS in Julia is fairly non-trivial.

zinphi · 2022-10-11T05:39:09Z

I‘m no expert but wouldn‘t it be quite easy to write some kind of wrapper libblas which just redirects level 3 BLAS calls to the Apple accelerate BLAS and all other calls to OpenBLAS? I mean ILP64 does not really play a role for level 3 BLAS imho anyway. On the other hand, level 3 BLAS routines are probably the only routines which benefit from Apple‘s AMX extension…

ViralBShah · 2022-10-15T15:57:42Z

Yes we could have a wrapper library that redirects all 64-bit ILP64 calls to a 32-bit BLAS. It seems like it would be easier to have Apple just provide ILP64 support with mangled names. Intel is doing that in MKL now.

Do we have a way to ask Apple to do this? @Keno @staticfloat Perhaps one of you had a contact at Apple?

Keno · 2022-10-15T22:46:51Z

We can ask.

zinphi · 2022-10-17T13:39:18Z

Nice that you guys have the appropriate contacts ;-) However, what I heard from other discussions, Apple seams to assign currently very little resources to their BLAS/LAPACK development. So, I wouldn't bet on them... Nevertheless, I keep my fingers crossed ^^
The other proper solution would be to have AMX kernels implemented in OpenBLAS. It seems that (Apple M1) AMX has been decrypted now (inofficially) more or less completly: https://github.com/corsix/amx. I guess with this knowledge the guys from OpenBLAS should be able to do their job (if there are no legal restriction applying).

ViralBShah · 2022-10-17T13:55:34Z

Please do file the pointer to Apple AMX kernels in an issue on the openblas github repo. Yes, it would be great for openblas to have those kernels.

zinphi · 2022-10-17T14:46:33Z

I tried my best and opened a new issue there (see OpenMathLib/OpenBLAS#3789). Let's see what they think about that.

mzy2240 · 2022-11-24T14:29:39Z

Might be relevant: mlpack/mlpack#3308

ctkelley · 2023-01-31T19:37:19Z

I just got a shiny new Mac Mini with an M2 Pro, so I thought I see how Apple Acclerate scaled. I timed gemm and lu with both OpenBLAS and Acclerate. It seems that Accelerate's advantage declines as the problem size increases. This is worse ith lu than gemm.

It's also interesting, at least to me, that Acclerate does so well for the single precision matrix multiply.

This is far from a definitive analysis, but makes me nervous about swapping OpenBlas for Accelerate.

I ran this on 1.9-beta3

julia> using Random, AppleAccelerateLinAlgWrapper, BenchmarkTools

julia> function testme(T = Float64)
           Random.seed!(46071)
           for p = 8:13
              N = 2^p

              A = rand(T, N, N)

              tblasmm = @belapsed $A * $A
              taccmm = @belapsed AppleAccelerateLinAlgWrapper.gemm($A, $A)
              println("MM: Dim= $N. BLAS time = $tblasmm. Apple time = $taccmm")


              tblaslu = @belapsed lu($A)
              tacclu = @belapsed AppleAccelerateLinAlgWrapper.lu($A)
              println("LU: Dim= $N. BLAS time = $tblaslu. Apple time = $tacclu")
          end

       end
testme (generic function with 2 methods)

The results for double precision:

julia> testme()
MM: Dim= 256. BLAS time = 1.84333e-04. Apple time = 9.31660e-05
LU: Dim= 256. BLAS time = 7.17958e-04. Apple time = 1.82250e-04
MM: Dim= 512. BLAS time = 8.25958e-04. Apple time = 4.33709e-04
LU: Dim= 512. BLAS time = 1.27475e-03. Apple time = 1.05083e-03
MM: Dim= 1024. BLAS time = 6.32771e-03. Apple time = 3.39408e-03
LU: Dim= 1024. BLAS time = 4.43283e-03. Apple time = 3.46121e-03
MM: Dim= 2048. BLAS time = 4.78387e-02. Apple time = 2.98090e-02
LU: Dim= 2048. BLAS time = 2.33295e-02. Apple time = 1.78365e-02
MM: Dim= 4096. BLAS time = 3.94479e-01. Apple time = 2.33508e-01
LU: Dim= 4096. BLAS time = 1.61876e-01. Apple time = 1.50505e-01
MM: Dim= 8192. BLAS time = 3.11572e+00. Apple time = 1.82149e+00
LU: Dim= 8192. BLAS time = 1.17175e+00. Apple time = 2.54976e+00

and for single

julia> testme(Float32)
MM: Dim= 256. BLAS time = 1.34667e-04. Apple time = 2.45840e-05
LU: Dim= 256. BLAS time = 6.53583e-04. Apple time = 1.25709e-04
MM: Dim= 512. BLAS time = 4.42458e-04. Apple time = 1.05875e-04
LU: Dim= 512. BLAS time = 1.26879e-03. Apple time = 5.36500e-04
MM: Dim= 1024. BLAS time = 3.32025e-03. Apple time = 8.74250e-04
LU: Dim= 1024. BLAS time = 3.40737e-03. Apple time = 2.66488e-03
MM: Dim= 2048. BLAS time = 2.44754e-02. Apple time = 9.16629e-03
LU: Dim= 2048. BLAS time = 1.38886e-02. Apple time = 1.42406e-02
MM: Dim= 4096. BLAS time = 1.94998e-01. Apple time = 7.03759e-02
LU: Dim= 4096. BLAS time = 8.70666e-02. Apple time = 8.09671e-02
MM: Dim= 8192. BLAS time = 1.54402e+00. Apple time = 5.09572e-01
LU: Dim= 8192. BLAS time = 6.15488e-01. Apple time = 6.45579e-01

domschl · 2023-02-01T08:46:53Z

So the decision might depend on your application scenario.

For machine learning, the decision would be clear (tested on MacBook Pro M2 Max, Julia head from 2023-01-26):

testme(Float32)
MM: Dim= 256. BLAS time = 5.1542e-5. Apple time = 2.4458e-5 factor=2.107367732439284
MM: Dim= 512. BLAS time = 0.000364542. Apple time = 0.0001065 factor=3.4229295774647883
MM: Dim= 1024. BLAS time = 0.003256166. Apple time = 0.000854958 factor=3.8085683741189627
MM: Dim= 2048. BLAS time = 0.024809042. Apple time = 0.008663458 factor=2.8636419776029385
MM: Dim= 4096. BLAS time = 0.205842959. Apple time = 0.067481875 factor=3.0503443924757576
MM: Dim= 8192. BLAS time = 1.73104175. Apple time = 0.503544333 factor=3.4377146887680294

A pretty consistent 3x speed advantage of Accelerate over OpenBLAS for matrix sizes relevant for machine learning operations.

chriselrod · 2023-02-01T10:42:08Z

I'd expect OpenBLAS sgemm to take at least 1.3 seconds with 8 cores for the 8192x8192 matrices:

julia> 2 * 8192^3 / (4*4*2*3.25e9*8)
1.3215283987692308

It requires 8192^3 FLOPs.
The CPU has 4 execution units with 4 Float32 each, doing 2 FLOPs/fma instruction, running at around 3.25e9 clock cycles/second, and there are 8 cores.

So, 0.615s reported by @ctkelley sounds too fast, and @domschl's 1.73s realistic.

Odd that @domschl's accelerate time was faster (0.5 vs 0.65s).

ctkelley · 2023-02-01T11:14:18Z

My .615 was for LU. My MM numbers are pretty close to what @domschl got. So OpenBLAS LU time is roughly 1/3 OpenBLAS MM time, as you would expect. The Apple LU times are hard for me to understand as the dimension grows. For dim = 8192, LU takes more time than MM.

domschl · 2023-03-12T15:13:37Z

might be interesting simplification, new support for ILP64 interface:

Release Notes Ventura 13.3 Beta 3

Accelerate

New Features

The BLAS and LAPACK libraries under the Accelerate framework are now inline with reference version 3.9.1. These new interfaces provide additional functionality and a new ILP64 interface. To use the new interfaces, define ACCELERATE_NEW_LAPACK before including the Accelerate or vecLib headers. For ILP64 interfaces, also define ACCELERATE_LAPACK_ILP64. For Swift projects, specify ACCELERATE_NEW_LAPACK=1 and ACCELERATE_LAPACK_ILP64=1 as preprocessor macros in Xcode build settings. (105572917)

ViralBShah · 2023-03-12T15:59:56Z

Nice!

staticfloat · 2023-03-14T18:36:59Z

Okay, I spun up a VM and tried it out. The good news is, many things work! The bad news is, it requires a hack to LBT to use their symbol names since they don't use a simple suffix on the F77 symbols, they drop the trailing underscore from the symbol name (e.g. dgemm_ -> dgemm$NEWLAPACK$ILP64). I've requested that they keep that trailing underscore (both after dgemm and after ILP64, to maintain compatibility with gfortran compilers which require a trailing underscore for all symbol names) but we'll see what they say. Another good piece of news is that their LAPACK implementation has been updated from 3.2.x to 3.9.x, so I think we're seeing a significant increase in Accelerate functionality!

I was running inside of a VM so benchmarks are basically useless, so all I'll say is that Accelerate (in the VM) was faster than OpenBLAS (in the VM) by about a factor of 3x when running peakflops().

ViralBShah · 2023-03-14T19:10:21Z

I suppose that LBT can pick Accelerate if we are on the right macOS version in the default Julia build, or default to openblas (which we would continue to ship for a long time). This saves the effort of making our BLAS runtime switchable. Apple was one of the last holdouts.

OpenBLAS does have a multi-threaded solvers (it patches LAPACK), so I am curious how the LU and cholesky factorization performance stacks up.

staticfloat · 2023-03-14T19:17:42Z

I suppose that LBT can pick Accelerate if we are on the right macOS version in the default Julia build

Yes, such a switch is actually quite easy to implement; we can even just try loading ILP64 Accelerate, and if it fails, we load OpenBLAS instead. It would go right here: https://github.com/JuliaLang/julia/blob/7ba7e326293bd3eddede81567bbe98078e81f775/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L645. We could also have it set via a Preference, or something like that.

ctkelley · 2023-03-14T21:21:32Z

I suppose that LBT can pick Accelerate if we are on the right macOS version in the default Julia build, or default to openblas (which we would continue to ship for a long time). This saves the effort of making our BLAS runtime switchable. Apple was one of the last holdouts.

OpenBLAS does have a multi-threaded solvers (it patches LAPACK), so I am curious how the LU and cholesky factorization performance stacks up.

I'd like to see how the matrix factorizations do as well. Things were a bit strange (see my post above) with the old version. If everything is 3X faster now, everyone wins.

cbrnr · 2023-03-23T08:37:53Z

From my anecdotal experience with ICA (running in Python via NumPy), I found that Accelerate is between 5x – 15x faster than OpenBLAS. The OpenBLAS implementation is so slow that even a 10-year-old Intel Mac has pretty much the same performance.

Again, this is only anecdotal evidence, and I am certainly not trying to bash OpenBLAS. However, I think this might underscore the importance of being able to use an optimized BLAS on every supported platform. Currently, this means that BLAS-dependent calculations are much slower than they could/should be on Apple Silicon, to the point that (depending on the algorithm of course) Julia could not the best choice for extremely fast performance anymore.

ctkelley · 2023-03-23T10:15:05Z

Are you seeing this speedup for factorizations (LU, Cholesky, SVD, QR, ...)? I am not for the older version (pre OS 12.3) version of Accelerate.

cbrnr · 2023-03-23T10:18:04Z

To be honest, I don't know which operations are involved in those ICA algorithms, but I'm guessing that SVD is very likely part of it. I am on macOS 13.2.1.

ctkelley · 2023-03-31T15:42:47Z

What's k? If it's the prefactor in the O-term, then do you mean k n^3 + O(n^2)?

philipturner · 2023-03-31T15:57:30Z

It's the coefficient of the n^3 term. At sufficiently large data sizes, the cubic term will dominate. This will give us an accurate estimate of (computational cost, GFLOPS, GFLOPs, ALU %); accurate to the correct order of magnitude. Whether it's precise is a different story.

ctkelley · 2023-03-31T16:04:53Z

So you meant k n^3 + O(n^2). So it is sufficient to look for jumps in n^3/time. You could do that yourself with the data in my post. I will give it a shot and post some results at some point.

…

On Fri, Mar 31, 2023 at 11:57 AM Philip Turner ***@***.***> wrote: It's the coefficient of the n^3 term. At sufficiently large data sizes, the cubic term will dominate. This will give us an accurate estimate of (computational cost, GFLOPS, GFLOPs, ALU %); accurate to the correct order of magnitude. Whether it's precise <https://en.wikipedia.org/wiki/Accuracy_and_precision> is a different story. — Reply to this email directly, view it on GitHub <#869>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOEX6ZAIRXDKHF7DSWWQS3W635HPANCNFSM5EKM2XKA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- C. T. Kelley Department of Mathematics, Box 8205 SAS Hall 2311 Stinson Drive North Carolina State University Raleigh, NC 27695-8205 (919) 515-7163, (919) 513-7336 (FAX) ***@***.*** https://ctk.math.ncsu.edu

philipturner · 2023-03-31T17:54:12Z

You could do that yourself with the data in my post.

My bad. Here's the data. For LU decomposition, k = 2/3. So multiply these numbers by 2/3.

GFLOPS/k	O64	A64	O32	A32
256	52	85	65	120
512	157	99	175	221
1024	288	262	396	449
2048	409	385	627	651
4096	449	397	829	878
8192	482	94	912	683
16384	501	312	980	382
32768	479	529	980	907

The data varies a lot because only one trial was taken. I usually run several dozen trials in quick succession, then select the one with maximum speed. I also run the benchmark multiple times, each separated by ~20 seconds, to account for varying system conditions.

chriselrod · 2023-03-31T18:07:06Z

GFLOPS is already scale invariant.
Floating point operations/time.

When the values change as a function of size, you know the algorithm is more/less efficient at those sizes.

ctkelley · 2023-03-31T18:12:41Z

I ran this a few times with similar results. The only real strange one is the Float64 result from Accelerator. You can (or at least I think I can) see all 8 cores kick in on Open BLAS at about the 4096 mark. I will try a couple more with my browser shut off, but that should not affect the AMX unit. What I wanted was the relative timings so I could figure out what LBT+Accelerator would do for me. I think it'll be good for me when I travel with my laptop and want to conserve battery life. On the desktop, it's not as clear. I do a fair amount of single precision right now, and that looks pretty decent. Not much of what I do is matrix-matrix product intensive.

…

On Fri, Mar 31, 2023 at 1:54 PM Philip Turner ***@***.***> wrote: You could do that yourself with the data in my post. My bad. Here's the data. For LU decomposition, k = 2/3 <https://math.stackexchange.com/questions/3028333/how-to-compute-amount-of-floating-point-operations-for-lu-decomposition-of-bande>. So multiply these numbers by 2/3. GFLOPS/k O64 A64 O32 A32 256 52 85 65 120 512 157 99 175 221 1024 288 262 396 449 2048 409 385 627 651 4096 449 397 829 878 8192 482 94 912 683 16384 501 312 980 382 32768 479 529 980 907 The data varies a lot because only one trial was taken. I usually run several dozen trials in quick succession, then select the one with maximum speed. I also run the benchmark multiple times, each separated by ~20 seconds, to account for varying system conditions. — Reply to this email directly, view it on GitHub <#869>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOEX6YDJFKG7W4WPKWBWKTW64K5DANCNFSM5EKM2XKA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- C. T. Kelley Department of Mathematics, Box 8205 SAS Hall 2311 Stinson Drive North Carolina State University Raleigh, NC 27695-8205 (919) 515-7163, (919) 513-7336 (FAX) ***@***.*** https://ctk.math.ncsu.edu

philipturner · 2023-03-31T18:25:26Z

GFLOPS is already scale invariant.
Floating point operations/time.

Good observation. I originally created that metric because I wanted to make an eigendecomposition algorithm using AMX assembly. Using GFLOPS to measure performance would be misleading to compare different algorithms. For example,

~~O(10n^3)~~ 10n^3 + an^2 + bn algorithm reaches 75% ALU utilization, or 1200 GFLOPS
~~O(2n^3)~~ 2n^3 + an^2 + bn algorithm reaches 25% ALU utilization, or 400 GFLOPS

Which one is faster? In addition, for some algorithms, you don't know the k prefactor, infamously with a failed eigendecomposition benchmark. My GFLOPS was off by an order of magnitude, and I still don't know what the k is.

chriselrod · 2023-03-31T19:17:18Z

Using GFLOPS to measure performance would be misleading to compare different algorithms.

Good point. As another example, parallel QR needs more FLOPs. Or Strassen vs standard matmul.
An algorithm that needs less FLOPs to get the same answer obviously has an advantage.

I'd consistently use time / f(size), where f(size) should be the same across all algorithms, and not the actual gflops that algorithm used.
You could pick the most efficient or naive algorithm, and use its flop formula to define f, e.g. 2M*K*N for matmul, or f(n) = 2n^3 + an^2 + bn for eigendecomposition.

Which is similar to what you'd suggested.

Maybe you could define f with a naive algorithm.
RecursiveFactorization.jl uses this for its lu benchmarks: https://github.com/JuliaLinearAlgebra/RecursiveFactorization.jl/blob/cbfeaeed45bbeff53b5945e66de5acb55830f4e4/perf/lu.jl#L7-L19

ctkelley · 2023-03-31T19:20:14Z

I did another run on an unloaded machine and got the same results. I use @belapsed from BenchmarkTools to get the timings and that's been very reproducible, at least for me. So the bizzare M2 pro behavior for Float64 seems to be really there for lu. One thing I just noticed is that the fan on my Mac Mini was running hard during the Open BLAS computation and not running at all during the Accelerate part.

I'm doing the computation on my M2 MacBook Air as I type this. The Open BLAS computation ran hot and the load averages from the w command were in the 6 range (on a machine with 4 BLAS threads). The Accelerate computation has load averages bellow 2 and seems to be running cooler. If the machine is on my lap, I like the run cool part.

ctkelley · 2023-03-31T19:44:34Z

Herewith the M2 MacBook Air lu results. Looks a bit more consistent
that the ones with the M2 pro.

I will stop doing this now.

Factorization = lu!
        N      O-64      A-64      R-64      O-32      A-32      R-32
      256  3.43e-04  2.01e-04  5.85e-01  2.69e-04  1.44e-04  5.36e-01
      512  1.08e-03  1.38e-03  1.28e+00  9.97e-04  6.17e-04  6.18e-01
     1024  5.72e-03  1.13e-02  1.98e+00  4.25e-03  3.76e-03  8.87e-01
     2048  4.12e-02  7.23e-02  1.76e+00  2.31e-02  4.11e-02  1.78e+00
     4096  3.16e-01  3.93e-01  1.24e+00  1.70e-01  2.40e-01  1.41e+00
     8192  2.54e+00  2.43e+00  9.56e-01  1.32e+00  1.22e+00  9.25e-01
    16384  2.06e+01  1.88e+01  9.11e-01  1.04e+01  7.31e+00  7.00e-01
    32768  1.79e+02  2.08e+02  1.16e+00  9.02e+01  6.39e+01  7.08e-01

philipturner · 2023-03-31T20:05:46Z

Which is similar to what you'd suggested.

Yours (k/GFLOPS) would help in an effort to predict the execution time of an entire app. For example, density functional theory (DFT) uses several LAPACK functions. You might sum the k/GFLOPS and multiply by (#electrons)^3. With my metric, I wanted to easily map it to actual ALU %, and if I got k wrong, not have to rewrite the raw data. I can also compare ALU % for complex operations to real operations by shifting k_real <-> 0.25k_complex. That brings up another important point - Accelerate really struggles on complex-valued functions.

OpenBLAS is more than 2x slower on the M2, while Accelerate is exactly 2x slower. OpenBLAS could be trying to treat the heterogeneous P and E blocks as homogeneous - major load imbalance. This would decrease the performance gap with Accelerate.

GFLOPS/k, k=2/3	O64	A64	O32	A32
256	49	83	62	117
512	130	97	135	218
1024	188	95	253	286
2048	168	119	372	209
4096	217	175	404	286
8192	216	226	416	451
16384	213	234	423	602
32768	197	169	390	551

ctkelley · 2023-03-31T21:12:59Z

Just tested Accelerate on something I'm working on. It's not very good at the triangular solves you do after lu. I'm seeing the same inconsistent scaling, even in Float32. For this project, I need to report timings on triangular solves to make a point about iterative refinement. Open BLAS is completely predictable, consistent with theory, and does what I need. Accelerate is a bit too flaky for publication work.

None of the problems we're talking about here should be impossible for Apple to fix, especially the complex function part in @philipturner 's post above, if they care.

philipturner · 2023-03-31T21:36:35Z

None of the problems we're talking about here should be impossible for Apple to fix

Most problems (aside from minor GFLOPS blips) probably are impossible to fix.

if they care.

I wouldn't phrase it like that. Apple engineers know OpenBLAS exists, and it already serves its purpose: high throughput, no matter the energy cost. Nvidia RTX 4090 exists, and it has the highest TFLOPS but zero regard for power efficiency. Apple's AMX hardware was tailored to be an "ML accelerator" (sic) and not a linear algebra accelerator. It performs real-valued convolutions or batched real-FP32 matrix multiplications. Modestly high-throughput real-FP64 was just a cool side effect.

Rather than reinvent the wheel, they gave developers something new - the ability to trade off between sheer performance and power efficiency. That is crucial for more apps than you think. They also did this with the GPU, sacrificing performance for the sake of efficiency.

yakovbraver · 2023-04-01T11:11:19Z

I might be late to the party, but here are some additional results. I used a script based on @ctkelley 's script above. Notably, I added ComplexF64 calculations ("-64Z" columns below).
Tested on Mac mini M2 Pro (12/19), 32 GB RAM, with julia -t auto (giving 8 Julia threads and also 8 BLAS threads).

Dot product

Test of dot(A, B)
       N    O-64Z    A-64Z  R-64Z     O-64     A-64   R-64     O-32     A-32   R-32 
      32  1.9e-08  1.2e-08   0.67  1.0e-08  8.8e-09   0.87  7.0e-09  8.8e-09   1.24
      64  4.0e-08  2.0e-08   0.49  1.6e-08  1.1e-08   0.72  7.9e-09  9.9e-09   1.25 
     128  8.4e-08  3.8e-08   0.45  3.5e-08  1.8e-08   0.51  1.3e-08  1.2e-08   0.95 
     256  1.9e-07  7.2e-08   0.37  7.2e-08  3.1e-08   0.44  1.9e-08  1.8e-08   0.96 
     512  4.1e-07  1.5e-07   0.36  1.7e-07  6.1e-08   0.36  3.2e-08  3.2e-08   1.00 
    1024  8.5e-07  2.9e-07   0.35  3.9e-07  1.3e-07   0.34  6.2e-08  6.3e-08   1.02 
    2048  1.7e-06  5.9e-07   0.34  8.3e-07  2.8e-07   0.34  1.3e-07  1.4e-07   1.01 
    4096  3.5e-06  1.2e-06   0.34  1.7e-06  5.7e-07   0.34  2.8e-07  1.7e-07   0.62 
    8192  7.0e-06  2.9e-06   0.41  3.5e-06  1.2e-06   0.34  5.7e-07  2.9e-07   0.50 
   16384  1.4e-05  5.8e-06   0.41  7.0e-06  2.9e-06   0.41  1.2e-06  5.0e-07   0.43 
   32768  2.8e-05  1.1e-05   0.40  1.4e-05  5.7e-06   0.41  2.9e-06  9.4e-07   0.33

Here, Accelerate gives a nice ~x2.5–x3 speed-up for ComplexF64 and Float64.

Matrix multiplication

Test of mul!(C, A, B)
       N    O-64Z    A-64Z  R-64Z     O-64     A-64   R-64     O-32     A-32   R-32 
      32  1.2e-05  4.1e-06   0.34  1.7e-06  3.4e-07   0.20  8.9e-07  2.3e-07   0.25 
      64  1.2e-05  1.5e-05   1.33  1.3e-05  1.8e-06   0.14  5.8e-06  6.1e-07   0.11 
     128  1.1e-04  7.8e-05   0.71  2.2e-05  1.2e-05   0.55  1.3e-05  3.5e-06   0.27 
     256  4.4e-04  6.1e-04   1.39  1.1e-04  9.0e-05   0.83  5.2e-05  2.4e-05   0.47 
     512  3.8e-03  3.2e-03   0.84  7.3e-04  3.8e-04   0.51  3.6e-04  1.1e-04   0.29 
    1024  2.7e-02  2.2e-02   0.80  5.7e-03  3.1e-03   0.54  2.9e-03  7.6e-04   0.26 
    2048  2.2e-01  1.5e-01   0.69  4.6e-02  3.0e-02   0.66  2.3e-02  8.3e-03   0.35 
    4096  1.8e+00  1.1e+00   0.60  3.6e-01  2.3e-01   0.63  1.9e-01  6.6e-02   0.35 
    8192  1.4e+01  7.8e+00   0.56  2.9e+00  1.8e+00   0.64  1.5e+00  5.0e-01   0.32

These results are in line with @ctkelley 's. Accelerate shines in real Float calculations.

Eigenvalues and eigenvectors for complex Hermitian and real symmetric matrices

Test of eigen(A)
       N    O-64Z    A-64Z  R-64Z     O-64     A-64   R-64     O-32     A-32   R-32 
      32  8.9e-05  8.6e-05   0.97  7.7e-05  6.6e-05   0.85  7.0e-05  5.3e-05   0.76 
      64  4.9e-04  4.8e-04   0.99  3.3e-04  2.6e-04   0.80  2.9e-04  2.3e-04   0.79 
     128  2.5e-03  3.3e-03   1.33  1.3e-03  1.2e-03   0.90  1.2e-03  9.7e-04   0.83 
     256  1.2e-02  1.4e-02   1.09  6.4e-03  4.8e-03   0.74  6.6e-03  5.0e-03   0.76 
     512  4.5e-02  8.2e-02   1.81  2.7e-02  2.6e-02   0.96  3.3e-02  3.1e-02   0.93 
    1024  2.1e-01  5.3e-01   2.57  1.3e-01  1.5e-01   1.09  1.4e-01  1.4e-01   1.04 
    2048  1.4e+00  6.1e+00   4.47  8.0e-01  1.1e+00   1.39  6.5e-01  7.8e-01   1.20 
    4096  9.6e+00  4.6e+01   4.83  4.7e+00  1.1e+01   2.22  2.0e+01  6.4e+00   0.31 
    8192  6.6e+01  3.7e+02   5.59  3.3e+01  7.9e+01   2.39  1.6e+02  1.4e+02   0.86

Here, Accelerate slows down the calculation.

Parallel matrix multiplication

Finally, I wanted to see how Accelerate would perform in a parallel context. I prepared 800 pairs of matrices and multiplied them in a sequential loop and also in a Threads.@threads loop. When running a sequential loop I still allowed for Open BLAS parallellism to take place, but when running a threaded loop I used BLAS.set_num_threads(1).
For N = 32 I obtained the following:

               O-64Z    A-64Z  R-64Z     O-64     A-64   R-64     O-32     A-32   R-32 
 Sequential: 1.0e-02  3.7e-03   0.35  1.6e-03  4.8e-04   0.31  7.7e-04  2.2e-04   0.29 
 Parallel:   1.5e-03  8.9e-04   0.61  7.8e-04  1.5e-04   0.19  6.9e-04  7.9e-05   0.12
 Speed-up:       6.7      4.2             2.1      3.2             1.1      2.8

We get a considerable speed-up when parallelising the Accelerate calculations.
And now N = 512:

               O-64Z    A-64Z  R-64Z     O-64     A-64   R-64     O-32     A-32   R-32 
 Sequential: 3.2e+00  2.7e+00   0.85  6.4e-01  5.5e-01   0.86  4.6e-01  1.7e-01   0.36 
 Parallel:   2.8e+00  1.6e+00   0.56  5.7e-01  3.7e-01   0.65  2.8e-01  1.1e-01   0.38
 Speed-up:       1.1      1.7             1.1      1.5             1.6      1.5

Perhaps running all 8 cores to get a x1.5 speed-up is not worth it, but at least it seems that threading Accelerate code does not degrade performance.

All in all, I think it is certainly beneficial to expose Accelerate to the general user.

philipturner · 2023-04-01T13:51:10Z

In my experience, threading significantly degraded performance, at least for SGEMM and DGEMM when harnessing AMX. N=32 was just barely utilizing the AMX on my machine (>50 GFLOPS FP64, <100 GFLOPS FP32). With yours, both FP64 and FP32 would have nonphysical performance if using only single-core NEON.

Matrix Multiplication Performance.xlsx

On the M1 family with old Accelerate, complex multiplications had 2x less ALU utilization than real multiplications. That changed to 1x with M2, new Accelerate.

BLAS Performance.xlsx
https://github.com/danielchalef/openblas-benchmark-m1

ZHEEV and DSYEV, k=unknown. I adjusted k for complex numbers to properly compare ALU % (k -> 0.25k). Accelerate has the same ALU utilization with complex, OpenBLAS has 2x more. I'll need to test whether ZHEEV_2STAGE and DSYEV_2STAGE are faster.

GFLOPS/k_adjusted	O64Z	A64Z	O64	A64
32	1.47	1.52	0.43	0.50
64	2.14	2.18	0.79	1.01
128	3.36	2.54	1.61	1.75
256	5.59	4.79	2.62	3.50
512	11.93	6.55	4.97	5.16
1024	20.45	8.10	8.26	7.16
2048	24.54	5.63	10.74	7.81
4096	28.63	5.98	14.62	6.25
8192	33.32	5.94	16.66	6.96

Sequential ZGEMM and DGEMM, k=2. My machine reached 78 GFLOPS for A64, N=32. Yours reached 109.2, maybe because M2 can issue more AMX instructions/cycle.

GFLOPS/k_adjusted	O64Z	A64Z	O64	A64
32	10.5	28.3	16.4	54.6
512	134.2	159.1	167.8	195.2

GFLOPS	O64Z	A64Z	O64	A64
32	21.0	56.6	32.8	109.2
512	268.4	318.2	335.6	390.4

ViralBShah · 2023-04-02T15:44:44Z

@philipturner Thanks for the in-depth benchmarking. I am doing the dead simple one with peakflops, in which Accelerate gives twice the FLOP rate of OpenBLAS. I suspect the first step here is to get a new version of LBT merged and tagged and included in Julia 1.9.

At that point AppleAccelerate.jl or a similar package can set up BLAS forwarding like we do with MKL.jl. That way, in Julia 1.9 (or 1.9.1), we can get the ability to use Accelerate with just a Pkg.add(). Later on, as we gain more experience, we can decide if we want it to be the default - but the evidence is unclear for that looking at your experiments.

staticfloat · 2023-04-06T14:20:05Z

Alright, I think the next step is to improve AppleAccelerate.jl to auto-load ILP64 Accelerate, when it is available. We may also want to give it the ability to load LAPACK_jll to provide LAPACK and only use Accelerate for the underlying BLAS calls. This simultaneously dodges the cholesky bug that we identified, and gives us the ability to see if vanilla LAPACK_jll provides any kind of performance improvement over Accelerate's LAPACK (doubtful). An even more interesting test would be to load OpenBLAS first, then load only Accelerate's libBLAS.dylib on top of that, so that we use OpenBLAS's threaded LAPACK routines, and Accelerate's BLAS routines.

ViralBShah · 2023-04-06T15:07:05Z

I think the LAPACK in our OpenBLAS probably directly links to its own BLAS and not LBT. I think some work needs to be done to assemble Accelerate BLAS, openblas accelerated lapack routines and the rest of lapack.

The LAPACK in BB on the other hand does link to LBT. so I think the recipe would be to load accelerate first, then lapack from BB, and finally manually forward a few lapack routines to openblas.

staticfloat · 2023-04-12T19:32:18Z

With JuliaLinearAlgebra/AppleAccelerate.jl#58 and incorporating this very minor fix to our LinearAlgebra test suite, we actually pass the entire Julia test suite while running on top of Accelerate.

My contact at Apple has asked for any examples of real-world usage that can be shown to work well on Accelerate; given that the LinearAlgebra test suite finishes with a small performance improvement over the OpenBLAS test suite, I'm inclined to think that small-scale LinearAlgebra problems may have a natural advantage here (as that is the majority of our test suite). They are looking for good examples that they can use to bolster the importance of developing good tools for people like us, so don't be shy!

philipturner · 2023-04-12T22:17:42Z

For very small problems, which are too small for OpenBLAS to harness multiple cores, Accelerate should excel. It uses the AMX because it's going to help single-core performance. For example, iPhones have only 2 performance cores so the AMX doesn't have any less vector throughput than the NEON units. Unfortunately, this tactic doesn't scale when the amount of NEON compute increases.

ViralBShah · 2023-05-22T12:22:20Z

Support is already on AppleAccelerate.jl master, and will be released when JuliaLinearAlgebra/AppleAccelerate.jl#62 merges.

Announcement in https://discourse.julialang.org/t/appleaccelerate-jl-v0-4-0/99351/3

ctkelley · 2023-09-26T22:14:17Z

Sonoma does better than OS 13.x. Here are some revised lu! numbers

        N      O-64            A-64         R-64        O-32        A-32        R-32 
      256  6.87e-04  1.46e-04  2.13e-01  6.66e-04  1.41e-04  2.11e-01
      512  1.22e-03   7.89e-04  6.48e-01  1.26e-03  6.07e-04  4.81e-01
     1024  3.95e-03  3.28e-03  8.32e-01  3.26e-03  2.45e-03  7.51e-01
     2048  2.05e-02  2.07e-02  1.01e+00  1.38e-02  1.55e-02  1.12e+00
     4096  1.49e-01  1.52e-01  1.02e+00  8.37e-02  8.61e-02  1.03e+00
     8192  1.14e+00  1.27e+00  1.11e+00  6.04e-01  6.52e-01  1.08e+00

The discontinuity at higher dimensions seems to be gone. 8 core M2 Pro.

oscarbg · 2024-04-13T23:35:20Z

Unrelated to the issue but can anyone share some m3 perf numbers like in lu for example..

philipturner · 2024-04-14T03:05:39Z

Generally, one-sided factorizations like LU should be fast in Accelerate. They can use block size 32 on AMX single-core quite easily. So a good fraction of GEMM GFLOPS.

Two-sided factorizations like tridiagonalization are where Accelerate (and OpenBLAS) are painfully slow. I had to write custom kernels that convert these into panel (QR) factorizations with most computations batched into small GEMM calls. Then the final bulge chasing which still needs further optimization at larger block sizes.

Perf on M3 should be no different than M1 or M2. The hardware hasn’t changed much fundamentally, except for some new instruction issuing capabilities and lower precisions for AI stuff.

oscarbg · 2024-11-27T16:32:37Z

peakflops(4096) 1.97967547574082e11 using AppleAccelerate julia> peakflops(4096) 3.979200628037524e11

…

On Tue, Nov 26, 2024 at 12:40 PM Dominik Schlösser ***@***.***> wrote: The default BLAS Julia uses is OpenBLAS. Apple's M1 has proprietary dedicated matrix hardware that is only accessible via Apple's Accelerate BLAS implementation. That proprietary interface can provide 2x to 4x speedups for some linear algebra use cases (see https://discourse.julialang.org/t/does-mac-m1-in-multithreads-is-slower-that-in-single-thread/61114/12?u=kristoffer.carlsson for some benchmarks and discussion.) Since Julia 1.7 there's a BLAS multiplexer: - https://github.com/staticfloat/libblastrampoline (this currently -- as far as I understood it -- requires still proprietary code for each BLAS, and for M1 there's so far only a minimal shim via AppleAccelerateLinAlgWrapper.jl <https://github.com/chriselrod/AppleAccelerateLinAlgWrapper.jl> ) So in theory, it should be possible to extend this so that depending on a given platform either OpenBLAS or other BLAS solutions are used transparently by default. So this issue discusses what needs to be done to have Apple's Accelerate access to M1 hardware acceleration available by default in Julia — Reply to this email directly, view it on GitHub <#869>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFDM6ML6NQQWAJM5YXIIA32CRM4HAVCNFSM6AAAAABSQHHZ62VHI2DSMVQWIX3LMV43ASLTON2WKOZSGY4TIMZWGYZDGOA> . You are receiving this because you commented.Message ID: ***@***.***>

domschl mentioned this issue Sep 19, 2021

[1.7.0-b4] segfault on m1 #866

Closed

ViralBShah added needs decision A decision on this change is needed system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips labels Sep 19, 2021

zinphi mentioned this issue Oct 18, 2022

Tuning for Apple M1 AMX2 coprocessor OpenMathLib/OpenBLAS#3789

Open

1fish2 mentioned this issue May 12, 2023

Update wcEcoli from Python 3.8 to 3.11.3 and up-to-date pip libraries CovertLab/wcEcoli#1367

Merged

11 tasks

ViralBShah closed this as completed May 24, 2023

geerlingguy mentioned this issue Jun 15, 2023

Benchmark Ampere Altra Developer Platform - 96 core 2.8 GHz ARM64 geerlingguy/top500-benchmark#10

Closed

ctkelley mentioned this issue Aug 9, 2023

macOS Accelerate Optimizations and Julia versus Python Benchmarks on Apple Silicon JuliaLang/julia#50806

Open

KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024

BLAS support for M1 ARM64 via Apple's Accelerate #869

BLAS support for M1 ARM64 via Apple's Accelerate #869

Comments

domschl commented Sep 19, 2021

ViralBShah commented Sep 19, 2021

domschl commented Sep 20, 2021

DilumAluthge commented Sep 20, 2021 • edited Loading

ViralBShah commented Sep 20, 2021

chriselrod commented Sep 20, 2021

ViralBShah commented Sep 5, 2022 • edited Loading

zinphi commented Oct 11, 2022

ViralBShah commented Oct 15, 2022

Keno commented Oct 15, 2022

zinphi commented Oct 17, 2022 • edited Loading

ViralBShah commented Oct 17, 2022

zinphi commented Oct 17, 2022

mzy2240 commented Nov 24, 2022

ctkelley commented Jan 31, 2023 • edited Loading

domschl commented Feb 1, 2023 • edited Loading

chriselrod commented Feb 1, 2023

ctkelley commented Feb 1, 2023 • edited Loading

domschl commented Mar 12, 2023 • edited Loading

Release Notes Ventura 13.3 Beta 3

Accelerate

New Features

ViralBShah commented Mar 12, 2023

staticfloat commented Mar 14, 2023

ViralBShah commented Mar 14, 2023 • edited Loading

staticfloat commented Mar 14, 2023

ctkelley commented Mar 14, 2023

cbrnr commented Mar 23, 2023

ctkelley commented Mar 23, 2023

cbrnr commented Mar 23, 2023

ctkelley commented Mar 31, 2023

philipturner commented Mar 31, 2023

ctkelley commented Mar 31, 2023 via email

philipturner commented Mar 31, 2023

chriselrod commented Mar 31, 2023 • edited Loading

ctkelley commented Mar 31, 2023 via email

philipturner commented Mar 31, 2023 • edited Loading

chriselrod commented Mar 31, 2023

ctkelley commented Mar 31, 2023

ctkelley commented Mar 31, 2023

philipturner commented Mar 31, 2023 • edited Loading

ctkelley commented Mar 31, 2023 • edited Loading

philipturner commented Mar 31, 2023 • edited Loading

yakovbraver commented Apr 1, 2023

Dot product

Matrix multiplication

Eigenvalues and eigenvectors for complex Hermitian and real symmetric matrices

Parallel matrix multiplication

philipturner commented Apr 1, 2023

ViralBShah commented Apr 2, 2023 • edited Loading

staticfloat commented Apr 6, 2023

ViralBShah commented Apr 6, 2023 • edited Loading

staticfloat commented Apr 12, 2023

philipturner commented Apr 12, 2023

ViralBShah commented May 22, 2023 • edited Loading

ctkelley commented Sep 26, 2023 • edited Loading

oscarbg commented Apr 13, 2024

philipturner commented Apr 14, 2024

oscarbg commented Nov 27, 2024 via email

DilumAluthge commented Sep 20, 2021 •

edited

Loading

ViralBShah commented Sep 5, 2022 •

edited

Loading

zinphi commented Oct 17, 2022 •

edited

Loading

ctkelley commented Jan 31, 2023 •

edited

Loading

domschl commented Feb 1, 2023 •

edited

Loading

ctkelley commented Feb 1, 2023 •

edited

Loading

domschl commented Mar 12, 2023 •

edited

Loading

ViralBShah commented Mar 14, 2023 •

edited

Loading

chriselrod commented Mar 31, 2023 •

edited

Loading

philipturner commented Mar 31, 2023 •

edited

Loading

philipturner commented Mar 31, 2023 •

edited

Loading

ctkelley commented Mar 31, 2023 •

edited

Loading

philipturner commented Mar 31, 2023 •

edited

Loading

ViralBShah commented Apr 2, 2023 •

edited

Loading

ViralBShah commented Apr 6, 2023 •

edited

Loading

ViralBShah commented May 22, 2023 •

edited

Loading

ctkelley commented Sep 26, 2023 •

edited

Loading