Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLAS support for M1 ARM64 via Apple's Accelerate #869

Closed
domschl opened this issue Sep 19, 2021 · 96 comments
Closed

BLAS support for M1 ARM64 via Apple's Accelerate #869

domschl opened this issue Sep 19, 2021 · 96 comments
Labels
needs decision A decision on this change is needed system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips

Comments

@domschl
Copy link

domschl commented Sep 19, 2021

The default BLAS Julia uses is OpenBLAS. Apple's M1 has proprietary dedicated matrix hardware that is only accessible via Apple's Accelerate BLAS implementation. That proprietary interface can provide 2x to 4x speedups for some linear algebra use cases (see https://discourse.julialang.org/t/does-mac-m1-in-multithreads-is-slower-that-in-single-thread/61114/12?u=kristoffer.carlsson for some benchmarks and discussion.)

Since Julia 1.7 there's a BLAS multiplexer:

So in theory, it should be possible to extend this so that depending on a given platform either OpenBLAS or other BLAS solutions are used transparently by default.

So this issue discusses what needs to be done to have Apple's Accelerate access to M1 hardware acceleration available by default in Julia

@ViralBShah
Copy link
Member

The plan was to have LBT as a way to pick a different BLAS than the default OpenBLAS for now. That requires you to load a package every time you start Julia to change the default. Eventually, once the Preferences mechanism becomes standard, we want to use that so that users can pick a different BLAS by default.

I don't think we want to depend on the Apple provided BLAS by default for the M1 for now.

@ViralBShah ViralBShah added needs decision A decision on this change is needed system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips labels Sep 19, 2021
@domschl
Copy link
Author

domschl commented Sep 20, 2021

Some (anecdotal) benchmark scenarios that might illustrate why Accelerate makes sense (at least for gemm type of operations). I've done a machine learning framework and benchmark suite, syncognite/bench written in C++ using eigen3 (so neither Julia nor OpenBLAS), but it might serve as a hint for what kind of performance improvements are possible.

The following table compare a benchmark run with native eigen3 on ARM64, and a second run using Apple's Accelerate within eigen (USE_SYSTEM_BLAS variable). The percentage columns show the speed-up for forward (fw) and backward (bw) path of neural network training when using accelerate, e.g:

  • Affine layers (basically a matrix mult): 480-600% faster
  • Affine layer with ReLU: 420-500% faster
  • LSTM recurrent networks: 200-300%

all that with very low energy usage!

Layer             fw(ms) fw/N(ms)   %  bw(ms) bw/N(ms)   %
OneHot             1.068  0.0107  0.1   0.000  0.0000 66.6
Affine             0.464  0.0050 620.1  1.210  0.0130 485.4
AffineRelu         0.538  0.0056 506.4  1.294  0.0135 424.4
Relu               0.023  0.0002  5.8   0.423  0.0042 -1.8
Nonlin-Relu        0.064  0.0006 -1.9   0.067  0.0007  0.7
Nonlin-Sgmd        0.126  0.0012 -1.2   0.067  0.0007 -1.8
Nonlin-Tanh        0.099  0.0010 -0.0   0.070  0.0007 -0.5
Nonlin-Selu        0.746  0.0075 -0.1   0.649  0.0065 -0.4
Nonlin-Resilu      0.202  0.0020  0.2   0.243  0.0024 -0.6
Dropout            0.713  0.0072 -0.8   0.030  0.0003  1.5
BatchNorm          0.159  0.0016 -3.1   0.217  0.0022 -4.6
SpatialBtchNrm     2.610  0.0260 -1.0   3.936  0.0393 -0.2
Convolution       54.388  0.5470 20.3  58.559  0.5869 45.1
Pooling            6.421  0.0643 -0.0  22.589  0.2262  1.7
LSTM              27.616  0.2778 299.8 78.172  0.7677 198.0
RNN               16.929  0.1660 110.4 33.648  0.3297 114.6
TemporalAff        5.704  0.0569 105.7  5.545  0.0554 209.3
WordEmbed          5.379  0.0541 129.1  4.227  0.0419 116.0
SVM                0.229  0.0023  0.5   0.225  0.0022  0.9
Softmax            0.228  0.0023  0.1   0.022  0.0002  1.5
TemporalSM         4.356  0.0434 -0.8   1.401  0.0140  0.1
TwoLayerNet        1.776  0.0177 300.5  2.555  0.0255 425.1

@DilumAluthge
Copy link
Member

DilumAluthge commented Sep 20, 2021

I think @chriselrod has played around with this a little bit.

@ViralBShah
Copy link
Member

The first thing is to create an MKL.jl like package for Apple Accelerate. We already have some support in LBT for Accelerate - so this should be fairly quick.

@chriselrod
Copy link
Contributor

A difficulty is that accelerate uses the 32 bit API, which is why AppleAccelerateLinAlgWrapper manually defines the methods it uses (and is based on Elliot's code).

(Also, AppleAccelerateLinAlgWrapper has a deliberately cumbersome name to avoid stealing/squatting on potentially valuable names for a future such package that supersedes it.)

@ViralBShah
Copy link
Member

ViralBShah commented Sep 5, 2022

Since Accelerate only has an LP64 BLAS and Julia wants ILP64, this is quite difficult, unless we can somehow make this choice dynamic as discussed in #891.

It should be possible to have a separate wrapper like @chriselrod discusses above that packages can directly invoke, but swapping it in as the default BLAS in Julia is fairly non-trivial.

@zinphi
Copy link

zinphi commented Oct 11, 2022

I‘m no expert but wouldn‘t it be quite easy to write some kind of wrapper libblas which just redirects level 3 BLAS calls to the Apple accelerate BLAS and all other calls to OpenBLAS? I mean ILP64 does not really play a role for level 3 BLAS imho anyway. On the other hand, level 3 BLAS routines are probably the only routines which benefit from Apple‘s AMX extension…

@ViralBShah
Copy link
Member

Yes we could have a wrapper library that redirects all 64-bit ILP64 calls to a 32-bit BLAS. It seems like it would be easier to have Apple just provide ILP64 support with mangled names. Intel is doing that in MKL now.

Do we have a way to ask Apple to do this? @Keno @staticfloat Perhaps one of you had a contact at Apple?

@Keno
Copy link
Member

Keno commented Oct 15, 2022

We can ask.

@zinphi
Copy link

zinphi commented Oct 17, 2022

Nice that you guys have the appropriate contacts ;-) However, what I heard from other discussions, Apple seams to assign currently very little resources to their BLAS/LAPACK development. So, I wouldn't bet on them... Nevertheless, I keep my fingers crossed ^^
The other proper solution would be to have AMX kernels implemented in OpenBLAS. It seems that (Apple M1) AMX has been decrypted now (inofficially) more or less completly: https://github.com/corsix/amx. I guess with this knowledge the guys from OpenBLAS should be able to do their job (if there are no legal restriction applying).

@ViralBShah
Copy link
Member

Please do file the pointer to Apple AMX kernels in an issue on the openblas github repo. Yes, it would be great for openblas to have those kernels.

@zinphi
Copy link

zinphi commented Oct 17, 2022

I tried my best and opened a new issue there (see OpenMathLib/OpenBLAS#3789). Let's see what they think about that.

@mzy2240
Copy link

mzy2240 commented Nov 24, 2022

Might be relevant: mlpack/mlpack#3308

@ctkelley
Copy link

ctkelley commented Jan 31, 2023

I just got a shiny new Mac Mini with an M2 Pro, so I thought I see how Apple Acclerate scaled. I timed gemm and lu with both OpenBLAS and Acclerate. It seems that Accelerate's advantage declines as the problem size increases. This is worse ith lu than gemm.

It's also interesting, at least to me, that Acclerate does so well for the single precision matrix multiply.

This is far from a definitive analysis, but makes me nervous about swapping OpenBlas for Accelerate.

I ran this on 1.9-beta3

julia> using Random, AppleAccelerateLinAlgWrapper, BenchmarkTools

julia> function testme(T = Float64)
           Random.seed!(46071)
           for p = 8:13
              N = 2^p

              A = rand(T, N, N)

              tblasmm = @belapsed $A * $A
              taccmm = @belapsed AppleAccelerateLinAlgWrapper.gemm($A, $A)
              println("MM: Dim= $N. BLAS time = $tblasmm. Apple time = $taccmm")


              tblaslu = @belapsed lu($A)
              tacclu = @belapsed AppleAccelerateLinAlgWrapper.lu($A)
              println("LU: Dim= $N. BLAS time = $tblaslu. Apple time = $tacclu")
          end

       end
testme (generic function with 2 methods)


The results for double precision:

julia> testme()
MM: Dim= 256. BLAS time = 1.84333e-04. Apple time = 9.31660e-05
LU: Dim= 256. BLAS time = 7.17958e-04. Apple time = 1.82250e-04
MM: Dim= 512. BLAS time = 8.25958e-04. Apple time = 4.33709e-04
LU: Dim= 512. BLAS time = 1.27475e-03. Apple time = 1.05083e-03
MM: Dim= 1024. BLAS time = 6.32771e-03. Apple time = 3.39408e-03
LU: Dim= 1024. BLAS time = 4.43283e-03. Apple time = 3.46121e-03
MM: Dim= 2048. BLAS time = 4.78387e-02. Apple time = 2.98090e-02
LU: Dim= 2048. BLAS time = 2.33295e-02. Apple time = 1.78365e-02
MM: Dim= 4096. BLAS time = 3.94479e-01. Apple time = 2.33508e-01
LU: Dim= 4096. BLAS time = 1.61876e-01. Apple time = 1.50505e-01
MM: Dim= 8192. BLAS time = 3.11572e+00. Apple time = 1.82149e+00
LU: Dim= 8192. BLAS time = 1.17175e+00. Apple time = 2.54976e+00

and for single

julia> testme(Float32)
MM: Dim= 256. BLAS time = 1.34667e-04. Apple time = 2.45840e-05
LU: Dim= 256. BLAS time = 6.53583e-04. Apple time = 1.25709e-04
MM: Dim= 512. BLAS time = 4.42458e-04. Apple time = 1.05875e-04
LU: Dim= 512. BLAS time = 1.26879e-03. Apple time = 5.36500e-04
MM: Dim= 1024. BLAS time = 3.32025e-03. Apple time = 8.74250e-04
LU: Dim= 1024. BLAS time = 3.40737e-03. Apple time = 2.66488e-03
MM: Dim= 2048. BLAS time = 2.44754e-02. Apple time = 9.16629e-03
LU: Dim= 2048. BLAS time = 1.38886e-02. Apple time = 1.42406e-02
MM: Dim= 4096. BLAS time = 1.94998e-01. Apple time = 7.03759e-02
LU: Dim= 4096. BLAS time = 8.70666e-02. Apple time = 8.09671e-02
MM: Dim= 8192. BLAS time = 1.54402e+00. Apple time = 5.09572e-01
LU: Dim= 8192. BLAS time = 6.15488e-01. Apple time = 6.45579e-01

@domschl
Copy link
Author

domschl commented Feb 1, 2023

So the decision might depend on your application scenario.

For machine learning, the decision would be clear (tested on MacBook Pro M2 Max, Julia head from 2023-01-26):

testme(Float32)
MM: Dim= 256. BLAS time = 5.1542e-5. Apple time = 2.4458e-5 factor=2.107367732439284
MM: Dim= 512. BLAS time = 0.000364542. Apple time = 0.0001065 factor=3.4229295774647883
MM: Dim= 1024. BLAS time = 0.003256166. Apple time = 0.000854958 factor=3.8085683741189627
MM: Dim= 2048. BLAS time = 0.024809042. Apple time = 0.008663458 factor=2.8636419776029385
MM: Dim= 4096. BLAS time = 0.205842959. Apple time = 0.067481875 factor=3.0503443924757576
MM: Dim= 8192. BLAS time = 1.73104175. Apple time = 0.503544333 factor=3.4377146887680294

A pretty consistent 3x speed advantage of Accelerate over OpenBLAS for matrix sizes relevant for machine learning operations.

@chriselrod
Copy link
Contributor

I'd expect OpenBLAS sgemm to take at least 1.3 seconds with 8 cores for the 8192x8192 matrices:

julia> 2 * 8192^3 / (4*4*2*3.25e9*8)
1.3215283987692308

It requires 8192^3 FLOPs.
The CPU has 4 execution units with 4 Float32 each, doing 2 FLOPs/fma instruction, running at around 3.25e9 clock cycles/second, and there are 8 cores.

So, 0.615s reported by @ctkelley sounds too fast, and @domschl's 1.73s realistic.

Odd that @domschl's accelerate time was faster (0.5 vs 0.65s).

@ctkelley
Copy link

ctkelley commented Feb 1, 2023

My .615 was for LU. My MM numbers are pretty close to what @domschl got. So OpenBLAS LU time is roughly 1/3 OpenBLAS MM time, as you would expect. The Apple LU times are hard for me to understand as the dimension grows. For dim = 8192, LU takes more time than MM.

@domschl
Copy link
Author

domschl commented Mar 12, 2023

might be interesting simplification, new support for ILP64 interface:

Release Notes Ventura 13.3 Beta 3

Accelerate

New Features

  • The BLAS and LAPACK libraries under the Accelerate framework are now inline with reference version 3.9.1. These new interfaces provide additional functionality and a new ILP64 interface. To use the new interfaces, define ACCELERATE_NEW_LAPACK before including the Accelerate or vecLib headers. For ILP64 interfaces, also define ACCELERATE_LAPACK_ILP64. For Swift projects, specify ACCELERATE_NEW_LAPACK=1 and ACCELERATE_LAPACK_ILP64=1 as preprocessor macros in Xcode build settings. (105572917)

@ViralBShah
Copy link
Member

Nice!

@staticfloat
Copy link
Member

Okay, I spun up a VM and tried it out. The good news is, many things work! The bad news is, it requires a hack to LBT to use their symbol names since they don't use a simple suffix on the F77 symbols, they drop the trailing underscore from the symbol name (e.g. dgemm_ -> dgemm$NEWLAPACK$ILP64). I've requested that they keep that trailing underscore (both after dgemm and after ILP64, to maintain compatibility with gfortran compilers which require a trailing underscore for all symbol names) but we'll see what they say. Another good piece of news is that their LAPACK implementation has been updated from 3.2.x to 3.9.x, so I think we're seeing a significant increase in Accelerate functionality!

I was running inside of a VM so benchmarks are basically useless, so all I'll say is that Accelerate (in the VM) was faster than OpenBLAS (in the VM) by about a factor of 3x when running peakflops().

@ViralBShah
Copy link
Member

ViralBShah commented Mar 14, 2023

I suppose that LBT can pick Accelerate if we are on the right macOS version in the default Julia build, or default to openblas (which we would continue to ship for a long time). This saves the effort of making our BLAS runtime switchable. Apple was one of the last holdouts.

OpenBLAS does have a multi-threaded solvers (it patches LAPACK), so I am curious how the LU and cholesky factorization performance stacks up.

@staticfloat
Copy link
Member

I suppose that LBT can pick Accelerate if we are on the right macOS version in the default Julia build

Yes, such a switch is actually quite easy to implement; we can even just try loading ILP64 Accelerate, and if it fails, we load OpenBLAS instead. It would go right here: https://github.com/JuliaLang/julia/blob/7ba7e326293bd3eddede81567bbe98078e81f775/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L645. We could also have it set via a Preference, or something like that.

@ctkelley
Copy link

I suppose that LBT can pick Accelerate if we are on the right macOS version in the default Julia build, or default to openblas (which we would continue to ship for a long time). This saves the effort of making our BLAS runtime switchable. Apple was one of the last holdouts.

OpenBLAS does have a multi-threaded solvers (it patches LAPACK), so I am curious how the LU and cholesky factorization performance stacks up.

I'd like to see how the matrix factorizations do as well. Things were a bit strange (see my post above) with the old version. If everything is 3X faster now, everyone wins.

@cbrnr
Copy link

cbrnr commented Mar 23, 2023

From my anecdotal experience with ICA (running in Python via NumPy), I found that Accelerate is between 5x – 15x faster than OpenBLAS. The OpenBLAS implementation is so slow that even a 10-year-old Intel Mac has pretty much the same performance.

Again, this is only anecdotal evidence, and I am certainly not trying to bash OpenBLAS. However, I think this might underscore the importance of being able to use an optimized BLAS on every supported platform. Currently, this means that BLAS-dependent calculations are much slower than they could/should be on Apple Silicon, to the point that (depending on the algorithm of course) Julia could not the best choice for extremely fast performance anymore.

@ctkelley
Copy link

Are you seeing this speedup for factorizations (LU, Cholesky, SVD, QR, ...)? I am not for the older version (pre OS 12.3) version of Accelerate.

@cbrnr
Copy link

cbrnr commented Mar 23, 2023

To be honest, I don't know which operations are involved in those ICA algorithms, but I'm guessing that SVD is very likely part of it. I am on macOS 13.2.1.

@ctkelley
Copy link

What's k? If it's the prefactor in the O-term, then do you mean k n^3 + O(n^2)?

@philipturner
Copy link

It's the coefficient of the n^3 term. At sufficiently large data sizes, the cubic term will dominate. This will give us an accurate estimate of (computational cost, GFLOPS, GFLOPs, ALU %); accurate to the correct order of magnitude. Whether it's precise is a different story.

@ctkelley
Copy link

ctkelley commented Mar 31, 2023 via email

@philipturner
Copy link

You could do that yourself with the data in my post.

My bad. Here's the data. For LU decomposition, k = 2/3. So multiply these numbers by 2/3.

GFLOPS/k O64 A64 O32 A32
256 52 85 65 120
512 157 99 175 221
1024 288 262 396 449
2048 409 385 627 651
4096 449 397 829 878
8192 482 94 912 683
16384 501 312 980 382
32768 479 529 980 907

The data varies a lot because only one trial was taken. I usually run several dozen trials in quick succession, then select the one with maximum speed. I also run the benchmark multiple times, each separated by ~20 seconds, to account for varying system conditions.

@chriselrod
Copy link
Contributor

chriselrod commented Mar 31, 2023

GFLOPS is already scale invariant.
Floating point operations/time.

When the values change as a function of size, you know the algorithm is more/less efficient at those sizes.

@ctkelley
Copy link

ctkelley commented Mar 31, 2023 via email

@philipturner
Copy link

philipturner commented Mar 31, 2023

GFLOPS is already scale invariant.
Floating point operations/time.

Good observation. I originally created that metric because I wanted to make an eigendecomposition algorithm using AMX assembly. Using GFLOPS to measure performance would be misleading to compare different algorithms. For example,

  • O(10n^3) 10n^3 + an^2 + bn algorithm reaches 75% ALU utilization, or 1200 GFLOPS
  • O(2n^3) 2n^3 + an^2 + bn algorithm reaches 25% ALU utilization, or 400 GFLOPS

Which one is faster? In addition, for some algorithms, you don't know the k prefactor, infamously with a failed eigendecomposition benchmark. My GFLOPS was off by an order of magnitude, and I still don't know what the k is.

@chriselrod
Copy link
Contributor

Using GFLOPS to measure performance would be misleading to compare different algorithms.

Good point. As another example, parallel QR needs more FLOPs. Or Strassen vs standard matmul.
An algorithm that needs less FLOPs to get the same answer obviously has an advantage.

I'd consistently use time / f(size), where f(size) should be the same across all algorithms, and not the actual gflops that algorithm used.
You could pick the most efficient or naive algorithm, and use its flop formula to define f, e.g. 2M*K*N for matmul, or f(n) = 2n^3 + an^2 + bn for eigendecomposition.

Which is similar to what you'd suggested.

Maybe you could define f with a naive algorithm.
RecursiveFactorization.jl uses this for its lu benchmarks: https://github.com/JuliaLinearAlgebra/RecursiveFactorization.jl/blob/cbfeaeed45bbeff53b5945e66de5acb55830f4e4/perf/lu.jl#L7-L19

@ctkelley
Copy link

I did another run on an unloaded machine and got the same results. I use @belapsed from BenchmarkTools to get the timings and that's been very reproducible, at least for me. So the bizzare M2 pro behavior for Float64 seems to be really there for lu. One thing I just noticed is that the fan on my Mac Mini was running hard during the Open BLAS computation and not running at all during the Accelerate part.

I'm doing the computation on my M2 MacBook Air as I type this. The Open BLAS computation ran hot and the load averages from the w command were in the 6 range (on a machine with 4 BLAS threads). The Accelerate computation has load averages bellow 2 and seems to be running cooler. If the machine is on my lap, I like the run cool part.

@ctkelley
Copy link

Herewith the M2 MacBook Air lu results. Looks a bit more consistent
that the ones with the M2 pro.

I will stop doing this now.

Factorization = lu!
        N      O-64      A-64      R-64      O-32      A-32      R-32
      256  3.43e-04  2.01e-04  5.85e-01  2.69e-04  1.44e-04  5.36e-01
      512  1.08e-03  1.38e-03  1.28e+00  9.97e-04  6.17e-04  6.18e-01
     1024  5.72e-03  1.13e-02  1.98e+00  4.25e-03  3.76e-03  8.87e-01
     2048  4.12e-02  7.23e-02  1.76e+00  2.31e-02  4.11e-02  1.78e+00
     4096  3.16e-01  3.93e-01  1.24e+00  1.70e-01  2.40e-01  1.41e+00
     8192  2.54e+00  2.43e+00  9.56e-01  1.32e+00  1.22e+00  9.25e-01
    16384  2.06e+01  1.88e+01  9.11e-01  1.04e+01  7.31e+00  7.00e-01
    32768  1.79e+02  2.08e+02  1.16e+00  9.02e+01  6.39e+01  7.08e-01

@philipturner
Copy link

philipturner commented Mar 31, 2023

Which is similar to what you'd suggested.

Yours (k/GFLOPS) would help in an effort to predict the execution time of an entire app. For example, density functional theory (DFT) uses several LAPACK functions. You might sum the k/GFLOPS and multiply by (#electrons)^3. With my metric, I wanted to easily map it to actual ALU %, and if I got k wrong, not have to rewrite the raw data. I can also compare ALU % for complex operations to real operations by shifting k_real <-> 0.25k_complex. That brings up another important point - Accelerate really struggles on complex-valued functions.

OpenBLAS is more than 2x slower on the M2, while Accelerate is exactly 2x slower. OpenBLAS could be trying to treat the heterogeneous P and E blocks as homogeneous - major load imbalance. This would decrease the performance gap with Accelerate.

GFLOPS/k, k=2/3 O64 A64 O32 A32
256 49 83 62 117
512 130 97 135 218
1024 188 95 253 286
2048 168 119 372 209
4096 217 175 404 286
8192 216 226 416 451
16384 213 234 423 602
32768 197 169 390 551

@ctkelley
Copy link

ctkelley commented Mar 31, 2023

Just tested Accelerate on something I'm working on. It's not very good at the triangular solves you do after lu. I'm seeing the same inconsistent scaling, even in Float32. For this project, I need to report timings on triangular solves to make a point about iterative refinement. Open BLAS is completely predictable, consistent with theory, and does what I need. Accelerate is a bit too flaky for publication work.

None of the problems we're talking about here should be impossible for Apple to fix, especially the complex function part in @philipturner 's post above, if they care.

@philipturner
Copy link

philipturner commented Mar 31, 2023

None of the problems we're talking about here should be impossible for Apple to fix

Most problems (aside from minor GFLOPS blips) probably are impossible to fix.

if they care.

I wouldn't phrase it like that. Apple engineers know OpenBLAS exists, and it already serves its purpose: high throughput, no matter the energy cost. Nvidia RTX 4090 exists, and it has the highest TFLOPS but zero regard for power efficiency. Apple's AMX hardware was tailored to be an "ML accelerator" (sic) and not a linear algebra accelerator. It performs real-valued convolutions or batched real-FP32 matrix multiplications. Modestly high-throughput real-FP64 was just a cool side effect.

Rather than reinvent the wheel, they gave developers something new - the ability to trade off between sheer performance and power efficiency. That is crucial for more apps than you think. They also did this with the GPU, sacrificing performance for the sake of efficiency.

@yakovbraver
Copy link

I might be late to the party, but here are some additional results. I used a script based on @ctkelley 's script above. Notably, I added ComplexF64 calculations ("-64Z" columns below).
Tested on Mac mini M2 Pro (12/19), 32 GB RAM, with julia -t auto (giving 8 Julia threads and also 8 BLAS threads).

Dot product

Test of dot(A, B)
       N    O-64Z    A-64Z  R-64Z     O-64     A-64   R-64     O-32     A-32   R-32 
      32  1.9e-08  1.2e-08   0.67  1.0e-08  8.8e-09   0.87  7.0e-09  8.8e-09   1.24
      64  4.0e-08  2.0e-08   0.49  1.6e-08  1.1e-08   0.72  7.9e-09  9.9e-09   1.25 
     128  8.4e-08  3.8e-08   0.45  3.5e-08  1.8e-08   0.51  1.3e-08  1.2e-08   0.95 
     256  1.9e-07  7.2e-08   0.37  7.2e-08  3.1e-08   0.44  1.9e-08  1.8e-08   0.96 
     512  4.1e-07  1.5e-07   0.36  1.7e-07  6.1e-08   0.36  3.2e-08  3.2e-08   1.00 
    1024  8.5e-07  2.9e-07   0.35  3.9e-07  1.3e-07   0.34  6.2e-08  6.3e-08   1.02 
    2048  1.7e-06  5.9e-07   0.34  8.3e-07  2.8e-07   0.34  1.3e-07  1.4e-07   1.01 
    4096  3.5e-06  1.2e-06   0.34  1.7e-06  5.7e-07   0.34  2.8e-07  1.7e-07   0.62 
    8192  7.0e-06  2.9e-06   0.41  3.5e-06  1.2e-06   0.34  5.7e-07  2.9e-07   0.50 
   16384  1.4e-05  5.8e-06   0.41  7.0e-06  2.9e-06   0.41  1.2e-06  5.0e-07   0.43 
   32768  2.8e-05  1.1e-05   0.40  1.4e-05  5.7e-06   0.41  2.9e-06  9.4e-07   0.33

Here, Accelerate gives a nice ~x2.5–x3 speed-up for ComplexF64 and Float64.

Matrix multiplication

Test of mul!(C, A, B)
       N    O-64Z    A-64Z  R-64Z     O-64     A-64   R-64     O-32     A-32   R-32 
      32  1.2e-05  4.1e-06   0.34  1.7e-06  3.4e-07   0.20  8.9e-07  2.3e-07   0.25 
      64  1.2e-05  1.5e-05   1.33  1.3e-05  1.8e-06   0.14  5.8e-06  6.1e-07   0.11 
     128  1.1e-04  7.8e-05   0.71  2.2e-05  1.2e-05   0.55  1.3e-05  3.5e-06   0.27 
     256  4.4e-04  6.1e-04   1.39  1.1e-04  9.0e-05   0.83  5.2e-05  2.4e-05   0.47 
     512  3.8e-03  3.2e-03   0.84  7.3e-04  3.8e-04   0.51  3.6e-04  1.1e-04   0.29 
    1024  2.7e-02  2.2e-02   0.80  5.7e-03  3.1e-03   0.54  2.9e-03  7.6e-04   0.26 
    2048  2.2e-01  1.5e-01   0.69  4.6e-02  3.0e-02   0.66  2.3e-02  8.3e-03   0.35 
    4096  1.8e+00  1.1e+00   0.60  3.6e-01  2.3e-01   0.63  1.9e-01  6.6e-02   0.35 
    8192  1.4e+01  7.8e+00   0.56  2.9e+00  1.8e+00   0.64  1.5e+00  5.0e-01   0.32 

These results are in line with @ctkelley 's. Accelerate shines in real Float calculations.

Eigenvalues and eigenvectors for complex Hermitian and real symmetric matrices

Test of eigen(A)
       N    O-64Z    A-64Z  R-64Z     O-64     A-64   R-64     O-32     A-32   R-32 
      32  8.9e-05  8.6e-05   0.97  7.7e-05  6.6e-05   0.85  7.0e-05  5.3e-05   0.76 
      64  4.9e-04  4.8e-04   0.99  3.3e-04  2.6e-04   0.80  2.9e-04  2.3e-04   0.79 
     128  2.5e-03  3.3e-03   1.33  1.3e-03  1.2e-03   0.90  1.2e-03  9.7e-04   0.83 
     256  1.2e-02  1.4e-02   1.09  6.4e-03  4.8e-03   0.74  6.6e-03  5.0e-03   0.76 
     512  4.5e-02  8.2e-02   1.81  2.7e-02  2.6e-02   0.96  3.3e-02  3.1e-02   0.93 
    1024  2.1e-01  5.3e-01   2.57  1.3e-01  1.5e-01   1.09  1.4e-01  1.4e-01   1.04 
    2048  1.4e+00  6.1e+00   4.47  8.0e-01  1.1e+00   1.39  6.5e-01  7.8e-01   1.20 
    4096  9.6e+00  4.6e+01   4.83  4.7e+00  1.1e+01   2.22  2.0e+01  6.4e+00   0.31 
    8192  6.6e+01  3.7e+02   5.59  3.3e+01  7.9e+01   2.39  1.6e+02  1.4e+02   0.86

Here, Accelerate slows down the calculation.

Parallel matrix multiplication

Finally, I wanted to see how Accelerate would perform in a parallel context. I prepared 800 pairs of matrices and multiplied them in a sequential loop and also in a Threads.@threads loop. When running a sequential loop I still allowed for Open BLAS parallellism to take place, but when running a threaded loop I used BLAS.set_num_threads(1).
For N = 32 I obtained the following:

               O-64Z    A-64Z  R-64Z     O-64     A-64   R-64     O-32     A-32   R-32 
 Sequential: 1.0e-02  3.7e-03   0.35  1.6e-03  4.8e-04   0.31  7.7e-04  2.2e-04   0.29 
 Parallel:   1.5e-03  8.9e-04   0.61  7.8e-04  1.5e-04   0.19  6.9e-04  7.9e-05   0.12
 Speed-up:       6.7      4.2             2.1      3.2             1.1      2.8

We get a considerable speed-up when parallelising the Accelerate calculations.
And now N = 512:

               O-64Z    A-64Z  R-64Z     O-64     A-64   R-64     O-32     A-32   R-32 
 Sequential: 3.2e+00  2.7e+00   0.85  6.4e-01  5.5e-01   0.86  4.6e-01  1.7e-01   0.36 
 Parallel:   2.8e+00  1.6e+00   0.56  5.7e-01  3.7e-01   0.65  2.8e-01  1.1e-01   0.38
 Speed-up:       1.1      1.7             1.1      1.5             1.6      1.5

Perhaps running all 8 cores to get a x1.5 speed-up is not worth it, but at least it seems that threading Accelerate code does not degrade performance.

All in all, I think it is certainly beneficial to expose Accelerate to the general user.

@philipturner
Copy link

In my experience, threading significantly degraded performance, at least for SGEMM and DGEMM when harnessing AMX. N=32 was just barely utilizing the AMX on my machine (>50 GFLOPS FP64, <100 GFLOPS FP32). With yours, both FP64 and FP32 would have nonphysical performance if using only single-core NEON.

Matrix Multiplication Performance.xlsx

On the M1 family with old Accelerate, complex multiplications had 2x less ALU utilization than real multiplications. That changed to 1x with M2, new Accelerate.

BLAS Performance.xlsx
https://github.com/danielchalef/openblas-benchmark-m1

ZHEEV and DSYEV, k=unknown. I adjusted k for complex numbers to properly compare ALU % (k -> 0.25k). Accelerate has the same ALU utilization with complex, OpenBLAS has 2x more. I'll need to test whether ZHEEV_2STAGE and DSYEV_2STAGE are faster.

GFLOPS/k_adjusted O64Z A64Z O64 A64
32 1.47 1.52 0.43 0.50
64 2.14 2.18 0.79 1.01
128 3.36 2.54 1.61 1.75
256 5.59 4.79 2.62 3.50
512 11.93 6.55 4.97 5.16
1024 20.45 8.10 8.26 7.16
2048 24.54 5.63 10.74 7.81
4096 28.63 5.98 14.62 6.25
8192 33.32 5.94 16.66 6.96

Sequential ZGEMM and DGEMM, k=2. My machine reached 78 GFLOPS for A64, N=32. Yours reached 109.2, maybe because M2 can issue more AMX instructions/cycle.

GFLOPS/k_adjusted O64Z A64Z O64 A64
32 10.5 28.3 16.4 54.6
512 134.2 159.1 167.8 195.2
GFLOPS O64Z A64Z O64 A64
32 21.0 56.6 32.8 109.2
512 268.4 318.2 335.6 390.4

@ViralBShah
Copy link
Member

ViralBShah commented Apr 2, 2023

@philipturner Thanks for the in-depth benchmarking. I am doing the dead simple one with peakflops, in which Accelerate gives twice the FLOP rate of OpenBLAS. I suspect the first step here is to get a new version of LBT merged and tagged and included in Julia 1.9.

At that point AppleAccelerate.jl or a similar package can set up BLAS forwarding like we do with MKL.jl. That way, in Julia 1.9 (or 1.9.1), we can get the ability to use Accelerate with just a Pkg.add(). Later on, as we gain more experience, we can decide if we want it to be the default - but the evidence is unclear for that looking at your experiments.

@staticfloat
Copy link
Member

Alright, I think the next step is to improve AppleAccelerate.jl to auto-load ILP64 Accelerate, when it is available. We may also want to give it the ability to load LAPACK_jll to provide LAPACK and only use Accelerate for the underlying BLAS calls. This simultaneously dodges the cholesky bug that we identified, and gives us the ability to see if vanilla LAPACK_jll provides any kind of performance improvement over Accelerate's LAPACK (doubtful). An even more interesting test would be to load OpenBLAS first, then load only Accelerate's libBLAS.dylib on top of that, so that we use OpenBLAS's threaded LAPACK routines, and Accelerate's BLAS routines.

@ViralBShah
Copy link
Member

ViralBShah commented Apr 6, 2023

I think the LAPACK in our OpenBLAS probably directly links to its own BLAS and not LBT. I think some work needs to be done to assemble Accelerate BLAS, openblas accelerated lapack routines and the rest of lapack.

The LAPACK in BB on the other hand does link to LBT. so I think the recipe would be to load accelerate first, then lapack from BB, and finally manually forward a few lapack routines to openblas.

@staticfloat
Copy link
Member

With JuliaLinearAlgebra/AppleAccelerate.jl#58 and incorporating this very minor fix to our LinearAlgebra test suite, we actually pass the entire Julia test suite while running on top of Accelerate.

My contact at Apple has asked for any examples of real-world usage that can be shown to work well on Accelerate; given that the LinearAlgebra test suite finishes with a small performance improvement over the OpenBLAS test suite, I'm inclined to think that small-scale LinearAlgebra problems may have a natural advantage here (as that is the majority of our test suite). They are looking for good examples that they can use to bolster the importance of developing good tools for people like us, so don't be shy!

@philipturner
Copy link

For very small problems, which are too small for OpenBLAS to harness multiple cores, Accelerate should excel. It uses the AMX because it's going to help single-core performance. For example, iPhones have only 2 performance cores so the AMX doesn't have any less vector throughput than the NEON units. Unfortunately, this tactic doesn't scale when the amount of NEON compute increases.

@ViralBShah
Copy link
Member

ViralBShah commented May 22, 2023

Support is already on AppleAccelerate.jl master, and will be released when JuliaLinearAlgebra/AppleAccelerate.jl#62 merges.

Announcement in https://discourse.julialang.org/t/appleaccelerate-jl-v0-4-0/99351/3

@ctkelley
Copy link

ctkelley commented Sep 26, 2023

Sonoma does better than OS 13.x. Here are some revised lu! numbers

        N      O-64            A-64         R-64        O-32        A-32        R-32 
      256  6.87e-04  1.46e-04  2.13e-01  6.66e-04  1.41e-04  2.11e-01
      512  1.22e-03   7.89e-04  6.48e-01  1.26e-03  6.07e-04  4.81e-01
     1024  3.95e-03  3.28e-03  8.32e-01  3.26e-03  2.45e-03  7.51e-01
     2048  2.05e-02  2.07e-02  1.01e+00  1.38e-02  1.55e-02  1.12e+00
     4096  1.49e-01  1.52e-01  1.02e+00  8.37e-02  8.61e-02  1.03e+00
     8192  1.14e+00  1.27e+00  1.11e+00  6.04e-01  6.52e-01  1.08e+00

The discontinuity at higher dimensions seems to be gone. 8 core M2 Pro.

@oscarbg
Copy link

oscarbg commented Apr 13, 2024

Unrelated to the issue but can anyone share some m3 perf numbers like in lu for example..

@philipturner
Copy link

Generally, one-sided factorizations like LU should be fast in Accelerate. They can use block size 32 on AMX single-core quite easily. So a good fraction of GEMM GFLOPS.

Two-sided factorizations like tridiagonalization are where Accelerate (and OpenBLAS) are painfully slow. I had to write custom kernels that convert these into panel (QR) factorizations with most computations batched into small GEMM calls. Then the final bulge chasing which still needs further optimization at larger block sizes.

Perf on M3 should be no different than M1 or M2. The hardware hasn’t changed much fundamentally, except for some new instruction issuing capabilities and lower precisions for AI stuff.

@KristofferC KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024
@oscarbg
Copy link

oscarbg commented Nov 27, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips
Projects
None yet
Development

No branches or pull requests