-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BLAS support for M1 ARM64 via Apple's Accelerate #869
Comments
The plan was to have LBT as a way to pick a different BLAS than the default OpenBLAS for now. That requires you to load a package every time you start Julia to change the default. Eventually, once the Preferences mechanism becomes standard, we want to use that so that users can pick a different BLAS by default. I don't think we want to depend on the Apple provided BLAS by default for the M1 for now. |
Some (anecdotal) benchmark scenarios that might illustrate why Accelerate makes sense (at least for The following table compare a benchmark run with native eigen3 on ARM64, and a second run using Apple's Accelerate within eigen (
all that with very low energy usage!
|
I think @chriselrod has played around with this a little bit. |
The first thing is to create an MKL.jl like package for Apple Accelerate. We already have some support in LBT for Accelerate - so this should be fairly quick. |
A difficulty is that accelerate uses the 32 bit API, which is why AppleAccelerateLinAlgWrapper manually defines the methods it uses (and is based on Elliot's code). (Also, AppleAccelerateLinAlgWrapper has a deliberately cumbersome name to avoid stealing/squatting on potentially valuable names for a future such package that supersedes it.) |
Since Accelerate only has an LP64 BLAS and Julia wants ILP64, this is quite difficult, unless we can somehow make this choice dynamic as discussed in #891. It should be possible to have a separate wrapper like @chriselrod discusses above that packages can directly invoke, but swapping it in as the default BLAS in Julia is fairly non-trivial. |
I‘m no expert but wouldn‘t it be quite easy to write some kind of wrapper libblas which just redirects level 3 BLAS calls to the Apple accelerate BLAS and all other calls to OpenBLAS? I mean ILP64 does not really play a role for level 3 BLAS imho anyway. On the other hand, level 3 BLAS routines are probably the only routines which benefit from Apple‘s AMX extension… |
Yes we could have a wrapper library that redirects all 64-bit ILP64 calls to a 32-bit BLAS. It seems like it would be easier to have Apple just provide ILP64 support with mangled names. Intel is doing that in MKL now. Do we have a way to ask Apple to do this? @Keno @staticfloat Perhaps one of you had a contact at Apple? |
We can ask. |
Nice that you guys have the appropriate contacts ;-) However, what I heard from other discussions, Apple seams to assign currently very little resources to their BLAS/LAPACK development. So, I wouldn't bet on them... Nevertheless, I keep my fingers crossed ^^ |
Please do file the pointer to Apple AMX kernels in an issue on the openblas github repo. Yes, it would be great for openblas to have those kernels. |
I tried my best and opened a new issue there (see OpenMathLib/OpenBLAS#3789). Let's see what they think about that. |
Might be relevant: mlpack/mlpack#3308 |
I just got a shiny new Mac Mini with an M2 Pro, so I thought I see how Apple Acclerate scaled. I timed gemm and lu with both OpenBLAS and Acclerate. It seems that Accelerate's advantage declines as the problem size increases. This is worse ith lu than gemm. It's also interesting, at least to me, that Acclerate does so well for the single precision matrix multiply. This is far from a definitive analysis, but makes me nervous about swapping OpenBlas for Accelerate. I ran this on 1.9-beta3
The results for double precision:
and for single
|
So the decision might depend on your application scenario. For machine learning, the decision would be clear (tested on MacBook Pro M2 Max, Julia head from 2023-01-26):
A pretty consistent 3x speed advantage of Accelerate over OpenBLAS for matrix sizes relevant for machine learning operations. |
I'd expect OpenBLAS sgemm to take at least 1.3 seconds with 8 cores for the 8192x8192 matrices: julia> 2 * 8192^3 / (4*4*2*3.25e9*8)
1.3215283987692308 It requires 8192^3 FLOPs. So, 0.615s reported by @ctkelley sounds too fast, and @domschl's 1.73s realistic. Odd that @domschl's accelerate time was faster (0.5 vs 0.65s). |
My .615 was for LU. My MM numbers are pretty close to what @domschl got. So OpenBLAS LU time is roughly 1/3 OpenBLAS MM time, as you would expect. The Apple LU times are hard for me to understand as the dimension grows. For dim = 8192, LU takes more time than MM. |
might be interesting simplification, new support for ILP64 interface: Release Notes Ventura 13.3 Beta 3AccelerateNew Features
|
Nice! |
Okay, I spun up a VM and tried it out. The good news is, many things work! The bad news is, it requires a hack to LBT to use their symbol names since they don't use a simple suffix on the F77 symbols, they drop the trailing underscore from the symbol name (e.g. I was running inside of a VM so benchmarks are basically useless, so all I'll say is that Accelerate (in the VM) was faster than OpenBLAS (in the VM) by about a factor of 3x when running |
I suppose that LBT can pick Accelerate if we are on the right macOS version in the default Julia build, or default to openblas (which we would continue to ship for a long time). This saves the effort of making our BLAS runtime switchable. Apple was one of the last holdouts. OpenBLAS does have a multi-threaded solvers (it patches LAPACK), so I am curious how the LU and cholesky factorization performance stacks up. |
Yes, such a switch is actually quite easy to implement; we can even just try loading ILP64 Accelerate, and if it fails, we load OpenBLAS instead. It would go right here: https://github.com/JuliaLang/julia/blob/7ba7e326293bd3eddede81567bbe98078e81f775/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L645. We could also have it set via a Preference, or something like that. |
I'd like to see how the matrix factorizations do as well. Things were a bit strange (see my post above) with the old version. If everything is 3X faster now, everyone wins. |
From my anecdotal experience with ICA (running in Python via NumPy), I found that Accelerate is between 5x – 15x faster than OpenBLAS. The OpenBLAS implementation is so slow that even a 10-year-old Intel Mac has pretty much the same performance. Again, this is only anecdotal evidence, and I am certainly not trying to bash OpenBLAS. However, I think this might underscore the importance of being able to use an optimized BLAS on every supported platform. Currently, this means that BLAS-dependent calculations are much slower than they could/should be on Apple Silicon, to the point that (depending on the algorithm of course) Julia could not the best choice for extremely fast performance anymore. |
Are you seeing this speedup for factorizations (LU, Cholesky, SVD, QR, ...)? I am not for the older version (pre OS 12.3) version of Accelerate. |
To be honest, I don't know which operations are involved in those ICA algorithms, but I'm guessing that SVD is very likely part of it. I am on macOS 13.2.1. |
What's k? If it's the prefactor in the O-term, then do you mean k n^3 + O(n^2)? |
It's the coefficient of the n^3 term. At sufficiently large data sizes, the cubic term will dominate. This will give us an accurate estimate of (computational cost, GFLOPS, GFLOPs, ALU %); accurate to the correct order of magnitude. Whether it's precise is a different story. |
So you meant k n^3 + O(n^2). So it is sufficient to look for jumps in
n^3/time. You could do that yourself with the data in my post. I will give
it a shot and post some results at some point.
…On Fri, Mar 31, 2023 at 11:57 AM Philip Turner ***@***.***> wrote:
It's the coefficient of the n^3 term. At sufficiently large data sizes,
the cubic term will dominate. This will give us an accurate estimate of
(computational cost, GFLOPS, GFLOPs, ALU %); accurate to the correct order
of magnitude. Whether it's precise
<https://en.wikipedia.org/wiki/Accuracy_and_precision> is a different
story.
—
Reply to this email directly, view it on GitHub
<#869>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACOEX6ZAIRXDKHF7DSWWQS3W635HPANCNFSM5EKM2XKA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
C. T. Kelley
Department of Mathematics, Box 8205
SAS Hall
2311 Stinson Drive
North Carolina State University
Raleigh, NC 27695-8205
(919) 515-7163, (919) 513-7336 (FAX)
***@***.***
https://ctk.math.ncsu.edu
|
My bad. Here's the data. For LU decomposition, k = 2/3. So multiply these numbers by 2/3.
The data varies a lot because only one trial was taken. I usually run several dozen trials in quick succession, then select the one with maximum speed. I also run the benchmark multiple times, each separated by ~20 seconds, to account for varying system conditions. |
GFLOPS is already scale invariant. When the values change as a function of size, you know the algorithm is more/less efficient at those sizes. |
I ran this a few times with similar results. The only real strange one is
the Float64 result from Accelerator. You can (or at least I think I can)
see all 8 cores kick in on Open BLAS at about the 4096 mark. I will try a
couple more with my browser shut off, but that should not affect the AMX
unit.
What I wanted was the relative timings so I could figure out what
LBT+Accelerator would do for me. I think it'll be good for me when I travel
with my laptop and want to conserve battery life. On the desktop, it's not
as clear. I do a fair amount of single precision right now, and that looks
pretty decent. Not much of what I do is matrix-matrix product intensive.
…On Fri, Mar 31, 2023 at 1:54 PM Philip Turner ***@***.***> wrote:
You could do that yourself with the data in my post.
My bad. Here's the data. For LU decomposition, k = 2/3
<https://math.stackexchange.com/questions/3028333/how-to-compute-amount-of-floating-point-operations-for-lu-decomposition-of-bande>.
So multiply these numbers by 2/3.
GFLOPS/k O64 A64 O32 A32
256 52 85 65 120
512 157 99 175 221
1024 288 262 396 449
2048 409 385 627 651
4096 449 397 829 878
8192 482 94 912 683
16384 501 312 980 382
32768 479 529 980 907
The data varies a lot because only one trial was taken. I usually run
several dozen trials in quick succession, then select the one with maximum
speed. I also run the benchmark multiple times, each separated by ~20
seconds, to account for varying system conditions.
—
Reply to this email directly, view it on GitHub
<#869>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACOEX6YDJFKG7W4WPKWBWKTW64K5DANCNFSM5EKM2XKA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
C. T. Kelley
Department of Mathematics, Box 8205
SAS Hall
2311 Stinson Drive
North Carolina State University
Raleigh, NC 27695-8205
(919) 515-7163, (919) 513-7336 (FAX)
***@***.***
https://ctk.math.ncsu.edu
|
Good observation. I originally created that metric because I wanted to make an eigendecomposition algorithm using AMX assembly. Using GFLOPS to measure performance would be misleading to compare different algorithms. For example,
Which one is faster? In addition, for some algorithms, you don't know the |
Good point. As another example, parallel QR needs more FLOPs. Or Strassen vs standard matmul. I'd consistently use Which is similar to what you'd suggested. Maybe you could define |
I did another run on an unloaded machine and got the same results. I use @belapsed from BenchmarkTools to get the timings and that's been very reproducible, at least for me. So the bizzare M2 pro behavior for Float64 seems to be really there for lu. One thing I just noticed is that the fan on my Mac Mini was running hard during the Open BLAS computation and not running at all during the Accelerate part. I'm doing the computation on my M2 MacBook Air as I type this. The Open BLAS computation ran hot and the load averages from the |
Herewith the M2 MacBook Air lu results. Looks a bit more consistent I will stop doing this now.
|
Yours (k/GFLOPS) would help in an effort to predict the execution time of an entire app. For example, density functional theory (DFT) uses several LAPACK functions. You might sum the k/GFLOPS and multiply by (#electrons)^3. With my metric, I wanted to easily map it to actual ALU %, and if I got OpenBLAS is more than 2x slower on the M2, while Accelerate is exactly 2x slower. OpenBLAS could be trying to treat the heterogeneous P and E blocks as homogeneous - major load imbalance. This would decrease the performance gap with Accelerate.
|
Just tested Accelerate on something I'm working on. It's not very good at the triangular solves you do after lu. I'm seeing the same inconsistent scaling, even in Float32. For this project, I need to report timings on triangular solves to make a point about iterative refinement. Open BLAS is completely predictable, consistent with theory, and does what I need. Accelerate is a bit too flaky for publication work. None of the problems we're talking about here should be impossible for Apple to fix, especially the complex function part in @philipturner 's post above, if they care. |
Most problems (aside from minor GFLOPS blips) probably are impossible to fix.
I wouldn't phrase it like that. Apple engineers know OpenBLAS exists, and it already serves its purpose: high throughput, no matter the energy cost. Nvidia RTX 4090 exists, and it has the highest TFLOPS but zero regard for power efficiency. Apple's AMX hardware was tailored to be an "ML accelerator" (sic) and not a linear algebra accelerator. It performs real-valued convolutions or batched real-FP32 matrix multiplications. Modestly high-throughput real-FP64 was just a cool side effect. Rather than reinvent the wheel, they gave developers something new - the ability to trade off between sheer performance and power efficiency. That is crucial for more apps than you think. They also did this with the GPU, sacrificing performance for the sake of efficiency. |
I might be late to the party, but here are some additional results. I used a script based on @ctkelley 's script above. Notably, I added ComplexF64 calculations ("-64Z" columns below). Dot product
Here, Accelerate gives a nice ~x2.5–x3 speed-up for ComplexF64 and Float64. Matrix multiplication
These results are in line with @ctkelley 's. Accelerate shines in real Float calculations. Eigenvalues and eigenvectors for complex Hermitian and real symmetric matrices
Here, Accelerate slows down the calculation. Parallel matrix multiplicationFinally, I wanted to see how Accelerate would perform in a parallel context. I prepared 800 pairs of matrices and multiplied them in a sequential loop and also in a
We get a considerable speed-up when parallelising the Accelerate calculations.
Perhaps running all 8 cores to get a x1.5 speed-up is not worth it, but at least it seems that threading Accelerate code does not degrade performance. All in all, I think it is certainly beneficial to expose Accelerate to the general user. |
In my experience, threading significantly degraded performance, at least for SGEMM and DGEMM when harnessing AMX. N=32 was just barely utilizing the AMX on my machine (>50 GFLOPS FP64, <100 GFLOPS FP32). With yours, both FP64 and FP32 would have nonphysical performance if using only single-core NEON. Matrix Multiplication Performance.xlsx On the M1 family with old Accelerate, complex multiplications had 2x less ALU utilization than real multiplications. That changed to 1x with M2, new Accelerate. BLAS Performance.xlsx ZHEEV and DSYEV, k=unknown. I adjusted k for complex numbers to properly compare ALU % (k -> 0.25k). Accelerate has the same ALU utilization with complex, OpenBLAS has 2x more. I'll need to test whether ZHEEV_2STAGE and DSYEV_2STAGE are faster.
Sequential ZGEMM and DGEMM, k=2. My machine reached 78 GFLOPS for A64, N=32. Yours reached 109.2, maybe because M2 can issue more AMX instructions/cycle.
|
@philipturner Thanks for the in-depth benchmarking. I am doing the dead simple one with At that point AppleAccelerate.jl or a similar package can set up BLAS forwarding like we do with MKL.jl. That way, in Julia 1.9 (or 1.9.1), we can get the ability to use Accelerate with just a |
Alright, I think the next step is to improve AppleAccelerate.jl to auto-load ILP64 Accelerate, when it is available. We may also want to give it the ability to load LAPACK_jll to provide LAPACK and only use Accelerate for the underlying BLAS calls. This simultaneously dodges the cholesky bug that we identified, and gives us the ability to see if vanilla LAPACK_jll provides any kind of performance improvement over Accelerate's LAPACK (doubtful). An even more interesting test would be to load OpenBLAS first, then load only Accelerate's |
I think the LAPACK in our OpenBLAS probably directly links to its own BLAS and not LBT. I think some work needs to be done to assemble Accelerate BLAS, openblas accelerated lapack routines and the rest of lapack. The LAPACK in BB on the other hand does link to LBT. so I think the recipe would be to load accelerate first, then lapack from BB, and finally manually forward a few lapack routines to openblas. |
With JuliaLinearAlgebra/AppleAccelerate.jl#58 and incorporating this very minor fix to our LinearAlgebra test suite, we actually pass the entire Julia test suite while running on top of Accelerate. My contact at Apple has asked for any examples of real-world usage that can be shown to work well on Accelerate; given that the LinearAlgebra test suite finishes with a small performance improvement over the OpenBLAS test suite, I'm inclined to think that small-scale LinearAlgebra problems may have a natural advantage here (as that is the majority of our test suite). They are looking for good examples that they can use to bolster the importance of developing good tools for people like us, so don't be shy! |
For very small problems, which are too small for OpenBLAS to harness multiple cores, Accelerate should excel. It uses the AMX because it's going to help single-core performance. For example, iPhones have only 2 performance cores so the AMX doesn't have any less vector throughput than the NEON units. Unfortunately, this tactic doesn't scale when the amount of NEON compute increases. |
Support is already on AppleAccelerate.jl master, and will be released when JuliaLinearAlgebra/AppleAccelerate.jl#62 merges. Announcement in https://discourse.julialang.org/t/appleaccelerate-jl-v0-4-0/99351/3 |
Sonoma does better than OS 13.x. Here are some revised lu! numbers
The discontinuity at higher dimensions seems to be gone. 8 core M2 Pro. |
Unrelated to the issue but can anyone share some m3 perf numbers like in lu for example.. |
Generally, one-sided factorizations like LU should be fast in Accelerate. They can use block size 32 on AMX single-core quite easily. So a good fraction of GEMM GFLOPS. Two-sided factorizations like tridiagonalization are where Accelerate (and OpenBLAS) are painfully slow. I had to write custom kernels that convert these into panel (QR) factorizations with most computations batched into small GEMM calls. Then the final bulge chasing which still needs further optimization at larger block sizes. Perf on M3 should be no different than M1 or M2. The hardware hasn’t changed much fundamentally, except for some new instruction issuing capabilities and lower precisions for AI stuff. |
peakflops(4096)
1.97967547574082e11
using AppleAccelerate
julia> peakflops(4096)
3.979200628037524e11
…On Tue, Nov 26, 2024 at 12:40 PM Dominik Schlösser ***@***.***> wrote:
The default BLAS Julia uses is OpenBLAS. Apple's M1 has proprietary
dedicated matrix hardware that is only accessible via Apple's Accelerate
BLAS implementation. That proprietary interface can provide 2x to 4x
speedups for some linear algebra use cases (see
https://discourse.julialang.org/t/does-mac-m1-in-multithreads-is-slower-that-in-single-thread/61114/12?u=kristoffer.carlsson
for some benchmarks and discussion.)
Since Julia 1.7 there's a BLAS multiplexer:
- https://github.com/staticfloat/libblastrampoline
(this currently -- as far as I understood it -- requires still
proprietary code for each BLAS, and for M1 there's so far only a minimal
shim via AppleAccelerateLinAlgWrapper.jl
<https://github.com/chriselrod/AppleAccelerateLinAlgWrapper.jl> )
So in theory, it should be possible to extend this so that depending on a
given platform either OpenBLAS or other BLAS solutions are used
transparently by default.
So this issue discusses what needs to be done to have Apple's Accelerate
access to M1 hardware acceleration available by default in Julia
—
Reply to this email directly, view it on GitHub
<#869>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFDM6ML6NQQWAJM5YXIIA32CRM4HAVCNFSM6AAAAABSQHHZ62VHI2DSMVQWIX3LMV43ASLTON2WKOZSGY4TIMZWGYZDGOA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
The default BLAS Julia uses is OpenBLAS. Apple's M1 has proprietary dedicated matrix hardware that is only accessible via Apple's Accelerate BLAS implementation. That proprietary interface can provide 2x to 4x speedups for some linear algebra use cases (see https://discourse.julialang.org/t/does-mac-m1-in-multithreads-is-slower-that-in-single-thread/61114/12?u=kristoffer.carlsson for some benchmarks and discussion.)
Since Julia 1.7 there's a BLAS multiplexer:
(this currently -- as far as I understood it -- requires still proprietary code for each BLAS, and for M1 there's so far only a minimal shim via AppleAccelerateLinAlgWrapper.jl )
So in theory, it should be possible to extend this so that depending on a given platform either OpenBLAS or other BLAS solutions are used transparently by default.
So this issue discusses what needs to be done to have Apple's Accelerate access to M1 hardware acceleration available by default in Julia
The text was updated successfully, but these errors were encountered: