Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cycle counts? #21

Closed
philipturner opened this issue Dec 14, 2022 · 0 comments
Closed

Cycle counts? #21

philipturner opened this issue Dec 14, 2022 · 0 comments

Comments

@philipturner
Copy link

philipturner commented Dec 14, 2022

I reverse-engineered the Apple GPU's ALU and cache sizes. Could this reference be linked in the README?

Thinking out load (outdated; you don't have to read this)

I've recently been running benchmarks of absolute throughput. I found some interesting things. Although the Apple GPU architecture claims to support 16-bit arithmetic, it does 32-bit under the hood. The number of FLOPS in either half or single precision was the same. In fact, performance decreased by ~4% in MPSMatrixMultiplication. The scheduler just reads either a 16 or 32-bit immediate, then feeds it to the ALU. However, there is no penalty for using 16-bit (or possibly 8-bit, against my expectation) either. In some wierd instances, using 16-bit data types increases the chance the ALU will be fully utilized.

Next, Apple seems to have a 32x32+64=64 multiply-accumulate instruction (1/11 IPC/simdgroup throughput, or 33-44 cycles with 3-4x concurrency). I made a shader that runs mulhi and * separately for 32-bit integers. Another version incorporating madhi and * + for the lower bits ran at the same throughput. However, for 64x64=128 there was a performance drop after incorporating addition.

I have more instructions to benchmark, and I can reveal the full data, later. It will be in metal-benchmarks. Should I add some of this to your repo, or could you hyperlink to my repo after I'm done researching?


Also about the AMX, using multicore multiple threads* CPU kills performance. The only way I achieved maximum performance was from a single-threaded context. Max FP64 performance was a crazy 700** GFLOPS, FP32 was 2700 GFLOPS, and FP16 was 2400 GFLOPS. BNNS seems to not utilize the 2x throughput ratio of FP16:FP32. Combined with the GPU lacking proper FP16, I see why Apple still pushes for the neural engine in higher-end chips. The GPU is not more powerful than the ANE - I previously thought it was.

*Activity monitor showed ~190% CPU utilization for xctest, perhaps the two cores necessary to fully utilize the M1 Pro AMX. There was also a temporary plateau at ~340 GFLOPS FP64, ~1300 GFLOPS FP32.

**Only achieved for 768x768 matrices, which likely saturate the L2 cache. Afterwards, it dropped and plateaued at 520 GFLOPS FP64, 1990 GFLOPS FP32. The plateau extended into matrices that are gigabytes large, bypassing the SLC. Rather, the GPU was more strongly affected here, dropping from 8400 to 6900 GFLOPS FP32 (32-core M1 Max).

Lastly, for something in common between the GPU and AMX. For real-world matrix multiplication, both peak at 79-84% their theoretical GFLOPS.

Edit: Later analysis suggests there may not be a 32x32+64=64 instruction, or it may be divided into two smaller independent instructions. I'd just say it's speculation at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant