Cycle counts? #21

philipturner · 2022-12-14T05:19:29Z

I reverse-engineered the Apple GPU's ALU and cache sizes. Could this reference be linked in the README?

Thinking out load (outdated; you don't have to read this)

I've recently been running benchmarks of absolute throughput. I found some interesting things. Although the Apple GPU architecture claims to support 16-bit arithmetic, it does 32-bit under the hood. The number of FLOPS in either half or single precision was the same. In fact, performance decreased by ~4% in MPSMatrixMultiplication. The scheduler just reads either a 16 or 32-bit immediate, then feeds it to the ALU. However, there is no penalty for using 16-bit (or possibly 8-bit, against my expectation) either. In some wierd instances, using 16-bit data types increases the chance the ALU will be fully utilized.

Next, Apple seems to have a 32x32+64=64 multiply-accumulate instruction (1/11 IPC/simdgroup throughput, or 33-44 cycles with 3-4x concurrency). I made a shader that runs mulhi and * separately for 32-bit integers. Another version incorporating madhi and * + for the lower bits ran at the same throughput. However, for 64x64=128 there was a performance drop after incorporating addition.

I have more instructions to benchmark, and I can reveal the full data, later. It will be in metal-benchmarks. Should I add some of this to your repo, or could you hyperlink to my repo after I'm done researching?

Also about the AMX, using ~~multicore~~ multiple threads* CPU kills performance. The only way I achieved maximum performance was from a single-threaded context. Max FP64 performance was a crazy 700** GFLOPS, FP32 was 2700 GFLOPS, and FP16 was 2400 GFLOPS. BNNS seems to not utilize the 2x throughput ratio of FP16:FP32. Combined with the GPU lacking proper FP16, I see why Apple still pushes for the neural engine in higher-end chips. The GPU is not more powerful than the ANE - I previously thought it was.

*Activity monitor showed ~190% CPU utilization for xctest, perhaps the two cores necessary to fully utilize the M1 Pro AMX. There was also a temporary plateau at ~340 GFLOPS FP64, ~1300 GFLOPS FP32.

**Only achieved for 768x768 matrices, which likely saturate the L2 cache. Afterwards, it dropped and plateaued at 520 GFLOPS FP64, 1990 GFLOPS FP32. The plateau extended into matrices that are gigabytes large, bypassing the SLC. Rather, the GPU was more strongly affected here, dropping from 8400 to 6900 GFLOPS FP32 (32-core M1 Max).

Lastly, for something in common between the GPU and AMX. For real-world matrix multiplication, both peak at 79-84% their theoretical GFLOPS.

Edit: Later analysis suggests there may not be a 32x32+64=64 instruction, or it may be divided into two smaller independent instructions. I'd just say it's speculation at this point.

The text was updated successfully, but these errors were encountered:

This was referenced Dec 16, 2022

Hardware acceleration on Apple Silicon with Metal plugin jax-ml/jax#8074

Closed

MPS device appears much slower than CPU on M1 Mac Pro pytorch/pytorch#77799

Closed

philipturner closed this as completed Dec 29, 2022

philipturner reopened this Dec 29, 2022

philipturner closed this as completed Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cycle counts? #21

Cycle counts? #21

philipturner commented Dec 14, 2022 •

edited

Loading

Cycle counts? #21

Cycle counts? #21

Comments

philipturner commented Dec 14, 2022 • edited Loading

philipturner commented Dec 14, 2022 •

edited

Loading