Most performant way of running Llama inference on Mac using ExecuTorch? #8571
Replies: 3 comments 5 replies
-
I'll break the ice, with numbers for the MPS delegate. So far, the fastest I am able to get with the MPS delegate running Llama 1B on my M1 Pro is ~54 tokens/sec generation time, using the following export command:
This produces a single MPS subgraph. Notice in the configuration above, that I didn't use the --use_sdpa_with_kv_cache option. If I do, then we get 17 delegate subgraphs, and the resulting performance is ~30 tokens/sec. This is because each call to sdpa_with_kv_cache causes a graph break. With the configuration above, even though I get ~54 tokens/sec generation time, the total inference time is dragged down, because prompt evaluation is slow. Here are the stats that ET prints out:
@DenisVieriu97 Is there a faster way of running Llama 1B on Mac with the MPS Delegate? |
Beta Was this translation helpful? Give feedback.
-
can we try |
Beta Was this translation helpful? Give feedback.
-
For torchao kernels in ET, you can do the following:
I think I measured around ~100 tokens/sec for llama1B on M1 Pro earlier with them. For a model that runs completely on ANE, this is near the best we have so: #8436 It requires a very different kind of runner than other llama models, though. I estimate its performance on Llama1B at:
Decode performance is not great, but still faster than reader speed. Prefill performance is excellent. |
Beta Was this translation helpful? Give feedback.
-
There are a bunch of possible configurations for running Llama on Mac using ExecuTorch: Delegating to XNNPACK, to MPS, to CoreML. Using sdpa with kv cache or not. Using fp16, bpf16 or fp32 precision. There are a bunch of quantization options: dynamically quantized activations or not, torchao lowbit kernels (3-bit, 4-bit, etc).
For each delegate, there are big differences in performance from configuration to configuration.
I want to know, directly from the experts in each delegate, what are the fastest configurations to run Llama inference on Mac, using ExecuTorch.
Let's use Llama 3.2 1B for this exploration. If you already have numbers, share them here. Please post instructions on how to reproduce.
If you have numbers for Llama 3.1 8B, please also post them here.
@metascroy @digantdesai @cccclai @shoumikhin @DenisVieriu97 @kimishpatel
Beta Was this translation helpful? Give feedback.
All reactions