Most performant way of running Llama inference on Mac using ExecuTorch? #8571

manuelcandales · 2025-02-19T17:24:57Z

manuelcandales
Feb 19, 2025
Collaborator

There are a bunch of possible configurations for running Llama on Mac using ExecuTorch: Delegating to XNNPACK, to MPS, to CoreML. Using sdpa with kv cache or not. Using fp16, bpf16 or fp32 precision. There are a bunch of quantization options: dynamically quantized activations or not, torchao lowbit kernels (3-bit, 4-bit, etc).

For each delegate, there are big differences in performance from configuration to configuration.

I want to know, directly from the experts in each delegate, what are the fastest configurations to run Llama inference on Mac, using ExecuTorch.

Let's use Llama 3.2 1B for this exploration. If you already have numbers, share them here. Please post instructions on how to reproduce.

If you have numbers for Llama 3.1 8B, please also post them here.

@metascroy @digantdesai @cccclai @shoumikhin @DenisVieriu97 @kimishpatel

manuelcandales · 2025-02-19T17:58:12Z

manuelcandales
Feb 19, 2025
Collaborator Author

I'll break the ice, with numbers for the MPS delegate.

So far, the fastest I am able to get with the MPS delegate running Llama 1B on my M1 Pro is ~54 tokens/sec generation time, using the following export command:

(executorch-mps) mcandales@mcandales-mbp executorch % CMAKE_INSTALL_PREFIX=$PWD/cmake-out python -m examples.models.llama.export_llama \
--checkpoint $LLAMAPATH/consolidated.00.pth \
--params  $LLAMAPATH/params.json \
-kv --disable_dynamic_shape --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \ 
--mps \
-qmode 8da4w --group_size 32 --embedding-quantize 4,32 \
--output_name $LLAMAPATH/llama1B-8da4w-mps.pte

This produces a single MPS subgraph. Notice in the configuration above, that I didn't use the --use_sdpa_with_kv_cache option. If I do, then we get 17 delegate subgraphs, and the resulting performance is ~30 tokens/sec. This is because each call to sdpa_with_kv_cache causes a graph break.

With the configuration above, even though I get ~54 tokens/sec generation time, the total inference time is dragged down, because prompt evaluation is slow. Here are the stats that ET prints out:

I 00:00:05.243660 executorch:stats.h:110] 	Prompt Tokens: 5    Generated Tokens: 122
I 00:00:05.243662 executorch:stats.h:116] 	Model Load Time:		0.402000 (seconds)
I 00:00:05.243664 executorch:stats.h:126] 	Total inference time:		4.786000 (seconds)		 Rate: 	25.491015 (tokens/second)
I 00:00:05.243666 executorch:stats.h:134] 		Prompt evaluation:	2.516000 (seconds)		 Rate: 	1.987281 (tokens/second)
I 00:00:05.243668 executorch:stats.h:145] 		Generated 122 tokens:	2.270000 (seconds)		 Rate: 	53.744493 (tokens/second)
I 00:00:05.243670 executorch:stats.h:153] 	Time to first generated token:	2.516000 (seconds)
I 00:00:05.243684 executorch:stats.h:160] 	Sampling time over 127 tokens:	0.066000 (seconds)

@DenisVieriu97 Is there a faster way of running Llama 1B on Mac with the MPS Delegate?

0 replies

kimishpatel · 2025-02-19T18:26:14Z

kimishpatel
Feb 19, 2025
Collaborator

can we try -d bf16? I know this path is a bit broken atm for quantization but @jackzhxng is working on this, kind of, to fix it

1 reply

manuelcandales Feb 19, 2025
Collaborator Author

Still doesn't work

metascroy · 2025-02-20T01:45:02Z

metascroy
Feb 20, 2025
Collaborator

For torchao kernels in ET, you can do the following:

Build ET

cmake -DPYTHON_EXECUTABLE=python \
    -DCMAKE_INSTALL_PREFIX=cmake-out \
    -DEXECUTORCH_ENABLE_LOGGING=1 \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_MPS=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -Bcmake-out .
cmake --build cmake-out -j16 --target install --config Release

Build llama runner with torchao

cmake -DPYTHON_EXECUTABLE=python \
    -DCMAKE_PREFIX_PATH=$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())') \
    -DCMAKE_INSTALL_PREFIX=cmake-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DEXECUTORCH_BUILD_TORCHAO=ON \
    -DTORCHAO_BUILD_EXECUTORCH_OPS=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_MPS=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -Bcmake-out/examples/models/llama \
    examples/models/llama
cmake --build cmake-out/examples/models/llama -j16 --target install --config Release

Install runner requirements

sh examples/models/llama/install_requirements.sh

Export model

python -m examples.models.llama.export_llama --checkpoint $MODEL_IN --params $PARAMS -kv --use_sdpa_with_kv_cache -qmode "torchao:8da${BITWIDTH}w" --group_size 256 -E "torchao:8,${EMBEDDING_DIM}" --disable_dynamic_shape --metadata $META_DATA '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name $MODEL_OUT

Run model

$EXECUTORCH_DIR/cmake-out/examples/models/llama/llama_main --model_path=$MODEL_OUT --tokenizer_path=$TOKENIZER --prompt="Once upon a time,"

I think I measured around ~100 tokens/sec for llama1B on M1 Pro earlier with them.

For a model that runs completely on ANE, this is near the best we have so: #8436

It requires a very different kind of runner than other llama models, though. I estimate its performance on Llama1B at:

17 tokens/sec decode on M1 Pro (30 tokens/sec decode on iPhone 15 Pro)
1122 tokens/sec prefill on M1 Pro (1882 tokens/sec prefill on iPhone 15 Pro)

Decode performance is not great, but still faster than reader speed. Prefill performance is excellent.

4 replies

manuelcandales Feb 20, 2025
Collaborator Author

@metascroy Thank you so for sharing this.
Wow, so for ANE, both decode and prefill perf on iPhone 15 is ~2x compared to M1 Pro? Is that expected? Does the iPhone 15 have a better ANE than the M1 Pro?

Also, for decode on M1 Pro you are getting ~100 tokens/sec on CPU using the torchao kernels, and 17 tokens/sec running on ANE with the different runner. Is that because the M1 Pro has a very weak ANE? Or is that because the code is still in early stages and not very optimized compared to CPU code?

So, to summarize the current decode numbers: ANE (17 tok/sec) < GPU (54 tok/sec) < CPU (~100 tok/sec).

The ordering should be the other way around, right? (i.e. with a perfectly optimized codebase, we should expect to see ANE perf > GPU perf > CPU perf, or not really? (i.e. not on M1 Pro, but yes on M4 for example)? @digantdesai @kimishpatel

digantdesai Feb 21, 2025
Collaborator

I would expect ANE/MPS to handily beat CPU at prefill, which you can see in Scott's numbers for ANE (1122). I don't know MPS prefill numbers but since you are focusing on decode I assume prefill is OK.

OTOH, Decode is memory bw bound, and even CPU can close to available memory bandwidth. So MPS should be in the similar ballpark. Without spending any time I can't tell where the issues are for MPS (i.e. 54 < 100). FWIW, this is my mental reference from llama.cpp 7B. For ANE, IIUC, it is not easy to utilize available hw OPS and translate that into better perf for decode i.e. KV-cache update. If we could, I guess it would also end up in the same ballpark as CPU, assuming same available memory bw for both.

kimishpatel Feb 21, 2025
Collaborator

@manuelcandales Lets focus on Metal part. ANE is a blackbox and we may not have all the programmability that we may have with CPU and Metal. I would like for us to focus on Metal perf while we also look at ANE perf separately.

metascroy Feb 21, 2025
Collaborator

does the iPhone 15 Pro have a better ANE than M1

I've generally see better numbers on the ANE on iPhone 15 Pro than M1 mac

We are still optimizing ANE perf. By splitting linear layers up into smaller chunks, we can boost ANE perf by 30% (~40 tokens/sec on iPhone 15 Pro; for comparison I think XNNPACK is in the high 50s).

ANE prefill perf is much faster than XNNPACK/CPU perf. But I also think ANE decode perf may be competitive with XNNPACK if you're generating tokens after evaluating a long prompt. The ANE model should have the same speed when decoding the first token vs. when decoding tokens near the end of the context (because it always includes all cache tokens, past and future, in its SDPA calculation). The CPU models do not do this (they skip SDPA computation on future tokens). So CPU models should slow down over time on very long context windows.

In terms of which is better, depends on your application. If you're interested in short prompts and short sequence generations, the CPU is quite good. For large prompts, ANE may be better.

Most performant way of running Llama inference on Mac using ExecuTorch? #8571

Uh oh!

manuelcandales Feb 19, 2025 Collaborator

Replies: 3 comments · 5 replies

Uh oh!

Uh oh!

manuelcandales Feb 19, 2025 Collaborator Author

Uh oh!

kimishpatel Feb 19, 2025 Collaborator

Uh oh!

manuelcandales Feb 19, 2025 Collaborator Author

Uh oh!

metascroy Feb 20, 2025 Collaborator

Uh oh!

Uh oh!

manuelcandales Feb 20, 2025 Collaborator Author

Uh oh!

digantdesai Feb 21, 2025 Collaborator

Uh oh!

kimishpatel Feb 21, 2025 Collaborator

Uh oh!

metascroy Feb 21, 2025 Collaborator

manuelcandales
Feb 19, 2025
Collaborator

Replies: 3 comments 5 replies

manuelcandales
Feb 19, 2025
Collaborator Author

kimishpatel
Feb 19, 2025
Collaborator

manuelcandales Feb 19, 2025
Collaborator Author

metascroy
Feb 20, 2025
Collaborator

manuelcandales Feb 20, 2025
Collaborator Author

digantdesai Feb 21, 2025
Collaborator

kimishpatel Feb 21, 2025
Collaborator

metascroy Feb 21, 2025
Collaborator