You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Run Llama 3 8B on Android (with Qualcomm AI Engine Direct Backend)
1
+
# Run Llama 3 3B Instruct on Android (with Qualcomm AI Engine Direct Backend)
2
2
3
-
This tutorial demonstrates how to export Llama 3 8B Instruct for Qualcomm AI Engine Direct Backend and running the model on a Qualcomm device.
3
+
This tutorial demonstrates how to export and run the Llama 3 3B Instruct model on a Qualcomm device using the Qualcomm AI Engine Direct Backend via ExecuTorch.
4
+
We use a static Llama [implementation](https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/model/static_llama.py) to optimize performance and memory usage during on-device inference.
4
5
5
6
## Prerequisites
6
7
@@ -13,10 +14,8 @@ This tutorial demonstrates how to export Llama 3 8B Instruct for Qualcomm AI Eng
13
14
14
15
## Instructions
15
16
16
-
### Step 1: Prepare the checkpoint of the model and optimized matrix from [Spin Quant](https://github.com/facebookresearch/SpinQuant)
17
-
18
-
1. For Llama 3 tokenizer and checkpoint, please refer to https://github.com/meta-llama/llama-models/blob/main/README.md for further instructions on how to download `tokenizer.model`, `consolidated.00.pth` and `params.json`.
19
-
2. To get the optimized matrix, please refer to [SpinQuant on GitHub](https://github.com/facebookresearch/SpinQuant). You can download the optimized rotation matrices in the Quantized Models section. Please choose **LLaMA-3-8B/8B_W4A16KV16_lr_1.5_seed_0**.
17
+
### Step 1: Prepare the checkpoint and tokenizer of the model.
18
+
1. For Llama 3 tokenizer and checkpoint, please refer to [instructions](https://www.llama.com/models/llama-3) for further instructions on how to download `tokenizer.model`, `consolidated.00.pth` and `params.json`.
20
19
21
20
### Step 2: Export to ExecuTorch with Qualcomm AI Engine Direct Backend
22
21
Deploying large language models like Llama 3 on-device presents the following challenges:
@@ -25,123 +24,85 @@ Deploying large language models like Llama 3 on-device presents the following ch
25
24
2. High model loading and inference time.
26
25
3. Difficulty in quantization.
27
26
28
-
To address these challenges, we have implemented the following solutions:
29
-
1. Using `quantization.pt2e_quantize = "qnn_16a4w'` to quantize activations and weights, thereby reducing the on-disk model size and alleviating memory pressure during inference.
30
-
2. Using `backed.qnn.num_sharding = 8` to shard the model into sub-parts.
31
-
3. Performing graph transformations to convert or decompose operations into more accelerator-friendly operations.
32
-
4. Using `backend.qnn.optimized_rotation_path = "<path_to_optimized_matrix>"` to apply R1 and R2 of [Spin Quant](https://github.com/facebookresearch/SpinQuant) to improve accuracy.
33
-
5. Using `quantization.calibration_data = "<|start_header_id|>system<|end_header_id|..."` to ensure that during quantization, the calibration includes special tokens in the prompt template. For more details on the prompt template, refer to [the model card](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/).
27
+
To address these, we apply the following optimizations:
28
+
29
+
1. Quantization: Use `QuantDtype.use_16a4w_block` for post-training quantization to reduce model size and memory usage
30
+
31
+
2. Mixed Precision Quantization: compresses KV cache tensors to 8-bit and applies `QuantDtype.use_16a8w` to the LM head.
32
+
33
+
3. SeqMSE Quantization: optimizes the parameter encodings of each layer of a model individually to minimize the difference between the layer’s original and quantized outputs. SeqMSE uses a search-based approach with `seq_mse_candidates` = 1000. (Implementation details: [SeqMSE pass](https://github.com/pytorch/executorch/blob/main/backends/qualcomm/_passes/seq_mse.py))
34
+
35
+
4. Model Sharding: Set `num_sharding` = 4 to shard the model into sub-parts. This helps reduce memory pressure and improve performance during on-device inference.
36
+
37
+
5. Graph Transformations: Convert operations into accelerator-friendly formats for better runtime performance.
38
+
39
+
You can find the full optimization configuration in this [file](https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/__init__.py), as shown below:
40
+
41
+
```python
42
+
@register_llm_model("llama3_2-1b_instruct")
43
+
@dataclass(init=False, frozen=True)
44
+
classLlama3_2_1B_Instruct(LLMModelConfig):
45
+
repo_id =None
46
+
params_path =None
47
+
convert_weights =None
48
+
transform_weight =True
49
+
# The Llama3_2 enabled should be instruct, however, Llama's tokenizer does not provide utility to apply chat template.
# Please note that calibration_data must include the prompt template for special tokens.
54
-
calibration_data: "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
55
-
backend:
56
-
qnn:
57
-
enabled: True
58
-
num_sharding: 8
59
-
60
-
61
-
# export_llm
62
-
python -m extension.llm.export.export_llm \
63
-
--config path/to/config.yaml
80
+
# export llama
81
+
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-3b_instruct --model_mode kv --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1 --compile_only
adb shell "cd ${DEVICE_DIR} && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.model> --prompt \"<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n\" --seq_len 128"
132
-
```
133
-
You should see the message:
134
-
```
135
-
<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello! I'd be delighted to chat with you about Facebook. Facebook is a social media platform that was created in 2004 by Mark Zuckerberg and his colleagues while he was a student at Harvard University. It was initially called "Facemaker" but later changed to Facebook, which is a combination of the words "face" and "book". The platform was initially intended for people to share their thoughts and share information with their friends, but it quickly grew to become one of the
94
+
# Run llama
95
+
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-3b_instruct --model_mode kv --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1 --pre_gen_pte ${PATH_TO_ARTIFACT}
136
96
```
137
97
138
98
## What is coming?
139
-
140
99
- Performance improvements
141
100
- Reduce the memory pressure during inference to support 12GB Qualcomm devices
142
-
- Support more LLMs (Qwen, Phi-4-mini, etc.)
101
+
- Broader LLM Support via [Optimum ExecuTorch](https://github.com/huggingface/optimum-executorch?tab=readme-ov-file#llms-large-language-models)
102
+
103
+
- Already supported models (e.g.): Llama2, Llama3, Gemma, Qwen, Phi-4, SmolLM. For usage examples, please refer to [README](https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/README.md)
143
104
144
105
## FAQ
145
106
146
107
If you encounter any issues while reproducing the tutorial, please file a github
147
-
issue on ExecuTorch repo and tag use `#qcom_aisw` tag
108
+
issue on ExecuTorch repo and tag use `#qcom_aisw` tag
0 commit comments