You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.md
+12-18Lines changed: 12 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,47 +26,41 @@ Deploying large language models like Llama 3 on-device presents the following ch
26
26
27
27
To address these, we apply the following optimizations:
28
28
29
-
1. Quantization: Use `QuantDtype.use_16a4w_block` for post-training quantization to reduce model size and memory usage
29
+
1. Quantization: Use `QuantDtype.use_16a4w_block` for post-training quantization to reduce model size and memory usage.
30
30
31
31
2. Mixed Precision Quantization: compresses KV cache tensors to 8-bit and applies `QuantDtype.use_16a8w` to the LM head.
32
32
33
-
3.SeqMSE Quantization: optimizes the parameter encodings of each layer of a model individually to minimize the difference between the layer’s original and quantized outputs. SeqMSE uses a search-based approach with `seq_mse_candidates` = 1000. (Implementation details: [SeqMSE pass](https://github.com/pytorch/executorch/blob/main/backends/qualcomm/_passes/seq_mse.py))
33
+
3.Model Sharding: Set `num_sharding` = 4 to shard the model into sub-parts. This helps reduce memory pressure and improve performance during on-device inference. The number of shards might be different depending on the model size.
34
34
35
-
4. Model Sharding: Set `num_sharding` = 4 to shard the model into sub-parts. This helps reduce memory pressure and improve performance during on-device inference.
36
-
37
-
5. Graph Transformations: Convert operations into accelerator-friendly formats for better runtime performance.
35
+
4. Graph Transformations: Convert operations into accelerator-friendly formats for better runtime performance.
38
36
39
37
You can find the full optimization configuration in this [file](https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/__init__.py), as shown below:
40
38
41
39
```python
42
-
@register_llm_model("llama3_2-1b_instruct")
40
+
@register_llm_model("llama3_2-3b_instruct")
43
41
@dataclass(init=False, frozen=True)
44
-
classLlama3_2_1B_Instruct(LLMModelConfig):
42
+
classLlama3_2_3B_Instruct(LLMModelConfig):
45
43
repo_id =None
46
44
params_path =None
47
45
convert_weights =None
48
46
transform_weight =True
49
47
# The Llama3_2 enabled should be instruct, however, Llama's tokenizer does not provide utility to apply chat template.
50
48
instruct_model =False
51
-
52
-
num_sharding =1
49
+
50
+
num_sharding =4
53
51
# quant config
54
52
ptq = QuantDtype.use_16a4w_block
55
-
group_size =32
53
+
group_size =32# Group size used in block quantization for weight quantization. Will only be used when ptq = 16a4w_block
56
54
masked_softmax =False
57
-
seq_mse_candidates =1000
55
+
56
+
# SeqMSE Quantization: optimizes the parameter encodings of each layer of a model individually to minimize the difference between the layer’s original and quantized outputs. (Implementation details: ./backends/qualcomm/_passes/seq_mse.py) In this configuration, we set `seq_mse_candidates` = 0, which means SeqMSE quantization is not applied.
0 commit comments