Skip to content

Commit 3c15e00

Browse files
add issue link and refine the comments
1 parent 23ecb9a commit 3c15e00

File tree

2 files changed

+13
-19
lines changed

2 files changed

+13
-19
lines changed

docs/source/backends-qualcomm.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -478,4 +478,4 @@ print(f"Model successfully exported to {model_name}")
478478
## FAQ
479479

480480
If you encounter any issues while reproducing the tutorial, please file a github
481-
issue on ExecuTorch repo and tag use `#qcom_aisw` tag
481+
[issue](https://github.com/pytorch/executorch/issues) on ExecuTorch repo and tag use `#qcom_aisw` tag

docs/source/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.md

Lines changed: 12 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -26,47 +26,41 @@ Deploying large language models like Llama 3 on-device presents the following ch
2626

2727
To address these, we apply the following optimizations:
2828

29-
1. Quantization: Use `QuantDtype.use_16a4w_block` for post-training quantization to reduce model size and memory usage
29+
1. Quantization: Use `QuantDtype.use_16a4w_block` for post-training quantization to reduce model size and memory usage.
3030

3131
2. Mixed Precision Quantization: compresses KV cache tensors to 8-bit and applies `QuantDtype.use_16a8w` to the LM head.
3232

33-
3. SeqMSE Quantization: optimizes the parameter encodings of each layer of a model individually to minimize the difference between the layer’s original and quantized outputs. SeqMSE uses a search-based approach with `seq_mse_candidates` = 1000. (Implementation details: [SeqMSE pass](https://github.com/pytorch/executorch/blob/main/backends/qualcomm/_passes/seq_mse.py))
33+
3. Model Sharding: Set `num_sharding` = 4 to shard the model into sub-parts. This helps reduce memory pressure and improve performance during on-device inference. The number of shards might be different depending on the model size.
3434

35-
4. Model Sharding: Set `num_sharding` = 4 to shard the model into sub-parts. This helps reduce memory pressure and improve performance during on-device inference.
36-
37-
5. Graph Transformations: Convert operations into accelerator-friendly formats for better runtime performance.
35+
4. Graph Transformations: Convert operations into accelerator-friendly formats for better runtime performance.
3836

3937
You can find the full optimization configuration in this [file](https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/__init__.py), as shown below:
4038

4139
``` python
42-
@register_llm_model("llama3_2-1b_instruct")
40+
@register_llm_model("llama3_2-3b_instruct")
4341
@dataclass(init=False, frozen=True)
44-
class Llama3_2_1B_Instruct(LLMModelConfig):
42+
class Llama3_2_3B_Instruct(LLMModelConfig):
4543
repo_id = None
4644
params_path = None
4745
convert_weights = None
4846
transform_weight = True
4947
# The Llama3_2 enabled should be instruct, however, Llama's tokenizer does not provide utility to apply chat template.
5048
instruct_model = False
51-
52-
num_sharding = 1
49+
50+
num_sharding = 4
5351
# quant config
5452
ptq = QuantDtype.use_16a4w_block
55-
group_size = 32
53+
group_size = 32 # Group size used in block quantization for weight quantization. Will only be used when ptq = 16a4w_block
5654
masked_softmax = False
57-
seq_mse_candidates = 1000
55+
56+
# SeqMSE Quantization: optimizes the parameter encodings of each layer of a model individually to minimize the difference between the layer’s original and quantized outputs. (Implementation details: ./backends/qualcomm/_passes/seq_mse.py) In this configuration, we set `seq_mse_candidates` = 0, which means SeqMSE quantization is not applied.
57+
seq_mse_candidates = 0
5858
r1 = False
5959
r2 = False
6060
r3 = False
61-
quantization_config_down_proj_16a8w = get_ptq_per_channel_quant_config(
62-
torch.uint16, weight_dtype=torch.int8, act_observer=MinMaxObserver
63-
)
6461
custom_annotation = (
6562
annotate_kv_8bit,
6663
annotate_output_16a8w,
67-
partial(
68-
annotate_down_proj, quantization_config=quantization_config_down_proj_16a8w
69-
),
7064
)
7165
```
7266

@@ -105,4 +99,4 @@ python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL
10599
## FAQ
106100

107101
If you encounter any issues while reproducing the tutorial, please file a github
108-
issue on ExecuTorch repo and tag use `#qcom_aisw` tag
102+
[issue](https://github.com/pytorch/executorch/issues) on ExecuTorch repo and tag use `#qcom_aisw` tag

0 commit comments

Comments
 (0)