Qualcomm AI Engine Direct - GA Qwen 2.5 0.5B #12333

shewu-quic · 2025-07-10T08:35:54Z

Summary:

Add a decoder_model_wrapper.py to ensure that the exported model can be fully delegated in Qnn Backend
Add a e2e script to run qwen 2.5
- Support spin quant R3
- Replace Qwen2Attention with QCQwen2Attention
- Pre-compute freqs_cos and freqs_sin to bypass rotary embedding
- Replace Qwen2RMSNorm with torch.nn,.RMSNorm
- Tag quant IO to avoid insering Q/DQ for I/O
- Reuse executorch llama runner, llama_main

Note that accuracy currently is bad, need to investigate more.

Reproduce command

python3 examples/qualcomm/oss_scripts/qwen/qwen.py  -s <serial>-H <host> -m SM8750  --prompt "My favourite condiment is "  -b build-android --decoder_model qwen2.5_0.5B --ptq 16a16w

Results

7/9

ptq: 16a16w
Speed: 62 tok/sec on SM8750, seq_len = 128
Accuracy: Bad

Outputs:

I 00:00:02.944266 executorch:stats.h:108] 	Prompt Tokens: 6    Generated Tokens: 121
I 00:00:02.944270 executorch:stats.h:114] 	Model Load Time:		0.677000 (seconds)
I 00:00:02.944274 executorch:stats.h:124] 	Total inference time:		2.034000 (seconds)		 Rate: 	59.488692 (tokens/second)
I 00:00:02.944279 executorch:stats.h:132] 		Prompt evaluation:	0.093000 (seconds)		 Rate: 	64.516129 (tokens/second)
I 00:00:02.944283 executorch:stats.h:143] 		Generated 121 tokens:	1.941000 (seconds)		 Rate: 	62.339001 (tokens/second)
I 00:00:02.944288 executorch:stats.h:151] 	Time to first generated token:	0.093000 (seconds)
I 00:00:02.944292 executorch:stats.h:158] 	Sampling time over 127 tokens:	0.059000 (seconds)
My favourite condiment is a thing, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan

7/11

ptq: 16a8w
Speed: 135 tok/sec on SM8750, seq_len = 128
Accuracy: Seems better

Outputs:

I 00:00:00.734588 executorch:text_llm_runner.cpp:100] RSS after loading model: 829.648438 MiB (0 if unsupported)
I 00:00:00.734865 executorch:text_llm_runner.cpp:157] Max new tokens resolved: 122, given start_pos 0, num_prompt_tokens 6, max_context_len 128
I 00:00:00.784392 executorch:text_llm_runner.cpp:184] RSS after prompt prefill: 829.648438 MiB (0 if unsupported)
I 00:00:01.677137 executorch:text_llm_runner.cpp:204] RSS after finishing text generation: 829.648438 MiB (0 if unsupported)
I 00:00:01.677171 executorch:stats.h:108] 	Prompt Tokens: 6    Generated Tokens: 121
I 00:00:01.677180 executorch:stats.h:114] 	Model Load Time:		0.431000 (seconds)
I 00:00:01.677187 executorch:stats.h:124] 	Total inference time:		0.943000 (seconds)		 Rate: 	128.313892 (tokens/second)
I 00:00:01.677193 executorch:stats.h:132] 		Prompt evaluation:	0.050000 (seconds)		 Rate: 	120.000000 (tokens/second)
I 00:00:01.677201 executorch:stats.h:143] 		Generated 121 tokens:	0.893000 (seconds)		 Rate: 	135.498320 (tokens/second)
I 00:00:01.677208 executorch:stats.h:151] 	Time to first generated token:	0.050000 (seconds)
I 00:00:01.677215 executorch:stats.h:158] 	Sampling time over 127 tokens:	0.017000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
[WARNING] [Qnn ExecuTorch]: QnnDsp <W> Function not called, PrepareLib isn't loaded!

/data/local/tmp/shewu/executorch/qwen_qnn_q16/outputs/: 1 file pulled. 0.7 MB/s (883 bytes in 0.001s)
INFO:root:Results[0]:
Setting up pretokenizer...
Pretokenizer set up
My favourite condiment is iced tea. I love it so much that I have to have it every day. I have a habit of making it at home. I have a few recipes for iced tea. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite

cc: @winskuo-quic , @haowhsu-quic

pytorch-bot · 2025-07-10T08:36:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12333

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 5 Unrelated Failures

As of commit 30d036f with merge base dfc387b ():

NEW FAILURES - The following jobs have failed:

Build documentation / build (buck2) / Build doc (gh)
At least one of the pre-conditions you specified did not hold
trunk / test-llama-runner-mac (fp32, coreml) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 2
trunk / test-llama-runner-mac (fp32, mps) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 2
trunk / test-llama-runner-mac (fp32, xnnpack+custom+quantize_kv) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 2
trunk / test-llama-torchao-lowbit / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 2

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Apple / build-benchmark-app / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 1
Apple / build-demo-ios / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 1
Apple / build-frameworks-ios / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 1
trunk / test-coreml-delegate / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 1
trunk / test-huggingface-transformers-coreml (qwen3-1.7b|xnnpack|--quantize) / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 138

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-07-10T08:36:36Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

shewu-quic · 2025-07-10T08:47:34Z

Hi @cccclai @kimishpatel,

I am working on supporting the decoder only model from transformer path.
I create a wrapper for decoder model which is based on the TorchExportableModuleWithStaticCache in the transformers.
There are some changes to fully delegate in QNN Backend:

Change an attention mask to avoid the computation in the model
Add buffer for freqs_cos and freqs_sin to bypass rotary embedding in the model
Change the attention with QCQwen2Attention
Replace Qwen2RMSNorm with torch.nn,.RMSNorm

May I know these changes are acceptable?

Thanks.

shewu-quic · 2025-07-10T08:51:46Z

And one more question, when I use your runner with qwen 2.5.
But I get the error for tokenizer. I add a workaround to change behavior with MergedWithPrevious.
Could you give me some advice to address it?

Unsupported behavior 'Isolated' for Split PreTokenizer. Only 'MergedWithPrevious' is supported.

kimishpatel · 2025-07-10T22:36:46Z

And one more question, when I use your runner with qwen 2.5. But I get the error for tokenizer. I add a workaround to change behavior with MergedWithPrevious. Could you give me some advice to address it?
Unsupported behavior 'Isolated' for Split PreTokenizer. Only 'MergedWithPrevious' is supported.

@larryliu0820 on this question

backends/qualcomm/utils/utils.py

kimishpatel · 2025-07-10T22:43:59Z

examples/qualcomm/oss_scripts/qwen/decoder_model_wrapper.py

+        # =====================================================================
+        outs = self.model(
+            input_ids=input_ids,
+            attention_mask=attn_mask,


One question: if you have to specify per layer mask, how would you?

@guangy10 does transformer api allow for per layer mask be specified here as list of tensor or something?

examples/qualcomm/oss_scripts/qwen/qwen.py

kimishpatel · 2025-07-10T23:11:32Z

examples/qualcomm/oss_scripts/qwen/qwen.py

+            )
+            if quant_dtype == QuantDtype.use_16a4w_block:
+                conv_nodes = [
+                    n for n in fx_graph_module.graph.nodes if "conv" in n.name


you dont want to check type of the node or node.target? to see if it is conv

Good point. Thanks!

examples/qualcomm/oss_scripts/qwen/qwen_model.py

kimishpatel · 2025-07-10T23:20:32Z

extension/llm/runner/text_decoder_runner.h

      const float temperature = 0.0f) {
    int32_t result = 0;
-    ET_SWITCH_THREE_TYPES(
+    ET_SWITCH_FOUR_TYPES(


@larryliu0820 these change seem acceptable?

kimishpatel

left some comments but I think this feels not the right approach for enabling transformer models. I left inline comments describing why

kimishpatel · 2025-07-11T02:29:45Z

couple of other questions I have is:

what is the performance like compared to static_llama like solution and
how is the kv cache managed. if it is quantized, how are you doing quantized cache updates?

shewu-quic · 2025-07-11T03:10:07Z

left some comments but I think this feels not the right approach for enabling transformer models. I left inline comments describing why

Let me confirm my understanding. The general principle is to use the APIs provided by transformers, such as TorchExportableModuleWithStaticCache, without involving the actual model code structure.
I have a question: is it possible to extend the flexibility of the transformers API (specifically TorchExportableModuleWithStaticCache) to allow for the specification of custom attention masks or attention functions?
Because it seems that the transformer framework intends to support this, as indicated by the presence of ALL_MASK_ATTENTION_FUNCTIONS and ALL_ATTENTION_FUNCTIONS

kimishpatel · 2025-07-11T03:12:21Z

left some comments but I think this feels not the right approach for enabling transformer models. I left inline comments describing why

Let me confirm my understanding. The general principle is to use the APIs provided by transformers, such as TorchExportableModuleWithStaticCache, without involving the actual model code structure. I have a question: is it possible to extend the flexibility of the transformers API (specifically TorchExportableModuleWithStaticCache) to allow for the specification of custom attention masks or attention functions? Because it seems that the transformer framework intends to support this, as indicated by the presence of ALL_MASK_ATTENTION_FUNCTIONS and ALL_ATTENTION_FUNCTIONS

Yes. I pointed to some examples of that, but note that it wont allow you to do the kind of things maybe you are doing. Like inserting R1/R3 etc. at least to my understanding. if you can do that using attention customization interface, thats great

shewu-quic · 2025-07-11T03:20:33Z

couple of other questions I have is:

what is the performance like compared to static_llama like solution and

how is the kv cache managed. if it is quantized, how are you doing quantized cache updates?

We are currently working on this item which evaluate ppl and performance. We will get back to you with the results as soon as possible.
For transformers path, it manages kv cache with mutable buffer mechanism via index_put to update kv cache in each turn. For quantized cache, we will avoid inserting Q/DQ for the input and output of kv cache with tag_quant_io pass,

cccclai · 2025-07-15T18:38:04Z

And one more question, when I use your runner with qwen 2.5. But I get the error for tokenizer. I add a workaround to change behavior with MergedWithPrevious. Could you give me some advice to address it?
Unsupported behavior 'Isolated' for Split PreTokenizer. Only 'MergedWithPrevious' is supported.

This should be fixed with the latest tokenizer https://github.com/pytorch-labs/tokenizers

shewu-quic · 2025-07-17T01:31:11Z

Hi @kimishpatel and @cccclai,

I have pushed a commit that leverages transformer APIs such as AttentionInterface, AttentionMaskInterface, and TorchExportableModuleForDecoderOnlyLM to make the decoder model QNN-friendly without altering the model structure.
Could you please let me know if this approach meets your expectations for the decoder-only model?

Results

It can be fully delegated and produce reasonable result with Qwen 2.5 0.5B.
The below result is generated by qwen 2.5 0.5B without R3, seq_len = 128, device = SM8750, quant_config = 16a8w

I 00:00:00.900210 executorch:text_llm_runner.cpp:100] RSS after loading model: 829.289062 MiB (0 if unsupported)
I 00:00:00.900500 executorch:text_llm_runner.cpp:157] Max new tokens resolved: 122, given start_pos 0, num_prompt_tokens 6, max_context_len 128
I 00:00:00.949226 executorch:text_llm_runner.cpp:184] RSS after prompt prefill: 829.289062 MiB (0 if unsupported)
I 00:00:01.851722 executorch:text_llm_runner.cpp:204] RSS after finishing text generation: 829.289062 MiB (0 if unsupported)
I 00:00:01.851748 executorch:stats.h:108] 	Prompt Tokens: 6    Generated Tokens: 121
I 00:00:01.851755 executorch:stats.h:114] 	Model Load Time:		0.516000 (seconds)
I 00:00:01.851763 executorch:stats.h:124] 	Total inference time:		0.952000 (seconds)		 Rate: 	127.100840 (tokens/second)
I 00:00:01.851769 executorch:stats.h:132] 		Prompt evaluation:	0.049000 (seconds)		 Rate: 	122.448980 (tokens/second)
I 00:00:01.851777 executorch:stats.h:143] 		Generated 121 tokens:	0.903000 (seconds)		 Rate: 	133.997785 (tokens/second)
I 00:00:01.851785 executorch:stats.h:151] 	Time to first generated token:	0.049000 (seconds)
I 00:00:01.851792 executorch:stats.h:158] 	Sampling time over 127 tokens:	0.012000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
[WARNING] [Qnn ExecuTorch]: QnnDsp <W> Function not called, PrepareLib isn't loaded!

/data/local/tmp/shewu/executorch/qwen_qnn_q16/outputs/: 1 file pulled. 0.5 MB/s (893 bytes in 0.002s)
INFO:root:Results[0]:
Setting up pretokenizer...
Pretokenizer set up
My favourite condiment is iced tea. I love the taste of it, and I love the fact that it is so easy to make. I make it in a couple of ways. The first way is to make a batch of iced tea at the end of the day and put it in the fridge. The second way is to make it in advance and put it in the freezer. I like the second way because it is much easier to make. I make it in advance by putting the tea in a large pitcher and adding the tea leaves. I add the sugar and the milk and stir it all together. I then put the

Validate the pte on wikitext limit = 1

The PPL of original nn module: 49
The PPL of QDQ module: 51
The PPL of QNN delegated on device: 51

Reproduce command

# export command
python3 examples/qualcomm/oss_scripts/qwen/qwen.py  -s <serial>-H <host> -m SM8750  -a qwen2 --prompt "My favourite condiment is "  -b build-android --decoder_model qwen2.5_0.5B  --calibration_tasks wikitext --calibration_limit 1 --ptq 16a8w

# eval command
python3 examples/qualcomm/oss_scripts/qwen/eval_qwen_qnn.py  -s <serial>-H <host> -m SM8750  -a qwen2 -b build-android --limit 1 --tokenizer_path qwen2 /tokenizer.json --pte qwen2 /qwen_qnn_q16.pte --logits_quant_attr_path qwen2 /qwen_qnn_q16_quant_attrs.txt

shewu-quic · 2025-08-18T06:56:58Z

Can you delete this line https://github.com/pytorch/executorch/blob/main/examples/models/llama/TARGETS#L115 since you delete this file?

Thanks for pointing out!

test_qnn_backend_linear_block unit test also seems failing

It appears that graph prepare failed in QNN 2.28, but this issue has been resolved in QNN 2.37.
Should I skip this test until the Qnn SDK is updated to version 2.37?

cccclai · 2025-08-18T17:07:41Z

Can you delete this line https://github.com/pytorch/executorch/blob/main/examples/models/llama/TARGETS#L115 since you delete this file?

Thanks for pointing out!

test_qnn_backend_linear_block unit test also seems failing

It appears that graph prepare failed in QNN 2.28, but this issue has been resolved in QNN 2.37. Should I skip this test until the Qnn SDK is updated to version 2.37?

I see, can you mark the test to skip?

facebook-github-bot · 2025-08-18T19:44:43Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D79702302.

cccclai · 2025-08-18T21:00:40Z

Can you rebase again?

shewu-quic · 2025-08-19T06:59:31Z

I see, can you mark the test to skip?

I have rebased and add a skip test macro for this unit test. Based on my test, it seems pass after 2.30.
Thanks.

facebook-github-bot · 2025-08-19T17:32:09Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D79702302.

cccclai · 2025-08-19T18:52:13Z

Actually there are some internal code refering to the rms norm pass...can you restore the change?

shewu-quic · 2025-08-21T06:47:35Z

Rebase completed.

facebook-github-bot · 2025-08-21T15:22:31Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D79702302.

Summary: - Support decoder-only model from huggingface - Add a decoder_model_wrapper.py to ensure that the exported model can be fully delegated in Qnn Backend - Add llm manager - Leaverage AttentionMaskInterface and AttentionInterface without touching model structure - Add Eval script to evaluate ppl on device - Support Qwen 2.5 - Add a e2e script to run qwen 2.5 - Support spin quant R3 - Support enable_x86_64 in qwen - Add unittest for qwen2_5 - Reuse executorch llama runner, llama_main - Move some source transformations to passes - Add a pass to convert linear to conv2d during transform_for_export_pipeline - Support recompose rms norm by pattern-based

shewu-quic · 2025-08-28T05:31:58Z

Rebase completed.

facebook-github-bot · 2025-08-28T17:21:15Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D79702302.

shoumikhin · 2025-08-31T18:18:32Z

@shewu-quic please take a look if the newly introduced failures on CI are related to this change
See test-llama-runner-mac failing jobs at https://hud.pytorch.org/pytorch/executorch/commit/216a1ecce60a5367b4bc4d82dbb3ce750d7bee54
A snippet from the logs:

2025-08-28T21:59:44.3757840Z flatccrt library is not found.
2025-08-28T21:59:44.3758040Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3758260Z etdump library is not found.
2025-08-28T21:59:44.3758460Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3758690Z bundled_program library is not found.
2025-08-28T21:59:44.3758910Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3759150Z neuron_backend library is not found.
2025-08-28T21:59:44.3759370Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3759600Z qnn_executorch_backend library is not found.
2025-08-28T21:59:44.3759830Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3760060Z extension_evalue_util library is not found.
2025-08-28T21:59:44.3760300Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3760520Z extension_runner_util library is not found.
2025-08-28T21:59:44.3760750Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3760970Z extension_training library is not found.
2025-08-28T21:59:44.3761180Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3761400Z vulkan_backend library is not found.
2025-08-28T21:59:44.3761610Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3761830Z quantized_ops_aot_lib library is not found.
2025-08-28T21:59:44.3762060Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3762270Z -- Configuring done (0.4s)
2025-08-28T21:59:44.3762410Z -- Generating done (0.1s)
2025-08-28T21:59:44.3762540Z CMake Warning:
2025-08-28T21:59:44.3762710Z   Manually-specified variables were not used by the project:
2025-08-28T21:59:44.3762880Z 
2025-08-28T21:59:44.3762920Z     BUILD_TESTING
2025-08-28T21:59:44.3763000Z 
2025-08-28T21:59:44.3763000Z 
2025-08-28T21:59:44.3767600Z -- Build files have been written to: /Users/ec2-user/runner/_work/executorch/executorch/pytorch/executorch/cmake-out/examples/models/llama
2025-08-28T21:59:44.3768080Z + cmake --build cmake-out/examples/models/llama -j9 --config Release
2025-08-28T21:59:44.3768430Z [ 40%] Building CXX object runner/CMakeFiles/llama_runner.dir/__/tokenizer/llama_tiktoken.cpp.o
2025-08-28T21:59:44.3768790Z [ 40%] Building CXX object runner/CMakeFiles/llama_runner.dir/runner.cpp.o
2025-08-28T21:59:44.3769110Z [ 60%] Linking CXX static library libllama_runner.a
2025-08-28T21:59:44.3769310Z [ 60%] Built target llama_runner
2025-08-28T21:59:44.3769550Z [ 80%] Building CXX object CMakeFiles/llama_main.dir/main.cpp.o
2025-08-28T21:59:44.3769770Z [100%] Linking CXX executable llama_main
2025-08-28T21:59:44.3769940Z ld: unknown options: -rpath=$ORIGIN 
2025-08-28T21:59:44.3770200Z clang: error: linker command failed with exit code 1 (use -v to see invocation)
2025-08-28T21:59:44.3770450Z make[2]: *** [llama_main] Error 1
2025-08-28T21:59:44.3770630Z make[1]: *** [CMakeFiles/llama_main.dir/all] Error 2
2025-08-28T21:59:44.3770810Z make: *** [all] Error 2
2025-08-28T21:59:44.3771290Z ERROR conda.cli.main_run:execute(125): `conda run bash .ci/scripts/test_llama.sh -model stories110M -build_tool cmake -dtype fp32 -mode xnnpack+custom+quantize_kv` failed. (See above for error)

shewu-quic · 2025-09-01T03:33:03Z

Thank you for noticing that.
It appears that $ORIGIN doesn't exist as an environment variable on Mac. I have created a fixed PR to address this issue.

Fix for #12333 which broke mac runner tests on trunk, since `$ORIGIN` is linux only.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 10, 2025