Skip to content

Conversation

shewu-quic
Copy link
Collaborator

@shewu-quic shewu-quic commented Jul 10, 2025

Summary:

  • Add a decoder_model_wrapper.py to ensure that the exported model can be fully delegated in Qnn Backend
  • Add a e2e script to run qwen 2.5
    • Support spin quant R3
    • Replace Qwen2Attention with QCQwen2Attention
    • Pre-compute freqs_cos and freqs_sin to bypass rotary embedding
    • Replace Qwen2RMSNorm with torch.nn,.RMSNorm
    • Tag quant IO to avoid insering Q/DQ for I/O
    • Reuse executorch llama runner, llama_main

Note that accuracy currently is bad, need to investigate more.

Reproduce command

python3 examples/qualcomm/oss_scripts/qwen/qwen.py  -s <serial>-H <host> -m SM8750  --prompt "My favourite condiment is "  -b build-android --decoder_model qwen2.5_0.5B --ptq 16a16w

Results

7/9

ptq: 16a16w
Speed: 62 tok/sec on SM8750, seq_len = 128
Accuracy: Bad

Outputs:

I 00:00:02.944266 executorch:stats.h:108] 	Prompt Tokens: 6    Generated Tokens: 121
I 00:00:02.944270 executorch:stats.h:114] 	Model Load Time:		0.677000 (seconds)
I 00:00:02.944274 executorch:stats.h:124] 	Total inference time:		2.034000 (seconds)		 Rate: 	59.488692 (tokens/second)
I 00:00:02.944279 executorch:stats.h:132] 		Prompt evaluation:	0.093000 (seconds)		 Rate: 	64.516129 (tokens/second)
I 00:00:02.944283 executorch:stats.h:143] 		Generated 121 tokens:	1.941000 (seconds)		 Rate: 	62.339001 (tokens/second)
I 00:00:02.944288 executorch:stats.h:151] 	Time to first generated token:	0.093000 (seconds)
I 00:00:02.944292 executorch:stats.h:158] 	Sampling time over 127 tokens:	0.059000 (seconds)
My favourite condiment is a thing, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan

7/11

ptq: 16a8w
Speed: 135 tok/sec on SM8750, seq_len = 128
Accuracy: Seems better

Outputs:

I 00:00:00.734588 executorch:text_llm_runner.cpp:100] RSS after loading model: 829.648438 MiB (0 if unsupported)
I 00:00:00.734865 executorch:text_llm_runner.cpp:157] Max new tokens resolved: 122, given start_pos 0, num_prompt_tokens 6, max_context_len 128
I 00:00:00.784392 executorch:text_llm_runner.cpp:184] RSS after prompt prefill: 829.648438 MiB (0 if unsupported)
I 00:00:01.677137 executorch:text_llm_runner.cpp:204] RSS after finishing text generation: 829.648438 MiB (0 if unsupported)
I 00:00:01.677171 executorch:stats.h:108] 	Prompt Tokens: 6    Generated Tokens: 121
I 00:00:01.677180 executorch:stats.h:114] 	Model Load Time:		0.431000 (seconds)
I 00:00:01.677187 executorch:stats.h:124] 	Total inference time:		0.943000 (seconds)		 Rate: 	128.313892 (tokens/second)
I 00:00:01.677193 executorch:stats.h:132] 		Prompt evaluation:	0.050000 (seconds)		 Rate: 	120.000000 (tokens/second)
I 00:00:01.677201 executorch:stats.h:143] 		Generated 121 tokens:	0.893000 (seconds)		 Rate: 	135.498320 (tokens/second)
I 00:00:01.677208 executorch:stats.h:151] 	Time to first generated token:	0.050000 (seconds)
I 00:00:01.677215 executorch:stats.h:158] 	Sampling time over 127 tokens:	0.017000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
[WARNING] [Qnn ExecuTorch]: QnnDsp <W> Function not called, PrepareLib isn't loaded!

/data/local/tmp/shewu/executorch/qwen_qnn_q16/outputs/: 1 file pulled. 0.7 MB/s (883 bytes in 0.001s)
INFO:root:Results[0]:
Setting up pretokenizer...
Pretokenizer set up
My favourite condiment is iced tea. I love it so much that I have to have it every day. I have a habit of making it at home. I have a few recipes for iced tea. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite

cc: @winskuo-quic , @haowhsu-quic

Copy link

pytorch-bot bot commented Jul 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12333

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 5 Unrelated Failures

As of commit 30d036f with merge base dfc387b (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 10, 2025
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@shewu-quic
Copy link
Collaborator Author

Hi @cccclai @kimishpatel,

I am working on supporting the decoder only model from transformer path.
I create a wrapper for decoder model which is based on the TorchExportableModuleWithStaticCache in the transformers.
There are some changes to fully delegate in QNN Backend:

  1. Change an attention mask to avoid the computation in the model
  2. Add buffer for freqs_cos and freqs_sin to bypass rotary embedding in the model
  3. Change the attention with QCQwen2Attention
  4. Replace Qwen2RMSNorm with torch.nn,.RMSNorm

May I know these changes are acceptable?

Thanks.

@shewu-quic
Copy link
Collaborator Author

And one more question, when I use your runner with qwen 2.5.
But I get the error for tokenizer. I add a workaround to change behavior with MergedWithPrevious.
Could you give me some advice to address it?

Unsupported behavior 'Isolated' for Split PreTokenizer. Only 'MergedWithPrevious' is supported.

@kimishpatel
Copy link
Contributor

And one more question, when I use your runner with qwen 2.5. But I get the error for tokenizer. I add a workaround to change behavior with MergedWithPrevious. Could you give me some advice to address it?

Unsupported behavior 'Isolated' for Split PreTokenizer. Only 'MergedWithPrevious' is supported.

@larryliu0820 on this question

# =====================================================================
outs = self.model(
input_ids=input_ids,
attention_mask=attn_mask,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: if you have to specify per layer mask, how would you?

@guangy10 does transformer api allow for per layer mask be specified here as list of tensor or something?

)
if quant_dtype == QuantDtype.use_16a4w_block:
conv_nodes = [
n for n in fx_graph_module.graph.nodes if "conv" in n.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you dont want to check type of the node or node.target? to see if it is conv

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Thanks!

const float temperature = 0.0f) {
int32_t result = 0;
ET_SWITCH_THREE_TYPES(
ET_SWITCH_FOUR_TYPES(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larryliu0820 these change seem acceptable?

Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments but I think this feels not the right approach for enabling transformer models. I left inline comments describing why

@kimishpatel
Copy link
Contributor

couple of other questions I have is:

  1. what is the performance like compared to static_llama like solution and
  2. how is the kv cache managed. if it is quantized, how are you doing quantized cache updates?

@shewu-quic
Copy link
Collaborator Author

left some comments but I think this feels not the right approach for enabling transformer models. I left inline comments describing why

Let me confirm my understanding. The general principle is to use the APIs provided by transformers, such as TorchExportableModuleWithStaticCache, without involving the actual model code structure.
I have a question: is it possible to extend the flexibility of the transformers API (specifically TorchExportableModuleWithStaticCache) to allow for the specification of custom attention masks or attention functions?
Because it seems that the transformer framework intends to support this, as indicated by the presence of ALL_MASK_ATTENTION_FUNCTIONS and ALL_ATTENTION_FUNCTIONS

@kimishpatel
Copy link
Contributor

left some comments but I think this feels not the right approach for enabling transformer models. I left inline comments describing why

Let me confirm my understanding. The general principle is to use the APIs provided by transformers, such as TorchExportableModuleWithStaticCache, without involving the actual model code structure. I have a question: is it possible to extend the flexibility of the transformers API (specifically TorchExportableModuleWithStaticCache) to allow for the specification of custom attention masks or attention functions? Because it seems that the transformer framework intends to support this, as indicated by the presence of ALL_MASK_ATTENTION_FUNCTIONS and ALL_ATTENTION_FUNCTIONS

Yes. I pointed to some examples of that, but note that it wont allow you to do the kind of things maybe you are doing. Like inserting R1/R3 etc. at least to my understanding. if you can do that using attention customization interface, thats great

@shewu-quic
Copy link
Collaborator Author

couple of other questions I have is:

  1. what is the performance like compared to static_llama like solution and
  2. how is the kv cache managed. if it is quantized, how are you doing quantized cache updates?
  1. We are currently working on this item which evaluate ppl and performance. We will get back to you with the results as soon as possible.
  2. For transformers path, it manages kv cache with mutable buffer mechanism via index_put to update kv cache in each turn. For quantized cache, we will avoid inserting Q/DQ for the input and output of kv cache with tag_quant_io pass,

@cccclai
Copy link
Contributor

cccclai commented Jul 15, 2025

And one more question, when I use your runner with qwen 2.5. But I get the error for tokenizer. I add a workaround to change behavior with MergedWithPrevious. Could you give me some advice to address it?

Unsupported behavior 'Isolated' for Split PreTokenizer. Only 'MergedWithPrevious' is supported.

This should be fixed with the latest tokenizer https://github.com/pytorch-labs/tokenizers

@shewu-quic
Copy link
Collaborator Author

shewu-quic commented Jul 17, 2025

Hi @kimishpatel and @cccclai,

I have pushed a commit that leverages transformer APIs such as AttentionInterface, AttentionMaskInterface, and TorchExportableModuleForDecoderOnlyLM to make the decoder model QNN-friendly without altering the model structure.
Could you please let me know if this approach meets your expectations for the decoder-only model?

Results

It can be fully delegated and produce reasonable result with Qwen 2.5 0.5B.
The below result is generated by qwen 2.5 0.5B without R3, seq_len = 128, device = SM8750, quant_config = 16a8w

I 00:00:00.900210 executorch:text_llm_runner.cpp:100] RSS after loading model: 829.289062 MiB (0 if unsupported)
I 00:00:00.900500 executorch:text_llm_runner.cpp:157] Max new tokens resolved: 122, given start_pos 0, num_prompt_tokens 6, max_context_len 128
I 00:00:00.949226 executorch:text_llm_runner.cpp:184] RSS after prompt prefill: 829.289062 MiB (0 if unsupported)
I 00:00:01.851722 executorch:text_llm_runner.cpp:204] RSS after finishing text generation: 829.289062 MiB (0 if unsupported)
I 00:00:01.851748 executorch:stats.h:108] 	Prompt Tokens: 6    Generated Tokens: 121
I 00:00:01.851755 executorch:stats.h:114] 	Model Load Time:		0.516000 (seconds)
I 00:00:01.851763 executorch:stats.h:124] 	Total inference time:		0.952000 (seconds)		 Rate: 	127.100840 (tokens/second)
I 00:00:01.851769 executorch:stats.h:132] 		Prompt evaluation:	0.049000 (seconds)		 Rate: 	122.448980 (tokens/second)
I 00:00:01.851777 executorch:stats.h:143] 		Generated 121 tokens:	0.903000 (seconds)		 Rate: 	133.997785 (tokens/second)
I 00:00:01.851785 executorch:stats.h:151] 	Time to first generated token:	0.049000 (seconds)
I 00:00:01.851792 executorch:stats.h:158] 	Sampling time over 127 tokens:	0.012000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
[WARNING] [Qnn ExecuTorch]: QnnDsp <W> Function not called, PrepareLib isn't loaded!

/data/local/tmp/shewu/executorch/qwen_qnn_q16/outputs/: 1 file pulled. 0.5 MB/s (893 bytes in 0.002s)
INFO:root:Results[0]:
Setting up pretokenizer...
Pretokenizer set up
My favourite condiment is iced tea. I love the taste of it, and I love the fact that it is so easy to make. I make it in a couple of ways. The first way is to make a batch of iced tea at the end of the day and put it in the fridge. The second way is to make it in advance and put it in the freezer. I like the second way because it is much easier to make. I make it in advance by putting the tea in a large pitcher and adding the tea leaves. I add the sugar and the milk and stir it all together. I then put the

Validate the pte on wikitext limit = 1

  • The PPL of original nn module: 49
  • The PPL of QDQ module: 51
  • The PPL of QNN delegated on device: 51

Reproduce command

# export command
python3 examples/qualcomm/oss_scripts/qwen/qwen.py  -s <serial>-H <host> -m SM8750  -a qwen2 --prompt "My favourite condiment is "  -b build-android --decoder_model qwen2.5_0.5B  --calibration_tasks wikitext --calibration_limit 1 --ptq 16a8w

# eval command
python3 examples/qualcomm/oss_scripts/qwen/eval_qwen_qnn.py  -s <serial>-H <host> -m SM8750  -a qwen2 -b build-android --limit 1 --tokenizer_path qwen2 /tokenizer.json --pte qwen2 /qwen_qnn_q16.pte --logits_quant_attr_path qwen2 /qwen_qnn_q16_quant_attrs.txt

@shewu-quic shewu-quic force-pushed the dev1/hutton/ga_qwen2.5_0.5B branch from 381bf9b to 242b602 Compare August 5, 2025 07:59
@shewu-quic shewu-quic marked this pull request as ready for review August 5, 2025 07:59
@shewu-quic
Copy link
Collaborator Author

Can you delete this line https://github.com/pytorch/executorch/blob/main/examples/models/llama/TARGETS#L115 since you delete this file?

Thanks for pointing out!

test_qnn_backend_linear_block unit test also seems failing

It appears that graph prepare failed in QNN 2.28, but this issue has been resolved in QNN 2.37.
Should I skip this test until the Qnn SDK is updated to version 2.37?

@cccclai
Copy link
Contributor

cccclai commented Aug 18, 2025

Can you delete this line https://github.com/pytorch/executorch/blob/main/examples/models/llama/TARGETS#L115 since you delete this file?

Thanks for pointing out!

test_qnn_backend_linear_block unit test also seems failing

It appears that graph prepare failed in QNN 2.28, but this issue has been resolved in QNN 2.37. Should I skip this test until the Qnn SDK is updated to version 2.37?

I see, can you mark the test to skip?

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D79702302.

@cccclai
Copy link
Contributor

cccclai commented Aug 18, 2025

Can you rebase again?

@shewu-quic shewu-quic force-pushed the dev1/hutton/ga_qwen2.5_0.5B branch from 0803d50 to 60a081a Compare August 19, 2025 06:58
@shewu-quic
Copy link
Collaborator Author

I see, can you mark the test to skip?

I have rebased and add a skip test macro for this unit test. Based on my test, it seems pass after 2.30.
Thanks.

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D79702302.

@cccclai
Copy link
Contributor

cccclai commented Aug 19, 2025

Actually there are some internal code refering to the rms norm pass...can you restore the change?

@shewu-quic shewu-quic force-pushed the dev1/hutton/ga_qwen2.5_0.5B branch 2 times, most recently from bda8231 to 11543a4 Compare August 21, 2025 06:46
@shewu-quic
Copy link
Collaborator Author

Rebase completed.

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D79702302.

Summary:

- Support decoder-only model from huggingface
  - Add a decoder_model_wrapper.py to ensure that the exported model can be fully delegated in Qnn Backend
  - Add llm manager
  - Leaverage AttentionMaskInterface and AttentionInterface without touching model structure
  - Add Eval script to evaluate ppl on device
- Support Qwen 2.5
  - Add a e2e script to run qwen 2.5
  - Support spin quant R3
  - Support enable_x86_64 in qwen
  - Add unittest for qwen2_5
  - Reuse executorch llama runner, llama_main
- Move some source transformations to passes
  - Add a pass to convert linear to conv2d during transform_for_export_pipeline
  - Support recompose rms norm by pattern-based
@shewu-quic shewu-quic force-pushed the dev1/hutton/ga_qwen2.5_0.5B branch from 11543a4 to 30d036f Compare August 28, 2025 05:31
@shewu-quic
Copy link
Collaborator Author

Rebase completed.

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D79702302.

@cccclai cccclai merged commit 216a1ec into pytorch:main Aug 28, 2025
245 of 255 checks passed
@shoumikhin
Copy link
Contributor

@shewu-quic please take a look if the newly introduced failures on CI are related to this change
See test-llama-runner-mac failing jobs at https://hud.pytorch.org/pytorch/executorch/commit/216a1ecce60a5367b4bc4d82dbb3ce750d7bee54
A snippet from the logs:

2025-08-28T21:59:44.3757840Z flatccrt library is not found.
2025-08-28T21:59:44.3758040Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3758260Z etdump library is not found.
2025-08-28T21:59:44.3758460Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3758690Z bundled_program library is not found.
2025-08-28T21:59:44.3758910Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3759150Z neuron_backend library is not found.
2025-08-28T21:59:44.3759370Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3759600Z qnn_executorch_backend library is not found.
2025-08-28T21:59:44.3759830Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3760060Z extension_evalue_util library is not found.
2025-08-28T21:59:44.3760300Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3760520Z extension_runner_util library is not found.
2025-08-28T21:59:44.3760750Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3760970Z extension_training library is not found.
2025-08-28T21:59:44.3761180Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3761400Z vulkan_backend library is not found.
2025-08-28T21:59:44.3761610Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3761830Z quantized_ops_aot_lib library is not found.
2025-08-28T21:59:44.3762060Z              If needed rebuild with the proper options in CMakeLists.txt
2025-08-28T21:59:44.3762270Z -- Configuring done (0.4s)
2025-08-28T21:59:44.3762410Z -- Generating done (0.1s)
2025-08-28T21:59:44.3762540Z CMake Warning:
2025-08-28T21:59:44.3762710Z   Manually-specified variables were not used by the project:
2025-08-28T21:59:44.3762880Z 
2025-08-28T21:59:44.3762920Z     BUILD_TESTING
2025-08-28T21:59:44.3763000Z 
2025-08-28T21:59:44.3763000Z 
2025-08-28T21:59:44.3767600Z -- Build files have been written to: /Users/ec2-user/runner/_work/executorch/executorch/pytorch/executorch/cmake-out/examples/models/llama
2025-08-28T21:59:44.3768080Z + cmake --build cmake-out/examples/models/llama -j9 --config Release
2025-08-28T21:59:44.3768430Z [ 40%] Building CXX object runner/CMakeFiles/llama_runner.dir/__/tokenizer/llama_tiktoken.cpp.o
2025-08-28T21:59:44.3768790Z [ 40%] Building CXX object runner/CMakeFiles/llama_runner.dir/runner.cpp.o
2025-08-28T21:59:44.3769110Z [ 60%] Linking CXX static library libllama_runner.a
2025-08-28T21:59:44.3769310Z [ 60%] Built target llama_runner
2025-08-28T21:59:44.3769550Z [ 80%] Building CXX object CMakeFiles/llama_main.dir/main.cpp.o
2025-08-28T21:59:44.3769770Z [100%] Linking CXX executable llama_main
2025-08-28T21:59:44.3769940Z ld: unknown options: -rpath=$ORIGIN 
2025-08-28T21:59:44.3770200Z clang: error: linker command failed with exit code 1 (use -v to see invocation)
2025-08-28T21:59:44.3770450Z make[2]: *** [llama_main] Error 1
2025-08-28T21:59:44.3770630Z make[1]: *** [CMakeFiles/llama_main.dir/all] Error 2
2025-08-28T21:59:44.3770810Z make: *** [all] Error 2
2025-08-28T21:59:44.3771290Z ERROR conda.cli.main_run:execute(125): `conda run bash .ci/scripts/test_llama.sh -model stories110M -build_tool cmake -dtype fp32 -mode xnnpack+custom+quantize_kv` failed. (See above for error)

@shewu-quic
Copy link
Collaborator Author

Thank you for noticing that.
It appears that $ORIGIN doesn't exist as an environment variable on Mac. I have created a fixed PR to address this issue.

jackzhxng added a commit that referenced this pull request Sep 2, 2025
Fix for #12333 which broke mac
runner tests on trunk, since `$ORIGIN` is linux only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants