-
Notifications
You must be signed in to change notification settings - Fork 687
Qualcomm AI Engine Direct - GA Qwen 2.5 0.5B #12333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12333
Note: Links to docs will display an error until the docs builds have been completed. ❌ 5 New Failures, 5 Unrelated FailuresAs of commit 30d036f with merge base dfc387b ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Hi @cccclai @kimishpatel, I am working on supporting the decoder only model from transformer path.
May I know these changes are acceptable? Thanks. |
And one more question, when I use your runner with qwen 2.5.
|
@larryliu0820 on this question |
# ===================================================================== | ||
outs = self.model( | ||
input_ids=input_ids, | ||
attention_mask=attn_mask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question: if you have to specify per layer mask, how would you?
@guangy10 does transformer api allow for per layer mask be specified here as list of tensor or something?
) | ||
if quant_dtype == QuantDtype.use_16a4w_block: | ||
conv_nodes = [ | ||
n for n in fx_graph_module.graph.nodes if "conv" in n.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you dont want to check type of the node or node.target? to see if it is conv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Thanks!
const float temperature = 0.0f) { | ||
int32_t result = 0; | ||
ET_SWITCH_THREE_TYPES( | ||
ET_SWITCH_FOUR_TYPES( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larryliu0820 these change seem acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments but I think this feels not the right approach for enabling transformer models. I left inline comments describing why
couple of other questions I have is:
|
Let me confirm my understanding. The general principle is to use the APIs provided by transformers, such as TorchExportableModuleWithStaticCache, without involving the actual model code structure. |
Yes. I pointed to some examples of that, but note that it wont allow you to do the kind of things maybe you are doing. Like inserting R1/R3 etc. at least to my understanding. if you can do that using attention customization interface, thats great |
|
This should be fixed with the latest tokenizer https://github.com/pytorch-labs/tokenizers |
34838d6
to
381bf9b
Compare
Hi @kimishpatel and @cccclai, I have pushed a commit that leverages transformer APIs such as AttentionInterface, AttentionMaskInterface, and TorchExportableModuleForDecoderOnlyLM to make the decoder model QNN-friendly without altering the model structure. ResultsIt can be fully delegated and produce reasonable result with Qwen 2.5 0.5B.
Validate the pte on wikitext limit = 1
Reproduce command
|
381bf9b
to
242b602
Compare
Thanks for pointing out!
It appears that graph prepare failed in QNN 2.28, but this issue has been resolved in QNN 2.37. |
I see, can you mark the test to skip? |
Can you rebase again? |
0803d50
to
60a081a
Compare
I have rebased and add a skip test macro for this unit test. Based on my test, it seems pass after 2.30. |
Actually there are some internal code refering to the rms norm pass...can you restore the change? |
bda8231
to
11543a4
Compare
Rebase completed. |
Summary: - Support decoder-only model from huggingface - Add a decoder_model_wrapper.py to ensure that the exported model can be fully delegated in Qnn Backend - Add llm manager - Leaverage AttentionMaskInterface and AttentionInterface without touching model structure - Add Eval script to evaluate ppl on device - Support Qwen 2.5 - Add a e2e script to run qwen 2.5 - Support spin quant R3 - Support enable_x86_64 in qwen - Add unittest for qwen2_5 - Reuse executorch llama runner, llama_main - Move some source transformations to passes - Add a pass to convert linear to conv2d during transform_for_export_pipeline - Support recompose rms norm by pattern-based
11543a4
to
30d036f
Compare
Rebase completed. |
@shewu-quic please take a look if the newly introduced failures on CI are related to this change
|
Thank you for noticing that. |
Fix for #12333 which broke mac runner tests on trunk, since `$ORIGIN` is linux only.
Summary:
Note that accuracy currently is bad, need to investigate more.
Reproduce command
Results
7/9
ptq: 16a16w
Speed: 62 tok/sec on SM8750, seq_len = 128
Accuracy: Bad
Outputs:
7/11
ptq: 16a8w
Speed: 135 tok/sec on SM8750, seq_len = 128
Accuracy: Seems better
Outputs:
cc: @winskuo-quic , @haowhsu-quic