-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ANDROID][ADRENO] Various enhancements for Adreno target #1536
Conversation
…erformance This improves the adreno dispatches for best unitization of hardware resources Changes to the network to avoid unnecessary transposes which are not required for Adreno. The performance of on Snapdragon Gen 2 is now improved upto Prefille: 30 toks/sec Decode : 11 toks/sec
Thank you for your Contribution Siva. I think the relax model is not used directly any more. mlc-llm/python/mlc_chat/interface/compile.py Line 130 in 095858c
I am unable to invoke your changes as that flow is not called for model conversion any more |
Thanks @srkreddy1238 for the impressive work! The performance is excellent, but I wonder if the schedule optimization could be generalized to other models. In other words, could we update DLight rules instead of dispatching TIR kernels? |
Thanks @Hzfengsy This PR is to address LLaMa (V1 & V2) performance needs at the moment. |
I think we should get it in because it's an enhancement :) |
This is great addition to the old workflow, and I very much look forward to seeing such significant performance enhancement on the new one too! |
thanks @srkreddy1238 ! We are moving towards the new SLM based pipeline, do you mind check that out and confirm if the approach works on SLM? also cc @Kartik14 , see #1494 I think one takeaway is that dispatch based approach is still useful in quickly improving perf of specific models, and we need to also have clear guide to to so in SLM pipeline |
Sure, I will check and confirm. |
thanks @srkreddy1238, let me know how it goes |
merging for now so there is a record in the main. Likely we will migrate to SLM very soon, this means we will phase out |
This PR fixes the Llama compilation failure in the old flow. * updating the uses of `relax.call_inplace_paked` and other ops to follow the signature in TVM. * fixing the typo on attention computation introduced by mlc-ai#1536.
This PR fixes the Llama compilation failure in the old flow. * updating the uses of `relax.call_inplace_paked` and other ops to follow the signature in TVM. * fixing the typo on attention computation introduced by #1536.
|
@hpcer we used q4f16_0 (group size 32) |
This improves the adreno dispatches for best unitization of hardware resources
Changes to the network to avoid unnecessary transposes which are not required for Adreno.
The decode performance is improved up to
Snapdragon Gen 2 : 11 toks/sec
Snapdragon Gen 3 : 14 toks/sec