Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ANDROID][ADRENO] Various enhancements for Adreno target #1536

Merged
merged 1 commit into from
Feb 14, 2024

Conversation

srkreddy1238
Copy link
Contributor

This improves the adreno dispatches for best unitization of hardware resources
Changes to the network to avoid unnecessary transposes which are not required for Adreno.

The decode performance is improved up to
Snapdragon Gen 2 : 11 toks/sec
Snapdragon Gen 3 : 14 toks/sec

…erformance

This improves the adreno dispatches for best unitization of hardware resources
Changes to the network to avoid unnecessary transposes which are not required for Adreno.

The performance of on Snapdragon Gen 2 is now improved upto

Prefille: 30 toks/sec
Decode  : 11 toks/sec
@Nick-infinity
Copy link

Thank you for your Contribution Siva. I think the relax model is not used directly any more.
There is a new compilation flow for models :

# Step 1. Create the quantized model

I am unable to invoke your changes as that flow is not called for model conversion any more

@Hzfengsy
Copy link
Member

Hzfengsy commented Jan 4, 2024

Thanks @srkreddy1238 for the impressive work! The performance is excellent, but I wonder if the schedule optimization could be generalized to other models. In other words, could we update DLight rules instead of dispatching TIR kernels?

@srkreddy1238
Copy link
Contributor Author

Thanks @Hzfengsy
DLight rules for Adreno is WIP (agree, this is important to generalize the enhancements for all models). Also, we have few transforms for Adreno which need to be added to the new compilation flow and make sure the performance retains.

This PR is to address LLaMa (V1 & V2) performance needs at the moment.

@junrushao
Copy link
Member

I think we should get it in because it's an enhancement :)

@junrushao
Copy link
Member

This is great addition to the old workflow, and I very much look forward to seeing such significant performance enhancement on the new one too!

@tqchen
Copy link
Contributor

tqchen commented Jan 5, 2024

thanks @srkreddy1238 ! We are moving towards the new SLM based pipeline, do you mind check that out and confirm if the approach works on SLM? also cc @Kartik14 , see #1494

I think one takeaway is that dispatch based approach is still useful in quickly improving perf of specific models, and we need to also have clear guide to to so in SLM pipeline

@srkreddy1238
Copy link
Contributor Author

thanks @srkreddy1238 ! We are moving towards the new SLM based pipeline, do you mind check that out and confirm if the approach works on SLM? also cc @Kartik14 , see #1494

Sure, I will check and confirm.

@tqchen
Copy link
Contributor

tqchen commented Jan 10, 2024

thanks @srkreddy1238, let me know how it goes

@tqchen tqchen merged commit 5953cce into mlc-ai:main Feb 14, 2024
@tqchen
Copy link
Contributor

tqchen commented Feb 14, 2024

merging for now so there is a record in the main.

Likely we will migrate to SLM very soon, this means we will phase out mlc_llm old build flow. @srkreddy1238 would be great to followup and have a PR for the latest SLM flow under python

MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Feb 19, 2024
This PR fixes the Llama compilation failure in the old flow.

* updating the uses of `relax.call_inplace_paked`
and other ops to follow the signature in TVM.
* fixing the typo on attention computation introduced by mlc-ai#1536.
MasterJH5574 added a commit that referenced this pull request Feb 19, 2024
This PR fixes the Llama compilation failure in the old flow.

* updating the uses of `relax.call_inplace_paked`
and other ops to follow the signature in TVM.
* fixing the typo on attention computation introduced by #1536.
@hpcer
Copy link

hpcer commented Mar 19, 2024

This improves the adreno dispatches for best unitization of hardware resources Changes to the network to avoid unnecessary transposes which are not required for Adreno.

The decode performance is improved up to Snapdragon Gen 2 : 11 toks/sec Snapdragon Gen 3 : 14 toks/sec
Hello, What is the quantization scheme used for testing this set of data, and what is the group size of the quantization?

@srkreddy1238
Copy link
Contributor Author

@hpcer we used q4f16_0 (group size 32)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants