[ANDROID][ADRENO] Various enhancements for Adreno target #1536

srkreddy1238 · 2024-01-04T06:22:30Z

This improves the adreno dispatches for best unitization of hardware resources
Changes to the network to avoid unnecessary transposes which are not required for Adreno.

The decode performance is improved up to
Snapdragon Gen 2 : 11 toks/sec
Snapdragon Gen 3 : 14 toks/sec

…erformance This improves the adreno dispatches for best unitization of hardware resources Changes to the network to avoid unnecessary transposes which are not required for Adreno. The performance of on Snapdragon Gen 2 is now improved upto Prefille: 30 toks/sec Decode : 11 toks/sec

Nick-infinity · 2024-01-04T07:04:19Z

Thank you for your Contribution Siva. I think the relax model is not used directly any more.
There is a new compilation flow for models :

mlc-llm/python/mlc_chat/interface/compile.py

Line 130 in 095858c

# Step 1. Create the quantized model

I am unable to invoke your changes as that flow is not called for model conversion any more

Hzfengsy · 2024-01-04T07:10:14Z

Thanks @srkreddy1238 for the impressive work! The performance is excellent, but I wonder if the schedule optimization could be generalized to other models. In other words, could we update DLight rules instead of dispatching TIR kernels?

srkreddy1238 · 2024-01-04T09:07:38Z

Thanks @Hzfengsy
DLight rules for Adreno is WIP (agree, this is important to generalize the enhancements for all models). Also, we have few transforms for Adreno which need to be added to the new compilation flow and make sure the performance retains.

This PR is to address LLaMa (V1 & V2) performance needs at the moment.

junrushao · 2024-01-04T18:37:28Z

I think we should get it in because it's an enhancement :)

junrushao · 2024-01-05T05:22:18Z

This is great addition to the old workflow, and I very much look forward to seeing such significant performance enhancement on the new one too!

tqchen · 2024-01-05T16:45:16Z

thanks @srkreddy1238 ! We are moving towards the new SLM based pipeline, do you mind check that out and confirm if the approach works on SLM? also cc @Kartik14 , see #1494

I think one takeaway is that dispatch based approach is still useful in quickly improving perf of specific models, and we need to also have clear guide to to so in SLM pipeline

srkreddy1238 · 2024-01-08T04:08:42Z

thanks @srkreddy1238 ! We are moving towards the new SLM based pipeline, do you mind check that out and confirm if the approach works on SLM? also cc @Kartik14 , see #1494

Sure, I will check and confirm.

tqchen · 2024-01-10T15:08:17Z

thanks @srkreddy1238, let me know how it goes

tqchen · 2024-02-14T19:10:10Z

merging for now so there is a record in the main.

Likely we will migrate to SLM very soon, this means we will phase out mlc_llm old build flow. @srkreddy1238 would be great to followup and have a PR for the latest SLM flow under python

This PR fixes the Llama compilation failure in the old flow. * updating the uses of `relax.call_inplace_paked` and other ops to follow the signature in TVM. * fixing the typo on attention computation introduced by mlc-ai#1536.

This PR fixes the Llama compilation failure in the old flow. * updating the uses of `relax.call_inplace_paked` and other ops to follow the signature in TVM. * fixing the typo on attention computation introduced by #1536.

hpcer · 2024-03-19T08:18:10Z

This improves the adreno dispatches for best unitization of hardware resources Changes to the network to avoid unnecessary transposes which are not required for Adreno.

The decode performance is improved up to Snapdragon Gen 2 : 11 toks/sec Snapdragon Gen 3 : 14 toks/sec
Hello, What is the quantization scheme used for testing this set of data, and what is the group size of the quantization?

srkreddy1238 · 2024-04-02T11:23:56Z

@hpcer we used q4f16_0 (group size 32)

junrushao approved these changes Jan 5, 2024

View reviewed changes

tqchen merged commit 5953cce into mlc-ai:main Feb 14, 2024

MasterJH5574 mentioned this pull request Feb 18, 2024

[Bug] mlc_llm.build stopped working on llama-2 (TypeError: got an unexpected keyword argument 'args') #1779

Closed

MasterJH5574 mentioned this pull request Feb 19, 2024

[Fix] Fix llama model compilation failure #1782

Merged

sbwww mentioned this pull request Mar 6, 2024

[Question][Android] Lower speed and GPU usage with SLM than legacy workflow on Android Adreno GPU #1896

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ANDROID][ADRENO] Various enhancements for Adreno target #1536

[ANDROID][ADRENO] Various enhancements for Adreno target #1536

srkreddy1238 commented Jan 4, 2024

Nick-infinity commented Jan 4, 2024

Hzfengsy commented Jan 4, 2024

srkreddy1238 commented Jan 4, 2024

junrushao commented Jan 4, 2024

junrushao commented Jan 5, 2024

tqchen commented Jan 5, 2024 •

edited

Loading

srkreddy1238 commented Jan 8, 2024

tqchen commented Jan 10, 2024

tqchen commented Feb 14, 2024

hpcer commented Mar 19, 2024

srkreddy1238 commented Apr 2, 2024

[ANDROID][ADRENO] Various enhancements for Adreno target #1536

[ANDROID][ADRENO] Various enhancements for Adreno target #1536

Conversation

srkreddy1238 commented Jan 4, 2024

Nick-infinity commented Jan 4, 2024

Hzfengsy commented Jan 4, 2024

srkreddy1238 commented Jan 4, 2024

junrushao commented Jan 4, 2024

junrushao commented Jan 5, 2024

tqchen commented Jan 5, 2024 • edited Loading

srkreddy1238 commented Jan 8, 2024

tqchen commented Jan 10, 2024

tqchen commented Feb 14, 2024

hpcer commented Mar 19, 2024

srkreddy1238 commented Apr 2, 2024

tqchen commented Jan 5, 2024 •

edited

Loading