[Question][Android] Lower speed and GPU usage with SLM than legacy workflow on Android Adreno GPU #1896

sbwww · 2024-03-06T08:16:34Z

❓ General Questions

I tried both workflows of mlc_llm.build (legacy) and mlc_chat compile (SLM) to compile and deploy Llama2 7B q4f16_1 model on Qualcomm 8gen3 device.

With the same input (42 tokens prefilled), the decoding speed diverges between legacy and SLM. SLM is ~10% slower than legacy.

I profiled the GPU usage with https://kite.mi.com/ as follows. It seems that model compiled with SLM (~80%) has a lower GPU usage than legacy (~87%) at decoding stage.

legacy	SLM
decode 425 tokens, 10.1 t/s	decode 254 tokens, 9.0 t/s

Possibly related to

[Bug] SLM - mlc_chat convert_weight has errors with q4f16_ft quantization #1723 (comment)
slower decoding speed with SLM
[ANDROID][ADRENO] Various enhancements for Adreno target #1536
enhaced Android implementation with legacy, but not yet integrated into SLM

I assume that #1536 contributes to the faster speed of legacy workflow. Haven't tested q4f16_0 quantization yet.

The text was updated successfully, but these errors were encountered:

sbwww · 2024-03-06T09:04:11Z

Haven't tested q4f16_0 quantization yet.

Similar pattern found in q4f16_0 quantization. SLM GPU usage ~83%, legacy GPU usage ~89%

Notably, SLM barely reaches 6.5 token/s with q4f16_0, and legacy is 11.5 token/s (prefill 42 tokens + decode 300 tokens)

Might need double-check before deprecating legacy workflow #1886

neobaud · 2024-03-06T18:52:41Z

I can confirm that I experience this also.

spectrometerHBH · 2024-03-09T16:50:24Z

SLM flow is using the Paged Attention kernel, which causes the perf regression since it's not tuned on Android.

#1915

tqchen · 2024-03-11T16:46:55Z

The attention regression should be fixed by #1915 . In the meantime, given all the known gaps are ready, we will proceed with the deprecation, so followup steps can move forward @srkreddy1238 would be great if you can also help bring some of the q4f16_0 optimizations to the new flow

tqchen · 2024-03-12T13:25:17Z

Thank you @sbwww for reporting and love to continue work together improving the new flow

srkreddy1238 · 2024-04-02T11:38:52Z

We find q4f16_0 being more convenient for Adreno (though we tried improving q4f16_1 initially). Now q4f16_0 compatible dlight schedules (GEMV, MatMul) are improved.
We fell shot of earlier performance with PagedAttn regressing. Let me give a try with #1915.

I feel the heat with somany options here, will start giving out the internal optimizations for Adreno soon

tqchen · 2024-04-02T13:31:42Z

Love to see these land @srkreddy1238 ! i know there can be a slight setbacks due to migration, but hopefully the new engine would offer path towards more useful things like speculation and more

srkreddy1238 · 2024-04-25T11:55:43Z

Here the PR's for Adreno improvements with SLM flow.

#2215 : Enable OpenCL Host Ptr usage for Android builds
#2214 : Restored cli utility to work with Android targets (no python here)
#2216 : Thread limit update for Adreno OpenCL
apache/tvm#16929 : Enable HostPtr (memory mappe) based data copy
mlc-ai/relax#319 : The schedule improvements for q4f16_0 schema

All these changes put together can push the decode performance up to 40% on Snapdragon Gen 3 from the current baseline.

Request to review

sbwww added the question Question about the usage label Mar 6, 2024

sbwww closed this as completed Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question][Android] Lower speed and GPU usage with SLM than legacy workflow on Android Adreno GPU #1896

[Question][Android] Lower speed and GPU usage with SLM than legacy workflow on Android Adreno GPU #1896

sbwww commented Mar 6, 2024

sbwww commented Mar 6, 2024 •

edited

Loading

neobaud commented Mar 6, 2024

spectrometerHBH commented Mar 9, 2024 •

edited

Loading

tqchen commented Mar 11, 2024

tqchen commented Mar 12, 2024

srkreddy1238 commented Apr 2, 2024

tqchen commented Apr 2, 2024

srkreddy1238 commented Apr 25, 2024 •

edited

Loading

[Question][Android] Lower speed and GPU usage with SLM than legacy workflow on Android Adreno GPU #1896

[Question][Android] Lower speed and GPU usage with SLM than legacy workflow on Android Adreno GPU #1896

Comments

sbwww commented Mar 6, 2024

❓ General Questions

sbwww commented Mar 6, 2024 • edited Loading

neobaud commented Mar 6, 2024

spectrometerHBH commented Mar 9, 2024 • edited Loading

tqchen commented Mar 11, 2024

tqchen commented Mar 12, 2024

srkreddy1238 commented Apr 2, 2024

tqchen commented Apr 2, 2024

srkreddy1238 commented Apr 25, 2024 • edited Loading

sbwww commented Mar 6, 2024 •

edited

Loading

spectrometerHBH commented Mar 9, 2024 •

edited

Loading

srkreddy1238 commented Apr 25, 2024 •

edited

Loading