Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question][Android] Lower speed and GPU usage with SLM than legacy workflow on Android Adreno GPU #1896

Closed
sbwww opened this issue Mar 6, 2024 · 8 comments
Labels
question Question about the usage

Comments

@sbwww
Copy link

sbwww commented Mar 6, 2024

❓ General Questions

I tried both workflows of mlc_llm.build (legacy) and mlc_chat compile (SLM) to compile and deploy Llama2 7B q4f16_1 model on Qualcomm 8gen3 device.

With the same input (42 tokens prefilled), the decoding speed diverges between legacy and SLM. SLM is ~10% slower than legacy.

I profiled the GPU usage with https://kite.mi.com/ as follows. It seems that model compiled with SLM (~80%) has a lower GPU usage than legacy (~87%) at decoding stage.

legacy SLM
decode 425 tokens, 10.1 t/s decode 254 tokens, 9.0 t/s
image image

Possibly related to

I assume that #1536 contributes to the faster speed of legacy workflow. Haven't tested q4f16_0 quantization yet.

@sbwww sbwww added the question Question about the usage label Mar 6, 2024
@sbwww
Copy link
Author

sbwww commented Mar 6, 2024

Haven't tested q4f16_0 quantization yet.

Similar pattern found in q4f16_0 quantization. SLM GPU usage ~83%, legacy GPU usage ~89%

Notably, SLM barely reaches 6.5 token/s with q4f16_0, and legacy is 11.5 token/s (prefill 42 tokens + decode 300 tokens)


Might need double-check before deprecating legacy workflow #1886

@neobaud
Copy link

neobaud commented Mar 6, 2024

I can confirm that I experience this also.

@spectrometerHBH
Copy link
Member

spectrometerHBH commented Mar 9, 2024

SLM flow is using the Paged Attention kernel, which causes the perf regression since it's not tuned on Android.

#1915

@tqchen
Copy link
Contributor

tqchen commented Mar 11, 2024

The attention regression should be fixed by #1915 . In the meantime, given all the known gaps are ready, we will proceed with the deprecation, so followup steps can move forward @srkreddy1238 would be great if you can also help bring some of the q4f16_0 optimizations to the new flow

@sbwww sbwww closed this as completed Mar 12, 2024
@tqchen
Copy link
Contributor

tqchen commented Mar 12, 2024

Thank you @sbwww for reporting and love to continue work together improving the new flow

@srkreddy1238
Copy link
Contributor

We find q4f16_0 being more convenient for Adreno (though we tried improving q4f16_1 initially). Now q4f16_0 compatible dlight schedules (GEMV, MatMul) are improved.
We fell shot of earlier performance with PagedAttn regressing. Let me give a try with #1915.

I feel the heat with somany options here, will start giving out the internal optimizations for Adreno soon

@tqchen
Copy link
Contributor

tqchen commented Apr 2, 2024

Love to see these land @srkreddy1238 ! i know there can be a slight setbacks due to migration, but hopefully the new engine would offer path towards more useful things like speculation and more

@srkreddy1238
Copy link
Contributor

srkreddy1238 commented Apr 25, 2024

Here the PR's for Adreno improvements with SLM flow.

#2215 : Enable OpenCL Host Ptr usage for Android builds
#2214 : Restored cli utility to work with Android targets (no python here)
#2216 : Thread limit update for Adreno OpenCL
apache/tvm#16929 : Enable HostPtr (memory mappe) based data copy
mlc-ai/relax#319 : The schedule improvements for q4f16_0 schema

All these changes put together can push the decode performance up to 40% on Snapdragon Gen 3 from the current baseline.

Request to review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question about the usage
Projects
None yet
Development

No branches or pull requests

5 participants