-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question][Android] Lower speed and GPU usage with SLM than legacy workflow on Android Adreno GPU #1896
Comments
Similar pattern found in Notably, SLM barely reaches 6.5 token/s with Might need double-check before deprecating legacy workflow #1886 |
I can confirm that I experience this also. |
SLM flow is using the Paged Attention kernel, which causes the perf regression since it's not tuned on Android. |
The attention regression should be fixed by #1915 . In the meantime, given all the known gaps are ready, we will proceed with the deprecation, so followup steps can move forward @srkreddy1238 would be great if you can also help bring some of the |
Thank you @sbwww for reporting and love to continue work together improving the new flow |
We find q4f16_0 being more convenient for Adreno (though we tried improving q4f16_1 initially). Now q4f16_0 compatible dlight schedules (GEMV, MatMul) are improved. I feel the heat with somany options here, will start giving out the internal optimizations for Adreno soon |
Love to see these land @srkreddy1238 ! i know there can be a slight setbacks due to migration, but hopefully the new engine would offer path towards more useful things like speculation and more |
Here the PR's for Adreno improvements with SLM flow. #2215 : Enable OpenCL Host Ptr usage for Android builds All these changes put together can push the decode performance up to 40% on Snapdragon Gen 3 from the current baseline. Request to review |
❓ General Questions
I tried both workflows of
mlc_llm.build
(legacy) andmlc_chat compile
(SLM) to compile and deploy Llama2 7Bq4f16_1
model on Qualcomm 8gen3 device.With the same input (42 tokens prefilled), the decoding speed diverges between legacy and SLM. SLM is ~10% slower than legacy.
I profiled the GPU usage with https://kite.mi.com/ as follows. It seems that model compiled with SLM (~80%) has a lower GPU usage than legacy (~87%) at decoding stage.
Possibly related to
mlc_chat convert_weight
has errors with q4f16_ft quantization #1723 (comment)slower decoding speed with SLM
enhaced Android implementation with legacy, but not yet integrated into SLM
I assume that #1536 contributes to the faster speed of legacy workflow. Haven't tested
q4f16_0
quantization yet.The text was updated successfully, but these errors were encountered: