-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using CLBlast to call GPU on Android device, what is the relationship between ngl parameters and model output correctness? #6562
Comments
Unfortunately, OpenCL for Android under-performs, and yes, even the output is incorrect: likely a memory alignment/padding issue You'll likely see wild results if you run the Related: CLBlast is more of a OpenCL library than an actual backend |
@Jeximo Thank you very much for your answer. Can we simply understand that the support of llama.cpp for GPU call on SoC is not perfect at present? Or is it because none of SoC's OpenCL driver support currently supports LLM-like reasoning? |
Yes, it's imperfect.
Yes, To put it simply, there's a lot of progress to make for LLM and GPU Android. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
with the help of issue2169 I use CLBlast on my qualcomm equipment (Adreno740v2) successfully call the GPU calls
But I found an interesting thing when I tried to do model reasoning, when I used the model stories260K.gguf, the model returned normal for questions and answers, but the GPU was hardly called (showing a call rate of 1% or even 0%).
For the model llama- 2-7B-Chatt.q4_k_M.GGUf and llama- 2-7B-Chat.q5_k_S.GGUf, I could get the output, but the output result was not correct. At this time, the GPU call rate is about 40%.
For the model LLAMA-2-13b-chat.q2_K. gguf and LLAMA-2-7b-chat.q2_K. gguf, I got normal and satisfactory responses when the ngl parameter was set to 2, but when I set the ngl parameter to close to all the GPU parameters that can be unloaded, For example, 40/41, the output of the model is back to random output. Of course, the GPU call rate is displayed at around 50%.
when ngl is 2 or 10 (not very large)
when set ngl as 40(40/41),the answer is rediculious
the cmd i use to run is as follows
GGML_OPENCL_PLATFORM=0 GGML_OPENCL_DEVICE=0 ./bin/main -t 8 -m /data/local/tmp/llama_cpu/llama-2-7b-chat.Q4_K_M.gguf --color -c 2048 -ngl 2 --temp 0.7 -n -1 -i -ins
i didn't change other parameters but ngl and model
This looks interesting, and I wonder if it's because CLBlast is making some kind of error in the GPU call?
Has anyone else found themselves in my situation? I want to know which direction I should take to eliminate this mistake.
The text was updated successfully, but these errors were encountered: