-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] parallel decoding on mobile #4064
Comments
I found this does work in other systems like a pi. Closing as this would be an environment issue |
My understanding is that when this happens, it means there is not enough compute to saturate the memory bandwidth. |
Thank you for the notice! @ggerganov On further testing, I was able to see some gain with smaller models on android, through termux. Pixel 6 ProQ4_K_M tinyllamabatchsize 2 = 16.9 t/s Q4_0 tinyllamabatchsize 2 = 9.53 t/s f16 version, the difference is less noticeable now:batchsize 2 = 7.2 t/s with raspberry pi 400, I can double the total tokens (4 t/s -> 8 t/s) with tinyllama Q4_K_M. In both cases, the cpu is at its limit, a batch size of 3 or 4 did not improve anything further. I was thinking the chip on Pixel 6 would have greater compute. Maybe the model runs too fast to do anything in parallel. |
@BarfingLemurs the pixel 6 pro CPU is a heterogeneous system, with 3 types of cores: 2 ultra-fast cortex-x1, 2 cortex-a76, and 4 slow but low-power cortex-a55. The raspberry pis all have homogeneous CPUs, so most likely the difference you are observing is due to some of the cores in the pixel waiting on the slow a55s. BLAS won't be used for batch sizes smaller than 32, so the processing will all be done in llama.cpp directly. Therefore, you can try to tune the thread count, if you set the threads-batch parameter to 2 you may see greater speedups |
I get worse speeds with -t 2. 4 is still best for my device
|
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
Motivation
The batched and parallel examples do not perform as expected. The examples normally demonstrate the tps scales with batch size:
Possible Implementation
Maybe this is implemented, but not working in the environment I tested.
The text was updated successfully, but these errors were encountered: