Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] parallel decoding on mobile #4064

Closed
4 tasks done
BarfingLemurs opened this issue Nov 13, 2023 · 5 comments
Closed
4 tasks done

[Feature Request] parallel decoding on mobile #4064

BarfingLemurs opened this issue Nov 13, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@BarfingLemurs
Copy link
Contributor

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Motivation

The batched and parallel examples do not perform as expected. The examples normally demonstrate the tps scales with batch size:

On an x86 cpu:
batch size 1 = 4 t/s
batch size 2 = 8 t/s
batch size 4 = 16 t/s

However:

On mobile:
batch size 1 = 4 t/s
batch size 2 = 4 t/s
batch size 4 = 4 t/s

Possible Implementation

Maybe this is implemented, but not working in the environment I tested.

@BarfingLemurs BarfingLemurs added the enhancement New feature or request label Nov 13, 2023
@BarfingLemurs
Copy link
Contributor Author

I found this does work in other systems like a pi. Closing as this would be an environment issue

@ggerganov
Copy link
Owner

However:
On mobile:
batch size 1 = 4 t/s
batch size 2 = 4 t/s
batch size 4 = 4 t/s

My understanding is that when this happens, it means there is not enough compute to saturate the memory bandwidth.
Here are some more results with AWS instances that behave similarly:

#3478 (comment)

@BarfingLemurs
Copy link
Contributor Author

BarfingLemurs commented Nov 30, 2023

Thank you for the notice!

@ggerganov On further testing, I was able to see some gain with smaller models on android, through termux.

Pixel 6 Pro

Q4_K_M tinyllama

batchsize 2 = 16.9 t/s
batchsize 1 = 14.5 t/s

Q4_0 tinyllama

batchsize 2 = 9.53 t/s
batchsize 1 = 8.1 t/s

f16 version, the difference is less noticeable now:

batchsize 2 = 7.2 t/s
batchsize 1 = 6.9 t/s

with raspberry pi 400, I can double the total tokens (4 t/s -> 8 t/s) with tinyllama Q4_K_M.

In both cases, the cpu is at its limit, a batch size of 3 or 4 did not improve anything further.

I was thinking the chip on Pixel 6 would have greater compute. Maybe the model runs too fast to do anything in parallel.

@AutonomicPerfectionist
Copy link
Contributor

I was thinking the chip on Pixel 6 would have greater compute

@BarfingLemurs the pixel 6 pro CPU is a heterogeneous system, with 3 types of cores: 2 ultra-fast cortex-x1, 2 cortex-a76, and 4 slow but low-power cortex-a55. The raspberry pis all have homogeneous CPUs, so most likely the difference you are observing is due to some of the cores in the pixel waiting on the slow a55s. BLAS won't be used for batch sizes smaller than 32, so the processing will all be done in llama.cpp directly. Therefore, you can try to tune the thread count, if you set the threads-batch parameter to 2 you may see greater speedups

@BarfingLemurs
Copy link
Contributor Author

@AutonomicPerfectionist

I get worse speeds with -t 2. 4 is still best for my device

./parallel -m ~/tinyllama-1.1b-chat-v0.3.Q4_K_M.gguf -ns 2 -np 2 -p "what is a llama?" -t 4 -n 30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants