[Feature Request] parallel decoding on mobile #4064

BarfingLemurs · 2023-11-13T19:24:57Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Motivation

The batched and parallel examples do not perform as expected. The examples normally demonstrate the tps scales with batch size:

On an x86 cpu:
batch size 1 = 4 t/s
batch size 2 = 8 t/s
batch size 4 = 16 t/s

However:

On mobile:
batch size 1 = 4 t/s
batch size 2 = 4 t/s
batch size 4 = 4 t/s

Possible Implementation

Maybe this is implemented, but not working in the environment I tested.

The text was updated successfully, but these errors were encountered:

BarfingLemurs · 2023-11-29T06:17:52Z

I found this does work in other systems like a pi. Closing as this would be an environment issue

ggerganov · 2023-11-29T08:50:08Z

However:
On mobile:
batch size 1 = 4 t/s
batch size 2 = 4 t/s
batch size 4 = 4 t/s

My understanding is that when this happens, it means there is not enough compute to saturate the memory bandwidth.
Here are some more results with AWS instances that behave similarly:

#3478 (comment)

BarfingLemurs · 2023-11-30T00:15:25Z

Thank you for the notice!

@ggerganov On further testing, I was able to see some gain with smaller models on android, through termux.

Pixel 6 Pro

Q4_K_M tinyllama

batchsize 2 = 16.9 t/s
batchsize 1 = 14.5 t/s

Q4_0 tinyllama

batchsize 2 = 9.53 t/s
batchsize 1 = 8.1 t/s

f16 version, the difference is less noticeable now:

batchsize 2 = 7.2 t/s
batchsize 1 = 6.9 t/s

with raspberry pi 400, I can double the total tokens (4 t/s -> 8 t/s) with tinyllama Q4_K_M.

In both cases, the cpu is at its limit, a batch size of 3 or 4 did not improve anything further.

I was thinking the chip on Pixel 6 would have greater compute. Maybe the model runs too fast to do anything in parallel.

AutonomicPerfectionist · 2023-12-01T16:29:20Z

I was thinking the chip on Pixel 6 would have greater compute

@BarfingLemurs the pixel 6 pro CPU is a heterogeneous system, with 3 types of cores: 2 ultra-fast cortex-x1, 2 cortex-a76, and 4 slow but low-power cortex-a55. The raspberry pis all have homogeneous CPUs, so most likely the difference you are observing is due to some of the cores in the pixel waiting on the slow a55s. BLAS won't be used for batch sizes smaller than 32, so the processing will all be done in llama.cpp directly. Therefore, you can try to tune the thread count, if you set the threads-batch parameter to 2 you may see greater speedups

BarfingLemurs · 2023-12-04T02:31:41Z

@AutonomicPerfectionist

I get worse speeds with -t 2. 4 is still best for my device

./parallel -m ~/tinyllama-1.1b-chat-v0.3.Q4_K_M.gguf -ns 2 -np 2 -p "what is a llama?" -t 4 -n 30

BarfingLemurs added the enhancement New feature or request label Nov 13, 2023

BarfingLemurs closed this as completed Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] parallel decoding on mobile #4064

[Feature Request] parallel decoding on mobile #4064

BarfingLemurs commented Nov 13, 2023

BarfingLemurs commented Nov 29, 2023

ggerganov commented Nov 29, 2023

BarfingLemurs commented Nov 30, 2023 •

edited

Loading

AutonomicPerfectionist commented Dec 1, 2023

BarfingLemurs commented Dec 4, 2023

[Feature Request] parallel decoding on mobile #4064

[Feature Request] parallel decoding on mobile #4064

Comments

BarfingLemurs commented Nov 13, 2023

Prerequisites

Feature Description

Motivation

Possible Implementation

BarfingLemurs commented Nov 29, 2023

ggerganov commented Nov 29, 2023

BarfingLemurs commented Nov 30, 2023 • edited Loading

Pixel 6 Pro

Q4_K_M tinyllama

Q4_0 tinyllama

f16 version, the difference is less noticeable now:

AutonomicPerfectionist commented Dec 1, 2023

BarfingLemurs commented Dec 4, 2023

BarfingLemurs commented Nov 30, 2023 •

edited

Loading