Parallel prompts evaluation #3363

Xarbirus · 2023-09-27T13:18:51Z

Xarbirus
Sep 27, 2023

Description

I currently tried to implement parallel processing of tokens inspired by baby-llama, i.e. I'm trying to change the dimension of tokens from [1 x N] to [M x N] to process several tokens in parallel at once. Here you can find my fork with the first experiment.

Note: I'm using Apple M2 Max.

Problem

Results on CPU and GPU differ.

Firstly I built my experimental app (input-batches-experiment) with -DLLAMA_METAL=OFF.
Its results fully correspond to expectations:

identical prompts processed in parallel produce identical results, like:

RUN: Parallel prompt processing, 3 parallel input prompts
	batch = 0, seq = 0 (n_past = 0), token = 2, first logit = -0.819255
	batch = 0, seq = 1 (n_past = 0), token = 3, first logit = -1.39025
	batch = 0, seq = 2 (n_past = 0), token = 5, first logit = -0.233219
	batch = 0, seq = 3 (n_past = 0), token = 7, first logit = -0.442564
	---
	batch = 1, seq = 0 (n_past = 0), token = 2, first logit = -0.819255
	batch = 1, seq = 1 (n_past = 0), token = 3, first logit = -1.39025
	batch = 1, seq = 2 (n_past = 0), token = 5, first logit = -0.233219
	batch = 1, seq = 3 (n_past = 0), token = 7, first logit = -0.442564
	---
	batch = 2, seq = 0 (n_past = 0), token = 2, first logit = -0.819255
	batch = 2, seq = 1 (n_past = 0), token = 3, first logit = -1.39025
	batch = 2, seq = 2 (n_past = 0), token = 5, first logit = -0.233219
	batch = 2, seq = 3 (n_past = 0), token = 7, first logit = -0.442564
	==========
	All parallel input results are equal: TRUE

different prompts processed in parallel also produce correct results, like:

RUN: Single prompt processing (llama_eval)
	batch = 0, seq = 0 (n_past = 0), token = 2, first logit = -0.819255
	batch = 0, seq = 1 (n_past = 0), token = 3, first logit = -1.39025
	batch = 0, seq = 2 (n_past = 0), token = 5, first logit = -0.233219
	batch = 0, seq = 3 (n_past = 0), token = 7, first logit = -0.442564
	==========
RUN: Another Single prompt processing (llama_eval)
	batch = 0, seq = 0 (n_past = 0), token = 4, first logit = -0.559572
	batch = 0, seq = 1 (n_past = 0), token = 6, first logit = -8.01453
	batch = 0, seq = 2 (n_past = 0), token = 10, first logit = -8.81363
	batch = 0, seq = 3 (n_past = 0), token = 14, first logit = -5.98331
	==========
RUN: Combined prompts (llama_eval_batch)
	batch = 0, seq = 0 (n_past = 0), token = 2, first logit = -0.819255
	batch = 0, seq = 1 (n_past = 0), token = 3, first logit = -1.39025
	batch = 0, seq = 2 (n_past = 0), token = 5, first logit = -0.233219
	batch = 0, seq = 3 (n_past = 0), token = 7, first logit = -0.442564
	---
	batch = 1, seq = 0 (n_past = 0), token = 4, first logit = -0.559572
	batch = 1, seq = 1 (n_past = 0), token = 6, first logit = -8.01453
	batch = 1, seq = 2 (n_past = 0), token = 10, first logit = -8.81363
	batch = 1, seq = 3 (n_past = 0), token = 14, first logit = -5.98331
	==========
	First batch results are equal:        TRUE
	Second batch results are equal:       TRUE
	All parallel input results are equal: TRUE

But when I built my experimental app with -DLLAMA_METAL=ON, I got totally inconsistent results (only the first batch is generated correctly):

identical prompts processed in parallel produce identical results, like:

RUN: Parallel prompt processing, 3 parallel input prompts (llama_eval_batch)
	batch = 0, seq = 0 (n_past = 0), token = 2, first logit = -0.810706
	batch = 0, seq = 1 (n_past = 0), token = 3, first logit = -1.40239
	batch = 0, seq = 2 (n_past = 0), token = 5, first logit = -0.253989
	batch = 0, seq = 3 (n_past = 0), token = 7, first logit = -0.468385
	---
	batch = 1, seq = 0 (n_past = 0), token = 2, first logit = -0.328787
	batch = 1, seq = 1 (n_past = 0), token = 3, first logit = -0.293358
	batch = 1, seq = 2 (n_past = 0), token = 5, first logit = -0.200654
	batch = 1, seq = 3 (n_past = 0), token = 7, first logit = -1.68856
	---
	batch = 2, seq = 0 (n_past = 0), token = 2, first logit = 0.382748
	batch = 2, seq = 1 (n_past = 0), token = 3, first logit = -2.22278
	batch = 2, seq = 2 (n_past = 0), token = 5, first logit = -1.58328
	batch = 2, seq = 3 (n_past = 0), token = 7, first logit = -1.61095
	==========
	All parallel input results are equal: FALSE

different prompts processed in parallel also produce incorrect results, like:

RUN: Single prompt processing (llama_eval)
	batch = 0, seq = 0 (n_past = 0), token = 2, first logit = -0.810706
	batch = 0, seq = 1 (n_past = 0), token = 3, first logit = -1.40239
	batch = 0, seq = 2 (n_past = 0), token = 5, first logit = -0.253989
	batch = 0, seq = 3 (n_past = 0), token = 7, first logit = -0.468385
	==========
RUN: Another Single prompt processing (llama_eval)
	batch = 0, seq = 0 (n_past = 0), token = 4, first logit = -0.583736
	batch = 0, seq = 1 (n_past = 0), token = 6, first logit = -8.08266
	batch = 0, seq = 2 (n_past = 0), token = 10, first logit = -8.88991
	batch = 0, seq = 3 (n_past = 0), token = 14, first logit = -6.03477
	==========
RUN: Combined prompts (llama_eval_batch)
	batch = 0, seq = 0 (n_past = 0), token = 2, first logit = -0.810706
	batch = 0, seq = 1 (n_past = 0), token = 3, first logit = -1.40239
	batch = 0, seq = 2 (n_past = 0), token = 5, first logit = -0.253989
	batch = 0, seq = 3 (n_past = 0), token = 7, first logit = -0.468385
	---
	batch = 1, seq = 0 (n_past = 0), token = 4, first logit = 0.185063
	batch = 1, seq = 1 (n_past = 0), token = 6, first logit = -2.07769
	batch = 1, seq = 2 (n_past = 0), token = 10, first logit = 0.382377
	batch = 1, seq = 3 (n_past = 0), token = 14, first logit = -2.8728
	==========
	First batch results are equal:        TRUE
	Second batch results are equal:       FALSE
	All parallel input results are equal: FALSE

Question

Looks like I missed something in a Metal part initialisation or I need to make changes to that part of the project.
Does anyone have any ideas why the behavior on CPU and GPU is so different, and how this can be fixed?

Answered by ggerganov

Oct 2, 2023

There is a batched example available now for parallel decoding

View full answer

ggerganov · 2023-10-02T08:38:36Z

ggerganov
Oct 2, 2023
Maintainer

There is a batched example available now for parallel decoding

5 replies

Xarbirus Oct 7, 2023
Author

Thank you for updating the repository!

I've checked this batched example. It seems like for metal implementation it makes a difference whether the input sequence has an even length or odd.

I ran ./batched ./models/llama-7b.Q4_0.gguf "Hello my name" 4 (even number of tokens: 3 words + bos)

Results with CPU (with -DLLAMA_METAL=OFF)

main: n_len = 32, n_ctx = 116, n_batch = 32, n_parallel = 4, n_kv_req = 116

sequence 0:
Hello my name is Melissa and I am a 26 year old single mother of a 4 year old. I have been a caregiver

sequence 1:
Hello my name is Tiffany and I am a 22 year old mother of 2. I am currently going to school to become a lic

sequence 2:
Hello my name is Tiffany and I am a 27 year old mother of 2 beautiful children, a 4 year old girl and a

sequence 3:
Hello my name is Sara and I am a Licensed Massage Therapist in the state of Washington. I have been in the massage field

main: decoded 112 tokens in 3.49 s, speed: 32.13 t/s

Results with Metal (with -DLLAMA_METAL=ON)

main: n_len = 32, n_ctx = 116, n_batch = 32, n_parallel = 4, n_kv_req = 116

sequence 0:
Hello my name is Melissa and I am a 26 year old single mother of a 4 year old. I have been a caregiver

sequence 1:
Hello my name is Katie and I am a 25 year old from the UK. I am currently living in Barcelona and have been here for 

sequence 2:
Hello my name is Tiffany and I am a 27 year old mother of 2 beautiful children, I am a single mother who is looking

sequence 3:
Hello my name is Sara and I am a Licensed Massage Therapist in the state of California. I have been in the massage field

main: decoded 112 tokens in 2.40 s, speed: 46.59 t/s

The results look very similar.

Than I ran ./batched ./models/llama-7b.Q4_0.gguf "Hello my name is" 4 (odd number of tokens: 4 words + bos)

Results with CPU (with -DLLAMA_METAL=OFF)

main: n_len = 32, n_ctx = 113, n_batch = 32, n_parallel = 4, n_kv_req = 113

sequence 0:
Hello my name is Shane and I am a 20 year old student. I am a very active person, and I like to keep fit.

sequence 1:
Hello my name is Sara and I am the owner of Sara's Hair Studio. I am a licensed cosmetologist with 2

sequence 2:
Hello my name is Amanda and I am from the United States. I am a 20 year old college student and I am looking for a job

sequence 3:
Hello my name is Michele and I am a new member of this site. I have a 12 year old son who is a Type 

main: decoded 108 tokens in 2.99 s, speed: 36.13 t/s

Results with Metal (with -DLLAMA_METAL=ON)

main: n_len = 32, n_ctx = 113, n_batch = 32, n_parallel = 4, n_kv_req = 113

sequence 0:
Hello my name is Dustin, and I am a 28 years old from United States.
I am a 28 years old from

sequence 1:
Hello my name is Chad, and I'm a 27 year old male from the United States. I've been a member of this

sequence 2:
Hello my name is Rohan, and I'm a 3rd year student at the University of Waterloo. I'm a Software

sequence 3:
Hello my name is Paul and I am from India.
I am looking for a job in any country.
I am a hard worker and I am

main: decoded 108 tokens in 2.31 s, speed: 46.85 t/s

The results don't look similar.

First logit for every token in an initial sequence is:

Token	CPU 4 tokens	Metal 4 tokens	CPU 5 tokens	Metal 5 tokens
0	-0.446245	-0.471425	-0.446245	-2.6873
1	-9.55505	-9.62378	-9.55505	1.08958
2	-6.53664	-6.59447	-6.53664	-3.13484
3	-1.54208	-1.5909	-1.54208	-1.86587
4	-	-	-5.76854	-5.38322

Does this mean that to work correctly with Metal, the prompts must be of even length?

Xarbirus Oct 7, 2023
Author

When I changed the number of batches to 5 (from 4), I got correct results:

First logit for every token in an initial sequence is:

Token	CPU 4 tokens	Metal 4 tokens	CPU 5 tokens	Metal 5 tokens
0	-0.446245	-0.471425	-0.446245	-0.471425
1	-9.55505	-9.62378	-9.55505	-9.62378
2	-6.53664	-6.59447	-6.53664	-6.59447
3	-1.54208	-1.5909	-1.54208	-1.5909
4	-	-	-5.76854	-5.84976

Further experiments showed that the problem occurs only if n_ctx > 32 and n_ctx % 2 == 1 on Metal.

ggerganov Oct 8, 2023
Maintainer

Thanks for this analysis. There is definitely an issue with n_ctx % 2 == 1 on Metal. Looking into it

ggerganov Oct 8, 2023
Maintainer

@Xarbirus Could you please try #3542 and see if your tests pass now?

Xarbirus Oct 8, 2023
Author

@ggerganov checked, works correctly! thank you for your help!

JianbangZ · 2023-10-11T15:58:48Z

JianbangZ
Oct 11, 2023

How does the prompt eval speed look like?
For instance a large paragraph summarization task, is it feasible to device the input prompt to multiple chcunks and apply this parallel prompt eval?

0 replies

ExtReMLapin · 2024-09-15T13:52:00Z

ExtReMLapin
Sep 15, 2024

@Xarbirus do you have any plan to merge your experiments into server example ?

1 reply

Xarbirus Sep 16, 2024
Author

@ExtReMLapin no, there are no plans, because the implementation from the batched example is superior to mine and works great!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel prompts evaluation #3363

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Parallel prompts evaluation #3363

Xarbirus Sep 27, 2023

Description

Note: I'm using Apple M2 Max.

Problem

Question

Replies: 3 comments · 6 replies

ggerganov Oct 2, 2023 Maintainer

Xarbirus Oct 7, 2023 Author

Xarbirus Oct 7, 2023 Author

ggerganov Oct 8, 2023 Maintainer

ggerganov Oct 8, 2023 Maintainer

Xarbirus Oct 8, 2023 Author

JianbangZ Oct 11, 2023

ExtReMLapin Sep 15, 2024

Xarbirus Sep 16, 2024 Author

Xarbirus
Sep 27, 2023

Replies: 3 comments 6 replies

ggerganov
Oct 2, 2023
Maintainer

Xarbirus Oct 7, 2023
Author

Xarbirus Oct 7, 2023
Author

ggerganov Oct 8, 2023
Maintainer

ggerganov Oct 8, 2023
Maintainer

Xarbirus Oct 8, 2023
Author

JianbangZ
Oct 11, 2023

ExtReMLapin
Sep 15, 2024

Xarbirus Sep 16, 2024
Author