Parallelization / Batching Explanation #4130

lapp0 · 2023-11-19T00:25:22Z

lapp0
Nov 19, 2023

Hi All,

I'm seeking clarity on the functionality of the --parallel option in /app/server, especially how it interacts with the --cont-batching parameter. My specific observation involves setting --ctx-size to 8192 and --parallel to 32. From the logs, it appears there are 32 slots, each handling a context segment of 256. My question is: Does this configuration imply that each slot processes a distinct segment of the context?

For instance, if I input 32 instances of an identical prompt with a length of 4096, would the first half of the slots remain idle due to the prompt already existing in the KV cache? This leads to confusion, as it seems the total number of submittable jobs is limited by the slot count. This is puzzling because different prompts might rely on the same slot for varied tasks. If the initial segment of the context is identical for multiple prompts, that segment might not require processing, as it's already in the KV cache.

I'm trying to understand the rationale behind dividing the context into segments when batching. Could you provide an explanation of how the --parallel and --cont-batching options function?

References:

server : parallel decoding and multimodal #3589
server : parallel decoding and multimodal (cont) #3677
llama : custom attention mask + parallel decoding + no context swaps #3228
- "To set the KV cache size, use the -c, --context parameter. For example, for 32 parallel streams that are expected to generate a maximum of 128 tokens each (i.e. -n 128), you would need to set -c 4096 (i.e. 32*128). If continuous batching is enabled, you would need some extra KV space to deal with fragmentation of the cache. In the example above, we conveniently set the context size to 8192 to guarantee that there will be no issues."
llama : add support for batched inference #2813
- "We want to be able to generate multiple sequences sharing the same context (a.k.a. prompt) in parallel."
https://github.com/ggerganov/llama.cpp/blob/master/examples/server/server.cpp#L588-L603
- batch handling in server.cpp dividing the n_ctx and calling llama_batch_init(n_ctx, 0, params.n_parallel);

nfairbank · 2024-01-08T16:10:00Z

nfairbank
Jan 8, 2024

Bump- trying to understand this as well...

0 replies

ggerganov · 2024-01-08T18:12:33Z

ggerganov
Jan 8, 2024
Maintainer

The --ctx-size argument actually specifies the total size of the KV cache (legacy name, --kv-size would be better). This corresponds to the total amount of tokens that can be stored across all independent sequences.

For example, if we specify --ctx-size 8192 this means that we can process:

2 sequences, each of max length of 4096 tokens
4 sequences, each of max length of 2048 tokens
8 sequences, each of max length of 1024 tokens
...
32 sequences, each of max length of 256 tokens
etc.

Simply put, if we want to be handling P sequences in parallel and we know that each sequence can have a maximum of T tokens (prompt + generated), then we want to set our KV cache size (i.e. --ctx-size) to T*P in order to be able to handle the worst-case scenario where all P sequences fill-up the maximum T tokens.

Since llama.cpp implements a "unified" cache strategy, the KV cache size is actually shared across all sequences. This means that it's allowed to have sequences with more than T tokens as long as the sum of all tokens does not exceed P*T.

From the logs, it appears there are 32 slots, each handling a context segment of 256. My question is: Does this configuration imply that each slot processes a distinct segment of the context?

No. Each sequence has it's own context. The tokens from each sequences "see" only the tokens from that same sequence. This is achieved with the KQ_mask:

llama.cpp/llama.cpp

Lines 6328 to 6368 in dd5ae06

    
           if (!alloc_inp_KQ_mask && strcmp(name, "KQ_mask") == 0) { 
        
               ggml_allocr_alloc(lctx.alloc, cur); 
        
               if (!ggml_allocr_is_measure(lctx.alloc)) { 
        
                   const int64_t n_kv     = cur->ne[0]; 
        
                   const int64_t n_tokens = cur->ne[1]; 
        
                   float * data; 
        
                   if (ggml_backend_buffer_is_host(cur->buffer)) { 
        
                       data = (float *) cur->data; 
        
                   } else { 
        
                       lctx.buf_copy.resize(ggml_nbytes(cur)); 
        
                       data = (float *) lctx.buf_copy.data(); 
        
                   } 
        
                   for (int h = 0; h < 1; ++h) { 
        
                       for (int j = 0; j < n_tokens; ++j) { 
        
                           const llama_pos    pos    = batch.pos[j]; 
        
                           const llama_seq_id seq_id = batch.seq_id[j][0]; 
        
                           for (int i = 0; i < n_kv; ++i) { 
        
                               float f; 
        
                               if (!lctx.kv_self.cells[i].has_seq_id(seq_id) || lctx.kv_self.cells[i].pos > pos) { 
        
                                   f = -INFINITY; 
        
                               } else { 
        
                                   f = 0; 
        
                               } 
        
                               data[h*(n_kv*n_tokens) + j*n_kv + i] = f; 
        
                           } 
        
                       } 
        
                   } 
        
                   if (data != cur->data) { 
        
                       ggml_backend_tensor_set(cur, data, 0, ggml_nbytes(cur)); 
        
                   } 
        
               } 
        
               alloc_inp_KQ_mask = true; 
        
           }

Each llama_decode call accepts a llama_batch. The batch can contain an arbitrary set of tokens - each token has it's own position and sequences id(s). The position and the sequence ids of a token determine to which other tokens (both from the batch and the KV cache) it will attend to, by constructing the respective KQ_mask. When processed, the batch of tokens is stored in the KV cache. This way, as long as there is enough free slots in the KV cache, we can call llama_decode. In this framework, continuous batching is trivial.

Another great benefit is that different sequences can share a common prompt without any extra compute. All it takes is to assign multiple sequence ids to the common tokens in the KV cache. A basic example is a system prompt of S tokens that can sit at the beginning of the unified KV cache and be attended to by all sequences that we process.

Together with the simplicity and advantages of this implementation, there are a few disadvantages:

The attention for each sequence is computed over the entire KV cache. This is suboptimal in the case of many sequences and very long contexts because we compute "cross-sequence attention" which is masked after that (i.e. we throw it away)
The determinism of the results is now also a function of where the tokens of a sequence are placed in the KV cache

In order to resolve these, I think we should add a standard attention implementation where each sequence has it's own KV cache buffer and the attention is computed separately. This way, users would be able to choose which implementation to use based on their specific use case.

18 replies

hiarcs Jul 18, 2024

That's it! I tried using the -nocb parameter as the baseline, and the difference is huge.
Thanks a lot.

enn-nafnlaus Jul 18, 2024

That's it! I tried using the -nocb parameter as the baseline, and the difference is huge. Thanks a lot.

Thanks for solving this performance mystery for me too :)

Mushoz Nov 22, 2024

This means that it's allowed to have sequences with more than T tokens as long as the sum of all tokens does not exceed P*T.

How does this work? If I set a context of 2048 with 8 slots, but only use one of those, I still get context shifts when I exceed 256 context length.

enn-nafnlaus Nov 22, 2024

This means that it's allowed to have sequences with more than T tokens as long as the sum of all tokens does not exceed P*T.

How does this work? If I set a context of 2048 with 8 slots, but only use one of those, I still get context shifts when I exceed 256 context length.

Yes. If you don't want that to happen, increase your context to 2048*8 (even though the model only has a max of 2048 tokens)

Mushoz Nov 22, 2024

But according to @ggerganov it is possible for a single request to exceed T tokens, because the KV cache is unified and if other slots are not (fully) in use, that space in the KV cache can allow a single request to process longer sequences. I know 256 will be the limit when all slots are fully in use, but I was expecting based on the earlier explanation that a single request could use 2048 if all other slots were not in use.

chigkim · 2024-07-16T21:33:21Z

chigkim
Jul 16, 2024

If I understand correctly, I'm told that setting -c beyond model context window size degrades output quality. However, does it make it difference in parallel?
For example, if I want to utilize the full context size for llama3 (8192) in 16 parallels, would setting -c 131072 and -np 16 decrease the quality of output?
Or, should I set -c 8192 -np 16 and only send prompt with max 512 tokens?
What about generation tokens? Should prompt + generation should be <= 512?
Then should I set -c 8192 -np 16 -n 256 and only send prompt with max 256 tokens?

1 reply

ggerganov Jul 17, 2024
Maintainer

For example, if I want to utilize the full context size for llama3 (8192) in 16 parallels, would setting -c 131072 and -np 16 decrease the quality of output?

It will not decrease the quality and this is the recommended way to go

What about generation tokens?

For 16 parallel slots, setting -c 131072 -np 16 would guarantee that the worst-case scenario in terms of memory consumption where all 16 sequences are 8192 tokens long, would fit in the KV cache

Then should I set -c 8192 -np 16 -n 256 and only send prompt with max 256 tokens?

Yes, you can do that and it will work in the worst-case scenario where all 16 slots have prompts of 256 tokens each and generate 256 tokens each: 16*(256 + 256) = 8192

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization / Batching Explanation #4130

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 19 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Parallelization / Batching Explanation #4130

Replies: 3 comments · 19 replies

ggerganov Jan 8, 2024 Maintainer

ggerganov Jul 17, 2024 Maintainer

Replies: 3 comments 19 replies

ggerganov
Jan 8, 2024
Maintainer

ggerganov Jul 17, 2024
Maintainer