server : parallel decoding and multimodal (cont) #3677

ggerganov · 2023-10-19T10:55:31Z

Continuation of #3589

Commands to test:

# parallel decoding from multiple clients (`-np 4`) with continuous batching (`-cb`)
./server -m ${model} --ctx_size 2048 -t 8 -ngl 99 -np 4 -cb

# vision support, upload image in the browser and ask questions
./server -m ./models/llava-7b-v1.5/ggml-model-f16.gguf --mmproj ./models/llava-7b-v1.5/mmproj-model-f16.gguf --ctx_size 2048 -t 8 -ngl 99

fix compilation errors with llvm

FSSRepo · 2023-10-22T17:36:52Z

Does it still crash after latest commit: 00ae55b

Fixed

monatis · 2023-10-22T17:39:00Z

examples/server/server.cpp

+            // release the slot
+            if (slot.state == PROCESSING && slot.command == RELEASE)
+            {
+                slot.state = slot.params.cache_prompt ? SLEEPING : IDLE;


Is the issue with the slot not released due to this line? With cache_prompt it sleeps instead of being released.

ggerganov · 2023-10-22T17:47:08Z

To avoid this, my idea is to add a timer to the slot and periodically send the signal to keep the slot suspended (while the user is using the web UI). If the slot doesn't receive the signal after the timer's duration, it will be automatically released.

Can't we release the slot after every request?
The logic is that a user provides the slot_id they've used last time and if it is released, we re-assign it and continue as we currently do. If it is not released, we search for a new slot. If a new request comes in with unassigned slot (or if the requested slot is in use), we assign a slot that has never been used yet, or the one that is released and has been used longest time ago.

FSSRepo · 2023-10-22T18:33:55Z

Can't we release the slot after every request?

You can try it, deleting slot.params.cache_prompt ? SLEEPING : of:

            // release the slot
            if (slot.state == PROCESSING && slot.command == RELEASE)
            {
                slot.state = slot.params.cache_prompt ? SLEEPING : IDLE;

To this:

            // release the slot
            if (slot.state == PROCESSING && slot.command == RELEASE)
            {
                slot.state = IDLE;

This works with single client, but when a new client (doesn't have an slot it assigned slot_id: -1) want to request a completion could use the released slot of the first client. With the state SLEEPING I avoid this.

ggerganov · 2023-10-22T18:59:07Z

Ok, I think it works now.

I think the server actually worked out pretty good! Good job @FSSRepo, @monatis and all!
There are still some things to improve and the logic is a bit too complicated in some places, but we can gradually improve this.

Will merge this soon if no issues are found

ggerganov · 2023-10-22T19:10:04Z

Parallel llava is not working anymore - first client is OK, but starting a second one produces nonsense:

cebtenzzre · 2023-10-22T19:09:31Z

examples/server/server.cpp


-    json prompt;
-    std::vector<llama_token> embd;
+    slot_state state = IDLE;


Since this enum has only two values now, couldn't this be bool idle = true or bool processing = false?

cebtenzzre · 2023-10-22T19:10:43Z

examples/server/server.cpp

+    }
+
+    bool is_processing() const {
+        return (state == IDLE && command == LOAD_PROMPT) || state == PROCESSING;


Can be simplified to state == PROCESSING || command == LOAD_PROMPT since there are only two possible states.

FSSRepo · 2023-10-22T19:13:48Z

Parallel llava is not working anymore - first client is OK, but starting a second one produces nonsense:

I think that llava doesn't work with parallel decoding

monatis · 2023-10-22T19:20:17Z

examples/server/public/index.html

@@ -434,9 +451,12 @@
            )
        ).join("\n"),
      });
-
+      if (selected_image) {
+        prompt = `A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER:[img-10]${msg}\nASSISTANT:`;


image slot id is statically set to 10 here. I think that might the reason for the parallel LLaVA issue.

Ok, let's fix this later then

The images are loaded for each slot, so it doesn't matter if it's 10, as it ultimately loads a different slot.

FSSRepo · 2023-10-22T19:30:09Z

Unexpected behaivor. I didn't know I could send multiple messages about the same image, the only problem is that it has to load every time I ask something.

To avoid reloading the same image every time a question is asked again, you could obtain the SHA-256 or CRC32 (less compute load) of the image and compare if they have the same ID and SHA-256 or CRC32 (less compute load), then keep the image in memory.

monatis · 2023-10-22T19:37:06Z

Yes technically we can chat about an image on multiple turns, and system prompt should not be prepended other than the first message, but I think optimizing it to eliminate the need to load/preprocess/encode images by a caching mechanism may be another story, i.e., another PR.

cebtenzzre · 2023-10-23T16:09:07Z

Why did this PR remove --threads-batch from the server? It only existed for one day (added in a8bdd65, removed in 29c8cdd).

ggerganov · 2023-10-23T16:34:37Z

This is my mistake - I tried to rebase this PR at some point but the changes to master were all around the place, so I applied them manually. Seems like I missed this one. Apologies

ibehnam · 2023-10-23T18:55:41Z

Continuation of #3589

Commands to test:

# parallel decoding from multiple clients (`-np 4`) with continuous batching (`-cb`)
./server -m ${model} --ctx_size 2048 -t 8 -ngl 99 -np 4 -cb

# vision support, upload image in the browser and ask questions
./server -m ./models/llava-7b-v1.5/ggml-model-f16.gguf --mmproj ./models/llava-7b-v1.5/mmproj-model-f16.gguf --ctx_size 2048 -t 8 -ngl 99

I can't thank you enough! This is great. I have one question:

In my tests, parallel treats different lines in the prompt as separate prompts and sends each one to a different slot. For example:

prompt = '''llama is an animal that inspired work on large language models.\nThis is why we see a lot of LLM packages with the word "llama" in them.'''

In this case, and assuming that np is at least 2, parallel processes llama is an animal that inspired work on large language models. and This is why we see a lot of LLM packages with the word "llama" in them. separately.

In my test, server doesn't show the same behavior 🤔
Can you please provide more info about how to send multiple queries to the server with parallel processing?

ibehnam · 2023-10-23T20:46:42Z

I should mention that using np actually slowed down inference on my M1 Pro chip:

Using parallelism:

./server -m "llama-2-13b.Q8_0.gguf" --ctx_size 4096 -ngl 128 --mlock --threads 10 --numa -np 16 -cb

-> 50 requests sent, total time = 60.38 seconds

Not using parallelism:

./server -m "llama-2-13b.Q8_0.gguf" --ctx_size 4096 -ngl 128 --mlock --threads 10 --no-mmap --numa

-> 50 requests sent sequentially, total time = 52.36 seconds

ggerganov · 2023-10-23T20:58:46Z

Optimal performance when using parallelism with quantum models currently requires some manual adjustments to some constants depending on the model size and hardware, both for Metal and CUDA. I wrote some thoughts on this here #3524 and here #3545 (comment)

You can try playing with this piece of code to get better parallel performance for your use-case:

llama.cpp/ggml-metal.m

Lines 1032 to 1058 in e393259

    
                                       // find the break-even point where the matrix-matrix kernel becomes more efficient compared 
        
                                       // to the matrix-vector kernel 
        
                                       int ne11_mm_min = 1; 
        
           #if 0 
        
                                       // the numbers below are measured on M2 Ultra for 7B and 13B models 
        
                                       // these numbers do not translate to other devices or model sizes 
        
                                       // TODO: need to find a better approach 
        
                                       if ([ctx->device.name isEqualToString:@"Apple M2 Ultra"]) { 
        
                                           switch (src0t) { 
        
                                               case GGML_TYPE_F16:  ne11_mm_min = 2;  break; 
        
                                               case GGML_TYPE_Q8_0: ne11_mm_min = 7;  break; 
        
                                               case GGML_TYPE_Q2_K: ne11_mm_min = 15; break; 
        
                                               case GGML_TYPE_Q3_K: ne11_mm_min = 7;  break; 
        
                                               case GGML_TYPE_Q4_0: 
        
                                               case GGML_TYPE_Q4_1: ne11_mm_min = 15; break; 
        
                                               case GGML_TYPE_Q4_K: ne11_mm_min = 11; break; 
        
                                               case GGML_TYPE_Q5_0:                          // not tested yet 
        
                                               case GGML_TYPE_Q5_1: ne11_mm_min = 13; break; // not tested yet 
        
                                               case GGML_TYPE_Q5_K: ne11_mm_min = 7;  break; 
        
                                               case GGML_TYPE_Q6_K: ne11_mm_min = 7;  break; 
        
                                               default:             ne11_mm_min = 1;  break; 
        
                                           } 
        
                                       } 
        
           #endif

But atm it's not very user-friendly and I haven't found an universal way to obtain the best performance automatically.

ibehnam · 2023-10-23T21:48:22Z

@ggerganov Thanks, I will have a look into those. Something that occasionally helped speed up inference on my end was using native Python ThreadPoolExecutor to call the server endpoint in parallel. Although, similar to what you said, it required fiddling with the number of workers—sometimes parallel Python code execution would actually slow down inference.

FSSRepo and others added 30 commits October 11, 2023 18:14

implementing parallel decoding in server example

63f99b1

crash fixed

4712302

save dev progress

7850421

Merge branch 'master' of https://github.com/ggerganov/llama.cpp

b716eeb

refactored sampling function

29c8cdd

completion endpoint working

8148480

multiple client support

5b8e29d

grammar + no stream completion

83c2b35

cached prompt support

500ac71

chat.mjs support cached prompt + some fixes

4ba5a50

server ui now support multiple clients

6358ae5

unused change reverted

a410a9e

fixed timings per slot

b6d9e21

add context swap

a2c2d98

add changes to README.md

eb08201

llava multimodal integration

9d98cdd

fixed tokens probs

de35b47

add multimodal input - alfa

9f72b44

refactor code + remove unused comments + improved README.md

7e64bfe

fix compilation errors with llvm

299f6b5

notify the user from server ui that multimodality is unavialable

4e5c5c4

Merge branch 'ggerganov:master' into master

f47fd17

Merge pull request #6 from damian0815/fssrepo_mac_fixes

9035978

fix compilation errors with llvm

some ci fixes

ce961a3

fix ci make build undefined ref errors

b727e02

fix long prompt than ctx proposed in #3639

fd64f04

fixed premature end due stop word

2d9f11d

context shift fixed

d7eca25

fix llava implementation

4d18043

sync README.md changes

aa2268f

monatis reviewed Oct 22, 2023

View reviewed changes

server : apply fix from #3722

8fe7ca4

ggerganov mentioned this pull request Oct 22, 2023

server : fix 'terminated by signal SIGSEGV' error when suffix is empty #3722

Closed

server : fix slot reuse

83e1490

cebtenzzre reviewed Oct 22, 2023

View reviewed changes

monatis reviewed Oct 22, 2023

View reviewed changes

server : add comment about changing slot_state to bool

c0f4d54

ggerganov merged commit 438c2ca into master Oct 22, 2023
32 checks passed

ggerganov mentioned this pull request Oct 22, 2023

PoC: server handling multiple clients with custom attention mask api #3490

Closed

shibe2 mentioned this pull request Oct 23, 2023

server: Cache is not reused between completions by default. #3738

Closed

ggerganov mentioned this pull request Oct 24, 2023

llama : add batched inference endpoint to server #3478

Closed

This was referenced Oct 24, 2023

server : re-add parameter -tb N, --threads-batch N #3768

Merged

Expose Llava as a shared library for downstream projects #3613

Merged

a-h mentioned this pull request Oct 31, 2023

Server allow /completion and /embedding #3815

Closed

4 tasks

KohakuBlueleaf mentioned this pull request Mar 6, 2024

llama cpp server not doing parallel inference for llava when using flags -np and -cb #5592

Closed

reversebias mentioned this pull request Mar 9, 2024

Fix prompt caching on llama.cpp endpoints huggingface/chat-ui#920

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : parallel decoding and multimodal (cont) #3677

server : parallel decoding and multimodal (cont) #3677

ggerganov commented Oct 19, 2023 •

edited

Loading

FSSRepo commented Oct 22, 2023

monatis Oct 22, 2023

ggerganov commented Oct 22, 2023 •

edited

Loading

FSSRepo commented Oct 22, 2023 •

edited

Loading

ggerganov commented Oct 22, 2023

ggerganov commented Oct 22, 2023

cebtenzzre Oct 22, 2023

cebtenzzre Oct 22, 2023

FSSRepo commented Oct 22, 2023 •

edited

Loading

monatis Oct 22, 2023

ggerganov Oct 22, 2023

FSSRepo Oct 22, 2023

FSSRepo commented Oct 22, 2023 •

edited

Loading

monatis commented Oct 22, 2023

cebtenzzre commented Oct 23, 2023 •

edited

Loading

ggerganov commented Oct 23, 2023 •

edited

Loading

ibehnam commented Oct 23, 2023

ibehnam commented Oct 23, 2023 •

edited

Loading

ggerganov commented Oct 23, 2023 •

edited

Loading

ibehnam commented Oct 23, 2023

server : parallel decoding and multimodal (cont) #3677

server : parallel decoding and multimodal (cont) #3677

Conversation

ggerganov commented Oct 19, 2023 • edited Loading

FSSRepo commented Oct 22, 2023

monatis Oct 22, 2023

Choose a reason for hiding this comment

ggerganov commented Oct 22, 2023 • edited Loading

FSSRepo commented Oct 22, 2023 • edited Loading

ggerganov commented Oct 22, 2023

ggerganov commented Oct 22, 2023

cebtenzzre Oct 22, 2023

Choose a reason for hiding this comment

cebtenzzre Oct 22, 2023

Choose a reason for hiding this comment

FSSRepo commented Oct 22, 2023 • edited Loading

monatis Oct 22, 2023

Choose a reason for hiding this comment

ggerganov Oct 22, 2023

Choose a reason for hiding this comment

FSSRepo Oct 22, 2023

Choose a reason for hiding this comment

FSSRepo commented Oct 22, 2023 • edited Loading

monatis commented Oct 22, 2023

cebtenzzre commented Oct 23, 2023 • edited Loading

ggerganov commented Oct 23, 2023 • edited Loading

ibehnam commented Oct 23, 2023

ibehnam commented Oct 23, 2023 • edited Loading

ggerganov commented Oct 23, 2023 • edited Loading

ibehnam commented Oct 23, 2023

ggerganov commented Oct 19, 2023 •

edited

Loading

ggerganov commented Oct 22, 2023 •

edited

Loading

FSSRepo commented Oct 22, 2023 •

edited

Loading

FSSRepo commented Oct 22, 2023 •

edited

Loading

FSSRepo commented Oct 22, 2023 •

edited

Loading

cebtenzzre commented Oct 23, 2023 •

edited

Loading

ggerganov commented Oct 23, 2023 •

edited

Loading

ibehnam commented Oct 23, 2023 •

edited

Loading

ggerganov commented Oct 23, 2023 •

edited

Loading