Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : parallel decoding and multimodal (cont) #3677

Merged
merged 72 commits into from
Oct 22, 2023
Merged

server : parallel decoding and multimodal (cont) #3677

merged 72 commits into from
Oct 22, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Oct 19, 2023

Continuation of #3589

Commands to test:

# parallel decoding from multiple clients (`-np 4`) with continuous batching (`-cb`)
./server -m ${model} --ctx_size 2048 -t 8 -ngl 99 -np 4 -cb

# vision support, upload image in the browser and ask questions
./server -m ./models/llava-7b-v1.5/ggml-model-f16.gguf --mmproj ./models/llava-7b-v1.5/mmproj-model-f16.gguf --ctx_size 2048 -t 8 -ngl 99

@FSSRepo
Copy link
Collaborator

FSSRepo commented Oct 22, 2023

Does it still crash after latest commit: 00ae55b

Fixed

// release the slot
if (slot.state == PROCESSING && slot.command == RELEASE)
{
slot.state = slot.params.cache_prompt ? SLEEPING : IDLE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the issue with the slot not released due to this line? With cache_prompt it sleeps instead of being released.

@ggerganov
Copy link
Owner Author

ggerganov commented Oct 22, 2023

To avoid this, my idea is to add a timer to the slot and periodically send the signal to keep the slot suspended (while the user is using the web UI). If the slot doesn't receive the signal after the timer's duration, it will be automatically released.

Can't we release the slot after every request?
The logic is that a user provides the slot_id they've used last time and if it is released, we re-assign it and continue as we currently do. If it is not released, we search for a new slot. If a new request comes in with unassigned slot (or if the requested slot is in use), we assign a slot that has never been used yet, or the one that is released and has been used longest time ago.

@FSSRepo
Copy link
Collaborator

FSSRepo commented Oct 22, 2023

Can't we release the slot after every request?

You can try it, deleting slot.params.cache_prompt ? SLEEPING : of:

            // release the slot
            if (slot.state == PROCESSING && slot.command == RELEASE)
            {
                slot.state = slot.params.cache_prompt ? SLEEPING : IDLE;

To this:

            // release the slot
            if (slot.state == PROCESSING && slot.command == RELEASE)
            {
                slot.state = IDLE;

This works with single client, but when a new client (doesn't have an slot it assigned slot_id: -1) want to request a completion could use the released slot of the first client. With the state SLEEPING I avoid this.

@ggerganov
Copy link
Owner Author

Ok, I think it works now.

I think the server actually worked out pretty good! Good job @FSSRepo, @monatis and all!
There are still some things to improve and the logic is a bit too complicated in some places, but we can gradually improve this.

Will merge this soon if no issues are found

@ggerganov
Copy link
Owner Author

Parallel llava is not working anymore - first client is OK, but starting a second one produces nonsense:

image


json prompt;
std::vector<llama_token> embd;
slot_state state = IDLE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this enum has only two values now, couldn't this be bool idle = true or bool processing = false?

}

bool is_processing() const {
return (state == IDLE && command == LOAD_PROMPT) || state == PROCESSING;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be simplified to state == PROCESSING || command == LOAD_PROMPT since there are only two possible states.

@FSSRepo
Copy link
Collaborator

FSSRepo commented Oct 22, 2023

Parallel llava is not working anymore - first client is OK, but starting a second one produces nonsense:

I think that llava doesn't work with parallel decoding

@@ -434,9 +451,12 @@
)
).join("\n"),
});

if (selected_image) {
prompt = `A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER:[img-10]${msg}\nASSISTANT:`;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image slot id is statically set to 10 here. I think that might the reason for the parallel LLaVA issue.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's fix this later then

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The images are loaded for each slot, so it doesn't matter if it's 10, as it ultimately loads a different slot.

@FSSRepo
Copy link
Collaborator

FSSRepo commented Oct 22, 2023

Screenshot 2023-10-22 152844
Unexpected behaivor. I didn't know I could send multiple messages about the same image, the only problem is that it has to load every time I ask something.

To avoid reloading the same image every time a question is asked again, you could obtain the SHA-256 or CRC32 (less compute load) of the image and compare if they have the same ID and SHA-256 or CRC32 (less compute load), then keep the image in memory.

@monatis
Copy link
Collaborator

monatis commented Oct 22, 2023

Yes technically we can chat about an image on multiple turns, and system prompt should not be prepended other than the first message, but I think optimizing it to eliminate the need to load/preprocess/encode images by a caching mechanism may be another story, i.e., another PR.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Oct 23, 2023

Why did this PR remove --threads-batch from the server? It only existed for one day (added in a8bdd65, removed in 29c8cdd).

@ggerganov
Copy link
Owner Author

ggerganov commented Oct 23, 2023

This is my mistake - I tried to rebase this PR at some point but the changes to master were all around the place, so I applied them manually. Seems like I missed this one. Apologies

@ibehnam
Copy link
Contributor

ibehnam commented Oct 23, 2023

Continuation of #3589

Commands to test:

# parallel decoding from multiple clients (`-np 4`) with continuous batching (`-cb`)
./server -m ${model} --ctx_size 2048 -t 8 -ngl 99 -np 4 -cb

# vision support, upload image in the browser and ask questions
./server -m ./models/llava-7b-v1.5/ggml-model-f16.gguf --mmproj ./models/llava-7b-v1.5/mmproj-model-f16.gguf --ctx_size 2048 -t 8 -ngl 99

I can't thank you enough! This is great. I have one question:

  • In my tests, parallel treats different lines in the prompt as separate prompts and sends each one to a different slot. For example:
prompt = '''llama is an animal that inspired work on large language models.\nThis is why we see a lot of LLM packages with the word "llama" in them.'''

In this case, and assuming that np is at least 2, parallel processes llama is an animal that inspired work on large language models. and This is why we see a lot of LLM packages with the word "llama" in them. separately.

In my test, server doesn't show the same behavior 🤔
Can you please provide more info about how to send multiple queries to the server with parallel processing?

@ibehnam
Copy link
Contributor

ibehnam commented Oct 23, 2023

I should mention that using np actually slowed down inference on my M1 Pro chip:

  • Using parallelism:
./server -m "llama-2-13b.Q8_0.gguf" --ctx_size 4096 -ngl 128 --mlock --threads 10 --numa -np 16 -cb

-> 50 requests sent, total time = 60.38 seconds

  • Not using parallelism:
./server -m "llama-2-13b.Q8_0.gguf" --ctx_size 4096 -ngl 128 --mlock --threads 10 --no-mmap --numa

-> 50 requests sent sequentially, total time = 52.36 seconds

@ggerganov
Copy link
Owner Author

ggerganov commented Oct 23, 2023

Optimal performance when using parallelism with quantum models currently requires some manual adjustments to some constants depending on the model size and hardware, both for Metal and CUDA. I wrote some thoughts on this here #3524 and here #3545 (comment)

You can try playing with this piece of code to get better parallel performance for your use-case:

llama.cpp/ggml-metal.m

Lines 1032 to 1058 in e393259

// find the break-even point where the matrix-matrix kernel becomes more efficient compared
// to the matrix-vector kernel
int ne11_mm_min = 1;
#if 0
// the numbers below are measured on M2 Ultra for 7B and 13B models
// these numbers do not translate to other devices or model sizes
// TODO: need to find a better approach
if ([ctx->device.name isEqualToString:@"Apple M2 Ultra"]) {
switch (src0t) {
case GGML_TYPE_F16: ne11_mm_min = 2; break;
case GGML_TYPE_Q8_0: ne11_mm_min = 7; break;
case GGML_TYPE_Q2_K: ne11_mm_min = 15; break;
case GGML_TYPE_Q3_K: ne11_mm_min = 7; break;
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1: ne11_mm_min = 15; break;
case GGML_TYPE_Q4_K: ne11_mm_min = 11; break;
case GGML_TYPE_Q5_0: // not tested yet
case GGML_TYPE_Q5_1: ne11_mm_min = 13; break; // not tested yet
case GGML_TYPE_Q5_K: ne11_mm_min = 7; break;
case GGML_TYPE_Q6_K: ne11_mm_min = 7; break;
default: ne11_mm_min = 1; break;
}
}
#endif

But atm it's not very user-friendly and I haven't found an universal way to obtain the best performance automatically.

@ibehnam
Copy link
Contributor

ibehnam commented Oct 23, 2023

@ggerganov Thanks, I will have a look into those. Something that occasionally helped speed up inference on my end was using native Python ThreadPoolExecutor to call the server endpoint in parallel. Although, similar to what you said, it required fiddling with the number of workers—sometimes parallel Python code execution would actually slow down inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

will ./server support parallel decoding as well?
9 participants