-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : parallel decoding and multimodal (cont) #3677
Conversation
fix compilation errors with llvm
Fixed |
examples/server/server.cpp
Outdated
// release the slot | ||
if (slot.state == PROCESSING && slot.command == RELEASE) | ||
{ | ||
slot.state = slot.params.cache_prompt ? SLEEPING : IDLE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the issue with the slot not released due to this line? With cache_prompt
it sleeps instead of being released.
Can't we release the slot after every request? |
You can try it, deleting // release the slot
if (slot.state == PROCESSING && slot.command == RELEASE)
{
slot.state = slot.params.cache_prompt ? SLEEPING : IDLE; To this: // release the slot
if (slot.state == PROCESSING && slot.command == RELEASE)
{
slot.state = IDLE; This works with single client, but when a new client (doesn't have an slot it assigned |
|
||
json prompt; | ||
std::vector<llama_token> embd; | ||
slot_state state = IDLE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this enum has only two values now, couldn't this be bool idle = true
or bool processing = false
?
} | ||
|
||
bool is_processing() const { | ||
return (state == IDLE && command == LOAD_PROMPT) || state == PROCESSING; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be simplified to state == PROCESSING || command == LOAD_PROMPT
since there are only two possible states.
I think that llava doesn't work with parallel decoding |
@@ -434,9 +451,12 @@ | |||
) | |||
).join("\n"), | |||
}); | |||
|
|||
if (selected_image) { | |||
prompt = `A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER:[img-10]${msg}\nASSISTANT:`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image slot id is statically set to 10 here. I think that might the reason for the parallel LLaVA issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's fix this later then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The images are loaded for each slot, so it doesn't matter if it's 10, as it ultimately loads a different slot.
Yes technically we can chat about an image on multiple turns, and system prompt should not be prepended other than the first message, but I think optimizing it to eliminate the need to load/preprocess/encode images by a caching mechanism may be another story, i.e., another PR. |
This is my mistake - I tried to rebase this PR at some point but the changes to |
I can't thank you enough! This is great. I have one question:
In this case, and assuming that In my test, |
I should mention that using
-> 50 requests sent, total time = 60.38 seconds
-> 50 requests sent sequentially, total time = 52.36 seconds |
Optimal performance when using parallelism with quantum models currently requires some manual adjustments to some constants depending on the model size and hardware, both for Metal and CUDA. I wrote some thoughts on this here #3524 and here #3545 (comment) You can try playing with this piece of code to get better parallel performance for your use-case: Lines 1032 to 1058 in e393259
But atm it's not very user-friendly and I haven't found an universal way to obtain the best performance automatically. |
@ggerganov Thanks, I will have a look into those. Something that occasionally helped speed up inference on my end was using native Python ThreadPoolExecutor to call the server endpoint in parallel. Although, similar to what you said, it required fiddling with the number of workers—sometimes parallel Python code execution would actually slow down inference. |
Continuation of #3589
Commands to test: