-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garbage output on serving 4 parallel users. #12067
Comments
Hi @adi-lb-phoenix, could you please provide your env and device config? In our test, ollama was able to run codellama as expected on MTL Linux. |
Hello @sgwhat .
On the host system
The gpu is intel arc A770 GPU. |
We are currently locating the cause of the |
@sgwhat Thank you for picking this up. It has been observed not just for |
ggerganov/llama.cpp#9505 (comment) |
When serving just one user Ipex llm has better speed than llama.cpp
Below is the result from llama.cpp
|
I started a server with the command
OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ./ollama serve
. We open 4 terminals and executed the command./ollama run codellama after which the model loaded. So now on 4 terminals we give the prompt
>>write a long poem.and execute it simultaneously (four parallel requests). The output is garbage values.
The text was updated successfully, but these errors were encountered: