Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage output on serving 4 parallel users. #12067

Open
adi-lb-phoenix opened this issue Sep 11, 2024 · 6 comments
Open

Garbage output on serving 4 parallel users. #12067

adi-lb-phoenix opened this issue Sep 11, 2024 · 6 comments
Assignees

Comments

@adi-lb-phoenix
Copy link

I started a server with the command OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ./ollama serve. We open 4 terminals and executed the command./ollama run codellama after which the model loaded. So now on 4 terminals we give the prompt>>write a long poem.and execute it simultaneously (four parallel requests). The output is garbage values.
Screenshot_20240911_152331

@sgwhat
Copy link
Contributor

sgwhat commented Sep 13, 2024

Hi @adi-lb-phoenix, could you please provide your env and device config? In our test, ollama was able to run codellama as expected on MTL Linux.

@adi-lb-phoenix
Copy link
Author

adi-lb-phoenix commented Sep 13, 2024

Hello @sgwhat .
So I have installed podman and distrobox on kde neon, on which I have created a ubuntu distro using distrobox. Ipex llm is deployed inside the ubuntu distrobox.
Inside ubuntu distrobox:

uname -a
Linux ubuntu22_ollama.JOHNAIC 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

On the host system

Linux JOHNAIC 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

The gpu is intel arc A770 GPU.

@sgwhat
Copy link
Contributor

sgwhat commented Sep 14, 2024

We are currently locating the cause of the codellama output issue on linux arc770 and will notify you as soon as possible.

@adi-lb-phoenix
Copy link
Author

@sgwhat Thank you for picking this up. It has been observed not just for codellama but for other models as well.

@adi-lb-phoenix
Copy link
Author

adi-lb-phoenix commented Sep 16, 2024

ggerganov/llama.cpp#9505 (comment)
Here llama.cpp does not output garbage values

@adi-lb-phoenix
Copy link
Author

When serving just one user Ipex llm has better speed than llama.cpp
Result from ipex-llm:

llama_print_timings:        load time =    7797.13 ms
llama_print_timings:      sample time =      30.64 ms /   400 runs   (    0.08 ms per token, 13055.26 tokens per second)
llama_print_timings: prompt eval time =    1322.78 ms /    13 tokens (  101.75 ms per token,     9.83 tokens per second)
llama_print_timings:        eval time =   11301.98 ms /   399 runs   (   28.33 ms per token,    35.30 tokens per second)
llama_print_timings:       total time =   12711.93 ms /   412 tokens

Below is the result from llama.cpp

llama_perf_sampler_print:    sampling time =      31.73 ms /   413 runs   (    0.08 ms per token, 13015.66 tokens per second)
llama_perf_context_print:        load time =    4317.89 ms
llama_perf_context_print: prompt eval time =     456.68 ms /    13 tokens (   35.13 ms per token,    28.47 tokens per second)
llama_perf_context_print:        eval time =   22846.95 ms /   399 runs   (   57.26 ms per token,    17.46 tokens per second)
llama_perf_context_print:       total time =   23379.98 ms /   412 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants