Garbage output on serving 4 parallel users. #12067

adi-lb-phoenix · 2024-09-11T10:00:22Z

I started a server with the command OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ./ollama serve. We open 4 terminals and executed the command./ollama run codellama after which the model loaded. So now on 4 terminals we give the prompt>>write a long poem.and execute it simultaneously (four parallel requests). The output is garbage values.

The text was updated successfully, but these errors were encountered:

sgwhat · 2024-09-13T02:36:57Z

Hi @adi-lb-phoenix, could you please provide your env and device config? In our test, ollama was able to run codellama as expected on MTL Linux.

adi-lb-phoenix · 2024-09-13T07:49:48Z

Hello @sgwhat .
So I have installed podman and distrobox on kde neon, on which I have created a ubuntu distro using distrobox. Ipex llm is deployed inside the ubuntu distrobox.
Inside ubuntu distrobox:

uname -a
Linux ubuntu22_ollama.JOHNAIC 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

On the host system

Linux JOHNAIC 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

The gpu is intel arc A770 GPU.

sgwhat · 2024-09-14T02:22:53Z

We are currently locating the cause of the codellama output issue on linux arc770 and will notify you as soon as possible.

adi-lb-phoenix · 2024-09-14T06:23:22Z

@sgwhat Thank you for picking this up. It has been observed not just for codellama but for other models as well.

adi-lb-phoenix · 2024-09-16T10:43:12Z

ggerganov/llama.cpp#9505 (comment)
Here llama.cpp does not output garbage values

adi-lb-phoenix · 2024-09-16T11:15:36Z

When serving just one user Ipex llm has better speed than llama.cpp
Result from ipex-llm:

llama_print_timings:        load time =    7797.13 ms
llama_print_timings:      sample time =      30.64 ms /   400 runs   (    0.08 ms per token, 13055.26 tokens per second)
llama_print_timings: prompt eval time =    1322.78 ms /    13 tokens (  101.75 ms per token,     9.83 tokens per second)
llama_print_timings:        eval time =   11301.98 ms /   399 runs   (   28.33 ms per token,    35.30 tokens per second)
llama_print_timings:       total time =   12711.93 ms /   412 tokens

Below is the result from llama.cpp

llama_perf_sampler_print:    sampling time =      31.73 ms /   413 runs   (    0.08 ms per token, 13015.66 tokens per second)
llama_perf_context_print:        load time =    4317.89 ms
llama_perf_context_print: prompt eval time =     456.68 ms /    13 tokens (   35.13 ms per token,    28.47 tokens per second)
llama_perf_context_print:        eval time =   22846.95 ms /   399 runs   (   57.26 ms per token,    17.46 tokens per second)
llama_perf_context_print:       total time =   23379.98 ms /   412 tokens

hkvision added the user issue label Sep 12, 2024

qiuxin2012 assigned sgwhat Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbage output on serving 4 parallel users. #12067

Garbage output on serving 4 parallel users. #12067

adi-lb-phoenix commented Sep 11, 2024

sgwhat commented Sep 13, 2024 •

edited

Loading

adi-lb-phoenix commented Sep 13, 2024 •

edited

Loading

sgwhat commented Sep 14, 2024

adi-lb-phoenix commented Sep 14, 2024

adi-lb-phoenix commented Sep 16, 2024 •

edited

Loading

adi-lb-phoenix commented Sep 16, 2024

Garbage output on serving 4 parallel users. #12067

Garbage output on serving 4 parallel users. #12067

Comments

adi-lb-phoenix commented Sep 11, 2024

sgwhat commented Sep 13, 2024 • edited Loading

adi-lb-phoenix commented Sep 13, 2024 • edited Loading

sgwhat commented Sep 14, 2024

adi-lb-phoenix commented Sep 14, 2024

adi-lb-phoenix commented Sep 16, 2024 • edited Loading

adi-lb-phoenix commented Sep 16, 2024

sgwhat commented Sep 13, 2024 •

edited

Loading

adi-lb-phoenix commented Sep 13, 2024 •

edited

Loading

adi-lb-phoenix commented Sep 16, 2024 •

edited

Loading