Completion endpoint returns same response repeatedly #1723

joshuaipwork · 2024-02-19T07:04:56Z

LocalAI version:
LocalAI Release 2.8.0.

Environment, CPU architecture, OS, and Version:
Docker container running in Linux Mint 21.3 "Virginia".
Image built from 2.8.0 Dockerfile.
Ryzen 5 7600x (x86-64)

Describe the bug
In the /v1/chat/completions endpoint, it seems like cached responses are being returned over and over for the same prompt, even if the seed is changed. Changing the seed, temperature, min-p, top-k, and top-p has no effect. The same response with an identical ID will be returned every time. This happens even when prompt_cache_ro and prompt_cache_all are both set to false.

To Reproduce
Send a response to the chat endpoint. A CURL command is shown below, which may have to be adapted for your testing circumstances.

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "mixtral",
    "messages": [
        {
            "role": "system",
            "content": "You are a friendly, expressive, and curious chatbot who loves to engage in conversations and roleplays."
        },
        {
            "role": "user",
            "content": "How are you doing?"
        }
    ],
    "max_tokens": 1024,
    "temperature": 1.2,
    "seed": 2668776,
}'

Now, change the seed. Observe that the endpoint returns an identical response.

Expected behavior
Two requests to the same model with the same prompt and different parameters may return two different results. LLMs are stochastic due to the sampling process, and parameters like temperature, seed, min-p, top-p, and top-k all act to introduce randomness between responses.

Logs
@lunamidori5 might have some, since they were helping me troubleshoot this problem.

Additional context
Here's the YAML file used when this problem was observed:

context_size: 4096
f16: true
gpu_layers: 2 
low_vram: false
mmap: true
name: gpt-14b-carly
no_mulmatq: false
prompt_cache_all: false
prompt_cache_ro: true
parameters:
  model: gpt-14b-carly.gguf
  top_k: 0.5
  top_p: 0.9
  n: 1
  RepeatPenalty: 1
  typical_p: 0.8
stopwords:
- user|
- assistant|
- system|
- <|im_end|>
- <|im_start|>
template:
  chat: localai-chat
  chat_message: localai-chatmsg
threads: 14

The text was updated successfully, but these errors were encountered:

holzmichlnator · 2024-03-07T08:52:58Z

I noticed this behaviour too while working with LocalAI. I also tried to set the prompt_cache_all option to false but this didn't change anything. Setting temperature, top_k, top_p, seed to different values also doesn't change anything. The strangest part is, it keeps generating the same output even after recreating the whole container except the ID is now different.
I also tried to modify the prompt slightly by adding an "!" at the end for example, but even then, the answer is identical.

I'm using localai/localai:v2.9.0-cublas-cuda12

mudler · 2024-03-08T11:52:17Z

While i'm having a look at this - the options seems passed just fine up to the gRPC server and llama.cpp.

However, even when I set slot.params.seed = time(NULL); right before we configure the slot ready to be processed

LocalAI/backend/cpp/llama/grpc-server.cpp

Line 879 in dc919e0

llama_set_rng_seed(ctx, slot->params.seed);

( as llama.cpp does as well ), it seems completely ignored.

Also to note, that's the full json data printed out (and looks indeed setting seed, top_k, and top_p accordingly):

12:36PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:40341): stdout {"cache_prompt":false,"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"mirostat":0,"mirostat_eta":0.0,"mirostat_tau":0.0,"n_keep":0,"n_predict":-1,"penalize_nl":false,"presence_penalty":0.0,"prompt":"Instruct: tell me a story about llamas\nOutput:\n","repeat_last_n":0,"repeat_penalty":0.0,"seed":-1,"stop":[],"stream":false,"temperature":0.20000000298023224,"tfs_z":0.0,"top_k":40,"top_p":0.949999988079071,"typical_p":0.0}

so it looks definitely something is off in the llama.cpp implementation, as our is just a gRPC wrapper on top of the http example (with few edits to avoid bugs like #1333 )

mudler · 2024-03-11T17:44:43Z

Ok, tracing seed seemed to be just a red herring and dragged me in the wrong direction. I'm tracing back the usage and it looks like it is because it doesn't have enough candidates to select the tokens.

I've tried to switch sampler with phi-2 and I got finally a more undeterministic result. It looks to me it is very much depending on the model/sampler strategy - mirostat can get more candidates, while the temperature sampler has less to select which (so it is more deterministic).

E.g. with phi-2:

name: phi-2
context_size: 2048
f16: true
gpu_layers: 90
mmap: true
trimsuffix:
- "\n"
parameters:
  model: huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
  temperature: 1.0
  top_k: 40
  top_p: 0.95
mirostat: 2
mirostat_eta: 1.0
mirostat_tau: 1.0

  seed: -1
template:
  chat: &template |
    Instruct: {{.Input}}
    Output:
  completion: *template

usage: |
      To use this model, interact with the API (in another terminal) with curl for instance:
      curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
          "model": "phi-2",
          "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}]
      }'

The default sampler on some models don't return enough candidates which leads to a false sense of randomness. Tracing back the code it looks that with the temperature sampler there might not be enough candidates to pick from, and since the seed and "randomness" take effect while picking a good candidate this yields to the same results over and over. Fixes #1723 by updating the examples and documentation to use mirostat instead.

mudler · 2024-03-11T18:52:14Z

There were incongruences with the docs, updated also the samples in #1820. If the issue persist feel free to re-open

This changeset aim to have better defaults and to properly detect when no inference settings are provided with the model. If not specified, we defaults to mirostat sampling, and offload all the GPU layers (if a GPU is detected). Related to #1373 and #1723

* fix(defaults): set better defaults for inferencing This changeset aim to have better defaults and to properly detect when no inference settings are provided with the model. If not specified, we defaults to mirostat sampling, and offload all the GPU layers (if a GPU is detected). Related to #1373 and #1723 * Adapt tests * Also pre-initialize default seed

joshuaipwork added bug Something isn't working unconfirmed labels Feb 19, 2024

mudler added roadmap regression labels Feb 19, 2024

lunamidori5 added confirmed model/llama and removed unconfirmed labels Feb 19, 2024

mudler self-assigned this Feb 24, 2024

mudler mentioned this issue Mar 11, 2024

fix(doc/examples): set defaults to mirostat #1820

Merged

mudler closed this as completed in #1820 Mar 11, 2024

mudler mentioned this issue Mar 12, 2024

fix(config): set better defaults for inferencing #1822

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Completion endpoint returns same response repeatedly #1723

Completion endpoint returns same response repeatedly #1723

joshuaipwork commented Feb 19, 2024 •

edited by lunamidori5

Loading

holzmichlnator commented Mar 7, 2024

mudler commented Mar 8, 2024 •

edited

Loading

mudler commented Mar 11, 2024 •

edited

Loading

mudler commented Mar 11, 2024 •

edited

Loading

Completion endpoint returns same response repeatedly #1723

Completion endpoint returns same response repeatedly #1723

Comments

joshuaipwork commented Feb 19, 2024 • edited by lunamidori5 Loading

holzmichlnator commented Mar 7, 2024

mudler commented Mar 8, 2024 • edited Loading

mudler commented Mar 11, 2024 • edited Loading

mudler commented Mar 11, 2024 • edited Loading

joshuaipwork commented Feb 19, 2024 •

edited by lunamidori5

Loading

mudler commented Mar 8, 2024 •

edited

Loading

mudler commented Mar 11, 2024 •

edited

Loading

mudler commented Mar 11, 2024 •

edited

Loading