-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Completion endpoint returns same response repeatedly #1723
Comments
I noticed this behaviour too while working with LocalAI. I also tried to set the prompt_cache_all option to false but this didn't change anything. Setting temperature, top_k, top_p, seed to different values also doesn't change anything. The strangest part is, it keeps generating the same output even after recreating the whole container except the ID is now different. I'm using localai/localai:v2.9.0-cublas-cuda12 |
While i'm having a look at this - the options seems passed just fine up to the gRPC server and llama.cpp. However, even when I set LocalAI/backend/cpp/llama/grpc-server.cpp Line 879 in dc919e0
Also to note, that's the full json data printed out (and looks indeed setting
so it looks definitely something is off in the llama.cpp implementation, as our is just a gRPC wrapper on top of the http example (with few edits to avoid bugs like #1333 ) |
Ok, tracing I've tried to switch sampler with phi-2 and I got finally a more undeterministic result. It looks to me it is very much depending on the model/sampler strategy - mirostat can get more candidates, while the temperature sampler has less to select which (so it is more deterministic). E.g. with phi-2: name: phi-2
context_size: 2048
f16: true
gpu_layers: 90
mmap: true
trimsuffix:
- "\n"
parameters:
model: huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
temperature: 1.0
top_k: 40
top_p: 0.95
mirostat: 2
mirostat_eta: 1.0
mirostat_tau: 1.0
seed: -1
template:
chat: &template |
Instruct: {{.Input}}
Output:
completion: *template
usage: |
To use this model, interact with the API (in another terminal) with curl for instance:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "phi-2",
"messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}]
}'
|
The default sampler on some models don't return enough candidates which leads to a false sense of randomness. Tracing back the code it looks that with the temperature sampler there might not be enough candidates to pick from, and since the seed and "randomness" take effect while picking a good candidate this yields to the same results over and over. Fixes #1723 by updating the examples and documentation to use mirostat instead.
The default sampler on some models don't return enough candidates which leads to a false sense of randomness. Tracing back the code it looks that with the temperature sampler there might not be enough candidates to pick from, and since the seed and "randomness" take effect while picking a good candidate this yields to the same results over and over. Fixes #1723 by updating the examples and documentation to use mirostat instead.
There were incongruences with the docs, updated also the samples in #1820. If the issue persist feel free to re-open |
* fix(defaults): set better defaults for inferencing This changeset aim to have better defaults and to properly detect when no inference settings are provided with the model. If not specified, we defaults to mirostat sampling, and offload all the GPU layers (if a GPU is detected). Related to #1373 and #1723 * Adapt tests * Also pre-initialize default seed
LocalAI version:
LocalAI Release 2.8.0.
Environment, CPU architecture, OS, and Version:
Docker container running in Linux Mint 21.3 "Virginia".
Image built from 2.8.0 Dockerfile.
Ryzen 5 7600x (x86-64)
Describe the bug
In the
/v1/chat/completions
endpoint, it seems like cached responses are being returned over and over for the same prompt, even if the seed is changed. Changing the seed, temperature, min-p, top-k, and top-p has no effect. The same response with an identical ID will be returned every time. This happens even whenprompt_cache_ro
andprompt_cache_all
are both set to false.To Reproduce
Send a response to the chat endpoint. A CURL command is shown below, which may have to be adapted for your testing circumstances.
Now, change the seed. Observe that the endpoint returns an identical response.
Expected behavior
Two requests to the same model with the same prompt and different parameters may return two different results. LLMs are stochastic due to the sampling process, and parameters like temperature, seed, min-p, top-p, and top-k all act to introduce randomness between responses.
Logs
@lunamidori5 might have some, since they were helping me troubleshoot this problem.
Additional context
Here's the YAML file used when this problem was observed:
The text was updated successfully, but these errors were encountered: