Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama.cpp: infinite loop of context switch #1333

Closed
mudler opened this issue Nov 25, 2023 · 6 comments
Closed

llama.cpp: infinite loop of context switch #1333

mudler opened this issue Nov 25, 2023 · 6 comments
Assignees
Labels
area/llama.cpp bug Something isn't working high prio

Comments

@mudler
Copy link
Owner

mudler commented Nov 25, 2023

This card is a tracker for ggerganov/llama.cpp#3969

This seems to happen to me as well with the llama.cpp backend only: I can reproduce it programmatically with certain text by using grammars

Update:

There is an "epic" here that we should keep an eye on: ggerganov/llama.cpp#4216

@mudler mudler added the bug Something isn't working label Nov 25, 2023
@mudler mudler self-assigned this Nov 25, 2023
@localai-bot
Copy link
Contributor

⚠️⚠️⚠️⚠️⚠️

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

but.... I can also be funny or helpful 😸 and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me,
I don't support (yet) replying!

⚠️⚠️⚠️⚠️⚠️

I'm sorry, I don't understand what you mean by "this card is a tracker for ggerganov/llama.cpp#3969". Can you please provide more context or clarify the issue you are referring to?

Sources:

@mudler
Copy link
Owner Author

mudler commented Nov 26, 2023

a workaround (not a solution) is available in #1339 - it still doesn't handle gracefully connections, but at least avoids the API to stall indefinetly

@dionysius
Copy link
Contributor

dionysius commented Jan 4, 2024

What is weird to me is that I don't have this issue with ollama and they are using llama.cpp as well AFAIK.

Model: TinyLlama-1.1B-Chat-v1.0
Config:

context_size: 1024
name: se-chat
parameters:
  model: tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
  temperature: 0.2
  top_k: 80
  top_p: 0.7
template:
  chat: chat
  chat_message: chatml
  completion: completion
threads: 1
gpu_layers: 100
[...]
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr llama_new_context_with_model: total VRAM used: 701.02 MiB (model: 601.02 MiB, context: 100.00 MiB)
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr Available slots:
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr  -> Slot 0 - max context: 1024
4:26PM INF [llama-cpp] Loads OK
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0 is processing [task id: 0]
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0 : kv cache rm - [0, end)
4:45PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0: context shift - n_keep = 0, n_left = 1022, n_discard = 511
[...]

Just to be sure they are using exactly the same model I didn't pull the model with ollama. I downloaded and imported it manually using a modelfile based on the original and named it tinyllama-custom:

root@be406c0fa880:/srv/custom/models# cat tinyllama-1.1b-chat-v1.0.Q4_K_M.modelfile
FROM /srv/custom/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
TEMPLATE """<|system|>
{{ .System }}</s>
<|user|>
{{ .Prompt }}</s>
<|assistant|>
"""
SYSTEM """You are a helpful AI assistant."""
PARAMETER stop "<|system|>"
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "</s>"
root@be406c0fa880:/srv/custom/models# ollama create tinyllama-custom -f tinyllama-1.1b-chat-v1.0.Q4_K_M.modelfile
transferring model data
creating model layer
creating template layer
creating system layer
creating parameters layer
creating config layer
using already created layer sha256:9fecc3b3cd76bba89d504f29b616eedf7da85b96540e490ca5824d3f7d2776a0
using already created layer sha256:af0ddbdaaa26f30d54d727f9dd944b76bdb926fdaf9a58f63f78c532f57c191f
using already created layer sha256:c8472cd9daed5e7c20aa53689e441e10620a002aacd58686aeac2cb188addb5c
using already created layer sha256:fa956ab37b8c21152f975a7fcdd095c4fee8754674b21d9b44d710435697a00d
writing layer sha256:2dc31b4921bcadca51ac93b60788e51c18955c48cab69d429ce922c0aa67ab82
writing manifest
success
λ ollama run tinyllama-custom
>>> How old is Mickey Mouse?
How old is Mickey Mouse?
As of now (2021), Mickey Mouse is 93 years old.

>>> Send a message (/? for help)

@mudler
Copy link
Owner Author

mudler commented Jan 4, 2024

What is weird to me is that I don't have this issue with ollama and they are using llama.cpp as well AFAIK.

Model: TinyLlama-1.1B-Chat-v1.0 Config:

context_size: 1024
name: se-chat
parameters:
  model: tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
  temperature: 0.2
  top_k: 80
  top_p: 0.7
template:
  chat: chat
  chat_message: chatml
  completion: completion
threads: 1
gpu_layers: 100
[...]
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr llama_new_context_with_model: total VRAM used: 701.02 MiB (model: 601.02 MiB, context: 100.00 MiB)
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr Available slots:
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr  -> Slot 0 - max context: 1024
4:26PM INF [llama-cpp] Loads OK
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0 is processing [task id: 0]
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0 : kv cache rm - [0, end)
4:45PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0: context shift - n_keep = 0, n_left = 1022, n_discard = 511
[...]

Just to be sure they are using exactly the same model I didn't pull the model with ollama. I downloaded and imported it manually using a modelfile based on the original and named it tinyllama-custom:

root@be406c0fa880:/srv/custom/models# cat tinyllama-1.1b-chat-v1.0.Q4_K_M.modelfile
FROM /srv/custom/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
TEMPLATE """<|system|>
{{ .System }}</s>
<|user|>
{{ .Prompt }}</s>
<|assistant|>
"""
SYSTEM """You are a helpful AI assistant."""
PARAMETER stop "<|system|>"
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "</s>"
root@be406c0fa880:/srv/custom/models# ollama create tinyllama-custom -f tinyllama-1.1b-chat-v1.0.Q4_K_M.modelfile
transferring model data
creating model layer
creating template layer
creating system layer
creating parameters layer
creating config layer
using already created layer sha256:9fecc3b3cd76bba89d504f29b616eedf7da85b96540e490ca5824d3f7d2776a0
using already created layer sha256:af0ddbdaaa26f30d54d727f9dd944b76bdb926fdaf9a58f63f78c532f57c191f
using already created layer sha256:c8472cd9daed5e7c20aa53689e441e10620a002aacd58686aeac2cb188addb5c
using already created layer sha256:fa956ab37b8c21152f975a7fcdd095c4fee8754674b21d9b44d710435697a00d
writing layer sha256:2dc31b4921bcadca51ac93b60788e51c18955c48cab69d429ce922c0aa67ab82
writing manifest
success
λ ollama run tinyllama-custom
>>> How old is Mickey Mouse?
How old is Mickey Mouse?
As of now (2021), Mickey Mouse is 93 years old.

>>> Send a message (/? for help)

ollama doesn't use the llama.cpp http/server code, indeed ggerganov/llama.cpp#3969 affects only the http implementation. When we switched away from using the bindings - we now rely directly on the llama.cpp code and we build grpc server around it in C++, and that brings us more close to llama.cpp implementation (with eventual bugs attached as well)

mudler added a commit that referenced this issue Feb 13, 2024
Infinite context loop might as well trigger an infinite loop of context
shifting if the model hallucinates and does not stop answering.
This has the unpleasant effect that the predicion never terminates,
which is the case especially on small models which tends to hallucinate.

Workarounds #1333 by removing
context-shifting.

See also upstream issue: ggerganov/llama.cpp#3969
mudler added a commit that referenced this issue Feb 13, 2024
Infinite context loop might as well trigger an infinite loop of context
shifting if the model hallucinates and does not stop answering.
This has the unpleasant effect that the predicion never terminates,
which is the case especially on small models which tends to hallucinate.

Workarounds #1333 by removing
context-shifting.

See also upstream issue: ggerganov/llama.cpp#3969
@mudler
Copy link
Owner Author

mudler commented Feb 13, 2024

@dionysius this is going to be fixed in #1704

mudler added a commit that referenced this issue Feb 13, 2024
Infinite context loop might as well trigger an infinite loop of context
shifting if the model hallucinates and does not stop answering.
This has the unpleasant effect that the predicion never terminates,
which is the case especially on small models which tends to hallucinate.

Workarounds #1333 by removing
context-shifting.

See also upstream issue: ggerganov/llama.cpp#3969
mudler added a commit that referenced this issue Feb 13, 2024
Infinite context loop might as well trigger an infinite loop of context
shifting if the model hallucinates and does not stop answering.
This has the unpleasant effect that the predicion never terminates,
which is the case especially on small models which tends to hallucinate.

Workarounds #1333 by removing
context-shifting.

See also upstream issue: ggerganov/llama.cpp#3969
@mudler
Copy link
Owner Author

mudler commented Feb 23, 2024

This is fixed in LocalAI. Upstream workaround is as well to put a cap on max tokens as the models tends to hallucinate, infinite context shifting might actually lead to infinite answers too (see commit message in c56b6dd ). It was nice to see that upstream confirmed the issue with ggerganov/llama.cpp#3969 (comment) after the above workaround - it sounds much more safer to not expose the user at all to this by disabling it entirely, and I think what we do is to shield the user from such nuances.

We can look at this again if someone really thinks this is an issue. Closing it for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/llama.cpp bug Something isn't working high prio
Projects
None yet
Development

No branches or pull requests

3 participants