llama.cpp: infinite loop of context switch #1333

mudler · 2023-11-25T08:29:30Z

This card is a tracker for ggerganov/llama.cpp#3969

This seems to happen to me as well with the llama.cpp backend only: I can reproduce it programmatically with certain text by using grammars

Update:

There is an "epic" here that we should keep an eye on: ggerganov/llama.cpp#4216

localai-bot · 2023-11-25T12:06:44Z

⚠️⚠️⚠️⚠️⚠️

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

but.... I can also be funny or helpful 😸 and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me,
I don't support (yet) replying!

⚠️⚠️⚠️⚠️⚠️

I'm sorry, I don't understand what you mean by "this card is a tracker for ggerganov/llama.cpp#3969". Can you please provide more context or clarify the issue you are referring to?

Sources:

mudler · 2023-11-26T09:59:56Z

a workaround (not a solution) is available in #1339 - it still doesn't handle gracefully connections, but at least avoids the API to stall indefinetly

dionysius · 2024-01-04T16:51:17Z

What is weird to me is that I don't have this issue with ollama and they are using llama.cpp as well AFAIK.

Model: TinyLlama-1.1B-Chat-v1.0
Config:

context_size: 1024
name: se-chat
parameters:
  model: tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
  temperature: 0.2
  top_k: 80
  top_p: 0.7
template:
  chat: chat
  chat_message: chatml
  completion: completion
threads: 1
gpu_layers: 100

[...]
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr llama_new_context_with_model: total VRAM used: 701.02 MiB (model: 601.02 MiB, context: 100.00 MiB)
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr Available slots:
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr  -> Slot 0 - max context: 1024
4:26PM INF [llama-cpp] Loads OK
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0 is processing [task id: 0]
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0 : kv cache rm - [0, end)
4:45PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0: context shift - n_keep = 0, n_left = 1022, n_discard = 511
[...]

Just to be sure they are using exactly the same model I didn't pull the model with ollama. I downloaded and imported it manually using a modelfile based on the original and named it tinyllama-custom:

root@be406c0fa880:/srv/custom/models# cat tinyllama-1.1b-chat-v1.0.Q4_K_M.modelfile
FROM /srv/custom/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
TEMPLATE """<|system|>
{{ .System }}</s>
<|user|>
{{ .Prompt }}</s>
<|assistant|>
"""
SYSTEM """You are a helpful AI assistant."""
PARAMETER stop "<|system|>"
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "</s>"

root@be406c0fa880:/srv/custom/models# ollama create tinyllama-custom -f tinyllama-1.1b-chat-v1.0.Q4_K_M.modelfile
transferring model data
creating model layer
creating template layer
creating system layer
creating parameters layer
creating config layer
using already created layer sha256:9fecc3b3cd76bba89d504f29b616eedf7da85b96540e490ca5824d3f7d2776a0
using already created layer sha256:af0ddbdaaa26f30d54d727f9dd944b76bdb926fdaf9a58f63f78c532f57c191f
using already created layer sha256:c8472cd9daed5e7c20aa53689e441e10620a002aacd58686aeac2cb188addb5c
using already created layer sha256:fa956ab37b8c21152f975a7fcdd095c4fee8754674b21d9b44d710435697a00d
writing layer sha256:2dc31b4921bcadca51ac93b60788e51c18955c48cab69d429ce922c0aa67ab82
writing manifest
success

λ ollama run tinyllama-custom
>>> How old is Mickey Mouse?
How old is Mickey Mouse?
As of now (2021), Mickey Mouse is 93 years old.

>>> Send a message (/? for help)

mudler · 2024-01-04T17:00:06Z

What is weird to me is that I don't have this issue with ollama and they are using llama.cpp as well AFAIK.

Model: TinyLlama-1.1B-Chat-v1.0 Config:

context_size: 1024
name: se-chat
parameters:
  model: tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
  temperature: 0.2
  top_k: 80
  top_p: 0.7
template:
  chat: chat
  chat_message: chatml
  completion: completion
threads: 1
gpu_layers: 100

[...]
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr llama_new_context_with_model: total VRAM used: 701.02 MiB (model: 601.02 MiB, context: 100.00 MiB)
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr Available slots:
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr  -> Slot 0 - max context: 1024
4:26PM INF [llama-cpp] Loads OK
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0 is processing [task id: 0]
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0 : kv cache rm - [0, end)
4:45PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0: context shift - n_keep = 0, n_left = 1022, n_discard = 511
[...]

Just to be sure they are using exactly the same model I didn't pull the model with ollama. I downloaded and imported it manually using a modelfile based on the original and named it tinyllama-custom:

root@be406c0fa880:/srv/custom/models# cat tinyllama-1.1b-chat-v1.0.Q4_K_M.modelfile
FROM /srv/custom/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
TEMPLATE """<|system|>
{{ .System }}</s>
<|user|>
{{ .Prompt }}</s>
<|assistant|>
"""
SYSTEM """You are a helpful AI assistant."""
PARAMETER stop "<|system|>"
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "</s>"

root@be406c0fa880:/srv/custom/models# ollama create tinyllama-custom -f tinyllama-1.1b-chat-v1.0.Q4_K_M.modelfile
transferring model data
creating model layer
creating template layer
creating system layer
creating parameters layer
creating config layer
using already created layer sha256:9fecc3b3cd76bba89d504f29b616eedf7da85b96540e490ca5824d3f7d2776a0
using already created layer sha256:af0ddbdaaa26f30d54d727f9dd944b76bdb926fdaf9a58f63f78c532f57c191f
using already created layer sha256:c8472cd9daed5e7c20aa53689e441e10620a002aacd58686aeac2cb188addb5c
using already created layer sha256:fa956ab37b8c21152f975a7fcdd095c4fee8754674b21d9b44d710435697a00d
writing layer sha256:2dc31b4921bcadca51ac93b60788e51c18955c48cab69d429ce922c0aa67ab82
writing manifest
success

λ ollama run tinyllama-custom
>>> How old is Mickey Mouse?
How old is Mickey Mouse?
As of now (2021), Mickey Mouse is 93 years old.

>>> Send a message (/? for help)

ollama doesn't use the llama.cpp http/server code, indeed ggerganov/llama.cpp#3969 affects only the http implementation. When we switched away from using the bindings - we now rely directly on the llama.cpp code and we build grpc server around it in C++, and that brings us more close to llama.cpp implementation (with eventual bugs attached as well)

Infinite context loop might as well trigger an infinite loop of context shifting if the model hallucinates and does not stop answering. This has the unpleasant effect that the predicion never terminates, which is the case especially on small models which tends to hallucinate. Workarounds #1333 by removing context-shifting. See also upstream issue: ggerganov/llama.cpp#3969

mudler · 2024-02-13T17:57:30Z

@dionysius this is going to be fixed in #1704

Infinite context loop might as well trigger an infinite loop of context shifting if the model hallucinates and does not stop answering. This has the unpleasant effect that the predicion never terminates, which is the case especially on small models which tends to hallucinate. Workarounds #1333 by removing context-shifting. See also upstream issue: ggerganov/llama.cpp#3969

mudler · 2024-02-23T19:10:20Z

This is fixed in LocalAI. Upstream workaround is as well to put a cap on max tokens as the models tends to hallucinate, infinite context shifting might actually lead to infinite answers too (see commit message in c56b6dd ). It was nice to see that upstream confirmed the issue with ggerganov/llama.cpp#3969 (comment) after the above workaround - it sounds much more safer to not expose the user at all to this by disabling it entirely, and I think what we do is to shield the user from such nuances.

We can look at this again if someone really thinks this is an issue. Closing it for now

mudler added the bug Something isn't working label Nov 25, 2023

mudler self-assigned this Nov 25, 2023

mudler added high prio area/llama.cpp labels Nov 25, 2023

mudler mentioned this issue Nov 25, 2023

feat: backend watchdog #1339

Closed

mudler mentioned this issue Dec 16, 2023

Could not load model: SIGILL: illegal instruction #1447

Closed

mudler mentioned this issue Feb 13, 2024

fix(llama.cpp): disable infinite context shifting #1704

Merged

mudler closed this as completed Feb 23, 2024

mudler mentioned this issue Mar 8, 2024

Completion endpoint returns same response repeatedly #1723

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp: infinite loop of context switch #1333

llama.cpp: infinite loop of context switch #1333

mudler commented Nov 25, 2023 •

edited

Loading

localai-bot commented Nov 25, 2023

mudler commented Nov 26, 2023

dionysius commented Jan 4, 2024 •

edited

Loading

mudler commented Jan 4, 2024 •

edited

Loading

mudler commented Feb 13, 2024

mudler commented Feb 23, 2024 •

edited

Loading

llama.cpp: infinite loop of context switch #1333

llama.cpp: infinite loop of context switch #1333

Comments

mudler commented Nov 25, 2023 • edited Loading

localai-bot commented Nov 25, 2023

⚠️⚠️⚠️⚠️⚠️

⚠️⚠️⚠️⚠️⚠️

mudler commented Nov 26, 2023

dionysius commented Jan 4, 2024 • edited Loading

mudler commented Jan 4, 2024 • edited Loading

mudler commented Feb 13, 2024

mudler commented Feb 23, 2024 • edited Loading

mudler commented Nov 25, 2023 •

edited

Loading

dionysius commented Jan 4, 2024 •

edited

Loading

mudler commented Jan 4, 2024 •

edited

Loading

mudler commented Feb 23, 2024 •

edited

Loading