-
It is my understanding that llama.cpp shifts the key-value cache when generating more tokens than fit into the context window, which is not supported for DeepSeek Coder V2. To reproduce, start a server with this model ./llama-server -m DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf -c 32 -ngl 999 --port 8080 and then request a prompt completion:
This should trigger the error
with llama.cpp release b3600. The corresponding code in llama.cpp is here: I believe that a saner approach would simply stop generating tokens instead of crashing the server. Is there some option that can be set to prevent clients from crashing the server? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
You can reduce |
Beta Was this translation helpful? Give feedback.
-
Any progress? Or (if not appropriate to discuss here) where should be a appropriate place to follow? |
Beta Was this translation helpful? Give feedback.
I did some research and apparently this is classified as a 7.5 high severity security issue, who knew! @ggerganov Could you delete or hide this discussion until it is fixed? I've opened a security issue instead.