src/llama.cpp:15646: Deepseek2 does not support K-shift #9092

99991 · 2024-08-19T10:31:57Z

99991
Aug 19, 2024

It is my understanding that llama.cpp shifts the key-value cache when generating more tokens than fit into the context window, which is not supported for DeepSeek Coder V2. To reproduce, start a server with this model

./llama-server -m DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf -c 32 -ngl 999 --port 8080

and then request a prompt completion:

curl -H "Content-Type: application/json" --request POST --data '{"prompt": "Mergesort in Python:", "n_predict": 32}' http://127.0.0.1:8080/completion

This should trigger the error

src/llama.cpp:15646: Deepseek2 does not support K-shift
Aborted

with llama.cpp release b3600.

The corresponding code in llama.cpp is here:

https://github.com/ggerganov/llama.cpp/blob/cfac111e2b3953cdb6b0126e67a2487687646971/src/llama.cpp#L15643C31-L15648C1

I believe that a saner approach would simply stop generating tokens instead of crashing the server. Is there some option that can be set to prevent clients from crashing the server?

Answered by 99991

Aug 22, 2024

I did some research and apparently this is classified as a 7.5 high severity security issue, who knew! @ggerganov Could you delete or hide this discussion until it is fixed? I've opened a security issue instead.

View full answer

ggerganov · 2024-08-20T07:46:24Z

ggerganov
Aug 20, 2024
Maintainer

You can reduce n_predict so that the prompt token + the predicted tokens are less or equal than 32 (the specified context).

2 replies

99991 Aug 20, 2024
Author

For security reasons, the client should not be able to decide whether to crash the server, so I guess you mean the option -n for llama-server instead of the option n_predict, which is sent by the client.

However, even when setting `./llama-server -n 32', the server will still crash. This is because, as you said, the value includes the number of prompt tokens, so the client can always send a longer prompt and still crash the server, even if the total number of tokens is limited.

I think that the server should simply not crash by default. It should not require the user to pass command line arguments to make it not crash.

99991 Aug 22, 2024
Author

I did some research and apparently this is classified as a 7.5 high severity security issue, who knew! @ggerganov Could you delete or hide this discussion until it is fixed? I've opened a security issue instead.

Answer selected by 99991

shengyang998 · 2024-11-18T10:43:16Z

shengyang998
Nov 18, 2024

Any progress? Or (if not appropriate to discuss here) where should be a appropriate place to follow?

2 replies

99991 Nov 18, 2024
Author

I reported this as a Denial-of-Service vulnerability at https://github.com/ggerganov/llama.cpp/security/advisories/GHSA-jp78-gmv4-cc44 on August 22, but have received no response. Note that for security reasons, the linked page will only be visible to project maintainers. But this probably does not matter anymore since this discussion has been public for so long.

99991 Nov 18, 2024
Author

I opened an issue: #10380

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/llama.cpp:15646: Deepseek2 does not support K-shift #9092

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

src/llama.cpp:15646: Deepseek2 does not support K-shift #9092

99991 Aug 19, 2024

Replies: 2 comments · 4 replies

ggerganov Aug 20, 2024 Maintainer

99991 Aug 20, 2024 Author

99991 Aug 22, 2024 Author

shengyang998 Nov 18, 2024

99991 Nov 18, 2024 Author

99991 Nov 18, 2024 Author

99991
Aug 19, 2024

Replies: 2 comments 4 replies

ggerganov
Aug 20, 2024
Maintainer

99991 Aug 20, 2024
Author

99991 Aug 22, 2024
Author

shengyang998
Nov 18, 2024

99991 Nov 18, 2024
Author

99991 Nov 18, 2024
Author