server: Cache is not reused between completions by default.

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- I am running the latest code: 96981f37b1e3f450d9e63e571514217bf60f0a7f
- I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior

Cache is reused and only a part of the prompt starting from the first mismatched token needs to be processed.

# Current Behavior

I think, after #3677, it stopped reusing the cache and processes the whole prompt on each completion request.

# Environment and Context

Linux, CLBlast build.

# Steps to Reproduce

command: `server -c 4096 -m xwin-lm-70b-v0.1.Q6_K.gguf`

llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_new_context_with_model: compute buffer total size = 574.13 MB
Available slots:
 -> Slot 0 - max context: 4096

all slots are idle and system prompt is empty, clear the KV cache

request: POST /completion `{"prompt":"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\nUSER: Hello, can you help me?\nASSISTANT:"}`

slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =   27268.61 ms /    47 tokens (  580.18 ms per token,     1.72 tokens per second)
print_timings:        eval time =   59156.03 ms /    42 runs   ( 1408.48 ms per token,     0.71 tokens per second)
print_timings:       total time =   86424.64 ms
slot 0 released (90 tokens in cache)

response: <details>`{"content":" Hello! I'd be happy to help you with any questions or topics you have in mind. Please feel free to ask, and I'll do my best to provide you with useful information and assistance.","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"xwin-lm-70b-v0.1.Q6_K.gguf","n_ctx":4096,"n_keep":0,"n_predict":-1,"n_probs":0,"penalize_nl":true,"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":false,"temp":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0},"model":"xwin-lm-70b-v0.1.Q6_K.gguf","prompt":"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\nUSER: Hello, can you help me?\nASSISTANT:","slot_id":0,"stop":true,"stopped_eos":true,"stopped_limit":false,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":59156.026,"predicted_n":42,"predicted_per_second":0.7099868405629547,"predicted_per_token_ms":1408.4768095238094,"prompt_ms":27268.61,"prompt_n":47,"prompt_per_second":1.7235935385045293,"prompt_per_token_ms":580.1831914893617},"tokens_cached":89,"tokens_evaluated":47,"tokens_predicted":42,"truncated":false}`</details>

At this point the original prompt as well as generated text should be in the cache.

Making exact same request as before. The prompt should match first half of the cache.

slot 0 is processing [task id: 1]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =   18216.41 ms /    47 tokens (  387.58 ms per token,     2.58 tokens per second)
print_timings:        eval time =   59435.15 ms /    42 runs   ( 1415.12 ms per token,     0.71 tokens per second)
print_timings:       total time =   77651.56 ms
slot 0 released (90 tokens in cache)

response: <details>`{"content":" Hello! I'd be happy to help you with any questions or topics you have in mind. Please feel free to ask, and I'll do my best to provide you with useful information and guidance.","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"xwin-lm-70b-v0.1.Q6_K.gguf","n_ctx":4096,"n_keep":0,"n_predict":-1,"n_probs":0,"penalize_nl":true,"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":false,"temp":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0},"model":"xwin-lm-70b-v0.1.Q6_K.gguf","prompt":"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\nUSER: Hello, can you help me?\nASSISTANT:","slot_id":0,"stop":true,"stopped_eos":true,"stopped_limit":false,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":59435.148,"predicted_n":42,"predicted_per_second":0.7066525686114217,"predicted_per_token_ms":1415.1225714285715,"prompt_ms":18216.411,"prompt_n":47,"prompt_per_second":2.580091105761722,"prompt_per_token_ms":387.58321276595746},"tokens_cached":89,"tokens_evaluated":47,"tokens_predicted":42,"truncated":false}`</details>

It erases the whole cache and processes all 47 request tokens again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: Cache is not reused between completions by default. #3738

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

server: Cache is not reused between completions by default. #3738

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions