llama.cpp server: How to effectively use cache_prompt parameter #10311

Mushoz · 2024-11-15T12:43:44Z

Mushoz
Nov 15, 2024

How are you guys effectively using the cache_prompt parameter? I am using my local LLM with openweb ui and aider, but AFAIK neither has the ability to set the cache_prompt parameter in their requests. So how are you guys effectively enabling the cache with so many clients not supporting this parameter?

Is there some other way to enable caching? The prompt processing would be a lot faster if earlier parts of our conversation could be cached.

ggerganov · 2024-11-15T12:54:59Z

ggerganov
Nov 15, 2024
Maintainer

It's up to the clients to support it. On our side, we can add an option to enable it by default for all requests. Maybe it's not a bad idea since I think it's always better to have it enabled.

12 replies

steampunque Nov 18, 2024

@ggerganov is caching ALWAYS desirable though?

The llama.cpp server prompt cache implementation will make generation non-deterministic, meaning you will get different answers for the same submitted prompt. Therefore if you need deterministic responses (guaranteed to give exact same results for same prompt every time) it will be necessary to turn the prompt cache off. #7745

ggerganov Nov 19, 2024
Maintainer

Submitting the same prompt with or without caching will always be deterministic in the single-user case. For multi-user, it's not deterministic with or without prompt caching.

ggerganov Nov 19, 2024
Maintainer

Correction - I read your #7745 (comment) and this would indeed lead to non-deterministic results when resubmitting the prompt after it was cached. The solution that separates the last token from the prompt makes sense and we should implement it. It will also be needed for speculative decoding support anyway (#10362)

steampunque Nov 19, 2024

Correction - I read your #7745 (comment) and this would indeed lead to non-deterministic results when resubmitting the prompt after it was cached. The solution that separates the last token from the prompt makes sense and we should implement it. It will also be needed for speculative decoding support anyway (#10362)

The code I showed in that comment will only work for server config with single slot. With multiple slots it will be necessary to keep track of the number of tokens added to the accumulating batch on a per slot basis as shown in the partial code block below. I did not worry about it since I only use single slot config. Any multiple slot config will immediately bring back non determinism to the results.

                       if (/*batch.n_tokens > 1*/ slot.batch_tokens > 1) {
                           slot.start_token = batch.token[batch.n_tokens-1];
                           slot.start_pos = batch.pos[batch.n_tokens-1];
                           slot.have_start_token=true;
                           batch.n_tokens--;
                           slot.batch_tokens--;

drunnells Nov 22, 2024

Is there a workaround for now?

The OP specifically mentions Open-WebUI. I've made a pull request to add cache_prompt to the completions call: open-webui/open-webui#7237

Or in the mean time you can manually edit site-packages/open_webui/apps/openai/main.py to add:

payload["cache_prompt"] = True

Right before:

    # Convert the modified body back to JSON
    payload = json.dumps(payload)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp server: How to effectively use cache_prompt parameter #10311

{{title}}

Replies: 1 comment 12 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

llama.cpp server: How to effectively use cache_prompt parameter #10311

Mushoz Nov 15, 2024

Replies: 1 comment · 12 replies

ggerganov Nov 15, 2024 Maintainer

steampunque Nov 18, 2024

ggerganov Nov 19, 2024 Maintainer

ggerganov Nov 19, 2024 Maintainer

steampunque Nov 19, 2024

drunnells Nov 22, 2024

Mushoz
Nov 15, 2024

Replies: 1 comment 12 replies

ggerganov
Nov 15, 2024
Maintainer

ggerganov Nov 19, 2024
Maintainer

ggerganov Nov 19, 2024
Maintainer