Cache when using v1/chat/completions? #4287

Michael-F-Ellis · 2023-12-01T21:45:25Z

Is it possible to tell the llama.cpp server to cache prompts when using the v1/chat/completions endpoint?

I've a CLI interface I created for fiction authors that accesses the OpenAI endpoints. I want to enable it to access local models via llama.cpp server. I've got it working now, but the response is very slow because it's re-evaluating the entire accumulated prompt with each request. I see that the /completions endpoint supports a cache flag, but I don't see one for the v1/chat/completions endpoint.

The text was updated successfully, but these errors were encountered:

Michael-F-Ellis · 2023-12-07T14:21:02Z

This has been resolved by #4347

Michael-F-Ellis mentioned this issue Dec 4, 2023

v1/chat/completions endpoint does not honor cache_prompt #4329

Closed

ggerganov mentioned this issue Dec 6, 2023

server : recognize cache_prompt parameter in OAI API #4347

Merged

Michael-F-Ellis closed this as completed Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache when using v1/chat/completions? #4287

Cache when using v1/chat/completions? #4287

Michael-F-Ellis commented Dec 1, 2023

Michael-F-Ellis commented Dec 7, 2023

Cache when using v1/chat/completions? #4287

Cache when using v1/chat/completions? #4287

Comments

Michael-F-Ellis commented Dec 1, 2023

Michael-F-Ellis commented Dec 7, 2023