Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache when using v1/chat/completions? #4287

Closed
Michael-F-Ellis opened this issue Dec 1, 2023 · 1 comment
Closed

Cache when using v1/chat/completions? #4287

Michael-F-Ellis opened this issue Dec 1, 2023 · 1 comment

Comments

@Michael-F-Ellis
Copy link

Is it possible to tell the llama.cpp server to cache prompts when using the v1/chat/completions endpoint?

I've a CLI interface I created for fiction authors that accesses the OpenAI endpoints. I want to enable it to access local models via llama.cpp server. I've got it working now, but the response is very slow because it's re-evaluating the entire accumulated prompt with each request. I see that the /completions endpoint supports a cache flag, but I don't see one for the v1/chat/completions endpoint.

@Michael-F-Ellis
Copy link
Author

This has been resolved by #4347

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant