-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache when using v1/chat/completions? #4287
Comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is it possible to tell the llama.cpp server to cache prompts when using the
v1/chat/completions
endpoint?I've a CLI interface I created for fiction authors that accesses the OpenAI endpoints. I want to enable it to access local models via llama.cpp server. I've got it working now, but the response is very slow because it's re-evaluating the entire accumulated prompt with each request. I see that the
/completions
endpoint supports a cache flag, but I don't see one for thev1/chat/completions
endpoint.The text was updated successfully, but these errors were encountered: