-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache Feature Request #95
Comments
I agree the "dummy" caching feature is already really useful. Makes all the difference between me wanting to use it or rather going to openai ;) Regarding a real caching feature we are waiting for upstream to persist the state correctly, right? Hypothetically, how large would a single state be? |
@snxraven @jmtatsch definitely high on my list, unfortunately at the moment I'm blocked because I can't restore the model state. I've tried the If anyone get even a basic example to work using that API I'd be happy to implement this. |
Relevant issue has been opened by another dev over at llama.cpp: ggml-org/llama.cpp#1054 |
Furthermore, this issue has been reopened so it may be fixed within llama.cpp: If you would like we can close this issue since there obviously is a solution coming. |
I did once do this by simply having multiple instances of llama running. |
Closing this in favor of #44 |
The current implementation of caching is wonderful, its been a great help speeding up conversations.
I do notice this trips up when a secondary user starts a conversation, would it be possible to allow for multi-conversation caching?
The main issue currently is the fact that the cache grows large over time and if the second user submits a question and then the first user submits another their entire chat history is being reran all over.
The text was updated successfully, but these errors were encountered: