-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reducing the time needed to reload a piece of text into the model by caching the state #202
Comments
#174 also asked this, or do you have something else in mind? |
Thank you for using llama.cpp and thank you for sharing your feature request! You'll be excited to hear that what you're requesting is my top priority right now. I'm using #91 as the best place to discuss this, since the solution will entail using mmap(). Everyone is welcome to participate in helping us find the best solution. I believe mmap() will reduce startup latency to effectively zero, for everyone, and it'll work on nearly every platform on earth, including Windows, which has a nearly equivalent API. |
I think this is a different issue — that one is about changing how the model is loaded, this one is about reducing the time needed to reload a piece of text into the model by caching the state. |
As you wish. Re-opening. |
Basically yes, except that interactive user input and generated results should be saved too. So basically you can save and stop and just continue where the model/you've left off later or on another PC. |
I can't find it now, but @ggerganov said save/restore of the k&v tensors would preserve the state, iirc. Lines 79 to 82 in 7213110
|
@bitRAKE Yes, those are transformer's hidden state, preserving them is sufficient. Now, the question is how to edit them properly. |
This issue is a duplicate of #64, isn't it? Since llama-rs did essentially the same thing, first in rustformers/llm#14, then with a slightly different interface in rustformers/llm#38, this is definitely feasible and would be really useful. May I suggest to close this issue and continue discussion in #64? One use case that would benefit greatly from session (KV) caching is story generation: start with an initial prompt and then continue down the most promising alternatives that are being generated. |
Yes, it is the same |
Hey!
Is it possible to add a way of dumping the current state into a file, so it can then be reloaded later? This would avoid the time needed to reload a long prompt over and over again.
Thanks
Niansa
The text was updated successfully, but these errors were encountered: