Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing the time needed to reload a piece of text into the model by caching the state #202

Closed
niansa opened this issue Mar 16, 2023 · 9 comments
Labels
enhancement New feature or request

Comments

@niansa
Copy link
Contributor

niansa commented Mar 16, 2023

Hey!

Is it possible to add a way of dumping the current state into a file, so it can then be reloaded later? This would avoid the time needed to reload a long prompt over and over again.

Thanks
Niansa

@bitRAKE
Copy link
Contributor

bitRAKE commented Mar 16, 2023

#174 also asked this, or do you have something else in mind?

@jart jart added the duplicate This issue or pull request already exists label Mar 16, 2023
@jart
Copy link
Contributor

jart commented Mar 16, 2023

Thank you for using llama.cpp and thank you for sharing your feature request! You'll be excited to hear that what you're requesting is my top priority right now. I'm using #91 as the best place to discuss this, since the solution will entail using mmap(). Everyone is welcome to participate in helping us find the best solution. I believe mmap() will reduce startup latency to effectively zero, for everyone, and it'll work on nearly every platform on earth, including Windows, which has a nearly equivalent API.

@jart jart closed this as completed Mar 16, 2023
@j-f1
Copy link
Collaborator

j-f1 commented Mar 16, 2023

I think this is a different issue — that one is about changing how the model is loaded, this one is about reducing the time needed to reload a piece of text into the model by caching the state.

@jart
Copy link
Contributor

jart commented Mar 16, 2023

As you wish. Re-opening.

@jart jart reopened this Mar 16, 2023
@jart jart added enhancement New feature or request and removed duplicate This issue or pull request already exists labels Mar 16, 2023
@niansa
Copy link
Contributor Author

niansa commented Mar 16, 2023

#174 also asked this, or do you have something else in mind?

Basically yes, except that interactive user input and generated results should be saved too. So basically you can save and stop and just continue where the model/you've left off later or on another PC.

@bitRAKE
Copy link
Contributor

bitRAKE commented Mar 16, 2023

I can't find it now, but @ggerganov said save/restore of the k&v tensors would preserve the state, iirc.

llama.cpp/main.cpp

Lines 79 to 82 in 7213110

// key + value memory
struct ggml_tensor * memory_k;
struct ggml_tensor * memory_v;

@jarcen
Copy link

jarcen commented Mar 16, 2023

@bitRAKE Yes, those are transformer's hidden state, preserving them is sufficient. Now, the question is how to edit them properly.
I'm also interested in removing n first elements to deal with context memory filling up.

@gjmulder gjmulder changed the title "Saving" current state? Reducing the time needed to reload a piece of text into the model by caching the state Mar 17, 2023
@sgoll
Copy link

sgoll commented Mar 30, 2023

This issue is a duplicate of #64, isn't it? Since llama-rs did essentially the same thing, first in rustformers/llm#14, then with a slightly different interface in rustformers/llm#38, this is definitely feasible and would be really useful.

May I suggest to close this issue and continue discussion in #64?

One use case that would benefit greatly from session (KV) caching is story generation: start with an initial prompt and then continue down the most promising alternatives that are being generated.

@ggerganov
Copy link
Owner

Yes, it is the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants