Reducing the time needed to reload a piece of text into the model by caching the state #202

niansa · 2023-03-16T09:27:24Z

Hey!

Is it possible to add a way of dumping the current state into a file, so it can then be reloaded later? This would avoid the time needed to reload a long prompt over and over again.

Thanks
Niansa

bitRAKE · 2023-03-16T10:10:29Z

#174 also asked this, or do you have something else in mind?

jart · 2023-03-16T11:46:15Z

Thank you for using llama.cpp and thank you for sharing your feature request! You'll be excited to hear that what you're requesting is my top priority right now. I'm using #91 as the best place to discuss this, since the solution will entail using mmap(). Everyone is welcome to participate in helping us find the best solution. I believe mmap() will reduce startup latency to effectively zero, for everyone, and it'll work on nearly every platform on earth, including Windows, which has a nearly equivalent API.

j-f1 · 2023-03-16T12:47:41Z

I think this is a different issue — that one is about changing how the model is loaded, this one is about reducing the time needed to reload a piece of text into the model by caching the state.

jart · 2023-03-16T16:19:41Z

As you wish. Re-opening.

niansa · 2023-03-16T17:04:00Z

#174 also asked this, or do you have something else in mind?

Basically yes, except that interactive user input and generated results should be saved too. So basically you can save and stop and just continue where the model/you've left off later or on another PC.

bitRAKE · 2023-03-16T18:16:18Z

I can't find it now, but @ggerganov said save/restore of the k&v tensors would preserve the state, iirc.

llama.cpp/main.cpp

Lines 79 to 82 in 7213110

    
           // key + value memory 
        
           struct ggml_tensor * memory_k; 
        
           struct ggml_tensor * memory_v;

jarcen · 2023-03-16T18:39:19Z

@bitRAKE Yes, those are transformer's hidden state, preserving them is sufficient. Now, the question is how to edit them properly.
I'm also interested in removing n first elements to deal with context memory filling up.

sgoll · 2023-03-30T15:00:46Z

This issue is a duplicate of #64, isn't it? Since llama-rs did essentially the same thing, first in rustformers/llm#14, then with a slightly different interface in rustformers/llm#38, this is definitely feasible and would be really useful.

May I suggest to close this issue and continue discussion in #64?

One use case that would benefit greatly from session (KV) caching is story generation: start with an initial prompt and then continue down the most promising alternatives that are being generated.

ggerganov · 2023-03-30T17:42:34Z

Yes, it is the same

jart added the duplicate This issue or pull request already exists label Mar 16, 2023

jart closed this as completed Mar 16, 2023

jart reopened this Mar 16, 2023

jart added enhancement New feature or request and removed duplicate This issue or pull request already exists labels Mar 16, 2023

bitRAKE mentioned this issue Mar 16, 2023

Should use mmap for model loading #91

Closed

gjmulder changed the title ~~"Saving" current state?~~ Reducing the time needed to reload a piece of text into the model by caching the state Mar 17, 2023

ggerganov closed this as completed Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing the time needed to reload a piece of text into the model by caching the state #202

Reducing the time needed to reload a piece of text into the model by caching the state #202

niansa commented Mar 16, 2023

bitRAKE commented Mar 16, 2023

jart commented Mar 16, 2023

j-f1 commented Mar 16, 2023 •

edited

Loading

jart commented Mar 16, 2023

niansa commented Mar 16, 2023

bitRAKE commented Mar 16, 2023

jarcen commented Mar 16, 2023 •

edited

Loading

sgoll commented Mar 30, 2023

ggerganov commented Mar 30, 2023

Reducing the time needed to reload a piece of text into the model by caching the state #202

Reducing the time needed to reload a piece of text into the model by caching the state #202

Comments

niansa commented Mar 16, 2023

bitRAKE commented Mar 16, 2023

jart commented Mar 16, 2023

j-f1 commented Mar 16, 2023 • edited Loading

jart commented Mar 16, 2023

niansa commented Mar 16, 2023

bitRAKE commented Mar 16, 2023

jarcen commented Mar 16, 2023 • edited Loading

sgoll commented Mar 30, 2023

ggerganov commented Mar 30, 2023

j-f1 commented Mar 16, 2023 •

edited

Loading

jarcen commented Mar 16, 2023 •

edited

Loading