Save and restore prompt evaluation state for much faster startup times #1169

ejones · 2023-04-25T03:09:44Z

Hi! I decided to take a stab at leveraging the new get / set state APIs to cache initial prompt evaluation in main. On my M2 at least, this feature lets me start up chat-13B.sh with 65B in seconds (after having run before).

Overview

Adds llama_load_session_file and llama_save_session_file APIs to serialize the model state + a user-provided sequence of input tokens (more on that later)
Adds a --session arg to examples/main that designates a file to load/save the session (creating on first run). Currently this is just used to speed up initial prompt evaluation, but could eventually e.g., restore conversations

Approach

Establishes a binary session file format that prepends some additional metadata to the state returned by llama_copy_state_data.

'ggst' | <u32> 0 | <llama_hparams> | <u32> inp_token_count | <token_count * llama_token> inp_tokens | <llama_state>

The embedded hparams is a sanity check that we don't load the state for a different model. The inp_tokens stream represents the sequence of input tokens whose evaluation led to llama_state.

When a past session is present during model evaluation, the session tokens are used (in examples/main) to determine the matching prefix length between the saved session and the current prompt (and technically input). These are skipped over using n_past. Regular evaluation then continues from the next token onward.

For convenience, a single --session arg in examples/main designates the file to save the session to (creating if needed) and load from on successive calls.

Testing

For interactive sessions, I tested this with examples/chat-13B.sh against quantized 30B and 65B:

examples/chat-13B.sh -m ~/llama-models/30B/ggml-model-q4_0.bin --session chat-session-30B.bin

I also tested the regular, non-session usage.

To measure performance I ran chat-13B.sh, modified to be non-interactive and generate only 10 tokens.

Results

Some rough timing results from my M2 running 30B on the prompt from chat-13B.sh.

Before this feature, ~37s startup:

llama_print_timings:        load time = 34743.04 ms
llama_print_timings:      sample time =    29.21 ms /    10 runs   (    2.92 ms per run)
llama_print_timings: prompt eval time = 34721.95 ms /   508 tokens (   68.35 ms per token)
llama_print_timings:        eval time =  1994.03 ms /     9 runs   (  221.56 ms per run)
llama_print_timings:       total time = 36766.42 ms

After this feature, first run, ~40s startup:

llama_print_timings:        load time = 35040.24 ms
llama_print_timings:      sample time =    29.48 ms /    10 runs   (    2.95 ms per run)
llama_print_timings: prompt eval time = 35024.65 ms /   508 tokens (   68.95 ms per token)
llama_print_timings:        eval time =  2001.31 ms /     9 runs   (  222.37 ms per run)
llama_print_timings:       total time = 39635.71 ms

After this feature, successive runs, ~5s:

llama_print_timings:        load time =  2874.73 ms
llama_print_timings:      sample time =    28.82 ms /    10 runs   (    2.88 ms per run)
llama_print_timings: prompt eval time =  2148.04 ms /    14 tokens (  153.43 ms per token)
llama_print_timings:        eval time =  1753.77 ms /     9 runs   (  194.86 ms per run)
llama_print_timings:       total time =  4657.45 ms

Caveats

I don't have a deep understanding of n_past, just that it can be leveraged for this prefix behavior
Session files are ~GBs large and don't leverage mmap, incurring a slight delay to save/load
The session usage in examples/main is oriented to optimizing initial prompt evaluation time. It uses a heuristic to determine if the session should be (re-)saved, such that loading (near) identitcal prompts doesn't incur the seconds to write the session file

mishudark · 2023-04-25T21:07:32Z

Out of curiosity, how many GB of memory are required in a M2 Mac to run the 30B model ?

ejones · 2023-04-26T01:44:36Z

@mishudark originally I think it was 20GB per the table in the README, but now with mmap I think it's much lower. Activity Monitor report only a few GB for me, which I think corresponds to just the model state?

dmahurin · 2023-04-26T23:52:34Z

Currently this is just used to speed up initial prompt evaluation, but could eventually e.g., restore conversations

Hi @ejones , Good work. How much remains to restore conversations?

ejones · 2023-04-27T01:06:53Z

@dmahurin in terms of program logic it probably wouldn't take much; starting from the end of the session is actually simpler because you're not finding a common prefix. I think the challenge is settling on the arguments and program behavior for when and how often you save sessions (for which there's currently a slight delay) and restoring prompt vs restoring full session on startup.

mikeggh · 2023-04-27T02:12:05Z

@dmahurin in terms of program logic it probably wouldn't take much; starting from the end of the session is actually simpler because you're not finding a common prefix. I think the challenge is settling on the arguments and program behavior for when and how often you save sessions (for which there's currently a slight delay) and restoring prompt vs restoring full session on startup.

Yeah, the first time I used the session I pressed ctrl-d expecting it to be saved, but then I realized it was inserting into the logic path for the next token.. I thought about some other methods for forcing a save, such as a diff ctrl, or symbol? I personally will be using the llama-cpp-python more since my own framework is built around it. I was really just testing it before...

Great job though! Besides that one scenario, it works great.. =]

Priestru · 2023-04-27T20:17:08Z

Amazing work! Can't wait for this feature.

Do you have any idea why did it affect generation speed?
Feels like it shouldn't make difference but speed up in provided example is noticeable.

llama_print_timings: eval time = 1753.77 ms / 9 runs ( 194.86 ms per run)

ggerganov

One possible improvement in the future is to make the llama_set_state_data() and llama_copy_state_data() work just with the used KV cache, instead of working with the full buffer always. The reason is that for a prompt of 100 tokens, the KV cache will be only 100 / 2048 full, so no need to store so many zeros. This will make the session files much smaller.

llama.cpp/llama.cpp

Lines 2177 to 2188 in 11d9023

    
           // copy kv cache 
        
           { 
        
               const size_t kv_size = ctx->model.kv_self.buf.size; 
        
               const int    kv_ntok = llama_get_kv_cache_token_count(ctx); 
        
               memcpy(out, &kv_size, sizeof(kv_size)); out += sizeof(kv_size); 
        
               memcpy(out, &kv_ntok, sizeof(kv_ntok)); out += sizeof(kv_ntok); 
        
               if (kv_size) { 
        
                   memcpy(out, ctx->model.kv_self.buf.addr, kv_size); out += kv_size; 
        
               } 
        
           }

llama.cpp/llama.cpp

Lines 2249 to 2271 in 11d9023

    
           // set kv cache 
        
           { 
        
               size_t kv_size; 
        
               int kv_ntok; 
        
               memcpy(&kv_size, in, sizeof(kv_size)); in += sizeof(kv_size); 
        
               memcpy(&kv_ntok, in, sizeof(kv_ntok)); in += sizeof(kv_ntok); 
        
               if (kv_size) { 
        
                   LLAMA_ASSERT(ctx->model.kv_self.buf.size == kv_size); 
        
                   void * k_data = ctx->model.kv_self.k->data; // remember data pointers 
        
                   void * v_data = ctx->model.kv_self.v->data; // because their value is stored in buf and overwritten by memcpy 
        
                   memcpy(ctx->model.kv_self.buf.addr, in, kv_size); in += kv_size; 
        
                   ctx->model.kv_self.k->data = k_data; // restore correct data pointers 
        
                   ctx->model.kv_self.v->data = v_data; 
        
               } 
        
               ctx->model.kv_self.n = kv_ntok; 
        
           }

ejones · 2023-04-29T01:44:56Z

@ggerganov that makes sense, thanks! I did notice there was a long stream of zeroes in the files.

@Priestru not sure, perhaps there's a higher cost to the initial token(s) when starting from scratch? For the timings I only generated 10 tokens so startup costs there would definitely impact the average.

syl-00110111 · 2023-04-29T12:10:47Z

kudos for the patch, why not an example on 7B ?

llama : add session file format and saved sessions in main

7f58f2c

ejones mentioned this pull request Apr 25, 2023

Combine large LLM with small LLM for faster inference #630

Closed

ejones changed the title ~~Add saved sessions for near-instant startup on long prompts~~ Save and restore prompt evaluation state for much faster startup times Apr 25, 2023

ejones mentioned this pull request Apr 25, 2023

llama : refactor get / set state + remove redundant kv cache API #1143

Merged

ggerganov added enhancement New feature or request high priority Very important issue labels Apr 26, 2023

ggerganov approved these changes Apr 28, 2023

View reviewed changes

ggerganov merged commit 1481a9c into ggerganov:master Apr 28, 2023

ejones mentioned this pull request Apr 29, 2023

Store KV cache of computed prompts to disk to avoid re-compute in follow-up runs #64

Closed

ejones mentioned this pull request Apr 30, 2023

Compress saved llama state #1247

Closed

ivanstepanovftw mentioned this pull request Apr 30, 2023

Different outputs with --session flag (cache prompt #1169) #1257

Closed

ivanstepanovftw added a commit to ivanstepanovftw/llama.cpp that referenced this pull request May 1, 2023

Hotfix prompt caching introduced in ggerganov#1169, fixes ggerganov#1257

c9fdebc

herrera-luis mentioned this pull request May 1, 2023

Adding --session support in examples/talk-llama ggerganov/whisper.cpp#845

Merged

ejones mentioned this pull request May 2, 2023

llama : only copy used KV cache in get / set state #1272

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save and restore prompt evaluation state for much faster startup times #1169

Save and restore prompt evaluation state for much faster startup times #1169

ejones commented Apr 25, 2023 •

edited

Loading

mishudark commented Apr 25, 2023

ejones commented Apr 26, 2023

dmahurin commented Apr 26, 2023

ejones commented Apr 27, 2023

mikeggh commented Apr 27, 2023

Priestru commented Apr 27, 2023

ggerganov left a comment •

edited

Loading

ejones commented Apr 29, 2023

syl-00110111 commented Apr 29, 2023

	// copy kv cache
	{
	const size_t kv_size = ctx->model.kv_self.buf.size;
	const int kv_ntok = llama_get_kv_cache_token_count(ctx);

	memcpy(out, &kv_size, sizeof(kv_size)); out += sizeof(kv_size);
	memcpy(out, &kv_ntok, sizeof(kv_ntok)); out += sizeof(kv_ntok);

	if (kv_size) {
	memcpy(out, ctx->model.kv_self.buf.addr, kv_size); out += kv_size;
	}
	}

	// set kv cache
	{
	size_t kv_size;
	int kv_ntok;

	memcpy(&kv_size, in, sizeof(kv_size)); in += sizeof(kv_size);
	memcpy(&kv_ntok, in, sizeof(kv_ntok)); in += sizeof(kv_ntok);

	if (kv_size) {
	LLAMA_ASSERT(ctx->model.kv_self.buf.size == kv_size);

	void * k_data = ctx->model.kv_self.k->data; // remember data pointers
	void * v_data = ctx->model.kv_self.v->data; // because their value is stored in buf and overwritten by memcpy

	memcpy(ctx->model.kv_self.buf.addr, in, kv_size); in += kv_size;

	ctx->model.kv_self.k->data = k_data; // restore correct data pointers
	ctx->model.kv_self.v->data = v_data;

	}

	ctx->model.kv_self.n = kv_ntok;
	}

Save and restore prompt evaluation state for much faster startup times #1169

Save and restore prompt evaluation state for much faster startup times #1169

Conversation

ejones commented Apr 25, 2023 • edited Loading

mishudark commented Apr 25, 2023

ejones commented Apr 26, 2023

dmahurin commented Apr 26, 2023

ejones commented Apr 27, 2023

mikeggh commented Apr 27, 2023

Priestru commented Apr 27, 2023

ggerganov left a comment • edited Loading

Choose a reason for hiding this comment

ejones commented Apr 29, 2023

syl-00110111 commented Apr 29, 2023

ejones commented Apr 25, 2023 •

edited

Loading

ggerganov left a comment •

edited

Loading