Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save and restore prompt evaluation state for much faster startup times #1169

Merged
merged 1 commit into from
Apr 28, 2023

Conversation

ejones
Copy link
Collaborator

@ejones ejones commented Apr 25, 2023

Hi! I decided to take a stab at leveraging the new get / set state APIs to cache initial prompt evaluation in main. On my M2 at least, this feature lets me start up chat-13B.sh with 65B in seconds (after having run before).

Overview

  • Adds llama_load_session_file and llama_save_session_file APIs to serialize the model state + a user-provided sequence of input tokens (more on that later)
  • Adds a --session arg to examples/main that designates a file to load/save the session (creating on first run). Currently this is just used to speed up initial prompt evaluation, but could eventually e.g., restore conversations

Approach

Establishes a binary session file format that prepends some additional metadata to the state returned by llama_copy_state_data.

'ggst' | <u32> 0 | <llama_hparams> | <u32> inp_token_count | <token_count * llama_token> inp_tokens | <llama_state>

The embedded hparams is a sanity check that we don't load the state for a different model. The inp_tokens stream represents the sequence of input tokens whose evaluation led to llama_state.

When a past session is present during model evaluation, the session tokens are used (in examples/main) to determine the matching prefix length between the saved session and the current prompt (and technically input). These are skipped over using n_past. Regular evaluation then continues from the next token onward.

For convenience, a single --session arg in examples/main designates the file to save the session to (creating if needed) and load from on successive calls.

Testing

For interactive sessions, I tested this with examples/chat-13B.sh against quantized 30B and 65B:

examples/chat-13B.sh -m ~/llama-models/30B/ggml-model-q4_0.bin --session chat-session-30B.bin

I also tested the regular, non-session usage.

To measure performance I ran chat-13B.sh, modified to be non-interactive and generate only 10 tokens.

Results

Some rough timing results from my M2 running 30B on the prompt from chat-13B.sh.

Before this feature, ~37s startup:

llama_print_timings:        load time = 34743.04 ms
llama_print_timings:      sample time =    29.21 ms /    10 runs   (    2.92 ms per run)
llama_print_timings: prompt eval time = 34721.95 ms /   508 tokens (   68.35 ms per token)
llama_print_timings:        eval time =  1994.03 ms /     9 runs   (  221.56 ms per run)
llama_print_timings:       total time = 36766.42 ms

After this feature, first run, ~40s startup:

llama_print_timings:        load time = 35040.24 ms
llama_print_timings:      sample time =    29.48 ms /    10 runs   (    2.95 ms per run)
llama_print_timings: prompt eval time = 35024.65 ms /   508 tokens (   68.95 ms per token)
llama_print_timings:        eval time =  2001.31 ms /     9 runs   (  222.37 ms per run)
llama_print_timings:       total time = 39635.71 ms

After this feature, successive runs, ~5s:

llama_print_timings:        load time =  2874.73 ms
llama_print_timings:      sample time =    28.82 ms /    10 runs   (    2.88 ms per run)
llama_print_timings: prompt eval time =  2148.04 ms /    14 tokens (  153.43 ms per token)
llama_print_timings:        eval time =  1753.77 ms /     9 runs   (  194.86 ms per run)
llama_print_timings:       total time =  4657.45 ms

Caveats

  • I don't have a deep understanding of n_past, just that it can be leveraged for this prefix behavior
  • Session files are ~GBs large and don't leverage mmap, incurring a slight delay to save/load
  • The session usage in examples/main is oriented to optimizing initial prompt evaluation time. It uses a heuristic to determine if the session should be (re-)saved, such that loading (near) identitcal prompts doesn't incur the seconds to write the session file

@ejones ejones changed the title Add saved sessions for near-instant startup on long prompts Save and restore prompt evaluation state for much faster startup times Apr 25, 2023
@mishudark
Copy link

Out of curiosity, how many GB of memory are required in a M2 Mac to run the 30B model ?

@ejones
Copy link
Collaborator Author

ejones commented Apr 26, 2023

@mishudark originally I think it was 20GB per the table in the README, but now with mmap I think it's much lower. Activity Monitor report only a few GB for me, which I think corresponds to just the model state?

@ggerganov ggerganov added enhancement New feature or request high priority Very important issue labels Apr 26, 2023
@dmahurin
Copy link
Contributor

Currently this is just used to speed up initial prompt evaluation, but could eventually e.g., restore conversations

Hi @ejones , Good work. How much remains to restore conversations?

@ejones
Copy link
Collaborator Author

ejones commented Apr 27, 2023

@dmahurin in terms of program logic it probably wouldn't take much; starting from the end of the session is actually simpler because you're not finding a common prefix. I think the challenge is settling on the arguments and program behavior for when and how often you save sessions (for which there's currently a slight delay) and restoring prompt vs restoring full session on startup.

@mikeggh
Copy link

mikeggh commented Apr 27, 2023

@dmahurin in terms of program logic it probably wouldn't take much; starting from the end of the session is actually simpler because you're not finding a common prefix. I think the challenge is settling on the arguments and program behavior for when and how often you save sessions (for which there's currently a slight delay) and restoring prompt vs restoring full session on startup.

Yeah, the first time I used the session I pressed ctrl-d expecting it to be saved, but then I realized it was inserting into the logic path for the next token.. I thought about some other methods for forcing a save, such as a diff ctrl, or symbol? I personally will be using the llama-cpp-python more since my own framework is built around it. I was really just testing it before...

Great job though! Besides that one scenario, it works great.. =]

@Priestru
Copy link

Amazing work! Can't wait for this feature.

Do you have any idea why did it affect generation speed?
Feels like it shouldn't make difference but speed up in provided example is noticeable.

llama_print_timings: eval time = 1753.77 ms / 9 runs ( 194.86 ms per run)

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible improvement in the future is to make the llama_set_state_data() and llama_copy_state_data() work just with the used KV cache, instead of working with the full buffer always. The reason is that for a prompt of 100 tokens, the KV cache will be only 100 / 2048 full, so no need to store so many zeros. This will make the session files much smaller.

llama.cpp/llama.cpp

Lines 2177 to 2188 in 11d9023

// copy kv cache
{
const size_t kv_size = ctx->model.kv_self.buf.size;
const int kv_ntok = llama_get_kv_cache_token_count(ctx);
memcpy(out, &kv_size, sizeof(kv_size)); out += sizeof(kv_size);
memcpy(out, &kv_ntok, sizeof(kv_ntok)); out += sizeof(kv_ntok);
if (kv_size) {
memcpy(out, ctx->model.kv_self.buf.addr, kv_size); out += kv_size;
}
}

llama.cpp/llama.cpp

Lines 2249 to 2271 in 11d9023

// set kv cache
{
size_t kv_size;
int kv_ntok;
memcpy(&kv_size, in, sizeof(kv_size)); in += sizeof(kv_size);
memcpy(&kv_ntok, in, sizeof(kv_ntok)); in += sizeof(kv_ntok);
if (kv_size) {
LLAMA_ASSERT(ctx->model.kv_self.buf.size == kv_size);
void * k_data = ctx->model.kv_self.k->data; // remember data pointers
void * v_data = ctx->model.kv_self.v->data; // because their value is stored in buf and overwritten by memcpy
memcpy(ctx->model.kv_self.buf.addr, in, kv_size); in += kv_size;
ctx->model.kv_self.k->data = k_data; // restore correct data pointers
ctx->model.kv_self.v->data = v_data;
}
ctx->model.kv_self.n = kv_ntok;
}

@ggerganov ggerganov merged commit 1481a9c into ggerganov:master Apr 28, 2023
@ejones
Copy link
Collaborator Author

ejones commented Apr 29, 2023

@ggerganov that makes sense, thanks! I did notice there was a long stream of zeroes in the files.

@Priestru not sure, perhaps there's a higher cost to the initial token(s) when starting from scratch? For the timings I only generated 10 tokens so startup costs there would definitely impact the average.

@syl-00110111
Copy link

kudos for the patch, why not an example on 7B ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority Very important issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants