-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save and restore prompt evaluation state for much faster startup times #1169
Conversation
Out of curiosity, how many GB of memory are required in a M2 Mac to run the 30B model ? |
@mishudark originally I think it was 20GB per the table in the README, but now with mmap I think it's much lower. Activity Monitor report only a few GB for me, which I think corresponds to just the model state? |
Hi @ejones , Good work. How much remains to restore conversations? |
@dmahurin in terms of program logic it probably wouldn't take much; starting from the end of the session is actually simpler because you're not finding a common prefix. I think the challenge is settling on the arguments and program behavior for when and how often you save sessions (for which there's currently a slight delay) and restoring prompt vs restoring full session on startup. |
Yeah, the first time I used the session I pressed ctrl-d expecting it to be saved, but then I realized it was inserting into the logic path for the next token.. I thought about some other methods for forcing a save, such as a diff ctrl, or symbol? I personally will be using the llama-cpp-python more since my own framework is built around it. I was really just testing it before... Great job though! Besides that one scenario, it works great.. =] |
Amazing work! Can't wait for this feature. Do you have any idea why did it affect generation speed?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible improvement in the future is to make the llama_set_state_data()
and llama_copy_state_data()
work just with the used KV cache, instead of working with the full buffer always. The reason is that for a prompt of 100 tokens, the KV cache will be only 100 / 2048
full, so no need to store so many zeros. This will make the session files much smaller.
Lines 2177 to 2188 in 11d9023
// copy kv cache | |
{ | |
const size_t kv_size = ctx->model.kv_self.buf.size; | |
const int kv_ntok = llama_get_kv_cache_token_count(ctx); | |
memcpy(out, &kv_size, sizeof(kv_size)); out += sizeof(kv_size); | |
memcpy(out, &kv_ntok, sizeof(kv_ntok)); out += sizeof(kv_ntok); | |
if (kv_size) { | |
memcpy(out, ctx->model.kv_self.buf.addr, kv_size); out += kv_size; | |
} | |
} |
Lines 2249 to 2271 in 11d9023
// set kv cache | |
{ | |
size_t kv_size; | |
int kv_ntok; | |
memcpy(&kv_size, in, sizeof(kv_size)); in += sizeof(kv_size); | |
memcpy(&kv_ntok, in, sizeof(kv_ntok)); in += sizeof(kv_ntok); | |
if (kv_size) { | |
LLAMA_ASSERT(ctx->model.kv_self.buf.size == kv_size); | |
void * k_data = ctx->model.kv_self.k->data; // remember data pointers | |
void * v_data = ctx->model.kv_self.v->data; // because their value is stored in buf and overwritten by memcpy | |
memcpy(ctx->model.kv_self.buf.addr, in, kv_size); in += kv_size; | |
ctx->model.kv_self.k->data = k_data; // restore correct data pointers | |
ctx->model.kv_self.v->data = v_data; | |
} | |
ctx->model.kv_self.n = kv_ntok; | |
} |
@ggerganov that makes sense, thanks! I did notice there was a long stream of zeroes in the files. @Priestru not sure, perhaps there's a higher cost to the initial token(s) when starting from scratch? For the timings I only generated 10 tokens so startup costs there would definitely impact the average. |
kudos for the patch, why not an example on 7B ? |
Hi! I decided to take a stab at leveraging the new get / set state APIs to cache initial prompt evaluation in
main
. On my M2 at least, this feature lets me start upchat-13B.sh
with 65B in seconds (after having run before).Overview
llama_load_session_file
andllama_save_session_file
APIs to serialize the model state + a user-provided sequence of input tokens (more on that later)--session
arg toexamples/main
that designates a file to load/save the session (creating on first run). Currently this is just used to speed up initial prompt evaluation, but could eventually e.g., restore conversationsApproach
Establishes a binary session file format that prepends some additional metadata to the state returned by
llama_copy_state_data
.The embedded hparams is a sanity check that we don't load the state for a different model. The
inp_tokens
stream represents the sequence of input tokens whose evaluation led tollama_state
.When a past session is present during model evaluation, the session tokens are used (in
examples/main
) to determine the matching prefix length between the saved session and the current prompt (and technically input). These are skipped over usingn_past
. Regular evaluation then continues from the next token onward.For convenience, a single
--session
arg inexamples/main
designates the file to save the session to (creating if needed) and load from on successive calls.Testing
For interactive sessions, I tested this with
examples/chat-13B.sh
against quantized 30B and 65B:I also tested the regular, non-session usage.
To measure performance I ran
chat-13B.sh
, modified to be non-interactive and generate only 10 tokens.Results
Some rough timing results from my M2 running 30B on the prompt from
chat-13B.sh
.Before this feature, ~37s startup:
After this feature, first run, ~40s startup:
After this feature, successive runs, ~5s:
Caveats
n_past
, just that it can be leveraged for this prefix behaviorexamples/main
is oriented to optimizing initial prompt evaluation time. It uses a heuristic to determine if the session should be (re-)saved, such that loading (near) identitcal prompts doesn't incur the seconds to write the session file