-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New kv_cache API insufficient to restore model state #730
Comments
Woops, sorry I just realized you obviously still need to Here's is the working version for future reference. #include "llama.h"
#include <vector>
#include <iostream>
using namespace std;
int main() {
// init
auto params = llama_context_default_params();
auto ctx = llama_init_from_file("../../models/ggml-model.bin", params);
auto tokens = vector<llama_token>(params.n_ctx);
auto prompt = "The quick brown fox";
auto n_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);
// evaluate prompt
llama_eval(ctx, tokens.data(), n_tokens, 0, 12);
auto last_n_tokens_size = 64;
auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);
last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_tokens);
auto n_past = n_tokens;
// save state
auto kv_cache_size = llama_get_kv_cache_size(ctx);
auto kv_cache_token_count = llama_get_kv_cache_token_count(ctx);
auto kv_cache = llama_get_kv_cache(ctx);
auto kv_cache_copy = vector<uint8_t>(kv_cache, kv_cache + kv_cache_size);
auto n_past_copy = n_past;
auto last_n_tokens_data_copy = vector<llama_token>(last_n_tokens_data);
// first run
cout << prompt;
for (auto i = 0; i < 6; i++) {
auto next_token = llama_sample_top_p_top_k(
ctx,
last_n_tokens_data.data() + last_n_tokens_data.size() - n_past,
last_n_tokens_size,
1,
1.0,
0.0,
1.1
);
auto next_token_str = llama_token_to_str(ctx, next_token);
last_n_tokens_data.push_back(next_token);
cout << next_token_str;
llama_eval(ctx, &next_token, 1, n_past, 12);
n_past += 1;
}
cout << endl;
//
// restore state
llama_set_kv_cache(ctx, kv_cache_copy.data(), kv_cache_size, kv_cache_token_count);
last_n_tokens_data = last_n_tokens_data_copy;
n_past = n_past_copy;
// call eval again on prompt tokens
llama_eval(ctx, tokens.data(), n_tokens, 0, 12);
//
// second run
cout << prompt;
for (auto i = 0; i < 6; i++) {
auto next_token = llama_sample_top_p_top_k(
ctx,
last_n_tokens_data.data() + last_n_tokens_data.size() - n_past,
last_n_tokens_size,
1,
1.0,
0.0,
1.1
);
auto next_token_str = llama_token_to_str(ctx, next_token);
last_n_tokens_data.push_back(next_token);
cout << next_token_str;
llama_eval(ctx, &next_token, 1, n_past, 12);
n_past += 1;
}
cout << endl;
//
return 0;
} |
It is correct that the pr does not implement this - but it describes that last tokens etc is is needed to save full state :) just wanted to implement the missing api for implementing the functionality for a prompt saving mechanism. |
@chrfalch sorry to bug you again on this one but I think I'm missing something. From my understanding based on your response you should be able to save the internal state to disk assuming you also save n_past and last_n_tokens, however I'm still not able to do this correctly / in a way that reduces processing time once the model is reloaded.
I would now expect to get the same output based on the original prompt e.g. Appreciate any help on this one, cheers. EDIT: And just to clarify, if I call eval after restoring the kv_cache as I did above it doesn't seem to reduce processing time. |
I dont think you need to eval initial prompt, because you'd wanted to avoid this |
@ivanstepanovftw do you have an example? I've tried not eval'ing but in that case even with |
A great test is to give the model a prompt saying its name, and then the test would be to ask it "What's your name" - if it responds with the correct name everything works. Here is how I have implemented saving the "state" of the model:
This is all you need. After you have eval'ed the prompt you do the above steps and save the results. Then to restore you do the opposite - and test it all with the AI name prompt trick. |
@ivanstepanovftw if I don't eval anything after restoring the kv_state in a new context I just get a segfault (I assume some internal buffers only get initialised on first eval). See example below: #include <vector>
#include <iostream>
#include <chrono>
#include "llama.h"
using namespace std;
int main() {
auto seed = 42;
auto thread_count = 12;
auto last_n_tokens_size = 64;
auto prompt = "The quick brown fox";
auto model_path = "../models/ggml-alpaca.bin";
auto n_past = 0;
auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);
// init
auto params = llama_context_default_params();
params.seed = seed;
auto ctx = llama_init_from_file(model_path, params);
auto tokens = vector<llama_token>(params.n_ctx);
auto n_prompt_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);
if (n_prompt_tokens < 1) {
cout << "Failed to tokenize prompt" << endl;
return 1;
}
// evaluate prompt
llama_eval(ctx, tokens.data(), n_prompt_tokens, n_past, thread_count);
last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_prompt_tokens);
n_past += n_prompt_tokens;
// save kv state, last n tokens and n_past
auto kv_cache_size = llama_get_kv_cache_size(ctx);
auto kv_cache_token_count = llama_get_kv_cache_token_count(ctx);
auto kv_cache = llama_get_kv_cache(ctx);
auto kv_cache_saved = vector<uint8_t>(kv_cache, kv_cache + kv_cache_size);
auto last_n_tokens_data_saved = vector<llama_token>(last_n_tokens_data);
auto n_past_saved = n_past;
// first run
cout << endl << prompt;
for (auto i = 0; i < 6; i++) {
auto next_token = llama_sample_top_p_top_k(
ctx,
&last_n_tokens_data.back() - last_n_tokens_size,
last_n_tokens_size,
40,
1.0,
1.0,
1.1
);
auto next_token_str = llama_token_to_str(ctx, next_token);
last_n_tokens_data.push_back(next_token);
cout << next_token_str;
if (llama_eval(ctx, &next_token, 1, n_past, thread_count)) {
cout << endl << "Failed to evaluate" << endl;
return 1;
}
n_past += 1;
}
cout << endl << endl;
// free old model
llama_free(ctx);
// load new model
params = llama_context_default_params();
params.seed = seed;
ctx = llama_init_from_file(model_path, params);
// restore state
llama_set_kv_cache(ctx, kv_cache_saved.data(), kv_cache_size, kv_cache_token_count);
last_n_tokens_data = last_n_tokens_data_saved;
n_past = n_past_saved;
// second run
cout << endl << prompt;
for (auto i = 0; i < 6; i++) {
auto next_token = llama_sample_top_p_top_k(
ctx,
&last_n_tokens_data.back() - last_n_tokens_size,
last_n_tokens_size,
40,
1.0,
1.0,
1.1
);
auto next_token_str = llama_token_to_str(ctx, next_token);
last_n_tokens_data.push_back(next_token);
cout << next_token_str;
if (llama_eval(ctx, &next_token, 1, n_past, thread_count)) {
cout << endl << "Failed to evaluate" << endl;
return 1;
}
n_past += 1;
}
cout << endl << endl;
return 0;
} To avoid this I also tried saving the first generated token and just evaling that with n_past = n_prompt_tokens but no luck I just get random output still. @chrfalch I would imagine this should be the same as what you're suggesting, the model should "know" about the evaluated prompt throught the kv_state. #include <vector>
#include <iostream>
#include <chrono>
#include "llama.h"
using namespace std;
int main() {
auto seed = 42;
auto thread_count = 12;
auto last_n_tokens_size = 64;
auto prompt = "The quick brown fox";
auto model_path = "../models/ggml-alpaca.bin";
auto n_past = 0;
auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);
// init
auto params = llama_context_default_params();
params.seed = seed;
auto ctx = llama_init_from_file(model_path, params);
auto tokens = vector<llama_token>(params.n_ctx);
auto n_prompt_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);
if (n_prompt_tokens < 1) {
cout << "Failed to tokenize prompt" << endl;
return 1;
}
// evaluate prompt
llama_eval(ctx, tokens.data(), n_prompt_tokens, n_past, thread_count);
last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_prompt_tokens);
n_past += n_prompt_tokens;
// save kv state, last n tokens and n_past
auto kv_cache_size = llama_get_kv_cache_size(ctx);
auto kv_cache_token_count = llama_get_kv_cache_token_count(ctx);
auto kv_cache = llama_get_kv_cache(ctx);
auto kv_cache_saved = vector<uint8_t>(kv_cache, kv_cache + kv_cache_size);
auto last_n_tokens_data_saved = vector<llama_token>(last_n_tokens_data);
auto n_past_saved = n_past;
// save first generated token
auto first_generated_token = llama_token(0);
// first run
cout << endl << prompt;
for (auto i = 0; i < 6; i++) {
auto next_token = llama_sample_top_p_top_k(
ctx,
&last_n_tokens_data.back() - last_n_tokens_size,
last_n_tokens_size,
40,
1.0,
1.0,
1.1
);
if (i == 0) {
first_generated_token = next_token;
}
auto next_token_str = llama_token_to_str(ctx, next_token);
last_n_tokens_data.push_back(next_token);
cout << next_token_str;
if (llama_eval(ctx, &next_token, 1, n_past, thread_count)) {
cout << endl << "Failed to evaluate" << endl;
return 1;
}
n_past += 1;
}
cout << endl << endl;
// free old model
llama_free(ctx);
// load new model
params = llama_context_default_params();
params.seed = seed;
ctx = llama_init_from_file(model_path, params);
// restore state
llama_set_kv_cache(ctx, kv_cache_saved.data(), kv_cache_size, kv_cache_token_count);
last_n_tokens_data = last_n_tokens_data_saved;
n_past = n_past_saved;
// restore first generated token so we can safely sample
llama_eval(
ctx,
&first_generated_token,
1,
n_past,
thread_count
);
last_n_tokens_data.push_back(first_generated_token);
n_past += 1;
// second run
cout << endl << prompt << llama_token_to_str(ctx, first_generated_token);
for (auto i = 0; i < 5; i++) {
auto next_token = llama_sample_top_p_top_k(
ctx,
&last_n_tokens_data.back() - last_n_tokens_size,
last_n_tokens_size,
40,
1.0,
1.0,
1.1
);
auto next_token_str = llama_token_to_str(ctx, next_token);
last_n_tokens_data.push_back(next_token);
cout << next_token_str;
if (llama_eval(ctx, &next_token, 1, n_past, thread_count)) {
cout << endl << "Failed to evaluate" << endl;
return 1;
}
n_past += 1;
}
cout << endl << endl;
return 0;
} |
I tried with something like this: size_t llama_get_kv_length(const struct llama_context * ctx, int n_past) {
return ctx->model.hparams.n_layer * ctx->model.hparams.n_embd * n_past;
}
size_t llama_get_kv_float_size(const struct llama_context * ctx) {
return ggml_element_size(ctx->model.kv_self.k);
}
void llama_get_kv_data(const struct llama_context * ctx, int n_past, void *kout, void *vout) {
const auto & model = ctx->model;
const auto & hparams = model.hparams;
auto & kv_self = model.kv_self;
LLAMA_ASSERT(!!kv_self.ctx);
const uint64_t n_embd = hparams.n_embd;
const uint64_t n_layer = hparams.n_layer;
const uint64_t n_ctx = hparams.n_ctx;
const uint64_t nb = ggml_element_size(kv_self.k);
char buffer[4096]; // enough?
ggml_context *cpy_ctx = ggml_init({ sizeof(buffer), buffer, true });
ggml_cgraph gf{};
gf.n_threads = 1;
ggml_tensor * kout3d = ggml_new_tensor_3d(cpy_ctx, kv_self.k->type, n_embd, n_past, n_layer);
kout3d->data = kout;
ggml_tensor * vout3d = ggml_new_tensor_3d(cpy_ctx, kv_self.k->type, n_past, n_embd, n_layer);
vout3d->data = vout;
ggml_tensor * k3d = ggml_view_3d(cpy_ctx, kv_self.k, n_embd, n_past, n_layer, nb*n_embd, nb*n_embd*n_ctx, 0);
ggml_tensor * v3d = ggml_view_3d(cpy_ctx, kv_self.v, n_past, n_embd, n_layer, nb*n_ctx, nb*n_ctx*n_embd, 0);
ggml_build_forward_expand(&gf, ggml_cpy(cpy_ctx, k3d, kout3d));
ggml_build_forward_expand(&gf, ggml_cpy(cpy_ctx, v3d, vout3d));
ggml_graph_compute(cpy_ctx, &gf);
} This will copy the data to the user's buffer, that need to have sufficient space, but also that's why more functions to query size. If the full context is needed, then it would be simpler. Also, since ggml_cpy can change data types, it may be possible to let the user extract only in f32 or f16, this code only gives whatever format is used currently by the model. |
@SlyEcho Thank you for your code, I tried it with @abetlen's example. (Sorry for the mess.) Because I don't know cpp as my previous mention, I'm not sure if I did it correctly. With the code, result is still not correct but at least it shows the same result. By the way, would it be possible to save and load them to a file? Code#include <vector>
#include <iostream>
#include <chrono>
#include "llama.h"
#include "llama.cpp"
using namespace std;
int main() {
auto seed = 42;
auto thread_count = 12;
auto last_n_tokens_size = 64;
auto prompt = "The quick brown fox";
auto model_path = "../../ggml-vicuna-7b-4bit.bin";
auto n_past = 0;
auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);
// init
auto params = llama_context_default_params();
params.seed = seed;
auto ctx = llama_init_from_file(model_path, params);
auto tokens = vector<llama_token>(params.n_ctx);
auto n_prompt_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);
if (n_prompt_tokens < 1) {
cout << "Failed to tokenize prompt" << endl;
return 1;
}
// evaluate prompt
llama_eval(ctx, tokens.data(), n_prompt_tokens, n_past, thread_count);
last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_prompt_tokens);
n_past += n_prompt_tokens;
// // save kv state, last n tokens and n_past
// auto kv_cache_size = llama_get_kv_cache_size(ctx);
// auto kv_cache_token_count = llama_get_kv_cache_token_count(ctx);
// auto kv_cache = llama_get_kv_cache(ctx);
// auto kv_cache_saved = vector<uint8_t>(kv_cache, kv_cache + kv_cache_size);
auto last_n_tokens_data_saved = vector<llama_token>(last_n_tokens_data);
auto n_past_saved = n_past;
// save first generated token
auto first_generated_token = llama_token(0);
// first run
cout << endl
<< prompt;
for (auto i = 0; i < 6; i++) {
auto next_token = llama_sample_top_p_top_k(
ctx,
&last_n_tokens_data.back() - last_n_tokens_size,
last_n_tokens_size,
40,
1.0,
1.0,
1.1);
if (i == 0) {
first_generated_token = next_token;
}
auto next_token_str = llama_token_to_str(ctx, next_token);
last_n_tokens_data.push_back(next_token);
cout << next_token_str;
if (llama_eval(ctx, &next_token, 1, n_past, thread_count)) {
cout << endl
<< "Failed to evaluate" << endl;
return 1;
}
n_past += 1;
}
cout << endl
<< endl;
// load new model
params = llama_context_default_params();
params.seed = seed;
auto ctx2 = llama_init_from_file(model_path, params);
// // restore state
// llama_set_kv_cache(ctx, kv_cache_saved.data(), kv_cache_size, kv_cache_token_count);
last_n_tokens_data = last_n_tokens_data_saved;
n_past = n_past_saved;
// Copy ctx to ctx2
/** Here!! */
// Work...
// // ctx2->model.kv_self = ctx->model.kv_self;
// ctx2->model.kv_self.k = ctx->model.kv_self.k;
// ctx2->model.kv_self.v = ctx->model.kv_self.v;
// ctx2->model.kv_self.n = ctx->model.kv_self.n;
// Work...
// memcpy(&ctx2->model.kv_self.k, &ctx->model.kv_self.k, ggml_nbytes(ctx->model.kv_self.k));
// memcpy(&ctx2->model.kv_self.v, &ctx->model.kv_self.v, ggml_nbytes(ctx->model.kv_self.v));
// memcpy(&ctx2->model.kv_self.n, &ctx->model.kv_self.n, sizeof(ctx->model.kv_self.n));
// Not work but always show same result.
const auto &model = ctx->model;
const auto &hparams = model.hparams;
auto &kv_self = model.kv_self;
LLAMA_ASSERT(!!kv_self.ctx);
const uint64_t n_embd = hparams.n_embd;
const uint64_t n_layer = hparams.n_layer;
const uint64_t n_ctx = hparams.n_ctx;
const uint64_t nb = ggml_element_size(kv_self.k);
// char buffer[4096]; // enough?
char buffer[65536]; // enough?
ggml_context *cpy_ctx = ggml_init({sizeof(buffer), buffer, true});
ggml_cgraph gf{};
gf.n_threads = 6;
ggml_tensor *kout3d = ggml_new_tensor_3d(cpy_ctx, kv_self.k->type, n_embd, n_past, n_layer);
kout3d->data = ctx->model.kv_self.k;
ggml_tensor *vout3d = ggml_new_tensor_3d(cpy_ctx, kv_self.k->type, n_past, n_embd, n_layer);
vout3d->data = ctx->model.kv_self.k;
ggml_tensor *k3d = ggml_view_3d(cpy_ctx, kv_self.k, n_embd, n_past_saved, n_layer, nb * n_embd, nb * n_embd * n_ctx, 0);
ggml_tensor *v3d = ggml_view_3d(cpy_ctx, kv_self.v, n_past_saved, n_embd, n_layer, nb * n_ctx, nb * n_ctx * n_embd, 0);
ggml_build_forward_expand(&gf, ggml_cpy(cpy_ctx, k3d, kout3d));
ggml_build_forward_expand(&gf, ggml_cpy(cpy_ctx, v3d, vout3d));
ggml_graph_compute(cpy_ctx, &gf);
ctx2->model.kv_self.k = kout3d;
ctx2->model.kv_self.v = vout3d;
/* Here! **/
// // free old model
// llama_free(ctx);
// restore first generated token so we can safely sample
llama_eval(
ctx2,
&first_generated_token,
1,
n_past,
thread_count);
last_n_tokens_data.push_back(first_generated_token);
n_past += 1;
// second run
cout << endl
<< prompt << llama_token_to_str(ctx2, first_generated_token);
for (auto i = 0; i < 5; i++) {
auto next_token = llama_sample_top_p_top_k(
ctx2,
&last_n_tokens_data.back() - last_n_tokens_size,
last_n_tokens_size,
40,
1.0,
1.0,
1.1);
auto next_token_str = llama_token_to_str(ctx2, next_token);
last_n_tokens_data.push_back(next_token);
cout << next_token_str;
if (llama_eval(ctx2, &next_token, 1, n_past, thread_count)) {
cout << endl
<< "Failed to evaluate" << endl;
return 1;
}
n_past += 1;
}
cout << endl
<< endl;
return 0;
} Result# Print using ctx
The quick brown fox jumps over the lazy dog
# Print using ctx2
The quick brown fox jashplaying system. |
Sure, you may serialize and deserialize your structure to byte array. You should keep in mind that pointers should be serialized correctly. If you exceriencing issues with memory bugs, try address sanitizers (or |
I tried dump kv directly. I'v seen following code is working, but I cannot understand why. Is there any difference between using kv_self.buf and this? #include <vector>
#include <iostream>
#include <chrono>
#include "llama.h"
#include "llama.cpp"
using namespace std;
int main() {
auto seed = 42;
auto thread_count = 12;
auto last_n_tokens_size = 64;
auto prompt = "The quick brown fox";
auto model_path = "../../ggml-vicuna-7b-4bit.bin";
auto n_past = 0;
auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);
// init
auto params = llama_context_default_params();
params.seed = seed;
auto ctx = llama_init_from_file(model_path, params);
auto tokens = vector<llama_token>(params.n_ctx);
auto n_prompt_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);
if (n_prompt_tokens < 1) {
cout << "Failed to tokenize prompt" << endl;
return 1;
}
// evaluate prompt
llama_eval(ctx, tokens.data(), n_prompt_tokens, n_past, thread_count);
last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_prompt_tokens);
n_past += n_prompt_tokens;
// Save ctx->model.kv_self.k->data and ctx->model.kv_self.v->data and ctx->model.kv_self.n to file
FILE *fp_write = fopen("dump_kv.bin", "wb");
fwrite(ctx->model.kv_self.k->data, ggml_nbytes(ctx->model.kv_self.k), 1, fp_write);
fwrite(ctx->model.kv_self.v->data, ggml_nbytes(ctx->model.kv_self.v), 1, fp_write);
fwrite(&ctx->model.kv_self.n, sizeof(ctx->model.kv_self.n), 1, fp_write);
fclose(fp_write);
// save state
auto last_n_tokens_data_saved = vector<llama_token>(last_n_tokens_data);
auto n_past_saved = n_past;
// save first generated token
auto first_generated_token = llama_token(0);
// first run
cout << endl
<< prompt;
for (auto i = 0; i < 6; i++) {
auto next_token = llama_sample_top_p_top_k(
ctx,
&last_n_tokens_data.back() - last_n_tokens_size,
last_n_tokens_size,
40,
1.0,
1.0,
1.1);
if (i == 0) {
first_generated_token = next_token;
}
auto next_token_str = llama_token_to_str(ctx, next_token);
last_n_tokens_data.push_back(next_token);
cout << next_token_str;
if (llama_eval(ctx, &next_token, 1, n_past, thread_count)) {
cout << endl
<< "Failed to evaluate" << endl;
return 1;
}
n_past += 1;
}
cout << endl
<< endl;
// free old model
llama_free(ctx);
// load new model
params = llama_context_default_params();
params.seed = seed;
auto ctx2 = llama_init_from_file(model_path, params);
// Load ctx->model.kv_self.k->data and ctx->model.kv_self.v->data and ctx->model.kv_self.n from file
FILE *fp_read = fopen("dump_kv.bin", "rb");
fread(ctx2->model.kv_self.k->data, ggml_nbytes(ctx2->model.kv_self.k), 1, fp_read);
fread(ctx2->model.kv_self.v->data, ggml_nbytes(ctx2->model.kv_self.v), 1, fp_read);
fread(&ctx2->model.kv_self.n, sizeof(ctx2->model.kv_self.n), 1, fp_read);
fclose(fp_read);
// restore state
last_n_tokens_data = last_n_tokens_data_saved;
n_past = n_past_saved;
// restore first generated token so we can safely sample
llama_eval(
ctx2,
&first_generated_token,
1,
n_past,
thread_count);
last_n_tokens_data.push_back(first_generated_token);
n_past += 1;
// second run
cout << endl
<< prompt << llama_token_to_str(ctx2, first_generated_token);
for (auto i = 0; i < 5; i++) {
auto next_token = llama_sample_top_p_top_k(
ctx2,
&last_n_tokens_data.back() - last_n_tokens_size,
last_n_tokens_size,
40,
1.0,
1.0,
1.1);
auto next_token_str = llama_token_to_str(ctx2, next_token);
last_n_tokens_data.push_back(next_token);
cout << next_token_str;
if (llama_eval(ctx2, &next_token, 1, n_past, thread_count)) {
cout << endl
<< "Failed to evaluate" << endl;
return 1;
}
n_past += 1;
}
cout << endl
<< endl;
return 0;
} |
Yes, Reading and writing |
@ggerganov fantastic, I can confirm that example 2 from this comment does work however the first example still causes a segfault. I assume that's because some buffers are being accessed in sample that are only initialised on the first eval. |
Hmm, just looking at the code, seems like everything should be initialized. |
@ggerganov I believe the issue is that Adding a check here and running that first example I gave seems to reveal the issue https://github.com/ggerganov/llama.cpp/blob/master/llama.cpp#L1493 const int n_logits = lctx.model.hparams.n_vocab;
LLAMA_ASSERT(lctx.logits.size() > 0);
const auto & logits = lctx.logits;
const auto * plogits = logits.data() + logits.size() - n_logits; As you can see, |
Ah sorry - I forgot to mention there is now new interface for saving / loading the llama state: I think you should try to use the new functions: // Returns the size in bytes of the state (rng, logits, embedding and kv_cache)
LLAMA_API size_t llama_get_state_size(struct llama_context * ctx);
// Copies the state to the specified destination address.
// Destination needs to have allocated enough memory.
// Returns the number of bytes copied
LLAMA_API size_t llama_copy_state_data(struct llama_context * ctx, uint8_t * dest);
// Set the state reading from the specified address
// Returns the number of bytes read
LLAMA_API size_t llama_set_state_data(struct llama_context * ctx, const uint8_t * src); The old interface will likely be removed at some point if the above works: // Returns the KV cache that will contain the context for the
// ongoing prediction with the model.
LLAMA_API const uint8_t * llama_get_kv_cache(struct llama_context * ctx);
// Returns the size of the KV cache
LLAMA_API size_t llama_get_kv_cache_size(struct llama_context * ctx);
// Returns the number of tokens in the KV cache
LLAMA_API int llama_get_kv_cache_token_count(struct llama_context * ctx);
// Sets the KV cache containing the current context for the model
LLAMA_API void llama_set_kv_cache(
struct llama_context * ctx,
const uint8_t * kv_cache,
size_t n_size,
int n_token_count); |
I hope see an example for @xaedes PR. By the way, by adding logits to @abetlen 's first example and also second then it seems working. I think @chrfalch 's method is right and simple so, is it possible to keep and add logits to #include <vector>
#include <iostream>
#include "llama.h"
#include "llama.cpp"
using namespace std;
...
/*** logits */
// float* logits_saved = llama_get_logits(ctx);
// size_t logits_size = sizeof(logits_saved);
// or
auto logits_saved = vector<float>(ctx->logits);
/* logits ***/
auto last_n_tokens_data_saved = vector<llama_token>(last_n_tokens_data);
auto n_past_saved = n_past;
// first run
...
// free old model
llama_free(ctx);
// load new model
params = llama_context_default_params();
params.seed = seed;
auto ctx = llama_init_from_file(model_path, params);
// restore state
llama_set_kv_cache(ctx, kv_cache_saved.data(), kv_cache_size, kv_cache_token_count);
/*** logits */
// ctx->logits.clear();
// ctx->logits.insert(ctx->logits.end(), logits_saved, logits_saved + logits_size);
// or
ctx->logits = logits_saved;
/* logits ***/
last_n_tokens_data = last_n_tokens_data_saved;
n_past = n_past_saved;
// second run
... |
@edp1096 Here is your example adapted to work with llama_copy_state_data & llama_set_state_data. #include <vector>
#include <iostream>
#include <chrono>
#include "llama.h"
#include "llama.cpp"
using namespace std;
int main() {
auto seed = 42;
auto thread_count = 4;
auto last_n_tokens_size = 64;
auto prompt = "The quick brown fox";
auto model_path = "../../ggml-vicuna-7b-4bit.bin";
auto n_past = 0;
auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);
// init
auto params = llama_context_default_params();
params.seed = seed;
auto ctx = llama_init_from_file(model_path, params);
auto tokens = vector<llama_token>(params.n_ctx);
auto n_prompt_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);
if (n_prompt_tokens < 1) {
cout << "Failed to tokenize prompt" << endl;
return 1;
}
// evaluate prompt
llama_eval(ctx, tokens.data(), n_prompt_tokens, n_past, thread_count);
last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_prompt_tokens);
n_past += n_prompt_tokens;
// Save state (rng, logits, embedding and kv_cache) to file
FILE *fp_write = fopen("dump_state.bin", "wb");
auto state_size = llama_get_state_size(ctx);
auto state_mem = new uint8_t[state_size];
llama_copy_state_data(ctx, state_mem); // could also copy directly to memory mapped file
fwrite(state_mem, 1, state_size, fp_write);
fclose(fp_write);
// save state (last tokens)
auto last_n_tokens_data_saved = vector<llama_token>(last_n_tokens_data);
auto n_past_saved = n_past;
// save first generated token
auto first_generated_token = llama_token(0);
// first run
cout << endl
<< prompt;
for (auto i = 0; i < 6; i++) {
auto next_token = llama_sample_top_p_top_k(
ctx,
&last_n_tokens_data.back() - last_n_tokens_size,
last_n_tokens_size,
40,
1.0,
1.0,
1.1);
if (i == 0) {
first_generated_token = next_token;
}
auto next_token_str = llama_token_to_str(ctx, next_token);
last_n_tokens_data.push_back(next_token);
cout << next_token_str;
if (llama_eval(ctx, &next_token, 1, n_past, thread_count)) {
cout << endl
<< "Failed to evaluate" << endl;
return 1;
}
n_past += 1;
}
cout << endl
<< endl;
// free old model
llama_free(ctx);
// load new model
params = llama_context_default_params();
params.seed = seed;
auto ctx2 = llama_init_from_file(model_path, params);
// Load state (rng, logits, embedding and kv_cache) from file
FILE *fp_read = fopen("dump_state.bin", "rb");
auto state_size2 = llama_get_state_size(ctx2);
if (state_size != state_size2) {
cerr << "state size differs\n";
}
fread(state_mem, 1, state_size, fp_read);
llama_set_state_data(ctx2, state_mem); // could also read directly from memory mapped file
fclose(fp_read);
// restore state (last tokens)
last_n_tokens_data = last_n_tokens_data_saved;
n_past = n_past_saved;
// this should not be necessary with llama_copy_state_data & llama_set_state_data as they will save and restore logits.
// // restore first generated token so we can safely sample
// llama_eval(
// ctx2,
// &first_generated_token,
// 1,
// n_past,
// thread_count);
// last_n_tokens_data.push_back(first_generated_token);
// n_past += 1;
// cout << endl << prompt << llama_token_to_str(ctx2, first_generated_token);
// second run
for (auto i = 0; i < 5; i++) {
auto next_token = llama_sample_top_p_top_k(
ctx2,
&last_n_tokens_data.back() - last_n_tokens_size,
last_n_tokens_size,
40,
1.0,
1.0,
1.1);
auto next_token_str = llama_token_to_str(ctx2, next_token);
last_n_tokens_data.push_back(next_token);
cout << next_token_str;
if (llama_eval(ctx2, &next_token, 1, n_past, thread_count)) {
cout << endl
<< "Failed to evaluate" << endl;
return 1;
}
n_past += 1;
}
cout << endl
<< endl;
return 0;
} |
It works great for me! Thank you @xaedes ! |
Thank you works great! I'll close this issue in that case. |
Trying with that example, is it a normal behavior restore using llama_set_state_data, setting a new ctx seed and get the same output for the second run? (and changing top_k and top_p). Or it does not affect and the initial prompt needs to be reevaluated? |
@s2kjn93h Yes, that is the expected behaviour. The seed is used when initializing the random number generator with llama_init_from_file. The state of the random number generator is then saved by llama_copy_state_data and will be restored with llama_set_state_data, so that the sampling results remain consistent. When a different seed is set for a new context and the state is then loaded with llama_set_state_data, the random number generator will be in the state from when llama_copy_state_data was called, i.e. from the previous run. If you want to generate other numbers after loading the llama state, i.e. to sample different tokens than the saved state would have, you can call llama_sample_top_p_top_k and discard the sampled token, as many times as you wish. It could be beneficial to have API functions for more precise control in the future. In the meantime, we can gain greater control by directly altering the memory allocated for the llama state. The code block for Here is an example of initializing the random number generator with seed = 42 * 1337: // get state from ctx
const size_t state_size = llama_get_state_size(ctx);
uint8_t * state_memory = new uint8_t[state_size];
llama_copy_state_data(ctx, state_memory);
// the rng we want to set in ctx
int seed = 42 * 1337;
auto rng = std::mt19937(seed);
// copy rng to state_memory (code taken from llama_copy_state_data)
#define LLAMA_MAX_RNG_STATE 64*1024
uint8_t * out = state_memory;
{
std::stringstream rng_ss;
rng_ss << rng;
const size_t rng_size = rng_ss.str().size();
char rng_buf[LLAMA_MAX_RNG_STATE];
memset(&rng_buf[0], 0, LLAMA_MAX_RNG_STATE);
memcpy(&rng_buf[0], rng_ss.str().data(), rng_ss.str().size());
memcpy(out, &rng_size, sizeof(rng_size)); out += sizeof(rng_size);
memcpy(out, &rng_buf[0], LLAMA_MAX_RNG_STATE); out += LLAMA_MAX_RNG_STATE;
}
// set our rng in the ctx by setting state from state_memory
llama_set_state_data(ctx, state_memory); |
I may be doing something wrong or misunderstanding the purpose of the
kv_cache
API but I believe the recent PR #685 by @chrfalch which added the ability to get / set thekv_cache
is still insufficient to restore the state of the model even when resetting external model state such aslast_n_tokens_data
andn_past
.Here is a minimal example
I'd expect the following output
But instead I get
Which implies the model is still generating from the end of the first run.
The text was updated successfully, but these errors were encountered: