Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New kv_cache API insufficient to restore model state #730

Closed
abetlen opened this issue Apr 3, 2023 · 23 comments
Closed

New kv_cache API insufficient to restore model state #730

abetlen opened this issue Apr 3, 2023 · 23 comments
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed high priority Very important issue

Comments

@abetlen
Copy link
Collaborator

abetlen commented Apr 3, 2023

I may be doing something wrong or misunderstanding the purpose of the kv_cache API but I believe the recent PR #685 by @chrfalch which added the ability to get / set the kv_cache is still insufficient to restore the state of the model even when resetting external model state such as last_n_tokens_data and n_past.

Here is a minimal example

#include "llama.h"
#include <vector>
#include <iostream>

using namespace std;

int main() {
    // init
    auto params = llama_context_default_params();
    auto ctx = llama_init_from_file("../../models/ggml-model.bin", params);
    auto tokens = vector<llama_token>(params.n_ctx);
    auto prompt = "The quick brown fox";
    auto n_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);

    // evaluate prompt
    llama_eval(ctx, tokens.data(), n_tokens, 0, 12);
    auto last_n_tokens_size = 64;
    auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);
    last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_tokens);
    auto n_past = n_tokens;

    // save state
    auto kv_cache_size = llama_get_kv_cache_size(ctx);
    auto kv_cache_token_count = llama_get_kv_cache_token_count(ctx);
    auto kv_cache = llama_get_kv_cache(ctx);
    auto kv_cache_copy = vector<uint8_t>(kv_cache, kv_cache + kv_cache_size);
    auto n_past_copy = n_past;
    auto last_n_tokens_data_copy = vector<llama_token>(last_n_tokens_data);
    
    // first run
    cout << prompt;
    for (auto i = 0; i < 6; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx,
            last_n_tokens_data.data() + last_n_tokens_data.size() - n_past,
            last_n_tokens_size,
            1,
            1.0,
            0.0,
            1.1
        );
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        llama_eval(ctx, &next_token, 1, n_past, 12);
        n_past += 1;
    }
    cout << endl;
    //

    // restore state
    llama_set_kv_cache(ctx, kv_cache_copy.data(), kv_cache_size, kv_cache_token_count);
    last_n_tokens_data = last_n_tokens_data_copy;
    n_past = n_past_copy;
    //

    // second run
    cout << prompt;
    for (auto i = 0; i < 6; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx,
            last_n_tokens_data.data() + last_n_tokens_data.size() - n_past,
            last_n_tokens_size,
            1,
            1.0,
            0.0,
            1.1
        );
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        llama_eval(ctx, &next_token, 1, n_past, 12);
        n_past += 1;
    }
    cout << endl;
    //
    return 0;
}

I'd expect the following output

The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog

But instead I get

The quick brown fox jumps over the lazy dog
The quick brown fox.
The quick brown fo

Which implies the model is still generating from the end of the first run.

@abetlen
Copy link
Collaborator Author

abetlen commented Apr 3, 2023

Woops, sorry I just realized you obviously still need to eval the prompt tokens again.

Here's is the working version for future reference.

#include "llama.h"
#include <vector>
#include <iostream>

using namespace std;

int main() {
    // init
    auto params = llama_context_default_params();
    auto ctx = llama_init_from_file("../../models/ggml-model.bin", params);
    auto tokens = vector<llama_token>(params.n_ctx);
    auto prompt = "The quick brown fox";
    auto n_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);

    // evaluate prompt
    llama_eval(ctx, tokens.data(), n_tokens, 0, 12);
    auto last_n_tokens_size = 64;
    auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);
    last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_tokens);
    auto n_past = n_tokens;

    // save state
    auto kv_cache_size = llama_get_kv_cache_size(ctx);
    auto kv_cache_token_count = llama_get_kv_cache_token_count(ctx);
    auto kv_cache = llama_get_kv_cache(ctx);
    auto kv_cache_copy = vector<uint8_t>(kv_cache, kv_cache + kv_cache_size);
    auto n_past_copy = n_past;
    auto last_n_tokens_data_copy = vector<llama_token>(last_n_tokens_data);
    
    // first run
    cout << prompt;
    for (auto i = 0; i < 6; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx,
            last_n_tokens_data.data() + last_n_tokens_data.size() - n_past,
            last_n_tokens_size,
            1,
            1.0,
            0.0,
            1.1
        );
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        llama_eval(ctx, &next_token, 1, n_past, 12);
        n_past += 1;
    }
    cout << endl;
    //

    // restore state
    llama_set_kv_cache(ctx, kv_cache_copy.data(), kv_cache_size, kv_cache_token_count);
    last_n_tokens_data = last_n_tokens_data_copy;
    n_past = n_past_copy;
    // call eval again on prompt tokens
    llama_eval(ctx, tokens.data(), n_tokens, 0, 12);
    //

    // second run
    cout << prompt;
    for (auto i = 0; i < 6; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx,
            last_n_tokens_data.data() + last_n_tokens_data.size() - n_past,
            last_n_tokens_size,
            1,
            1.0,
            0.0,
            1.1
        );
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        llama_eval(ctx, &next_token, 1, n_past, 12);
        n_past += 1;
    }
    cout << endl;
    //
    return 0;
}

@abetlen abetlen closed this as completed Apr 3, 2023
@chrfalch
Copy link
Contributor

chrfalch commented Apr 3, 2023

It is correct that the pr does not implement this - but it describes that last tokens etc is is needed to save full state :) just wanted to implement the missing api for implementing the functionality for a prompt saving mechanism.

@abetlen
Copy link
Collaborator Author

abetlen commented Apr 11, 2023

@chrfalch sorry to bug you again on this one but I think I'm missing something.

From my understanding based on your response you should be able to save the internal state to disk assuming you also save n_past and last_n_tokens, however I'm still not able to do this correctly / in a way that reduces processing time once the model is reloaded.

  1. llama_init_from_file the model.
  2. Tokenize the initial prompt (e.g. the quick brown fox jumps)
  3. llama_eval the prompt and set n_past to the number of prompt tokens, and start to fill last_n_tokens_data with the prompt tokens.
  4. Save kv_cache, kv_cache_size, kv_cache_token_count, n_past and last_n_tokens.
  5. Generate some tokens in a llama_sample_top_p_top_k / llama_eval loop (e.g over the lazy dog)
  6. llama_free the context
  7. Reload via llama_init_from_file
  8. Restore the state via llama_set_kv_cache with the saved values from above.
  9. Restore n_past and last_n_tokens to the saved value.
  10. Generate some tokens in a llama_sample_top_p_top_k / llama_eval loop (e.g over the lazy dog) starting from the saved value of n_past and last_n_tokens

I would now expect to get the same output based on the original prompt e.g. over the lazy dog but it seems that the model is not taking this into account and instead I get back a random response.

Appreciate any help on this one, cheers.

EDIT: And just to clarify, if I call eval after restoring the kv_cache as I did above it doesn't seem to reduce processing time.

@ivanstepanovftw
Copy link
Collaborator

I dont think you need to eval initial prompt, because you'd wanted to avoid this

@abetlen
Copy link
Collaborator Author

abetlen commented Apr 13, 2023

@ivanstepanovftw do you have an example? I've tried not eval'ing but in that case even with n_past saved the model fails to generate the same output (just random generation).

@chrfalch
Copy link
Contributor

A great test is to give the model a prompt saying its name, and then the test would be to ask it "What's your name" - if it responds with the correct name everything works.

Here is how I have implemented saving the "state" of the model:

  1. Save kv_cache size by calling llama_get_kv_cache_size
  2. Get memory in kv_cache by calling llama_get_kv_cache
  3. Save token count by calling llama_get_kv_cache_token_count
  4. Save n_past
  5. Save last_n_tokens and its size

This is all you need. After you have eval'ed the prompt you do the above steps and save the results.

Then to restore you do the opposite - and test it all with the AI name prompt trick.

@abetlen
Copy link
Collaborator Author

abetlen commented Apr 13, 2023

@ivanstepanovftw if I don't eval anything after restoring the kv_state in a new context I just get a segfault (I assume some internal buffers only get initialised on first eval). See example below:

#include <vector>
#include <iostream>
#include <chrono>

#include "llama.h"

using namespace std;

int main() {
    auto seed = 42;
    auto thread_count = 12;
    auto last_n_tokens_size = 64;
    auto prompt = "The quick brown fox";
    auto model_path = "../models/ggml-alpaca.bin";

    auto n_past = 0;
    auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);

    // init
    auto params = llama_context_default_params();
    params.seed = seed;
    auto ctx = llama_init_from_file(model_path, params);
    auto tokens = vector<llama_token>(params.n_ctx);
    auto n_prompt_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);

    if (n_prompt_tokens < 1) {
        cout << "Failed to tokenize prompt" << endl;
        return 1;
    }

    // evaluate prompt

    llama_eval(ctx, tokens.data(), n_prompt_tokens, n_past, thread_count);

    last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_prompt_tokens);
    n_past += n_prompt_tokens;

    // save kv state, last n tokens and n_past
    auto kv_cache_size = llama_get_kv_cache_size(ctx);
    auto kv_cache_token_count = llama_get_kv_cache_token_count(ctx);
    auto kv_cache = llama_get_kv_cache(ctx);
    auto kv_cache_saved = vector<uint8_t>(kv_cache, kv_cache + kv_cache_size);

    auto last_n_tokens_data_saved = vector<llama_token>(last_n_tokens_data);
    auto n_past_saved = n_past;

    // first run
    cout << endl << prompt;
    for (auto i = 0; i < 6; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx,
            &last_n_tokens_data.back() - last_n_tokens_size,
            last_n_tokens_size,
            40,
            1.0,
            1.0,
            1.1
        );
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        if (llama_eval(ctx, &next_token, 1, n_past, thread_count)) {
            cout << endl << "Failed to evaluate" << endl;
            return 1;
        }
        n_past += 1;
    }
    cout << endl << endl;

    // free old model
    llama_free(ctx);

    // load new model
    params = llama_context_default_params();
    params.seed = seed;
    ctx = llama_init_from_file(model_path, params);

    // restore state
    llama_set_kv_cache(ctx, kv_cache_saved.data(), kv_cache_size, kv_cache_token_count);
    last_n_tokens_data = last_n_tokens_data_saved;
    n_past = n_past_saved;

    // second run
    cout << endl << prompt;
    for (auto i = 0; i < 6; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx,
            &last_n_tokens_data.back() - last_n_tokens_size,
            last_n_tokens_size,
            40,
            1.0,
            1.0,
            1.1
        );
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        if (llama_eval(ctx, &next_token, 1, n_past, thread_count)) {
            cout << endl << "Failed to evaluate" << endl;
            return 1;
        }
        n_past += 1;
    }
    cout << endl << endl;
    return 0;
}

To avoid this I also tried saving the first generated token and just evaling that with n_past = n_prompt_tokens but no luck I just get random output still.

@chrfalch I would imagine this should be the same as what you're suggesting, the model should "know" about the evaluated prompt throught the kv_state.

#include <vector>
#include <iostream>
#include <chrono>

#include "llama.h"

using namespace std;

int main() {
    auto seed = 42;
    auto thread_count = 12;
    auto last_n_tokens_size = 64;
    auto prompt = "The quick brown fox";
    auto model_path = "../models/ggml-alpaca.bin";

    auto n_past = 0;
    auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);

    // init
    auto params = llama_context_default_params();
    params.seed = seed;
    auto ctx = llama_init_from_file(model_path, params);
    auto tokens = vector<llama_token>(params.n_ctx);
    auto n_prompt_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);

    if (n_prompt_tokens < 1) {
        cout << "Failed to tokenize prompt" << endl;
        return 1;
    }

    // evaluate prompt

    llama_eval(ctx, tokens.data(), n_prompt_tokens, n_past, thread_count);

    last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_prompt_tokens);
    n_past += n_prompt_tokens;

    // save kv state, last n tokens and n_past
    auto kv_cache_size = llama_get_kv_cache_size(ctx);
    auto kv_cache_token_count = llama_get_kv_cache_token_count(ctx);
    auto kv_cache = llama_get_kv_cache(ctx);
    auto kv_cache_saved = vector<uint8_t>(kv_cache, kv_cache + kv_cache_size);

    auto last_n_tokens_data_saved = vector<llama_token>(last_n_tokens_data);
    auto n_past_saved = n_past;

    // save first generated token
    auto first_generated_token = llama_token(0);

    // first run
    cout << endl << prompt;
    for (auto i = 0; i < 6; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx,
            &last_n_tokens_data.back() - last_n_tokens_size,
            last_n_tokens_size,
            40,
            1.0,
            1.0,
            1.1
        );
        if (i == 0) {
            first_generated_token = next_token;
        }
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        if (llama_eval(ctx, &next_token, 1, n_past, thread_count)) {
            cout << endl << "Failed to evaluate" << endl;
            return 1;
        }
        n_past += 1;
    }
    cout << endl << endl;

    // free old model
    llama_free(ctx);

    // load new model
    params = llama_context_default_params();
    params.seed = seed;
    ctx = llama_init_from_file(model_path, params);

    // restore state
    llama_set_kv_cache(ctx, kv_cache_saved.data(), kv_cache_size, kv_cache_token_count);
    last_n_tokens_data = last_n_tokens_data_saved;
    n_past = n_past_saved;

    // restore first generated token so we can safely sample
    llama_eval(
        ctx,
        &first_generated_token,
        1,
        n_past,
        thread_count
    );
    last_n_tokens_data.push_back(first_generated_token);
    n_past += 1;

    // second run
    cout << endl << prompt << llama_token_to_str(ctx, first_generated_token);
    for (auto i = 0; i < 5; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx,
            &last_n_tokens_data.back() - last_n_tokens_size,
            last_n_tokens_size,
            40,
            1.0,
            1.0,
            1.1
        );
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        if (llama_eval(ctx, &next_token, 1, n_past, thread_count)) {
            cout << endl << "Failed to evaluate" << endl;
            return 1;
        }
        n_past += 1;
    }
    cout << endl << endl;
    return 0;
}

@ggerganov ggerganov reopened this Apr 19, 2023
@ggerganov ggerganov added bug Something isn't working help wanted Extra attention is needed good first issue Good for newcomers high priority Very important issue labels Apr 19, 2023
@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 19, 2023

I tried with something like this:

size_t llama_get_kv_length(const struct llama_context * ctx, int n_past) {
    return ctx->model.hparams.n_layer * ctx->model.hparams.n_embd * n_past;
}

size_t llama_get_kv_float_size(const struct llama_context * ctx) {
    return ggml_element_size(ctx->model.kv_self.k);
}

void llama_get_kv_data(const struct llama_context * ctx, int n_past, void *kout, void *vout) {
    const auto & model   = ctx->model;
    const auto & hparams = model.hparams;

    auto & kv_self = model.kv_self;

    LLAMA_ASSERT(!!kv_self.ctx);

    const uint64_t n_embd  = hparams.n_embd;
    const uint64_t n_layer = hparams.n_layer;
    const uint64_t n_ctx   = hparams.n_ctx;

    const uint64_t nb = ggml_element_size(kv_self.k);

    char buffer[4096]; // enough?
    ggml_context *cpy_ctx = ggml_init({ sizeof(buffer), buffer, true });
    ggml_cgraph gf{};
    gf.n_threads = 1;


    ggml_tensor * kout3d = ggml_new_tensor_3d(cpy_ctx, kv_self.k->type, n_embd, n_past, n_layer);
    kout3d->data = kout;

    ggml_tensor * vout3d = ggml_new_tensor_3d(cpy_ctx, kv_self.k->type, n_past, n_embd, n_layer);
    vout3d->data = vout;

    ggml_tensor * k3d = ggml_view_3d(cpy_ctx, kv_self.k, n_embd, n_past, n_layer, nb*n_embd, nb*n_embd*n_ctx, 0);
    ggml_tensor * v3d = ggml_view_3d(cpy_ctx, kv_self.v, n_past, n_embd, n_layer, nb*n_ctx, nb*n_ctx*n_embd, 0);

    ggml_build_forward_expand(&gf, ggml_cpy(cpy_ctx, k3d, kout3d));
    ggml_build_forward_expand(&gf, ggml_cpy(cpy_ctx, v3d, vout3d));
    ggml_graph_compute(cpy_ctx, &gf);
}

This will copy the data to the user's buffer, that need to have sufficient space, but also that's why more functions to query size. If the full context is needed, then it would be simpler. Also, since ggml_cpy can change data types, it may be possible to let the user extract only in f32 or f16, this code only gives whatever format is used currently by the model.

@edp1096
Copy link
Contributor

edp1096 commented Apr 20, 2023

@SlyEcho Thank you for your code, I tried it with @abetlen's example. (Sorry for the mess.)

Because I don't know cpp as my previous mention, I'm not sure if I did it correctly.

With the code, result is still not correct but at least it shows the same result.

By the way, would it be possible to save and load them to a file?

Code

#include <vector>
#include <iostream>
#include <chrono>

#include "llama.h"
#include "llama.cpp"

using namespace std;

int main() {
    auto seed = 42;
    auto thread_count = 12;
    auto last_n_tokens_size = 64;
    auto prompt = "The quick brown fox";
    auto model_path = "../../ggml-vicuna-7b-4bit.bin";

    auto n_past = 0;
    auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);

    // init
    auto params = llama_context_default_params();
    params.seed = seed;
    auto ctx = llama_init_from_file(model_path, params);
    auto tokens = vector<llama_token>(params.n_ctx);
    auto n_prompt_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);

    if (n_prompt_tokens < 1) {
        cout << "Failed to tokenize prompt" << endl;
        return 1;
    }

    // evaluate prompt

    llama_eval(ctx, tokens.data(), n_prompt_tokens, n_past, thread_count);

    last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_prompt_tokens);
    n_past += n_prompt_tokens;

    // // save kv state, last n tokens and n_past
    // auto kv_cache_size = llama_get_kv_cache_size(ctx);
    // auto kv_cache_token_count = llama_get_kv_cache_token_count(ctx);
    // auto kv_cache = llama_get_kv_cache(ctx);
    // auto kv_cache_saved = vector<uint8_t>(kv_cache, kv_cache + kv_cache_size);

    auto last_n_tokens_data_saved = vector<llama_token>(last_n_tokens_data);
    auto n_past_saved = n_past;

    // save first generated token
    auto first_generated_token = llama_token(0);

    // first run
    cout << endl
         << prompt;
    for (auto i = 0; i < 6; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx,
            &last_n_tokens_data.back() - last_n_tokens_size,
            last_n_tokens_size,
            40,
            1.0,
            1.0,
            1.1);
        if (i == 0) {
            first_generated_token = next_token;
        }
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        if (llama_eval(ctx, &next_token, 1, n_past, thread_count)) {
            cout << endl
                 << "Failed to evaluate" << endl;
            return 1;
        }
        n_past += 1;
    }
    cout << endl
         << endl;

    // load new model
    params = llama_context_default_params();
    params.seed = seed;
    auto ctx2 = llama_init_from_file(model_path, params);

    // // restore state
    // llama_set_kv_cache(ctx, kv_cache_saved.data(), kv_cache_size, kv_cache_token_count);
    last_n_tokens_data = last_n_tokens_data_saved;
    n_past = n_past_saved;

    // Copy ctx to ctx2



    /** Here!! */

    // Work...
    // // ctx2->model.kv_self = ctx->model.kv_self;

    // ctx2->model.kv_self.k = ctx->model.kv_self.k;
    // ctx2->model.kv_self.v = ctx->model.kv_self.v;
    // ctx2->model.kv_self.n = ctx->model.kv_self.n;

    // Work...
    // memcpy(&ctx2->model.kv_self.k, &ctx->model.kv_self.k, ggml_nbytes(ctx->model.kv_self.k));
    // memcpy(&ctx2->model.kv_self.v, &ctx->model.kv_self.v, ggml_nbytes(ctx->model.kv_self.v));
    // memcpy(&ctx2->model.kv_self.n, &ctx->model.kv_self.n, sizeof(ctx->model.kv_self.n));

    // Not work but always show same result.
    const auto &model = ctx->model;
    const auto &hparams = model.hparams;

    auto &kv_self = model.kv_self;

    LLAMA_ASSERT(!!kv_self.ctx);

    const uint64_t n_embd = hparams.n_embd;
    const uint64_t n_layer = hparams.n_layer;
    const uint64_t n_ctx = hparams.n_ctx;

    const uint64_t nb = ggml_element_size(kv_self.k);

    // char buffer[4096];  // enough?
    char buffer[65536];  // enough?
    ggml_context *cpy_ctx = ggml_init({sizeof(buffer), buffer, true});
    ggml_cgraph gf{};
    gf.n_threads = 6;

    ggml_tensor *kout3d = ggml_new_tensor_3d(cpy_ctx, kv_self.k->type, n_embd, n_past, n_layer);
    kout3d->data = ctx->model.kv_self.k;

    ggml_tensor *vout3d = ggml_new_tensor_3d(cpy_ctx, kv_self.k->type, n_past, n_embd, n_layer);
    vout3d->data = ctx->model.kv_self.k;

    ggml_tensor *k3d = ggml_view_3d(cpy_ctx, kv_self.k, n_embd, n_past_saved, n_layer, nb * n_embd, nb * n_embd * n_ctx, 0);
    ggml_tensor *v3d = ggml_view_3d(cpy_ctx, kv_self.v, n_past_saved, n_embd, n_layer, nb * n_ctx, nb * n_ctx * n_embd, 0);

    ggml_build_forward_expand(&gf, ggml_cpy(cpy_ctx, k3d, kout3d));
    ggml_build_forward_expand(&gf, ggml_cpy(cpy_ctx, v3d, vout3d));
    ggml_graph_compute(cpy_ctx, &gf);

    ctx2->model.kv_self.k = kout3d;
    ctx2->model.kv_self.v = vout3d;

    /* Here! **/



    // // free old model
    // llama_free(ctx);

    // restore first generated token so we can safely sample
    llama_eval(
        ctx2,
        &first_generated_token,
        1,
        n_past,
        thread_count);
    last_n_tokens_data.push_back(first_generated_token);
    n_past += 1;

    // second run
    cout << endl
         << prompt << llama_token_to_str(ctx2, first_generated_token);
    for (auto i = 0; i < 5; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx2,
            &last_n_tokens_data.back() - last_n_tokens_size,
            last_n_tokens_size,
            40,
            1.0,
            1.0,
            1.1);
        auto next_token_str = llama_token_to_str(ctx2, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        if (llama_eval(ctx2, &next_token, 1, n_past, thread_count)) {
            cout << endl
                 << "Failed to evaluate" << endl;
            return 1;
        }
        n_past += 1;
    }
    cout << endl
         << endl;
    return 0;
}

Result

# Print using ctx
The quick brown fox jumps over the lazy dog

# Print using ctx2
The quick brown fox jashplaying system.

@ivanstepanovftw
Copy link
Collaborator

By the way, would it be possible to save and load them to a file? - @edp1096

Sure, you may serialize and deserialize your structure to byte array. You should keep in mind that pointers should be serialized correctly. If you exceriencing issues with memory bugs, try address sanitizers (or valgring --tool=memcheck, but it is slower).

@edp1096
Copy link
Contributor

edp1096 commented Apr 20, 2023

I tried dump kv directly.

I'v seen following code is working, but I cannot understand why.

Is there any difference between using kv_self.buf and this?

#include <vector>
#include <iostream>
#include <chrono>

#include "llama.h"
#include "llama.cpp"

using namespace std;

int main() {
    auto seed = 42;
    auto thread_count = 12;
    auto last_n_tokens_size = 64;
    auto prompt = "The quick brown fox";
    auto model_path = "../../ggml-vicuna-7b-4bit.bin";

    auto n_past = 0;
    auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);

    // init
    auto params = llama_context_default_params();
    params.seed = seed;
    auto ctx = llama_init_from_file(model_path, params);
    auto tokens = vector<llama_token>(params.n_ctx);
    auto n_prompt_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);

    if (n_prompt_tokens < 1) {
        cout << "Failed to tokenize prompt" << endl;
        return 1;
    }

    // evaluate prompt

    llama_eval(ctx, tokens.data(), n_prompt_tokens, n_past, thread_count);

    last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_prompt_tokens);
    n_past += n_prompt_tokens;

    // Save ctx->model.kv_self.k->data and ctx->model.kv_self.v->data and ctx->model.kv_self.n to file
    FILE *fp_write = fopen("dump_kv.bin", "wb");
    fwrite(ctx->model.kv_self.k->data, ggml_nbytes(ctx->model.kv_self.k), 1, fp_write);
    fwrite(ctx->model.kv_self.v->data, ggml_nbytes(ctx->model.kv_self.v), 1, fp_write);
    fwrite(&ctx->model.kv_self.n, sizeof(ctx->model.kv_self.n), 1, fp_write);
    fclose(fp_write);

    // save state
    auto last_n_tokens_data_saved = vector<llama_token>(last_n_tokens_data);
    auto n_past_saved = n_past;

    // save first generated token
    auto first_generated_token = llama_token(0);

    // first run
    cout << endl
         << prompt;
    for (auto i = 0; i < 6; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx,
            &last_n_tokens_data.back() - last_n_tokens_size,
            last_n_tokens_size,
            40,
            1.0,
            1.0,
            1.1);
        if (i == 0) {
            first_generated_token = next_token;
        }
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        if (llama_eval(ctx, &next_token, 1, n_past, thread_count)) {
            cout << endl
                 << "Failed to evaluate" << endl;
            return 1;
        }
        n_past += 1;
    }
    cout << endl
         << endl;

    // free old model
    llama_free(ctx);

    // load new model
    params = llama_context_default_params();
    params.seed = seed;
    auto ctx2 = llama_init_from_file(model_path, params);

    // Load ctx->model.kv_self.k->data and ctx->model.kv_self.v->data and ctx->model.kv_self.n from file
    FILE *fp_read = fopen("dump_kv.bin", "rb");
    fread(ctx2->model.kv_self.k->data, ggml_nbytes(ctx2->model.kv_self.k), 1, fp_read);
    fread(ctx2->model.kv_self.v->data, ggml_nbytes(ctx2->model.kv_self.v), 1, fp_read);
    fread(&ctx2->model.kv_self.n, sizeof(ctx2->model.kv_self.n), 1, fp_read);
    fclose(fp_read);

    // restore state
    last_n_tokens_data = last_n_tokens_data_saved;
    n_past = n_past_saved;

    // restore first generated token so we can safely sample
    llama_eval(
        ctx2,
        &first_generated_token,
        1,
        n_past,
        thread_count);
    last_n_tokens_data.push_back(first_generated_token);
    n_past += 1;

    // second run
    cout << endl
         << prompt << llama_token_to_str(ctx2, first_generated_token);
    for (auto i = 0; i < 5; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx2,
            &last_n_tokens_data.back() - last_n_tokens_size,
            last_n_tokens_size,
            40,
            1.0,
            1.0,
            1.1);
        auto next_token_str = llama_token_to_str(ctx2, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        if (llama_eval(ctx2, &next_token, 1, n_past, thread_count)) {
            cout << endl
                 << "Failed to evaluate" << endl;
            return 1;
        }
        n_past += 1;
    }
    cout << endl
         << endl;
    return 0;
}

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 20, 2023

Is there any difference between using kv_self.buf and this?

Yes, kv_self.buf contains more stuff than just the tensors.

Reading and writing kv_self.k/v.data as binary will work as long as the context length and KV floating point type is exactly the same. If this were exposed from llama.h then for most applications it would be sufficient, IMHO.

@ggerganov
Copy link
Member

@abetlen and all

I think this commit should fix the issue: 8687c1f

@abetlen
Copy link
Collaborator Author

abetlen commented Apr 21, 2023

@ggerganov fantastic, I can confirm that example 2 from this comment does work however the first example still causes a segfault. I assume that's because some buffers are being accessed in sample that are only initialised on the first eval.

@ggerganov
Copy link
Member

@ggerganov fantastic, I can confirm that example 2 from this comment does work however the first example still causes a segfault. I assume that's because some buffers are being accessed in sample that are only initialised on the first eval.

Hmm, just looking at the code, seems like everything should be initialized.
Will take a deeper look later if this problem remains unsolved

@abetlen
Copy link
Collaborator Author

abetlen commented Apr 22, 2023

@ggerganov I believe the issue is that llama_sample_top_p_top_k is expecting the logits but they're not being saved and restored with this kv_cache approach.

Adding a check here and running that first example I gave seems to reveal the issue https://github.com/ggerganov/llama.cpp/blob/master/llama.cpp#L1493

    const int n_logits = lctx.model.hparams.n_vocab;

    LLAMA_ASSERT(lctx.logits.size() > 0);
    const auto & logits = lctx.logits;
    const auto * plogits = logits.data() + logits.size() - n_logits;

As you can see, n_logits is going to just be the length of the vocab and logits will be a size 0 vector causing an illegal memory access and the resulting segfault.

@ggerganov
Copy link
Member

Ah sorry - I forgot to mention there is now new interface for saving / loading the llama state:

#1105

I think you should try to use the new functions:

    // Returns the size in bytes of the state (rng, logits, embedding and kv_cache)
    LLAMA_API size_t llama_get_state_size(struct llama_context * ctx);

    // Copies the state to the specified destination address.
    // Destination needs to have allocated enough memory.
    // Returns the number of bytes copied
    LLAMA_API size_t llama_copy_state_data(struct llama_context * ctx, uint8_t * dest);

    // Set the state reading from the specified address
    // Returns the number of bytes read
    LLAMA_API size_t llama_set_state_data(struct llama_context * ctx, const uint8_t * src);

The old interface will likely be removed at some point if the above works:

    // Returns the KV cache that will contain the context for the
    // ongoing prediction with the model.
    LLAMA_API const uint8_t * llama_get_kv_cache(struct llama_context * ctx);
    // Returns the size of the KV cache
    LLAMA_API size_t llama_get_kv_cache_size(struct llama_context * ctx);
    // Returns the number of tokens in the KV cache
    LLAMA_API int llama_get_kv_cache_token_count(struct llama_context * ctx);
    // Sets the KV cache containing the current context for the model
    LLAMA_API void llama_set_kv_cache(
            struct llama_context * ctx,
                   const uint8_t * kv_cache,
                          size_t   n_size,
                             int   n_token_count);

@edp1096
Copy link
Contributor

edp1096 commented Apr 22, 2023

I hope see an example for @xaedes PR.

By the way, by adding logits to @abetlen 's first example and also second then it seems working.

I think @chrfalch 's method is right and simple so, is it possible to keep and add logits to llama_get_kv_cache and llama_set_kv_cache?

#include <vector>
#include <iostream>

#include "llama.h"
#include "llama.cpp"

using namespace std;
...
    /*** logits */
    // float* logits_saved = llama_get_logits(ctx);
    // size_t logits_size = sizeof(logits_saved);

    // or

    auto logits_saved = vector<float>(ctx->logits);
    /* logits ***/
    auto last_n_tokens_data_saved = vector<llama_token>(last_n_tokens_data);
    auto n_past_saved = n_past;

    // first run
...
    // free old model
    llama_free(ctx);

    // load new model
    params = llama_context_default_params();
    params.seed = seed;
    auto ctx = llama_init_from_file(model_path, params);

    // restore state
    llama_set_kv_cache(ctx, kv_cache_saved.data(), kv_cache_size, kv_cache_token_count);

    /*** logits */
    // ctx->logits.clear();
    // ctx->logits.insert(ctx->logits.end(), logits_saved, logits_saved + logits_size);

    // or

    ctx->logits = logits_saved;
    /* logits ***/
    last_n_tokens_data = last_n_tokens_data_saved;
    n_past = n_past_saved;

    // second run
...

@xaedes
Copy link
Collaborator

xaedes commented Apr 22, 2023

@edp1096 Here is your example adapted to work with llama_copy_state_data & llama_set_state_data.
The main difference is that you need to allocate memory before retrieving state data with llama_copy_state_data.
Reason for this is, that random number generator state, logits, embeddings and kvcache are not in one shared memory block, so simply returning a single pointer is not easily possible.
Restoring the first generated token for segfault free sampling after loading is not necessary anymore.

#include <vector>
#include <iostream>
#include <chrono>

#include "llama.h"
#include "llama.cpp"

using namespace std;

int main() {
    auto seed = 42;
    auto thread_count = 4;
    auto last_n_tokens_size = 64;
    auto prompt = "The quick brown fox";
    auto model_path = "../../ggml-vicuna-7b-4bit.bin";


    auto n_past = 0;
    auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);

    // init
    auto params = llama_context_default_params();
    params.seed = seed;
    auto ctx = llama_init_from_file(model_path, params);
    auto tokens = vector<llama_token>(params.n_ctx);
    auto n_prompt_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);

    if (n_prompt_tokens < 1) {
        cout << "Failed to tokenize prompt" << endl;
        return 1;
    }

    // evaluate prompt

    llama_eval(ctx, tokens.data(), n_prompt_tokens, n_past, thread_count);

    last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_prompt_tokens);
    n_past += n_prompt_tokens;

    // Save state (rng, logits, embedding and kv_cache) to file
    FILE *fp_write = fopen("dump_state.bin", "wb");
    auto state_size = llama_get_state_size(ctx);
    auto state_mem = new uint8_t[state_size];
    llama_copy_state_data(ctx, state_mem); // could also copy directly to memory mapped file
    fwrite(state_mem, 1, state_size, fp_write);
    fclose(fp_write);

    // save state (last tokens)
    auto last_n_tokens_data_saved = vector<llama_token>(last_n_tokens_data);
    auto n_past_saved = n_past;

    // save first generated token
    auto first_generated_token = llama_token(0);

    // first run
    cout << endl
         << prompt;
    for (auto i = 0; i < 6; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx,
            &last_n_tokens_data.back() - last_n_tokens_size,
            last_n_tokens_size,
            40,
            1.0,
            1.0,
            1.1);
        if (i == 0) {
            first_generated_token = next_token;
        }
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        if (llama_eval(ctx, &next_token, 1, n_past, thread_count)) {
            cout << endl
                 << "Failed to evaluate" << endl;
            return 1;
        }
        n_past += 1;
    }
    cout << endl
         << endl;

    // free old model
    llama_free(ctx);

    // load new model
    params = llama_context_default_params();
    params.seed = seed;

    auto ctx2 = llama_init_from_file(model_path, params);

    // Load state (rng, logits, embedding and kv_cache) from file
    FILE *fp_read = fopen("dump_state.bin", "rb");
    auto state_size2 = llama_get_state_size(ctx2);
    if (state_size != state_size2) {
        cerr << "state size differs\n";
    }
    fread(state_mem, 1, state_size, fp_read);
    llama_set_state_data(ctx2, state_mem);  // could also read directly from memory mapped file
    fclose(fp_read);

    // restore state (last tokens)
    last_n_tokens_data = last_n_tokens_data_saved;
    n_past = n_past_saved;

    // this should not be necessary with llama_copy_state_data & llama_set_state_data as they will save and restore logits.
    
    // // restore first generated token so we can safely sample
    // llama_eval(
    //     ctx2,
    //     &first_generated_token,
    //     1,
    //     n_past,
    //     thread_count);
    // last_n_tokens_data.push_back(first_generated_token);
    // n_past += 1;
    // cout << endl << prompt << llama_token_to_str(ctx2, first_generated_token);
    
    // second run
    for (auto i = 0; i < 5; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx2,
            &last_n_tokens_data.back() - last_n_tokens_size,
            last_n_tokens_size,
            40,
            1.0,
            1.0,
            1.1);
        auto next_token_str = llama_token_to_str(ctx2, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        if (llama_eval(ctx2, &next_token, 1, n_past, thread_count)) {
            cout << endl
                 << "Failed to evaluate" << endl;
            return 1;
        }
        n_past += 1;
    }
    cout << endl
         << endl;
    return 0;
}

@edp1096
Copy link
Contributor

edp1096 commented Apr 22, 2023

It works great for me! Thank you @xaedes !

@abetlen
Copy link
Collaborator Author

abetlen commented Apr 23, 2023

Ah sorry - I forgot to mention there is now new interface for saving / loading the llama state:

#1105

Thank you works great! I'll close this issue in that case.

@malarau
Copy link

malarau commented Apr 24, 2023

@xaedes

Here is your example adapted to work with llama_copy_state_data & llama_set_state_data...

Trying with that example, is it a normal behavior restore using llama_set_state_data, setting a new ctx seed and get the same output for the second run? (and changing top_k and top_p).

Or it does not affect and the initial prompt needs to be reevaluated?

@xaedes
Copy link
Collaborator

xaedes commented Apr 24, 2023

@s2kjn93h

Yes, that is the expected behaviour. The seed is used when initializing the random number generator with llama_init_from_file. The state of the random number generator is then saved by llama_copy_state_data and will be restored with llama_set_state_data, so that the sampling results remain consistent.

When a different seed is set for a new context and the state is then loaded with llama_set_state_data, the random number generator will be in the state from when llama_copy_state_data was called, i.e. from the previous run.

If you want to generate other numbers after loading the llama state, i.e. to sample different tokens than the saved state would have, you can call llama_sample_top_p_top_k and discard the sampled token, as many times as you wish.

It could be beneficial to have API functions for more precise control in the future. In the meantime, we can gain greater control by directly altering the memory allocated for the llama state.

The code block for llama_copy_state_data demonstrates how to write a random number generator state to memory.
https://github.com/ggerganov/llama.cpp/blob/9b0a4d421459f4e5e1af735c9784c3247b379025/llama.cpp#L2116-L2129

Here is an example of initializing the random number generator with seed = 42 * 1337:

// get state from ctx
const size_t state_size = llama_get_state_size(ctx);
uint8_t * state_memory = new uint8_t[state_size];
llama_copy_state_data(ctx, state_memory);

// the rng we want to set in ctx
int seed = 42 * 1337;
auto rng = std::mt19937(seed);

// copy rng to state_memory (code taken from llama_copy_state_data)
#define LLAMA_MAX_RNG_STATE 64*1024
uint8_t * out = state_memory;
{
    std::stringstream rng_ss;
    rng_ss << rng;

    const size_t rng_size = rng_ss.str().size();
    char rng_buf[LLAMA_MAX_RNG_STATE];

    memset(&rng_buf[0], 0, LLAMA_MAX_RNG_STATE);
    memcpy(&rng_buf[0], rng_ss.str().data(), rng_ss.str().size());

    memcpy(out, &rng_size,   sizeof(rng_size));    out += sizeof(rng_size);
    memcpy(out, &rng_buf[0], LLAMA_MAX_RNG_STATE); out += LLAMA_MAX_RNG_STATE;
}

// set our rng in the ctx by setting state from state_memory
llama_set_state_data(ctx, state_memory);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed high priority Very important issue
Projects
None yet
Development

No branches or pull requests

8 participants