Original prompt being forgotten. n_keep tokens not functioning as expected? #1647

MrJackSpade · 2023-05-30T03:22:23Z

Windows.

Maybe I'm misinterpreting the functionality.

I noticed that my long-running bot was forgetting its original prompt, despite n-keep being -1. Pulled down the project, and threw a breakpoint in the section that manages the rollover


           // infinite text generation via context swapping
            // if we run out of context:
            // - take the n_keep first tokens from the original prompt (via n_past)
            // - take half of the last (n_ctx - n_keep) tokens and recompute the logits in batches
            if (n_past + (int) embd.size() > n_ctx) {

                const int n_left = n_past - params.n_keep;

                // always keep the first token - BOS
                n_past = std::max(1, params.n_keep);

                // insert n_left/2 tokens at the start of embd from last_n_tokens
                embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size(), last_n_tokens.end() - embd.size());

                // stop saving session if we run out of context
                path_session.clear();

I'm not 100% sure exactly what this is supposed to do, its not intuitive to me because I'm a c# guy, but since last_n_tokens is being used and inserted into the embed, my assumption is that a new embed is being constructed to contain the new context information. I'm also assuming that based on the comments, this new embed is supposed to contain the original prompt. So this appears to be functioning like some kind of refresh.

What I'm not seeing, however, is my original prompt anywhere in this new embed.

I'm not sure if this is a bug, or if I'm fundamentally misunderstanding the intent of this block of code.

Could this be the cause of my session forgetting its original prompt?

The text was updated successfully, but these errors were encountered:

mgroeber9110 · 2023-05-30T20:50:17Z

It also took me a while to figure out how this works, but the reason you are not seeing the original prompt is that its context is literally kept when resetting, and embd only contains the dynamically generated context that gets re-scanned.

The important part is the setting of n_past, which indicates the length of the prefix that llama_eval() should preserve before appending the contents of embd. The code even tries to extend this prefix as much as possible to include the entire start of the context that may have been calculated previously.

If you still suspect that something goes wrong in this mechanism, it might be interesting to simulate the state after reset by concatenating the original prompt and the preserved part of the old text into a prompt and letting the model generate from that (with a fixed seed or temp==0 for determinism). You could get at the exact contents of the preserved text by uncommenting the "resetting" printf() code.

You should then see clearly if your text is continued with or without regard for the original prompt, without having to rely on the reset mechanism, i.e. if the model really "forgets" the original prompt, or if the combination of original prompt + half of the most recently generated text just is a poor substitute for actual bigger context.

Thinking this through, I can imagine that the concatenation could sometimes be meaning-distorting, e.g. if the original prompt ends in something like "Human:", but what gets concatenated next is a truncated part of the AIs response, i.e. effectively flipping dialog turns.

May be worth actually writing out for a concrete example...

MrJackSpade · 2023-05-30T21:14:56Z

Appreciate the response.

Before I dig too deep into it, I should probably ask for clarification on one point.

What you're saying is, by the time it gets to path_session.clear();, embd should NOT contain the original prompt information?

Whats confusing me is that embd does contain data from last_n_tokens, which makes me think that this context "rotation" is a destructive operation, and it also looks like the original prompt is treated the same as any other input. So what I'm missing is exactly how it functions such that the original prompt is preserved, but last_n_tokens is not.

The reason I'm trying to hunt this down, is because as far as I can tell, the model is able to properly recall all data from the original prompt until the context is refreshed when it overruns. After that point, the model cant seem to remember anything. We're talking like 10/10 accuracy before the context rolls, and (usually) like 0/10 after. For some reason however every once in a while it will be able to answer questions after the context rolls, but thats pretty rare.

Its just super weird that the models entire personality pivots with the context roll. It starts hallucinating everything. I cant replicate this behavior at any point before the context switches either.

mgroeber9110 · 2023-05-31T19:27:23Z

One caveat: I have not written any of the code, I have just tried to understand it some weeks ago out of a similar concern as yours.

What took me a while to notice (obvious in hindsight if you know where to look) is that, in standard --interactive mode, n_keep defaults to zero, not the prompt length. To change this, pass --keep -1 on the command line to freeze exactly the initial prompt that you passed in a file or on the command line.

In the code, the important part for preserving the original prompt is the parameter n_past in this call:

llama_eval(ctx, &embd[i], n_eval, n_past, params.n_threads)

It indicates how many characters of the previous state of the model should be preserved, before evaluating the contents of embd starting from this state. So, llama_eval() is only partially destructive, but does not start from the fresh state of an "empty" model, but from the state after n_past tokens have been evaluated.

This is in fact used for all evaluations, to make sure that new samples are appended at the end of the context.

Of course, non of this really guarantees that the partial buffer together with the prepended original prompt make enough sense to continue a meaningful dialog. For example, the user's last input may not be included, so the model has no clue any more what its actual task is...

SlyEcho · 2023-06-05T15:12:40Z

embd is kind of like a queue of new tokens to evaluate. These new tokens come from generating (sampling the likely next token) or user input from the console or from the program input (-p or -f args).

n_past keeps track of how much of the context the model has evaluated. It can recall the info for past tokens from the KV cache so there is no need to pass in the tokens again, only the number of tokens.

n_ctx is the context size, you can set it from the command line with -c. The default is 512, but the maximum is 2048 for standard LLaMa models and clones.

if (n_past + embd.size() > n_ctx) -- this means that if for example n_past is 512 and embd contains one token (that was just generated) the model cannot evaluate any more because it won't fit the context.

What now has to happen is that the context has to be shortened somehow. A trick that seems to be very effective is used where some of the old context is kept in the beginning (this is optional) and the second half of the old context is moved after that, freeing up half or less of the context space for new tokens.

n_keep is the amount of tokens to keep from the old context. Using the value -1 should keep all of the original prompt (from -p or -f), but take note that it will not exceed n_ctx.

n_left = n_past - n_keep -- the tokens that are not kept in the beginning. The first half will be deleted and the second half moved before the tokens in embd.

n_past = max(1, n_keep) -- it is pretty much the same as n_past = n_keep except that there is no need to evaluate the first token which should always be BOS in cases when n_keep == 0.

last_n_tokens -- contains the last tokens that were evaluated or generated. This data structure works in a little bit unexpected way. The size of this vector is always n_ctx and new tokens are added to the end and the first element is removed, keeping the size constant.

embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size(), last_n_tokens.end() - embd.size())

Let's unravel this one.

std::vector<>::insert(pos, first, last) -- will insert the provided elements into the chosen position in embd. Since embd.begin() is used, they are inserted into the beginning.

last_n_tokens.begin() + n_ctx -- this is pointing to the end of last_n_tokens since it always has n_ctx elements. (Technically it should be the same as last_n_tokens.end()?)

last_n_tokens.begin() + n_ctx - n_left/2 -- we move back from the end by the amount of tokens we are keeping.

- embd.size() is taken off because last_n_tokens already contains the same tokens that are in embd. It shifts the start and end iterators by this.

embd will never contain the n_keep tokens again because they don't need to ever be reevaluated. The effect that n_keep has on the model is used on n_past instead.

TonyWeimmer40 · 2023-06-06T17:41:18Z

I've also noticed n_keep acting unexpectedly. With (-p) n_keep defaults to 0 as expected but with (-ins) it defaults to 2 even if you force --keep -1. Can somebody show their command line with -ins and a working --keep n?

MrJackSpade · 2023-06-08T21:10:03Z

I'm going to close this, because while I cant explain why the model kept forgetting the prompt, after continuing to dig through the code for the last week I can at least say I understand how the eval works.

In the end I've found a different way to ensure the model remembers the prompt.

TonyWeimmer40 · 2023-06-09T18:52:44Z

I'm going to close this, because while I cant explain why the model kept forgetting the prompt, after continuing to dig through the code for the last week I can at least say I understand how the eval works.

In the end I've found a different way to ensure the model remembers the prompt.

Could you please share the method/ fix you've found? It would be a great help to get it up and running.

amyawu · 2024-06-28T02:33:24Z

^ Trying to figure that out too.

MrJackSpade closed this as completed Jun 8, 2023

TonyWeimmer40 mentioned this issue Jun 10, 2023

--Keep -1 defaults to 2, Forgotten prompt. #1790

Closed

wtarreau mentioned this issue Aug 19, 2023

server : better default prompt #2646

Merged

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Original prompt being forgotten. n_keep tokens not functioning as expected? #1647

Original prompt being forgotten. n_keep tokens not functioning as expected? #1647

MrJackSpade commented May 30, 2023

mgroeber9110 commented May 30, 2023

MrJackSpade commented May 30, 2023

mgroeber9110 commented May 31, 2023 •

edited

Loading

SlyEcho commented Jun 5, 2023

TonyWeimmer40 commented Jun 6, 2023

MrJackSpade commented Jun 8, 2023

TonyWeimmer40 commented Jun 9, 2023 •

edited

Loading

amyawu commented Jun 28, 2024

Original prompt being forgotten. n_keep tokens not functioning as expected? #1647

Original prompt being forgotten. n_keep tokens not functioning as expected? #1647

Comments

MrJackSpade commented May 30, 2023

mgroeber9110 commented May 30, 2023

MrJackSpade commented May 30, 2023

mgroeber9110 commented May 31, 2023 • edited Loading

SlyEcho commented Jun 5, 2023

TonyWeimmer40 commented Jun 6, 2023

MrJackSpade commented Jun 8, 2023

TonyWeimmer40 commented Jun 9, 2023 • edited Loading

amyawu commented Jun 28, 2024

mgroeber9110 commented May 31, 2023 •

edited

Loading

TonyWeimmer40 commented Jun 9, 2023 •

edited

Loading