Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Original prompt being forgotten. n_keep tokens not functioning as expected? #1647

Closed
MrJackSpade opened this issue May 30, 2023 · 8 comments
Closed

Comments

@MrJackSpade
Copy link

Windows.

Maybe I'm misinterpreting the functionality.

I noticed that my long-running bot was forgetting its original prompt, despite n-keep being -1. Pulled down the project, and threw a breakpoint in the section that manages the rollover


           // infinite text generation via context swapping
            // if we run out of context:
            // - take the n_keep first tokens from the original prompt (via n_past)
            // - take half of the last (n_ctx - n_keep) tokens and recompute the logits in batches
            if (n_past + (int) embd.size() > n_ctx) {

                const int n_left = n_past - params.n_keep;

                // always keep the first token - BOS
                n_past = std::max(1, params.n_keep);

                // insert n_left/2 tokens at the start of embd from last_n_tokens
                embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size(), last_n_tokens.end() - embd.size());

                // stop saving session if we run out of context
                path_session.clear();

I'm not 100% sure exactly what this is supposed to do, its not intuitive to me because I'm a c# guy, but since last_n_tokens is being used and inserted into the embed, my assumption is that a new embed is being constructed to contain the new context information. I'm also assuming that based on the comments, this new embed is supposed to contain the original prompt. So this appears to be functioning like some kind of refresh.

What I'm not seeing, however, is my original prompt anywhere in this new embed.

I'm not sure if this is a bug, or if I'm fundamentally misunderstanding the intent of this block of code.

Could this be the cause of my session forgetting its original prompt?

@mgroeber9110
Copy link
Contributor

It also took me a while to figure out how this works, but the reason you are not seeing the original prompt is that its context is literally kept when resetting, and embd only contains the dynamically generated context that gets re-scanned.

The important part is the setting of n_past, which indicates the length of the prefix that llama_eval() should preserve before appending the contents of embd. The code even tries to extend this prefix as much as possible to include the entire start of the context that may have been calculated previously.

If you still suspect that something goes wrong in this mechanism, it might be interesting to simulate the state after reset by concatenating the original prompt and the preserved part of the old text into a prompt and letting the model generate from that (with a fixed seed or temp==0 for determinism). You could get at the exact contents of the preserved text by uncommenting the "resetting" printf() code.

You should then see clearly if your text is continued with or without regard for the original prompt, without having to rely on the reset mechanism, i.e. if the model really "forgets" the original prompt, or if the combination of original prompt + half of the most recently generated text just is a poor substitute for actual bigger context.

Thinking this through, I can imagine that the concatenation could sometimes be meaning-distorting, e.g. if the original prompt ends in something like "Human:", but what gets concatenated next is a truncated part of the AIs response, i.e. effectively flipping dialog turns.

May be worth actually writing out for a concrete example...

@MrJackSpade
Copy link
Author

Appreciate the response.

Before I dig too deep into it, I should probably ask for clarification on one point.

What you're saying is, by the time it gets to path_session.clear();, embd should NOT contain the original prompt information?

Whats confusing me is that embd does contain data from last_n_tokens, which makes me think that this context "rotation" is a destructive operation, and it also looks like the original prompt is treated the same as any other input. So what I'm missing is exactly how it functions such that the original prompt is preserved, but last_n_tokens is not.

The reason I'm trying to hunt this down, is because as far as I can tell, the model is able to properly recall all data from the original prompt until the context is refreshed when it overruns. After that point, the model cant seem to remember anything. We're talking like 10/10 accuracy before the context rolls, and (usually) like 0/10 after. For some reason however every once in a while it will be able to answer questions after the context rolls, but thats pretty rare.

Its just super weird that the models entire personality pivots with the context roll. It starts hallucinating everything. I cant replicate this behavior at any point before the context switches either.

@mgroeber9110
Copy link
Contributor

mgroeber9110 commented May 31, 2023

One caveat: I have not written any of the code, I have just tried to understand it some weeks ago out of a similar concern as yours.

What took me a while to notice (obvious in hindsight if you know where to look) is that, in standard --interactive mode, n_keep defaults to zero, not the prompt length. To change this, pass --keep -1 on the command line to freeze exactly the initial prompt that you passed in a file or on the command line.

In the code, the important part for preserving the original prompt is the parameter n_past in this call:

llama_eval(ctx, &embd[i], n_eval, n_past, params.n_threads)

It indicates how many characters of the previous state of the model should be preserved, before evaluating the contents of embd starting from this state. So, llama_eval() is only partially destructive, but does not start from the fresh state of an "empty" model, but from the state after n_past tokens have been evaluated.

This is in fact used for all evaluations, to make sure that new samples are appended at the end of the context.

Of course, non of this really guarantees that the partial buffer together with the prepended original prompt make enough sense to continue a meaningful dialog. For example, the user's last input may not be included, so the model has no clue any more what its actual task is...

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 5, 2023

embd is kind of like a queue of new tokens to evaluate. These new tokens come from generating (sampling the likely next token) or user input from the console or from the program input (-p or -f args).

n_past keeps track of how much of the context the model has evaluated. It can recall the info for past tokens from the KV cache so there is no need to pass in the tokens again, only the number of tokens.

n_ctx is the context size, you can set it from the command line with -c. The default is 512, but the maximum is 2048 for standard LLaMa models and clones.

if (n_past + embd.size() > n_ctx) -- this means that if for example n_past is 512 and embd contains one token (that was just generated) the model cannot evaluate any more because it won't fit the context.

What now has to happen is that the context has to be shortened somehow. A trick that seems to be very effective is used where some of the old context is kept in the beginning (this is optional) and the second half of the old context is moved after that, freeing up half or less of the context space for new tokens.

n_keep is the amount of tokens to keep from the old context. Using the value -1 should keep all of the original prompt (from -p or -f), but take note that it will not exceed n_ctx.

n_left = n_past - n_keep -- the tokens that are not kept in the beginning. The first half will be deleted and the second half moved before the tokens in embd.

n_past = max(1, n_keep) -- it is pretty much the same as n_past = n_keep except that there is no need to evaluate the first token which should always be BOS in cases when n_keep == 0.

last_n_tokens -- contains the last tokens that were evaluated or generated. This data structure works in a little bit unexpected way. The size of this vector is always n_ctx and new tokens are added to the end and the first element is removed, keeping the size constant.

embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size(), last_n_tokens.end() - embd.size())

Let's unravel this one.

std::vector<>::insert(pos, first, last) -- will insert the provided elements into the chosen position in embd. Since embd.begin() is used, they are inserted into the beginning.

last_n_tokens.begin() + n_ctx -- this is pointing to the end of last_n_tokens since it always has n_ctx elements. (Technically it should be the same as last_n_tokens.end()?)

last_n_tokens.begin() + n_ctx - n_left/2 -- we move back from the end by the amount of tokens we are keeping.

- embd.size() is taken off because last_n_tokens already contains the same tokens that are in embd. It shifts the start and end iterators by this.


embd will never contain the n_keep tokens again because they don't need to ever be reevaluated. The effect that n_keep has on the model is used on n_past instead.

@TonyWeimmer40
Copy link

I've also noticed n_keep acting unexpectedly. With (-p) n_keep defaults to 0 as expected but with (-ins) it defaults to 2 even if you force --keep -1. Can somebody show their command line with -ins and a working --keep n?

@MrJackSpade
Copy link
Author

I'm going to close this, because while I cant explain why the model kept forgetting the prompt, after continuing to dig through the code for the last week I can at least say I understand how the eval works.

In the end I've found a different way to ensure the model remembers the prompt.

@TonyWeimmer40
Copy link

TonyWeimmer40 commented Jun 9, 2023

I'm going to close this, because while I cant explain why the model kept forgetting the prompt, after continuing to dig through the code for the last week I can at least say I understand how the eval works.

In the end I've found a different way to ensure the model remembers the prompt.

Could you please share the method/ fix you've found? It would be a great help to get it up and running.

@amyawu
Copy link

amyawu commented Jun 28, 2024

^ Trying to figure that out too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants