Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Ability to rewind model evaluation by a fixed number of tokens #1281

Closed
abetlen opened this issue May 2, 2023 · 13 comments
Labels
enhancement New feature or request good first issue Good for newcomers high priority Very important issue

Comments

@abetlen
Copy link
Collaborator

abetlen commented May 2, 2023

The recent additions of the state and session APIs have made it possible to implement caching for llama models which has greatly improved the responsiveness in many applications.

The current APIs howeve still leave something to be desired, specifically it would be very useful to be able to rewind / rollback an evaluated model by a fixed number of tokens so a single longer saved state could be used to restore any shorter state.

@ggerganov ggerganov added enhancement New feature or request good first issue Good for newcomers high priority Very important issue labels May 2, 2023
@mikeggh
Copy link

mikeggh commented May 3, 2023

How would this be possible the way the model works upon each layer? Or do you mean using RAM memory to keep X sessions (or diffs from some particular Y session) saved until completed, or replaced? This would give ability for rewinding with more ram usage.

if you meant it in another way, then please elaborate.. =] would love to hear other concepts around this as I'm working on adding roles, and other scenarios to my setup and it would help..

@j-f1
Copy link
Collaborator

j-f1 commented May 4, 2023

I think you should be able to reduce n_past by the number of tokens you want to forget and then run inference/prompt evaluation again, right?

@abetlen
Copy link
Collaborator Author

abetlen commented May 4, 2023

@j-f1 from my understanding (and experiments) n_past is the number of tokens to keep from the current eval position not the start of the context window.

@DannyDaemonic
Copy link
Contributor

n_past is your position in the context window. I've successfully used an older position to roll back "state." In fact, it's done internally when you fill the context window and you used the --keep option.

This is also used with --session files. If you change the last characters of your prompt, it will use the state up until those characters and reevaluate the remaining.

@ggerganov
Copy link
Member

@abetlen The n_past is counted from the start of the context. So reducing n_past by let's say 4 is equivalent to forgetting the last 4 tokens without any re-evaluation (see @DannyDaemonic's answer)

So maybe we do not even need an API change - the user code can simply control the n_past number to achieve the desired rewind function.

@ejones
Copy link
Collaborator

ejones commented May 4, 2023

On the topic of restoring a shorter state from a longer session, I just put up #1310 which would allow sessions to be updated and restored incrementally. So potentially more efficient than loading the full session and rewinding.

@mikeggh
Copy link

mikeggh commented May 4, 2023

@abetlen The n_past is counted from the start of the context. So reducing n_past by let's say 4 is equivalent to forgetting the last 4 tokens without any re-evaluation (see @DannyDaemonic's answer)

But this doesn't affect the logits/state though right? So it's only affect on future tokens would be with the sample algorithm and penalty right? Maybe I'm looking at this in the completely wrong way...

@DannyDaemonic
Copy link
Contributor

The logits are updated after an eval. You can either eval the token at n_past again, or simply begin eval on new tokens if there's input to process.

This is where a new API function could come in. It could recalculate the logits starting at a particular n_past, but in practice, just evaluating the last token or starting to eval new tokens should be sufficient.

@mikeggh
Copy link

mikeggh commented May 5, 2023

Thanks for the explanation! I'll be sure to read into all of the new apis..

@abetlen
Copy link
Collaborator Author

abetlen commented May 5, 2023

@j-f1 @ggerganov @DannyDaemonic this worked! I must've been doing something wrong (maybe not re-evaling the last token when the same prompt was used).

I'll close this issue as the n_past solution works great for me, no need to make any API changes, thanks!

@abetlen abetlen closed this as completed May 5, 2023
@mikeggh
Copy link

mikeggh commented May 5, 2023

Does this mean you could select a token, for example 20, and generate 5-10 completely different variations? Once you finish a variation with a specific temperature setting or other choices at token 20, can you simply rewind n_past by 16-32 and generate another entire line from the same initial token? Can you choose a different option from the sample without having to load or save a session until you generate as many variations as you want?

I think I'm confused at seeing a 1+gb state file which was purely just some tokens being inserted, and how I can just use n_past to rewind without it possibly affecting that. I looked into ggml's eval and I see the n_past is used, so maybe I just don't understand the network/kv cache aspect. I'll try to dig into some of the pdfs around the original model which may help me understand the differences..

@DannyDaemonic
Copy link
Contributor

I was a bit confused myself. I'm still not 100% sure how it works, but I think the results are built up as you go but since each result builds on all the previous results, they are stored in a way that they can be referenced in the future, which means you can jump back to them later as well. You can see #1111 for a little more discussion but always verify what you read in a public forum.

I believe this is how a lot of online chat bots and authoring tools actually do it. When you click "Regenerate response" on ChatGPT, it's most likely just jumping back to a past spot in the context and starting again. I do notice a change in "tone" so I think they are also adjusting temperature or top k or something.

@mikeggh
Copy link

mikeggh commented May 5, 2023

Great, I appreciate your response, even though the issue has already been closed. I had been contemplating it in my mind in the same way that you had approached it in 1111, and overall it makes more sense now. I will try to follow the code, but I understand that you can rewind with n_past, and I assume it will just overwrite prior KV cache. This is useful for the 10-15 variations example.

Currently, I am subjecting all my prompts to variations of temperatures and different models. This approach would enable me to save time and reduce CPU usage. Furthermore, I would like to delve deeper into creating more diverse outcomes and exploring GPT "creativity." Additionally, I am attempting to perform secondary queries for each prompt using NLP/Spacy to break down the English language used in each response. I also aim to provide explanations about proper nouns and other relevant information. Thus, conserving CPU through tricks and normal API usage is a wise move for me to achieve my objectives.
have a great weekend!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers high priority Very important issue
Projects
None yet
Development

No branches or pull requests

6 participants