-
Notifications
You must be signed in to change notification settings - Fork 11.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Ability to rewind model evaluation by a fixed number of tokens #1281
Comments
How would this be possible the way the model works upon each layer? Or do you mean using RAM memory to keep X sessions (or diffs from some particular Y session) saved until completed, or replaced? This would give ability for rewinding with more ram usage. if you meant it in another way, then please elaborate.. =] would love to hear other concepts around this as I'm working on adding roles, and other scenarios to my setup and it would help.. |
I think you should be able to reduce |
@j-f1 from my understanding (and experiments) |
This is also used with |
@abetlen The So maybe we do not even need an API change - the user code can simply control the |
On the topic of restoring a shorter state from a longer session, I just put up #1310 which would allow sessions to be updated and restored incrementally. So potentially more efficient than loading the full session and rewinding. |
But this doesn't affect the logits/state though right? So it's only affect on future tokens would be with the sample algorithm and penalty right? Maybe I'm looking at this in the completely wrong way... |
The logits are updated after an eval. You can either eval the token at This is where a new API function could come in. It could recalculate the logits starting at a particular |
Thanks for the explanation! I'll be sure to read into all of the new apis.. |
@j-f1 @ggerganov @DannyDaemonic this worked! I must've been doing something wrong (maybe not re-evaling the last token when the same prompt was used). I'll close this issue as the |
Does this mean you could select a token, for example 20, and generate 5-10 completely different variations? Once you finish a variation with a specific temperature setting or other choices at token 20, can you simply rewind n_past by 16-32 and generate another entire line from the same initial token? Can you choose a different option from the sample without having to load or save a session until you generate as many variations as you want? I think I'm confused at seeing a 1+gb state file which was purely just some tokens being inserted, and how I can just use n_past to rewind without it possibly affecting that. I looked into ggml's eval and I see the n_past is used, so maybe I just don't understand the network/kv cache aspect. I'll try to dig into some of the pdfs around the original model which may help me understand the differences.. |
I was a bit confused myself. I'm still not 100% sure how it works, but I think the results are built up as you go but since each result builds on all the previous results, they are stored in a way that they can be referenced in the future, which means you can jump back to them later as well. You can see #1111 for a little more discussion but always verify what you read in a public forum. I believe this is how a lot of online chat bots and authoring tools actually do it. When you click "Regenerate response" on ChatGPT, it's most likely just jumping back to a past spot in the context and starting again. I do notice a change in "tone" so I think they are also adjusting temperature or top k or something. |
Great, I appreciate your response, even though the issue has already been closed. I had been contemplating it in my mind in the same way that you had approached it in 1111, and overall it makes more sense now. I will try to follow the code, but I understand that you can rewind with n_past, and I assume it will just overwrite prior KV cache. This is useful for the 10-15 variations example. Currently, I am subjecting all my prompts to variations of temperatures and different models. This approach would enable me to save time and reduce CPU usage. Furthermore, I would like to delve deeper into creating more diverse outcomes and exploring GPT "creativity." Additionally, I am attempting to perform secondary queries for each prompt using NLP/Spacy to break down the English language used in each response. I also aim to provide explanations about proper nouns and other relevant information. Thus, conserving CPU through tricks and normal API usage is a wise move for me to achieve my objectives. |
The recent additions of the state and session APIs have made it possible to implement caching for llama models which has greatly improved the responsiveness in many applications.
The current APIs howeve still leave something to be desired, specifically it would be very useful to be able to rewind / rollback an evaluated model by a fixed number of tokens so a single longer saved state could be used to restore any shorter state.
The text was updated successfully, but these errors were encountered: