-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
main : add option to save full output to session #1338
Conversation
I like what you're doing here. I think there are two different needs being met with session files. I wonder if, for clarity, we don't need two separate options: a prompt cache (which does what Here's what I would proposeAnd these are just my thoughts. I'm not saying it has to be this way by any means. Let me know what you think. Prompt Cache:
|
I just tested this out, and it looks like uninitialized tokens are 0, so we can resume by doing something like this: int n_resume_tokens = session_tokens.size();
for (int i = 0; i < (int) session_tokens.size(); i++) {
if (session_tokens[i] == 0) {
n_resume_tokens = i;
break;
}
printf("%s", llama_token_to_str(ctx, session_tokens[i]));
} |
I like idea to rename --session to --prompt-cache |
Yeah sounds good, I can make those adjustments. As for |
Hmm, I guess I'm fine with it either tossing the prompt or appending it. I don't see the use case for a prompt when you've already got context ready to resume, but it is weird to sort of just ignore the prompt someone passed in. The only downside to appending it would mean be people can't just hit So, yeah. I can see both sides to it. This is the internet, so it might be something you set up one way and find it outrages people - in which case, it'd be easy enough to change. |
Yeah, it's so that we can resume with additional input, e.g. feedback or user messages. The current PR does this as it preserves the prefix matching on prompt. Where this is going is I think we can accomplish interactivity, instruct etc now with repeat invocations of main rather than in process. Then we can retire all that code and refocus main on just single turn generation. I have a POC of chat working this way locally. |
I can see that use case. I agree appending makes the most sense. I just think we should be consistent about it. If there's a session file from a previous An end user could keep a running text of the conversation and feed that in as the prompt each time, but it will "randomly" stop working. The reason being that once you hit your context length, the reset will drop half of your context. Two things will happen, your prompt will be greater than the context so it will be refused, and if you wanted to use just the second half of your prompt so far, it'd be tricky to know what part of the prompt to trim. (It's not exactly half.) It would be a lot more practical if it just printed out the tokens from the context and used the prompt as additional input after. I'm not sure if I'm explaining it clearly. If you have a proof of concept, you could try setting your context to something low, like Edit: Typo: should be |
Yeah, I'm envisioning that context management can be externalized along with interactivity. This matches how the hosted LLMs work anyway. This can be provided in scripts, just doesn't need to be hard-coded in main. |
Interesting. I would be hesitant to force the externalization of the context window. It would seem to me, a script tracking the output would have a lot easier of a time just appending the new output to it's log of the conversation than tracking the actual context so that it can properly resume. |
Not to dissuade you from this approach, but I think what #1345 is inquiring about is an so/dll version of main. It sounds like they may have experience in that area. That might be a better approach if you're looking for more control over the context outside of main. |
@DannyDaemonic I implemented the renames you suggested, but I decided to keep the original semantics of passing in the full prompt. In testing the appending behavior, I felt it was too hard to reason about the state of the session file, as you alluded to. And I do believe the ability to pass in new inputs (however it is done) is what makes this worthwhile; if you're just resuming without inputs, I feel like you could've just passed a larger (As a side benefit, I think this reduces the surprise of the new behavior of Regarding #1345, I believe that actually hits the nail on the head for what I'm saying about retiring interactivity in main. This new |
Oh, sorry. I misspoke earlier. I mean try it with a low context, like |
I mean, I understood what you meant. Is your point that this only works until you fill up the context size? If so I think that doesn't diminish its value? |
Of the two approaches, passing in the full history each time vs resuming the session purely from the state, one keeps working. Passing in the full history only works until you hit your first context reset, then you can't resume anymore. My thought is that it kind of defeats the point of being able to save and resume a session if sometimes when you save, you can't resume. |
I see. Yes, with a growing prompt, the caller (of |
The nice thing about this project is while it lets you control a lot of low level options, you don't have to understand anything about Transformers, context, or even inference to use it. A session could do that as well. I originally thought that your motive was to bring a session function into main. I think perhaps it was just the original name of the option that led me to think this. I apologize for the confusion earlier. I actually feel a bit guilty being that your original As a side note, I don't know if the project is necessarily looking to retire the interactivity of main. A lot of people like it for the simplicity even if it is through a terminal. Your end goal of retiring all interactivity from main is actually much more in line with the idea of just having an so/dll made from main. But these two approaches don't have to be at odds. I think main can offer full prompt caching as you desire for your uses. Edit: Somehow I hit send while editing my response. I've fixed it up. |
I hit send early on accident somehow. I've cleaned up the comment on Github. Just noting here in case you're replying via email instead of on Github. |
Thanks, yeah, it occurred to me that the nomenclature was part of the problem. I think initially I envisioned just a straight-up restoration of state but over the past few days I've shifted to the "full prompt cache" behavior as being more valuable. I'll make those renames. As for main, my vision is more that |
@DannyDaemonic updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks fine, the only real issue I see with this PR is you only write the session file on one exit path. Nothing else noted is really a problem.
There are two points where the program exits. At the end of main
, and in sigint_handler
. If you're in interactive mode, the only way to exit is with Ctrl-C. So as is, for most people, they will end up quitting without their cache being written.
Now that I think about it though, it's probably not safe to call llama_save_session_file
in sigint_handler
because you could be in the middle of an evaluation.
That makes it tricky to just throw in there. Off the top of my head, we'd have to add another variable like is_interacting
but named is_evaluating
(which should both be volatile sig_atomic_t
btw) that we set to true and false around our llama evals and and if someone tries to Ctrl-C
while is_evaluating
it would warn them that it may take a second to save or that they can hit Ctrl-C
again to exit without saving and we'd set params.interactive
to false, n_remain
to 0, and params.n_predict
to 0, which would trigger this exit path.
Edit: You'll still get stuck on input.
If you don't have experience with interrupts, threading, or race conditions, I'd skip that approach.
The next best solution would be write the session out during interactive mode every time before we prompt the user for input. Around line 530, which unfortunately, could be quite often. Optionally, to make this less painful, we could watch the token count (n_past
) and only save when interactive mode is about to block (530) at n_last_save + 128
(or some number that makes sense).
Once again you're stung by the whole interactivity part of main
. I guess you could just state that it doesn't work in interactive or instructional mode in gpt_print_usage
and in README.md
, but that makes it harder to accept this pull request as is.
To me it seems this functionality is mostly needed for non-interactive mode anyway, so if it is difficult to come up with a proper solution for interactive mode, then merge the PR as it is and we can figure it out later. |
@DannyDaemonic addressed comments. I agree with punting on interactive mode for now. I was indeed reluctant to grapple with the complexities of signal handling on that. Added usage text and an error to that effect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just need to move the check for incompatible parameters to gpt_params_parse
and it's ready to merge.
@DannyDaemonic updated |
Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
Thanks @DannyDaemonic ! |
@ejones, Is there any possibility this could be used in server.exe as well? It could be very useful to start a server session by loading from a file, then allow saving the session back to the prompt-cache file at any given point in time via an endpoint. This would give the user the flexibility to start a session, chat, reset, chat more, save and pick up later. I have started to take a look at this but am still getting familiar with the code differences between main and server. Also, any advice with the following concerning prompt caching: (TL;DR: Caching the static LLM instructions so they don't have to be computed every time a new session starts.) I feel like I am using the prompt-cache a little different in that I prefer to use it as a way to cache the initial instructions for the LLM before sending it a prompt. I start with a large prompt with instructions (Assistant characteristics, desired output formats, etc.). Because these are static between sessions, I don't want to load this every time. So I start with instructions and Is there a better way to do this? |
EDITED after updates
This is a much scaled-back change in place of #1310. Renames
--session
to--prompt-cache
and adds a new option,--prompt-cache-all
, that causes user input and generations to be saved to the session/cache as well. This new option allows for fast continuation of generations (with additional input).Testing
--prompt-cache
just saves the initial prompt:--prompt-cache-all
saves prompt + generations, allowing ~constant generation time for continuing generation on successive calls:chat-13B.sh
with prompt cache