Chat session state management #560

eublefar · 2024-03-02T16:39:59Z

As discussed here #559

I've added few things:

StatefulExecutorBase.AddPromptAsync - basically runs the text through decode without sampling new tokens.
I am basically running InferInternal twice with WaitForInput = true, but it's very reliant on specific implementation of child classes and not the abstraction itself, so maybe you'll have some suggestions on how to improve it.
SessionState record class reperesenting ChatSession state in memory.
void LoadSession(SessionState state) and SessionState GetSessionState() for ChatSession class to save session state at arbitrary point.
static Task<ChatSession> ChatSession.InitializeSessionFromHistoryAsync(ILLamaExecutor executor, ChatHistory history, CancellationToken cancellationToken = default) and Task<ChatSession> ChatSession.ProcessSystemMessage(string content) to pre-process KV cache.
Functions to reset states of StatefulExecutors and LlamaContext.
Example ChatSessionWithRestart to show off how it works

I didn't find any ChatSession unit tests so I did not write any.

BR,
Mykyta

… stateful executors and context

… from history and process system message methods for pre-processing prompts. Serializing executor state to JSON, to avoid saved states from being updated by reference.

…ration that resets chat to original point of history without extra processing

LLama/LLamaContext.cs

LLama/LLamaExecutorBase.cs

LLama/LLamaInteractExecutor.cs

LLama/LLamaInstructExecutor.cs

martindevans · 2024-03-02T17:08:22Z

Thanks for this PR! I'm not very familiar with the higher level executors, but I've just left a few comments on things I spotted :)

As a general note. I'm personally working on bringing a brand new executor to LLamaSharp based on the BatchedExecutor which I introduced in the least release. It's currently very "low level" compared to these executors, but you may be interested in taking a look at it since it can do a lot of the things you want to do here. I'd be interested in any feedback you have about it.

…ssionState.ContextState no nullable

eublefar · 2024-03-02T19:12:11Z

Sorry, I've over-thought it a bit with nullable ExecutorBaseState and context State. I've just made those non-nullable and removed all the ResetState methods (as now there is no need for them, there will always be some state to load so no need to reset).

eublefar · 2024-03-02T19:48:54Z

@martindevans

As a general note. I'm personally working on bringing a brand new executor to LLamaSharp based on the BatchedExecutor which I introduced in the least release. It's currently very "low level" compared to these executors, but you may be interested in taking a look at it since it can do a lot of the things you want to do here. I'd be interested in any feedback you have about it.

Huh, it indeed does, I should've checked it out.
As for feedback, my use case is a bit weird for this project I guess, but I shall share it nontheless :)
Basically the use case for me is that I am making a game in Unity with a lot of prompts.
I started this project of mine some time ago and decided to write my own high-level interface based on NativeAPI/handles directly.

Some notes on what would make BatchExecutor viable for my use-case:

I found that it's impossible for me to keep everything in one KV cache on GPU because of the limited context size, so Save/Load state in-memory is very important. Any ways to spill the context into RAM in seamless and optimal way, while still being able to run inference in batch mode would be very appreciated.
On a side note: I found out that vanilla Executors actually do something like this while writing this PR (I've cleaned KV cache, but if I did not remove Executor state, then LLM actually continued conversation that should've been removed from the KV cache). Really curious if that can be applied to batch setting.
If I process batches too big - frames per second suffer, so I need to be able to batch requests implicitly (e.g. split single prompt/message into multiple runs). If I am not doing it manually, llama.cpp just crashes whole application with assert that batch size is too small, so C# side needs to be smart in this regard.

I'd be happy to help, if that direction would be useful for you.

martindevans

Thanks for fixing those previous comments, this looks good to me 👍

I won't merge it myself immediately, since I'm not very familiar with the high level chat/executor stuff. If no one else has any comments I'll merge it before the next release happens :)

AsakusaRinne

Thank you for the contribution! The part of pre-filling prompt looks good to me while I have some concerns about the the part of session state. I'm open for any discussion and please feel free to ping me if there's something blocking you. :)

LLama/ChatSession.cs

LLama/LLamaExecutorBase.cs

martindevans · 2024-03-03T17:31:58Z

@eublefar I opened a discussion thread here about the your BatchedExecutor feedback, to keep this PR on topic :)

Xsanf · 2024-03-12T10:19:32Z

@eublefar > If I process batches too big - frames per second suffer....

I'm not sure if this will help, but it helped me in a similar situation.

By default, we transfer control to await UniTask.NextFrame(); and we believe that this is sufficient from the point of view of control logic.

               await foreach (var text in executor.InferAsync(query, inferenceParams))
                {
                    output += text;
                    Output.text = output;
                    await UniTask.NextFrame();
                }

At a fixed frequency of 35 FPS

        Application.targetFrameRate = 35;
        QualitySettings.vSyncCount = 1;

During execution, FPS may drop to 7-6 frames for the entire time of inference.
There arises the difficult question of synchronizing two sequences with different periods, the Moire effect, at what exact moment in the frame the control transfer occurs.

Quite a crude technique, but you can simply add additional await UniTask.NextFrame();

                await foreach (var text in executor.InferAsync(query, inferenceParams))
                {
                    output += text;
                    Output.text = output;
                    await UniTask.NextFrame();
                    await UniTask.NextFrame();
                    await UniTask.NextFrame();
                }

When adding one additional UniTask.NextFrame(); FPS reaches up to 25-35.
When adding two UniTask.NextFrame(); FPS does not drop.

This has almost no effect on the speed of the withdrawal itself, and eliminates the Moire effect (frieze), since regardless of the generation length, only the generation time of one token matters.

Additional question. Why is the "seed" set in ModelParams. When it is more logical in inferenceParams, where it is needed.
This is not very convenient, because if you need to regenerate the request to obtain a new option, you cannot change the "seed" without additional manipulations.

martindevans · 2024-03-12T11:35:26Z

Why is the "seed" set in ModelParams.

LLamaSharp has IModelParams which is everything required to load a model into memory and IContextParams which is everything required to create an inference context. Seed is set on the context params. The ModelParams class is a convenience that implements both interfaces in one place, so you can set all config options in one go. But if you want more control you can implement those two params interfaces yourself.

(If you want to ask any followup questions please open an issue and ping me, to keep this PR on topic)

AsakusaRinne · 2024-03-17T10:15:22Z

@eublefar Hi, how's it going? Many thanks for your contribution. We'll be happy if you could complete this PR but it shouldn't be blamed if you don't have time to continue. Please let us know if you're not available in the future two weeks and I'll finish this work myself. :)

eublefar · 2024-03-17T10:19:17Z

@AsakusaRinne Hey, sorry, just got back from the vacation. I'll get on it today :)

…set operations

… the llama.cpp

eublefar · 2024-03-17T14:59:54Z

@AsakusaRinne I've suffered a bit with serializing/deserializing transforms, but It's working now I think.

eublefar · 2024-03-18T11:19:51Z

Tests pass on my machine and I can't figure out which one is crashing here, any suggestions?

martindevans · 2024-03-18T14:17:34Z

Tests are a little flakey at the moment (language models are huge, so I think we're just trying to do too much work in some tests for the github runners to handle). Since you passed on Linux, and this PR isn't really platform specific, you're probably ok. I've restarted the failed CI runs.

AsakusaRinne

Hi, sorry for the delay of the review. It's impressing and the overall looks good to me, with a few comments left. :)

LLama.Examples/Examples/ChatSessionWithHistory.cs

LLama.Examples/Examples/ChatSessionWithRestart.cs

LLama/ChatSession.cs

LLama/LLamaExecutorBase.cs

AsakusaRinne

LGTM, many thanks for this contribution! :)

eublefar added 3 commits March 2, 2024 14:51

Chat session Get/Load in-memory state operations, reset state ops for…

35153a7

… stateful executors and context

AddPromptAsync method for stateful executors, Chat session initialize…

b2f7dbb

… from history and process system message methods for pre-processing prompts. Serializing executor state to JSON, to avoid saved states from being updated by reference.

Example chat session with preprocessing of chat history and reset ope…

0763f30

…ration that resets chat to original point of history without extra processing