-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SimpleChat Completion Mode flexibility and cleanup, Settings gMe, Optional sliding window #7480
SimpleChat Completion Mode flexibility and cleanup, Settings gMe, Optional sliding window #7480
Conversation
Consolidate global control variables into gMe, to allow developers to modify some of the behaviour at runtime by changing gMe members in browser's devel-tools console. Add support for a optional (disabled by default) sliding window wrt what amount of the chat history is sent to the server/ai-model when chating. Add configuration fields sent to /chat/completions or /completions end points, as members of gMe (chatRequestOptions), Initial skeleton, for future, for trying to fetch json data as it is getting generated, rather than in one shot at the end, if servers support the same (need to look in to web service api in more detail later). Update the usage messages shown wrt new/fresh chat. Also show the current status of dev console controllable configurations. To keep things simple, avoid mixing chat and completion messages, once at the time of switch, when switching from chat to completion mode. |
Just have a alert msg wrt needing javascript enabled in html. And have usage message from js file. Update the usage message a bit. So also enable switch session wrt setup_ui call. Add a possible system prompt as a placeholder for the system-input.
In completion mode * avoid inserting Role: prefix before each role's message * avoid inserting newline at the begin and end of the prompt message. However if there are multiple role messages, then insert newline when going from one role's message to the next role's message.
Readme update wrt completion mode behavior. Usage help updated wrt completion mode behavior. When changing from input to textarea elment wrt user input, the last newline at the end of the user input wrt textarea, was forgotten to be filtered, this is fixed now. However if user wants to have a explicit newline they can using shift+enter to insert a newline, that wont be removed. The extra newline removal logic uses substring and keyup to keep things simple and avoid some previously noted bugs wrt other events in the key path as well as IME composition etal.
previous logic would have cleared/reset the xchat, without doing the same wrt iLastSys, thus leading to it pointing to a now non existent role-content entry. So if a user set a system prompt and used completion mode, it would have done the half stupid clear, after the model response was got. Inturn when user tries to send a new completion query, it would inturn lead to handle_user_submit trying to add/update system prompt if any, which will fail, bcas iLastSys will be still pointing to a non existant entry. This is fixed now, by having a proper clear helper wrt SC class.
Previously any chat history including model response to a completion query would have got cleared, after showing the same to the user, at the end of handle_user_submit, rather than at the begining. This gave the flexibility that user could switch from chat mode to completion mode and have the chat history till then sent to the ai model, as part of the completion query. However this flow also had the issue that, if user switches between different chat sessions, after getting a completion response, they can no longer see the completion query and its response that they had just got. The new flow changes the clearing of chat history wrt completion mode to the begining of handle_user_submit, so that user doesnt lose the last completion mode query and response, till a new completion mode query is sent to the model, even if they were to switch between the chat sessions. At the same time the loss of flexibility wrt converting previous chat history into being part of the completion query implicitly doesnt matter, because now the end user can enter multiline queries.
For later the server flow doesnt seem to be sending back data early, atleast for the request (inc options) that is currently sent. if able to read json data early on in future, as and when ai model is generating data, then this helper needs to indirectly update the chat div with the recieved data, without waiting for the overall data to be available.
Keep the title simple so that print file name doesnt have chars that need to be removed. Update readme wrt some of the new helpers and options. Change Usage list to a list of lists, add few items and style it to reduce the margin wrt lists.
As some times based on the query from the user, the ai model may get into a run away kind of generation with repeatations etal, so adding max_tokens to try and limit this run away behaviour, if possible.
This allows the end user to see the settings used by the logic, as well as allows users to change/update the settings if they want to by using devel-tools/console
This is disabled by default. However if enabled, then in addition to latest system message, only the last N user messages, after the latest system message and its reponses from the ai model will be sent to the ai-model, when querying for a new response. This specified N also includes the latest user query.
45eb798
to
11d2d31
Compare
Hi @mofosyne, This set of updates/commits is ready for merging with main/master. [Look at next comment] |
Hi @ggerganov @mofosyne @ngxson @teleprint-me / all Based on some testing across 7b and below models (Llama3, Mistral v02, Phi3, Phi2) on a 8GB laptop, what I notice is that
So do you want the default context window sent from the client side to server to be limited to system+last-query-response+cur-query or do you want the client to always blindly send the full chat history wrt a given chat-session. For a developer keeping client side context window limiting logic to unlimited, may be useful to test server/ai-model implications. While for a simple/normal end-user usage keeping client side context window limited to system + last-qeury-response + cur-query may make some sense. Users also have the option of using NewCHAT to start a new chat session. NOTE: The implemented client side context window limiting logic is flexible and one can control the amount of chat history sent interms of latest system prompt update in a given chat session and how much of latest user query + assistant responses after that should be included or should full history be included. But the two possibilities I mentioned above indicates the issue and possible solution in a simple and straight way. NOTE: @ngxson the server currently doesnt seem to honor max_tokens wrt /completions endpoint, while /chat/completions endpoint does seem to honor it. This angle could also help fine tune the CI test failure which was seen wrt server testing, based on a quick glance at what I saw when it was noticed on the original simplechat pr. |
In the The plain |
Regarding this point, maybe you can enable However, this approach only works when you add more messages to the history. This won't work with sliding window, because we remove earlier messages in the history. Please have a look at #5793 for more details. |
Will checkout behaviour with cache_prompt (with and without max_tokens), at same time I am not fully sure this will solve the core issue, which is to an extent caused by model's context window. #5793, I assume you are mentioning wrt future, as it seems to be paused for now? Also from a casual look at things mentioned in it, the current caching (as well as the suggestion discussed in it) is mainly useful for the common case of having the core user entered context in system prompt and or system-prompt + 1st few fixed handshake (which can be indirectly taken as a larger system prompt) inturn followed by a changing tail in the query/chat sent. Which either way inturn also seem to favor a large-system-prompt along with a client-side sliding window mechanism.
|
Reduce chat history context sent to the server/ai-model to be just the system-prompt, prev-user-request-and-ai-response and cur-user-request, instead of the previous full chat history. This way if there is any response with garbage/repeatation, it doesnt mess with things beyond the next question, in some ways. Increase max_tokens to 1024, so that a relatively large previous reponse doesnt eat up the space available wrt next query-response. However dont forget that the server when started should also be started with a model context size of 1k or more, to be on safe side. Add frequency and presence penalty fields set to 1.2 to the set of fields sent to server along with the user query. So that the model is partly set to try avoid repeating text in its response.
The /completions endpoint of examples/server doesnt take max_tokens, instead it takes the internal n_predict, for now add the same on the client side, maybe later add max_tokens to /completions endpoint handling.
Did test with cache_params, didnt change things much. Also specifically didnt check wrt speed impact if any, rather the garbage phase will eat up time. As with before max_tokens ensures that garbage text repeatation doesnt runaway. Additionally presence/frequence penalty, also didnt help wrt avoiding garbage text repeatation wrt the queries which generated the same. Tested the above two aspects for now with Llama3 8b. As a sum total of all, from a overall perspective, for now have setup the defaults in SimpleChat ie client side such that
Developer/end-user is shown this info when ever a new-chat is started. And they can modify these if required, by changing gMe from browser's devel-tools/console, for now. |
@hanishkvc The context can be dynamically managed. Another possibility is to make a dynamic upper limit defined by the user. Artificially limiting the context seems undesirable. |
Hi @teleprint-me @ggerganov @ngxson Currently llama.cpp/server implicitly limits the model context size. What I have seen is that by default it is 512 tokens, ie if one runs server without -c CONTEXT_SIZE argument. And either way independent of llama.cpp, most of the current models either way have implicit relatively small context size limits. Things are moving towards larger implicit ctxt size or sliding ctxt window on the model side etal, but not fully evolved nor wide spread yet. Inturn what I have seen is that when the [system-prompt + chat-history + current-query] and inturn the response generated for the current-query together exceed this context size, most of the times it normally leads to repeating garbage text, which seems to implicitly stop once everything including garbage text reaches around 8K tokens (some limit enforced by llama.cpp or so potentially I havent looked into), which inturn will take a sufficiently lot of time on laptops without GPU etal. Using max_tokens hint from client side, forces the server to stop once that many tokens have been generated by the ai-model, irrespective of whether it indicates end of generation or not. Given that client doesnt have control over the model's context size, as well as no control over the question/query that will be asked by the end user, which inturn may lead to a long answer which can inturn lead to context size exceeding, so that is the reason why I am forcing that max_tokens limit. Its also the reason why I implemented the client side sliding window concept to try and see that the context loaded by the system+history+query doesnt occupy too much of the context size on average. Currently I havent added a UI specifically for setting it and other config parameters. However I expose most of the logic / behaviour controlling parameters over a global variable gMe, so that one can modify them from browser's devel-tool/console. Maybe I will add a UI to control these settings. I originally started on SimpleChat, more to test some logic wrt llama.cpp and inturn server. So idea has been to keep the code basic and minimal, so that anyone wanting to test llama.cpp and server dont have to dig through code too much to change things or so. Also I havent looked into llama.cpp and server code flow in detail currently, only specific parts when I faced some issue. So not sure of the llama and server flow fully. At same time I assume currently to allow for accomodating fixed context and sliding context related models etal, in a straight and simple way maybe llama.cpp and server inturn dont implicitly stop on context size being reached, instead depend on max_tokens/n_predict wrt when to force a stop.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did a quick code review. Looks alright, but will wait for one other reviewer for a bit before merging.
Regarding targeted audience. I would think ordinary people would prefer the standard interface. As you said, this is more targeted towards developers so feature set should be lean and mean and focus on explainability of the code itself. If anything, you may want to emphasise that point in code, so developers don't try to feature creep it.
The idea is to be easy enough to use for basic purposes, while also being simple and easily discernable by developers who may not be from web frontend background (so inturn may not be familiar with template/end-use-specific-language-extensions driven flows) so that they can use it to explore/experiment things. Because there is some amount of experimentation currently going on with the default interface exposed by server, so chances are for now this may be more approachable for ordinary users also. And given that the idea is also to help explore/experiment for developers, some flexibility is provided to change behaviour easily using the devel-tools/console for now. And skeletal logic has been implemented to explore some of the end points and ideas/implications around them. |
Sounds reasonable enough. Let's see if we get any positive assessment or objection, otherwise I'll merge it in a few days or so. |
Hey @mofosyne, Please do go ahead and merge this in, this is a logical bunch of straight forward cleanup and improvement to the initial SimpleChat. If possible I want to start a new PR on a optional simple histogram based auto garbage identification and trimming logic, and inturn dial down frequence/presence penalty, bcas those penalities just makes the garbage text repeatition, more difficult to identify, because in some of the models like llama3 it just seems to over-intelligently repeat the text, but with small variations between the repeatations, making it difficult to auto identify the garbage. And also maybe a simple setup ui. But I want to branch it from the merged master here, rather than this PR branch. And finish it tomorrow itself or so, so that I can get back to other things. |
@hanishkvc okay, had a second lookover. Considering the isolated changes to an isolated subsystem, happy to merge in so you can proceed with your next idea. But just gonna double check, do you really want to add these extra feature or is this feature creeping? What is your justification for adding these in this 'simple' ui, rather than placing it in the main tool for instance. You need to balance between the code objective to be read vs the code acting as a developer tool. If needed and you can justify it, then should we make a 'dev tool' ui folder and keep this as a stripped down demo? |
@mofosyne I agree in principle, My idea is to keep the settings related code confined to the Me class (which maintains the global configurable parameters and shows it) or a equivalent new class/file, and the dumb garbage trimming into its own helper class/file. With minimal change to existing code flow and bit more of cleanup/structuring wrt moving some existing helpers into their own file like a ui file or so. So the end goal is to still keep it simple and potentially even more structured, while also hopefully making it easier to use some of its flexibility. |
PR #7548 extends SimpleChat with a simple minded garbage trimming logic and a minimal settings ui. |
@hanishkvc i suspect your new UI may be running into the same caching issues that affect other third party clients, as described in: #7185 |
Hi @khimaros I had looked at prompt_cache sometime back. However because I use relatively fixed system-prompt + a sliding window into user chat flow, so the current prompt_cache caching which expects a contiguous history (for its own valid reasons), wont help much from re-use of cached data perspective. But at the same time, technically there shouldnt be any harm in keeping the prompt_cache enabled, so that for now the system-prompt part (and in future maybe the sliding user-chat part also can get reused in appropriate way, when caching logic acquires that capability). I will send a patch for simplechat keeping it enabled, as well as adding a delete button if there is previous chat history available from a previous session, that one may want to remove, tomorrow. |
In Completion mode
now by default, the logic doesnt insert any role related "THE_ROLE: " prefix to the message, when generating the prompt json entry to the server for /completions endpoint. This gives full control to the end user to decide if they want to insert any prefix or not, based on the model they are communicating with in the completion mode. Given that /completion endpoint doesnt add any additional in-between special-tokens/tags/chat-templating of its own, using the multiline user input support, a end user could generate any combinatiion of messages with special tokens etal, as they see fit to check how the model will respond. Or use it just for a normal completion related query.
avoid the implicit textarea enter-key/newline char wrt the user-input, however if the end user explicitly wants newline in the sent message, they can always use shift+enter key press to insert the same.
fix a corner case with normally unexpected system prompt with completion mode, messing with the logic. Now its up to the end user if they want to put a system prompt (normally not expected nor needed in completion mode) or not, the logic wont raise an exception.
In general
update the readme as well as the usage message to capture the flow better.
add a placeholder hint wrt the system prompt
@mofosyne, do wait for half a day (12 hours), I will cross check things once later today, before you merge this. At sametime if anyone is trying to use SimpleChat in the mean time, the commits in this PR could be useful.