Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle context size overflow in AssistantAgent #9

Closed
1 task
Tracked by #1032
sonichi opened this issue Aug 1, 2023 · 9 comments
Closed
1 task
Tracked by #1032

handle context size overflow in AssistantAgent #9

sonichi opened this issue Aug 1, 2023 · 9 comments
Labels
enhancement New feature or request

Comments

@sonichi
Copy link
Contributor

sonichi commented Aug 1, 2023

microsoft/FLAML#1098, microsoft/FLAML#1153, microsoft/FLAML#1158 each addresses this in some specialized way. Can we integrate these ideas into a generic solution and make AssistantAgent able to overcome this limitation out of the box?

Tasks

@sonichi sonichi added the enhancement New feature or request label Aug 1, 2023
@yiranwu0
Copy link
Collaborator

yiranwu0 commented Aug 2, 2023

I will handle this problem in microsoft/FLAML#1153. The problem should be in generate_reply, when it returns extra long messages. My current plan include the following functionalities:

  1. use tiktoken for more accurate count of tokens, add a static function that checks token left given model and previous messages.
  2. Allow user to pass in a predefined output limit.
  3. when the generated output (for example, from code execution) passes the max token allowed or the user predefined error, it will return a long result error.

@thinkall has implemented the tiktoken count in microsoft/FLAML#1158. Should I try to fix this concurrently?

@sonichi
Copy link
Contributor Author

sonichi commented Aug 2, 2023

I will handle this problem in microsoft/FLAML#1153. The problem should be in generate_reply, when it returns extra long messages. My current plan include the following functionalities:

  1. use tiktoken for more accurate count of tokens, add a static function that checks token left given model and previous messages.
  2. Allow user to pass in a predefined output limit.
  3. when the generated output (for example, from code execution) passes the max token allowed or the user predefined error, it will return a long result error.

@thinkall has implemented the tiktoken count in microsoft/FLAML#1158. Should I try to fix this concurrently?

Your proposal can solve part of the problem. It does the check on the sender's side in case the receiver requests a length limit.
There can be other alternatives:

  1. The receiver requests that when the msg is longer than the threshold, the sender sends a part of the msg and they have protocol to deal with the remaining part. Continual learning via LearningAgent and TeachingAgent FLAML#1098 and Add RetrieveChat FLAML#1158 are examples of such.
  2. The receiver doesn't request check on the sender's side. It performs compression on the receiver's side. For example, it can employ agents in Continual learning via LearningAgent and TeachingAgent FLAML#1098 to do so. Even when the check on the sender is requested, some compression can still be done at the receiver's side to make room for future msgs.

It'll be good to figure out what we want to support and have a comprehensive design.
Could you discuss with @thinkall and @LeoLjl ? You are in the same time zone. Once you have a proposal, @qingyun-wu and I can go over it.

@yiranwu0
Copy link
Collaborator

yiranwu0 commented Aug 2, 2023

Sure, I will discuss will @thinkall and @LeoLjl about it.

I just updated microsoft/FLAML#1153 to allow user to set a pre-defined token limit for outputs from code or function call, this is a different task and I think is a different task from handling token_limit in oai_reply.

@yiranwu0
Copy link
Collaborator

yiranwu0 commented Aug 5, 2023

@sonichi @qingyun-wu Here is my proposed plan:

On AssistantAgent:
Add a parameter on_token_limit from ["Terminate", "Compress" ]. We would if token limit is reached before oai.create is called, if set to "terminate", we would terminate the message. If set to compress, we would use a compress agent to compress previous messages and prepare for future conversations (we can also set a threshold like 80% of max token to start an async agent). I read that openai is doing summary for previous messages if it is too long.

On UserProxyAgent (I already added this in microsoft/FLAML#1153):
Allow user to specify the "auto_reply_token_limit". default to -1 (no limit). When auto_reply_token_limit > 0 and the token count from auto reply (code execution or function call) exceeds the limit, the output will be replaced with an error message. This can let users prevent unexpected cases where the output from code execution or functions calls overflowed.

From the two changes above, all 3 generate_reply cases are addressed: oai_reply, code execution and function call.
I am thinking of general tasks like problem-solving. @BeibinLi likes the "compression" and "terminate" approach.

For tasks that involve databases and has a large consumption on tokens, like answer questions given a long text, or search for data in a database, I think we need special design targeting at those applications.

@sonichi
Copy link
Contributor Author

sonichi commented Aug 5, 2023

The proposal is a good start. I like the design that covers two options: deal with token limit after/before a reply is made.
I think we can generalize this design:

  1. For each auto reply method, we add an optional argument token_limit to let the method know the token limit for each reply. Allow it to be either a user-specified constant or an auto-decided number. The method is responsible for handling that constraint. This includes the retrieval-based auto reply, such as the one in RetrieveChat.
  2. For oai_reply, we catch the token limit error, and return (False, None) when the error happens. That will yield the chance of finalizing the reply and let the next registered method decide the reply. Then, we can register the compressor method to be processed after the oai_reply yields.

@yiranwu0
Copy link
Collaborator

yiranwu0 commented Aug 8, 2023

On second thought, I don’t think we need to pass a token_limit argument. Currently for function and code execution, I use a class variable “auto_reply_token_limit” to customize behavior when limit is reached. When a new agent is overloading, they can employ this variable, or just create a new class variable.

@sonichi
Copy link
Contributor Author

sonichi commented Aug 8, 2023

Should the sender tell the receiver the token limit? "token_limit" and ways to handle token_limit should be separated. "token_limit" is a number that should be sent by the sender. Maybe we can make that a field in the message. The way to handle token_limit is decided in the auto reply method.

@yiranwu0
Copy link
Collaborator

yiranwu0 commented Aug 10, 2023

I have a few questions when looking at the

  1. In receive function, it calls generate reply without passing in messages: self.generate_reply(sender=sender), so the message will be None. When registered methods such as generate_oai_reply is called, message will be None and it takes out the pre-stored messages:
        if messages is None:
            messages = self._oai_messages[sender]

It seems that this message argument is not used. When would this be used?
One possible usage: when generate_reply is called individually.

  1. the context argument passed to register_auto_reply seem more appropriate to be rename to reply_config?
    In oai_reply it is converted to llm_config and in code execution it is converted to code_execution_config. In other reply methods it is not used. It seems that "context" can be a field in message from oai and also "content" is a field in message.

@sonichi
Copy link
Contributor Author

sonichi commented Aug 10, 2023

I have a few questions when looking at the

  1. In receive function, it calls generate reply without passing in messages: self.generate_reply(sender=sender), so the message will be None. When registered methods such as generate_oai_reply is called, message will be None and it takes out the pre-stored messages:
        if messages is None:
            messages = self._oai_messages[sender]

It seems that this message argument is not used. When would this be used? One possible usage: when generate_reply is called individually.

  1. the context argument passed to register_auto_reply seem more appropriate to be rename to reply_config?
    In oai_reply it is converted to llm_config and in code execution it is converted to code_execution_config. In other reply methods it is not used. It seems that "context" can be a field in message from oai and also "content" is a field in message.

Good questions. Regarding 1, yes messages will be used when generate_reply is called individually. We can revise the calling usage in receive function to make it pass messages, to avoid this confusion.
Regarding 2, we can rename it into config if we want to avoid the confusion. One thing to note is that this variable could be updated in the reply function to maintain some state. I wanted to use it in other methods too but haven't done the refactoring. @ekzhu is it OK to rename context into config in generate_reply()?

@sonichi sonichi transferred this issue from microsoft/FLAML Sep 23, 2023
skzhang1 pushed a commit to skzhang1/autogen that referenced this issue Aug 26, 2024
Integrate Mem0 for providing long-term memory for AI Agents
jackgerrits pushed a commit that referenced this issue Oct 2, 2024
jackgerrits pushed a commit that referenced this issue Oct 2, 2024
* first draft workflow for understandings

* breaking change WIP

* breaking SK upgrade

* update compiles

* add error handling/logging

* change network timeout

* mermory refactor

* simple dev understanding

* error handling for building understanding
jackgerrits added a commit that referenced this issue Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants