-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of LLM summarizing condenser #6597
Improve performance of LLM summarizing condenser #6597
Conversation
# Create a new summary event with the condensed content | ||
summary_event = AgentCondensationObservation(summary_response) | ||
for forgotten_event in forgotten_events: | ||
prompt += str(forgotten_event) + '\n\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be an argument for standardizing the Event
-> Message
conversion and breaking it out of the agent control-flow.
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
TESTS: test_format() passed | ||
CHANGES: str(val) replaces f"{val:.16G}" | ||
DEPS: None modified | ||
INTENT: Fix float precision overflow""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that Sonnet is doing reasonable with this prompt. I'm going to test it on Deepseek 😅
(not necessary for this PR, of course, just a thought)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised this is what it ended up being, it's definitely not how I would default to prompting!
I started off with a very simple prompt, then I passed it and a trajectory or two to Claude and asked what it would change. Took two or three iterations before it met the baseline but it got me from 0% to 40%. I have to imagine you'd arrive at a very different result if you repeated the process with DeepSeek.
mock_llm.completion.assert_called_once() | ||
call_args = mock_llm.completion.call_args[1] | ||
assert 'messages' in call_args | ||
assert len(call_args['messages']) == 1 | ||
assert 'Event 1' in call_args['messages'][0]['content'] | ||
assert 'Event 2' in call_args['messages'][0]['content'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can still test for some message inclusion? Not a big deal, just thinking that tests with condensation are not easy, so maybe we can include in unit tests whatever we can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, I absolutely love that we're doing this incrementally, and anyway we can play with that prompt later too.
…ith49/OpenHands into feat/llm-summarizing-condenser-fixes
Co-authored-by: Calvin Smith <calvin@all-hands.dev> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
End-user friendly description of the problem this fixes or functionality that this introduces
Give a summary of what the PR does, explaining any non-trivial design decisions
Updates the LLM-based summarizing condenser to improve performance.
LLMSummarizingCondenser
now relies on theRollingCondenser
to control when summarization happens.LLMSummarizingCondenser
is now the default condenser (still off by default!), compared to the more aggressiveAmortizedForgettingCondenser
that has the same forgetting behavior as theLLMSummarizingCondenser
but without the summarization.To test performance, I ran this condenser against the no-condensation baseline (both on OH v0.21.0) on the whole of SWE-bench Verified with a maximum of 100 iterations. The condenser resolved 200 instances, compared to the baseline's 203, and cost $40 more to run due to the lower prompt cache utilization. However, the average response latency is a consistent 8 seconds compared to the 12 seconds (at iteration 30) and 16 seconds (at iteration 100) of the baseline.
Link of any specific issues this addresses