feat(condenser): Explicit view properties#2116
Conversation
…correct manipulation indices behavior
|
I'm on it! enyst can track my progress at all-hands.dev |
|
Taste rating: 🔴 Needs improvement This refactor is aiming at a real problem (the old But there are a couple of places where the data structure choice and behavioral guarantees don’t line up, and that’s where you get the kind of “works in tests, breaks in production” bugs. [CRITICAL ISSUES]1)
|
Re-read / self-audit of my reviewAfter re-reading my own comment + skimming the diff again, I think the ordering contract for A couple adjustments in severity/precision:
Net: the main point stands — don’t return an unordered container and then reason about “consecutive indices.” Everything else is secondary / depends on intended semantics. |
|
I reviewed PR #2116 in a “/codereview-roasted (but not exaggerated)” style, focusing on fundamental design and correctness risks rather than style nits, and I posted that review as a markdown PR comment. Then I re-read my own review and re-checked the diff “with fresh eyes,” adjusted the severity/precision of a couple points (notably downgrading how certain I am about the No code changes were made and nothing was pushed; the requested deliverable was the PR review + the follow-up re-analysis comment. |
| This property is important to enforce for Anthropic models with thinking enabled. | ||
| They expect the first element of such a tool loop to have a thinking block, and use | ||
| some checksums to make sure it is correctly placed. In such a setup if we remove any | ||
| element of the tool loop we have to remove the whole thing. |
There was a problem hiding this comment.
Sorry, I know we've been looking at this before, but I still don't know that this has to be this way, and it seems ... not possible? Unless it simply means we need to summarize ~everything.
I'm confused how the first element could have signature generated from tool calls that weren't generated by the model yet... sorry to be dense. I know there was some weird stuff in Anthropic docs, but it may be worth considering if perhaps there are other ways of looking at it.
"Tool loop" seems to mean an "agent turn" - the model is just doing tool calls, we give results, continue. Such an agent turn can run from initial prompt to any number of events, including 500, 1k or 10k just the same. So we summarize the ~whole thing, which means... the whole agent-only source events.
In other words, it seems that maybe all we need to know is if there was a user message (real or synthetic)? If only the initial message -> summarize the rest. If there is another somewhere, we can summarize from start until that message(s)?
There was a problem hiding this comment.
Opus generates a thinking block for the first message in an agent turn, and will throw some API exceptions if it sees an agent turn without a thinking block.
The checksums just ensure the thinking blocks it sees were actually generated by Anthropic when it emitted that particular message, which means we can't modify the thinking block or put it anywhere else without Anthropic complaining.
So we can't get rid of the thinking blocks and we can't move or change them. We can modify the suffix of the agent turns (anything past the first action/observation pair) when we condense, but that isn't a very useful condensation strategy to us at the moment. The result is we treat agent turns as atomic.
If the history is just one long agent turn, yeah, we have to summarize the whole thing. Otherwise condensation will "snap" to the agent turn / user message boundaries.
To be clear, I don't like this. I believe our compact-the-prefix-and-keep-the-suffix approach is better for keeping agents on track than Anthropic's all-or-nothing compaction, but they're optimizing for Claude Code usage and this was the best solution I could find at the time.
There was a problem hiding this comment.
If the history is just one long agent turn, yeah, we have to summarize the whole thing. Otherwise condensation will "snap" to the agent turn / user message boundaries.
Exactly 🤔
This might actually be a better mental model:
- we summarize the "event blocks" between user messages
- that might mean the whole view.
It's easy to understand or visualize, it seems to me (I'll come back to this, but it's not an issue for this PR, it's just trying to understand - specially the aspect where we might need information from outside the view...)
There was a problem hiding this comment.
Just remember that mental model only holds for Anthropic endpoints when thinking is enabled. In all other situations the "event blocks" are fine-grained enough for us to summarize parts of them without the completion endpoint complaining.
enyst
left a comment
There was a problem hiding this comment.
Thank you for this, it does seem cleaner!
I'd love it if we can give it a little thought, or maybe I just need to understand a bit better, because the most important change, it seems to me, is that now we would force all other LLMs to get their events summarized if they were in an agent's turn ("tool loop").
Is this... the case? Seems so, and if I understand correctly, that is what you pointed out in the description too. What are the consequences of that? Does it work reasonably well with a few other SOTA LLMs - I seem to recall most of the summarize prompt was Sonnet generated -ish, and idk, it's possible that some LLMs work with it as long as they continue the second half. But now there is no more second half.
On the other hand, if we do this... does that mean we could just do full view? All the time, maybe?
I think we are in a moment where SOTA LLMs got good at long term tasks. GPT-5.x got there first, Opus 4.6 was specifically targeted at lengthening the history (the agent turn!), the ability of Claude to do ~similar.
That, it seems to me, is a big deal, and the numbers are difficult to align with our older assumptions (~120, ~240 events): 1k events in a tool loop was even possible with Opus 4.5 without blinking, and passing through 2 condensations in-between. If I understand correctly and those must have been full event view condensations, then how much reason is there left, to do less than full view?
VascoSch92
left a comment
There was a problem hiding this comment.
I’ve left a few comments, most of which are just nits or suggestions.
Overall, the core logic makes sense to me.
However, I have a suggestion regarding the ViewPropertyBase interface: the current method names are a bit misleading. Without reading the docstrings, it’s difficult to guess their actual behavior.
I can propose the following renames:
enforce→get_violationsorfind_invalid_eventsmanipulation_indices→get_allowed_rangeorget_edit_boundaries
| ) | ||
|
|
||
|
|
||
| ALL_PROPERTIES: list[ViewPropertyBase] = [ |
There was a problem hiding this comment.
Is that wanted?
Because you are actually instantiating the classes and not exposing closures.
There was a problem hiding this comment.
I think this is fine? There's no state associated with the properties, so while we could instantiate them in the view every time this makes them effectively singletons and saves some object creation cycles.
openhands-sdk/openhands/sdk/context/view/properties/tool_call_matching.py
Show resolved
Hide resolved
| events in the loop. | ||
| """ | ||
| tool_loops: list[set[EventID]] = [] | ||
| current_tool_loop: set[EventID] | None = None |
There was a problem hiding this comment.
why not just current_tool_loop: set[EventID] = set()?
I believe the logic stay the same and we don' have two types for one variable
There was a problem hiding this comment.
I like having the explicit None to indicate we're not in a tool loop instead of relying on an empty set for that check. The logic is basically the same but the extra typing requirements make the individual cases more clear (at least to my eyes).
| # loops) are a subset of all the events. If a tool loop in the view isn't | ||
| # present in the total list of tool loops that indicates some element has | ||
| # been forgotten and we have to remove the remaining elements from the view. | ||
| if view_tool_loop not in all_tool_loops: |
There was a problem hiding this comment.
I’m not sure about the typical size of all_tool_loops, but if it starts to grow, we should consider modifying the method to return a set.
Since the current implementation returns a list of sets (and sets aren't hashable), we could convert each set into a string and return a set[str] instead.
This would reduce the complexity from O(n**2) to O(n).
If the length is always small, feel free to ignore this, better to keep it readable than to over-engineer it!
There was a problem hiding this comment.
This might be slightly pre-mature optimization. You're right that the check can be optimized, but our more substantial optimization is going to be when we eliminate the dependency on all_events in the first place. We'll have to push some of this indexing deeper into the conversation state, but there's no reason we shouldn't be incrementally computing and storing tool loops and batches there instead of recomputing them each time in the view.
There was a problem hiding this comment.
Sorry to interfere here, but maybe it's worth considering to not store agent events information, and instead store information about the user messages. I think maybe we get the same result, except user messages are much fewer and maybe it all becomes easier?
Co-authored-by: Vasco Schiavo <115561717+VascoSch92@users.noreply.github.com>
Good catch, definitely unintended behavior. I've added the thinking block requirements back to the tool loop so now it only triggers for Anthropic models.
Depends on the event distribution, but yeah, Opus has been operating like this for at least a month. I still think our condensation approach is better than collapsing the whole view when the models support it though. Our earlier experiments definitely indicated a slight performance boost for weaker models. |
Reasonable. I caught earlier examples of this but missed a few test cases that were still casting them to a list. Those tests now rely on set comparisons instead. |
|
Hi! I started running the condenser tests on your PR. You will receive a comment with the results shortly. Note: These are non-blocking tests that validate condenser functionality across different LLMs. |
Condenser Test Results (Non-Blocking)
🧪 Integration Tests ResultsOverall Success Rate: 100.0% 📊 Summary
📋 Detailed Resultslitellm_proxy_anthropic_claude_opus_4_5_20251101
litellm_proxy_gpt_5.1_codex_max
Skipped Tests:
|
| Enforcement is intended as a fallback mechanism to handle edge cases, bad data, or | ||
| unforeseen situations. Because enforcement assumes the view is in a bad state, it | ||
| often requires a much larger perspective on the events and therefore depends on a | ||
| sequence of _all_ events in the conversation. |
There was a problem hiding this comment.
I think maybe we need to try to not depend on all events in the conversation, maybe we have alternatives?
- fallback to forgetting all view minus
keep_first - fallback to system prompt + user message(s) + a summary of agent actions in the view
- or worse case, system prompt + some info that things went badly and all we know is that this was user's task, maybe summary, maybe ask the user for more
For example, I feel the cloud has difficulty keeping up with 2k events, sandboxes keep crashing and getting lost, I don't know the exact reason but I know I didn't manage to have a longer conversation for a while... (on v1)
There was a problem hiding this comment.
Agreed we want to streamline some of these operations to avoid dealing with the whole event stream. But enforcing the properties we have unfortunately requires information from events outside the view.
Simple example: say I look at the current view and see a single action/observation pair. If I want to enforce batch atomicity, the only way I can know that the pair isn't part of a larger batch is to know that there are no other action events with matching llm_response_id values in the larger context.
There are a few solutions to avoiding doing that:
- Compute some kind of batch map in the conversation state and expose that. We'd have to do that tracking for each property we want to enforce and make sure it gets propagated to the view.
- Make sure the batch is never split in the first place. That's what the manipulation indices are trying to do.
In an ideal world we don't need the enforcement at all, just the manipulation indices. That's why every time enforcement happens it logs a warning. If we can verify that we don't see those warnings in practice, we can disable enforcement altogether (or only trigger it when we detect some error and need to recover).
| # of the all_events sequence -- if the batch ids in the view aren't exactly | ||
| # one-to-one with the batch ids generated by the all_events sequence, that | ||
| # can only mean something has been forgotten and we need to drop the entire | ||
| # batch. |
There was a problem hiding this comment.
Just a thought, not a review comment, FWIW, I don't know if we tested that this is really necessary, but maybe I'm missing it. It might be worth having a specific test, if we don't
There was a problem hiding this comment.
Batch atomicity? I assume it was needed at some point, it's a carry-over from v0. Probably model-specific behavior. With this PR we can disable it pretty easily by just not applying that property.
There was a problem hiding this comment.
I'm not sure, I thought it was new:
It seemed the PR fixed act2 / summary / obs2 , but I think it also dropped act1 and obs1. Which I don't know if we had to or tested for.
I think it would be great to test with act1 / obs1, assuming that we still avoid inserting summary between 🤔
There was a problem hiding this comment.
Comes from before that: #775
We need a better way to actually stress the API assumptions made by various providers, but I think that's beyond the scope of this PR
There was a problem hiding this comment.
I completely forgot that PR! And I reviewed it at the time 😭
Thank you, yes I agree.
There was a problem hiding this comment.
Thank you for this!
I do have a question re: previous results show that weaker LLMs do better with ~half events: when did we last test that, and what qualified then as "weaker" LLMs? If it's old enough, we really may want to re-eval/reconsider, the LLM capabilities went ahead for a few lifetimes. 😅
I'd like to (try to) make a few notes:
- I'm starting to feel it's worth it to play with the idea of restating the
tool loopsconstraint as auser messagesconstraint - they're much fewer, we can know exactly what's between them (I hope?)
- we need to track
user messagesfor a hook that is broken; maybe we could use that tracking for both features - it might be worth considering to delete
keep_firstattribute; in V1 we no longer need 4 minimum, the additional environment info that used to be arecall obsis now a suffix of the system prompt so I think we need two: system and user. That may save us from those bugs when we try to enforcekeep_firstbut bump into the first events of atool loop - we might want to test for batch atomicity, as noted,
though on the other hand, I suspect you might be right it was in some other form happening in V0...🤔 (update: you just showed it was older) - it's worth IMHO to think whether the summary of all events should be exactly the same way as partial view; for example
codex-cliadds to the prompt that a summary just happened and nudges the agent to re-investigate a little to get its bearings again - I have seen in other parts of the codebase a need to know
the current view, it could be great if we could somehow expose it
Maybe we could make issues for a few of these, to look into, if they don't sound totally off-base?
I think some of these are totally warranted. My mind is on some short-term tasks this PR unblocked, let me sit on this for a day or so and I'll get back to you. |
Oh, this would have been way back when the condenser strategies were first being tested. We're due for a re-evaluation. It's been on my radar but haven't had any free cycles to get it set up.
Tracking user messages seems helpful, but I'd be careful about restating the tool loop constraint as a negative like that. Could be a useful optimization to help detect boundaries but we'd still have to double-check the tool loop is actually a tool loop (action and observation events only, starts with thinking blocks, none in between, etc.)
The default for We could remove, but we definitely want to keep at least the system prompt event. So...maybe keep the attribute and set it to 1 if that's the behavior we want?
Good point, our "all event" summary is handled by the hard context reset summary generation so we've already got an entry point to specialize the behavior a bit.
Started working on #2141 yesterday (which I notice you've already seen). |
Summary
One challenge in maintaining the condenser is ensuring it does not violate any of the properties the downstream APIs expect to hold. These change frequently and often without warning, but the code in the view that enforces them has metastasized and become difficult to update without unforeseen consequences.
This PR addresses this challenge by removing the property enforcement code from the
Viewand moving them to a separateViewPropertyBaseimplementation. This implementation ensures properties hold in two separate ways:Breaking Changes
View.manipulation_indicestype changed from a list of integers to aManipulationIndicestype that extends a set of integer. All current usages have been updated.View.find_next_manipulation_indexdeprecated, replaced withManipulationIndices.find_next. All current usages have been updated.View.manipulation_indicesno longer a computed property. Change made to avoid defining a Pydantic serialization scheme forManipulationIndices-- since views were never explicitly serialized, this change should not impact any existing code.Other Changes
Viewtests, now organized around properties.Future Improvements
As a bonus, this PR unlocks several future improvements:
Viewthe entire event stream is necessary. This should make it much easier to sever that dependency by storing the appropriate metadata (like batch maps and tool loop ranges) in the conversation and exporting just that.ToolLoopAtomicityPropertyif we know the model is an Anthropic model with thinking enabled.Checklist
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:94b828d-pythonRun
All tags pushed for this build
About Multi-Architecture Support
94b828d-python) is a multi-arch manifest supporting both amd64 and arm6494b828d-python-amd64) are also available if needed