feat(condenser): Explicit view properties by csmith49 · Pull Request #2116 · OpenHands/software-agent-sdk

csmith49 · 2026-02-18T16:40:46Z

Summary

One challenge in maintaining the condenser is ensuring it does not violate any of the properties the downstream APIs expect to hold. These change frequently and often without warning, but the code in the view that enforces them has metastasized and become difficult to update without unforeseen consequences.

This PR addresses this challenge by removing the property enforcement code from the View and moving them to a separate ViewPropertyBase implementation. This implementation ensures properties hold in two separate ways:

By enforcing them explicitly. This is an expensive and destructive process by which events are removed from the view if they would violate a property. For example, an observation without a matching action would be forcibly removed.
By generating manipulation indices. These are used by the condensers to carefully choose how they modify the view. If the modifications stick to those indices, then the properties we care about should hold inductively.

Breaking Changes

View.manipulation_indices type changed from a list of integers to a ManipulationIndices type that extends a set of integer. All current usages have been updated.
View.find_next_manipulation_index deprecated, replaced with ManipulationIndices.find_next. All current usages have been updated.
View.manipulation_indices no longer a computed property. Change made to avoid defining a Pydantic serialization scheme for ManipulationIndices -- since views were never explicitly serialized, this change should not impact any existing code.

Other Changes

Clean-up of the View tests, now organized around properties.
Significant documentation improvements.
Simpler property implementations.

Future Improvements

As a bonus, this PR unlocks several future improvements:

We know explicitly where in the View the entire event stream is necessary. This should make it much easier to sever that dependency by storing the appropriate metadata (like batch maps and tool loop ranges) in the conversation and exporting just that.
Property enforcement is all-or-nothing -- we enforce the same properties for all models at once. But if necessary we could dynamically load certain properties based on the model being used. For example, we might only load the ToolLoopAtomicityProperty if we know the model is an Anthropic model with thinking enabled.

Checklist

If the PR is changing/adding functionality, are there tests to reflect this?
If there is an example, have you run the example to make sure that it works?
If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
Is the github CI passing?

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:94b828d-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-94b828d-python \
  ghcr.io/openhands/agent-server:94b828d-python

All tags pushed for this build

ghcr.io/openhands/agent-server:94b828d-golang-amd64
ghcr.io/openhands/agent-server:94b828d-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:94b828d-golang-arm64
ghcr.io/openhands/agent-server:94b828d-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:94b828d-java-amd64
ghcr.io/openhands/agent-server:94b828d-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:94b828d-java-arm64
ghcr.io/openhands/agent-server:94b828d-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:94b828d-python-amd64
ghcr.io/openhands/agent-server:94b828d-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:94b828d-python-arm64
ghcr.io/openhands/agent-server:94b828d-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:94b828d-golang
ghcr.io/openhands/agent-server:94b828d-java
ghcr.io/openhands/agent-server:94b828d-python

About Multi-Architecture Support

Each variant tag (e.g., 94b828d-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 94b828d-python-amd64) are also available if needed

…correct manipulation indices behavior

github-actions · 2026-02-18T16:44:32Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/context/condenser
llm_summarizing_condenser.py	121	16	86%	275–276, 278–280, 285, 288–289, 292–293, 298, 303, 305–306, 319–320
openhands-sdk/openhands/sdk/context/view
manipulation_indices.py	14	1	92%	36
view.py	61	1	98%	76
TOTAL	18314	5572	69%

openhands-ai · 2026-02-19T00:25:44Z

I'm on it! enyst can track my progress at all-hands.dev

enyst · 2026-02-19T00:32:54Z

Taste rating: 🔴 Needs improvement

This refactor is aiming at a real problem (the old View had a lot of policy embedded in it), and the general direction is good: properties as first-class objects is a clean separation.

But there are a couple of places where the data structure choice and behavioral guarantees don’t line up, and that’s where you get the kind of “works in tests, breaks in production” bugs.

[CRITICAL ISSUES]

1) `ManipulationIndices` is a `set`, but the API is still semantically ordered

Files:
- openhands-sdk/openhands/sdk/context/view/manipulation_indices.py (class extends set[int])
- tests/integration/tests/c01_thinking_block_condenser.py:52 (indices = list(view.manipulation_indices))
- tests/sdk/context/view/test_view_tool_loop_boundaries.py:209 (assert list(indices) == [0, 1, 2])

A “manipulation index” is fundamentally a boundary in a sequence. The whole idea of “ranges between consecutive indices define atomic units” only makes sense if you have a sorted sequence of indices.

Right now:

View.manipulation_indices returns a set.
Code/tests convert it to list(...) and then treat adjacent elements as ordered boundaries.

That’s not a nit: a set’s iteration order is not an API contract. Even if it looks stable for small ints, it’s accidental.

Actionable fix options:

Keep the internal representation as a set, but make iteration deterministic and ordered:
- override __iter__ on ManipulationIndices to yield sorted(super()), and/or
- add an explicit sorted()/as_list_sorted() method and update all call sites to use it.
Or don’t pretend it’s a set: keep it as list[int] (sorted) and use set(...) only where you need set operations.

If you want the type to be a set for intersection (&=), fine — but then you need an explicit ordered view for any logic that treats indices as boundaries.

2) `ToolCallMatchingProperty` regresses the old “`None` tool_call_id is invalid” behavior

File: openhands-sdk/openhands/sdk/context/view/properties/tool_call_matching.py

The old View code had explicit tool_call_id is not None filtering when building the action/observation tool-call-id sets. That wasn’t there for fun — it was defensive against bad/legacy data.

The new enforcement logic effectively treats None as just another ID:

None ends up in both sets if present in both event types
then an action+observation with tool_call_id=None will “match” and survive enforcement

Even if the type system says ToolCallID = str, the old code suggests there’s a real-world edge case here (and a reviewer already flagged it).

Actionable fix: treat None as “unmatchable”:

don’t add None to the tool_call_id sets
and/or explicitly mark events with tool_call_id is None for removal

Also worth noting: manipulation_indices() uses pending_tool_call_ids.remove(...), which will throw if the stream is already malformed. If this is truly “impossible” post-enforcement, document that assumption; otherwise, make it robust.

3) `ToolLoopAtomicityProperty` semantics look broader than the previous implementation

File: openhands-sdk/openhands/sdk/context/view/properties/tool_loop_atomicity.py

Old behavior (per the removed View.manipulation_indices doc + logic): tool loops were tied to thinking-block batches.

New behavior: a “tool loop” is any contiguous run of action/observation events.

That’s a meaningful behavior change: it can dramatically shrink available manipulation ranges and cause condensers to hit the “no valid range” path more often — even for models that don’t have the thinking-block checksum constraints.

Maybe that’s intentional (“safer by default”), but then it should be called out as such because it affects condensation availability.

Actionable fix: either

restore the “thinking-block starts the loop” rule, or
keep the broader rule but explicitly document the change + expected impact, or
make loading this property conditional (as your PR description mentions as a future improvement).

[IMPROVEMENT OPPORTUNITIES]

View.enforce_properties recursion (openhands-sdk/openhands/sdk/context/view/view.py:95+):
The recursion will terminate (events strictly decrease), but it’s still an odd choice. A simple while True loop would be clearer and avoids any “oops, we dropped 2k events and hit recursion depth” surprises.
Mutable defaults in Pydantic models (View.condensations: list[Condensation] = []):
This was already present on main, but it’s still a footgun. Prefer Field(default_factory=list).

[TESTING GAPS]

The integration test and test_view_tool_loop_boundaries.py currently depend on set ordering (see critical issue Setup Makefile, Pre-commit, and initial Tool and Schema definition #1). They should be fixed to use an ordered form (or the type should provide one).
There’s no test covering the tool_call_id=None edge case that the old implementation explicitly guarded against.

VERDICT

Not ready to merge as-is: the ManipulationIndices ordering contract and the tool_call_id=None regression are the kind of “looks fine until it isn’t” issues.

Once those are addressed, the overall architecture (property isolation + tests organized by property) looks like a solid direction.

Key insight: right now the code says “set of safe boundaries” but the condensers/tests use it as an ordered boundary list. Pick one and make it explicit.

enyst · 2026-02-19T00:33:18Z

Re-read / self-audit of my review

After re-reading my own comment + skimming the diff again, I think the ordering contract for ManipulationIndices is the only truly “this will bite you” issue I’d keep at 🔴 severity.

A couple adjustments in severity/precision:

tool_call_id=None regression: I called this out as a hard regression, but given the current type (ToolCallID = str) and Pydantic validation, it’s plausible that None can’t exist in real persisted events anymore (unless something is using model_construct / bypassing validation / mocks). The old code was defensive, so I still think it’s worth deciding explicitly:
- either keep the defensive None handling (and add a test that demonstrates a real path where it can occur),
- or remove the legacy behavior/tests and treat “None tool_call_id” as invalid data that can’t load.
Tool loop semantics: This is a real behavior change vs the old “thinking-block-started loops” rule, but I may have framed it too negatively. If the goal is “safer by default across models,” enforcing atomicity on any contiguous tool run is defensible — it just needs to be an explicit, documented tradeoff because it reduces condensation flexibility.
__iter__ / sequence semantics: I didn’t call this out originally, and I’m not going to invent it now as a must-fix. The existing View already didn’t implement __iter__ on main, so this PR isn’t introducing that regression. The real issue is that any new tests/code that treat View as directly iterable are silently wrong due to BaseModel.__iter__ yielding (field, value) pairs.

Net: the main point stands — don’t return an unordered container and then reason about “consecutive indices.” Everything else is secondary / depends on intended semantics.

openhands-ai · 2026-02-19T00:34:04Z

I reviewed PR #2116 in a “/codereview-roasted (but not exaggerated)” style, focusing on fundamental design and correctness risks rather than style nits, and I posted that review as a markdown PR comment.

Then I re-read my own review and re-checked the diff “with fresh eyes,” adjusted the severity/precision of a couple points (notably downgrading how certain I am about the tool_call_id=None regression depending on real-world data validity, and reframing the broader tool-loop atomicity as a potentially intentional tradeoff), and posted that re-analysis as a second markdown PR comment.

No code changes were made and nothing was pushed; the requested deliverable was the PR review + the follow-up re-analysis comment.

enyst · 2026-02-19T02:17:56Z

openhands-sdk/openhands/sdk/context/view/properties/tool_loop_atomicity.py

+    This property is important to enforce for Anthropic models with thinking enabled.
+    They expect the first element of such a tool loop to have a thinking block, and use
+    some checksums to make sure it is correctly placed. In such a setup if we remove any
+    element of the tool loop we have to remove the whole thing.


Sorry, I know we've been looking at this before, but I still don't know that this has to be this way, and it seems ... not possible? Unless it simply means we need to summarize ~everything.

I'm confused how the first element could have signature generated from tool calls that weren't generated by the model yet... sorry to be dense. I know there was some weird stuff in Anthropic docs, but it may be worth considering if perhaps there are other ways of looking at it.

"Tool loop" seems to mean an "agent turn" - the model is just doing tool calls, we give results, continue. Such an agent turn can run from initial prompt to any number of events, including 500, 1k or 10k just the same. So we summarize the ~whole thing, which means... the whole agent-only source events.

In other words, it seems that maybe all we need to know is if there was a user message (real or synthetic)? If only the initial message -> summarize the rest. If there is another somewhere, we can summarize from start until that message(s)?

Opus generates a thinking block for the first message in an agent turn, and will throw some API exceptions if it sees an agent turn without a thinking block.

The checksums just ensure the thinking blocks it sees were actually generated by Anthropic when it emitted that particular message, which means we can't modify the thinking block or put it anywhere else without Anthropic complaining.

So we can't get rid of the thinking blocks and we can't move or change them. We can modify the suffix of the agent turns (anything past the first action/observation pair) when we condense, but that isn't a very useful condensation strategy to us at the moment. The result is we treat agent turns as atomic.

If the history is just one long agent turn, yeah, we have to summarize the whole thing. Otherwise condensation will "snap" to the agent turn / user message boundaries.

To be clear, I don't like this. I believe our compact-the-prefix-and-keep-the-suffix approach is better for keeping agents on track than Anthropic's all-or-nothing compaction, but they're optimizing for Claude Code usage and this was the best solution I could find at the time.

If the history is just one long agent turn, yeah, we have to summarize the whole thing. Otherwise condensation will "snap" to the agent turn / user message boundaries.

Exactly 🤔

This might actually be a better mental model:

we summarize the "event blocks" between user messages

that might mean the whole view.

It's easy to understand or visualize, it seems to me (I'll come back to this, but it's not an issue for this PR, it's just trying to understand - specially the aspect where we might need information from outside the view...)

Just remember that mental model only holds for Anthropic endpoints when thinking is enabled. In all other situations the "event blocks" are fine-grained enough for us to summarize parts of them without the completion endpoint complaining.

enyst

Thank you for this, it does seem cleaner!

I'd love it if we can give it a little thought, or maybe I just need to understand a bit better, because the most important change, it seems to me, is that now we would force all other LLMs to get their events summarized if they were in an agent's turn ("tool loop").

Is this... the case? Seems so, and if I understand correctly, that is what you pointed out in the description too. What are the consequences of that? Does it work reasonably well with a few other SOTA LLMs - I seem to recall most of the summarize prompt was Sonnet generated -ish, and idk, it's possible that some LLMs work with it as long as they continue the second half. But now there is no more second half.

On the other hand, if we do this... does that mean we could just do full view? All the time, maybe?

I think we are in a moment where SOTA LLMs got good at long term tasks. GPT-5.x got there first, Opus 4.6 was specifically targeted at lengthening the history (the agent turn!), the ability of Claude to do ~similar.

That, it seems to me, is a big deal, and the numbers are difficult to align with our older assumptions (~120, ~240 events): 1k events in a tool loop was even possible with Opus 4.5 without blinking, and passing through 2 condensations in-between. If I understand correctly and those must have been full event view condensations, then how much reason is there left, to do less than full view?

VascoSch92

I’ve left a few comments, most of which are just nits or suggestions.

Overall, the core logic makes sense to me.

However, I have a suggestion regarding the ViewPropertyBase interface: the current method names are a bit misleading. Without reading the docstrings, it’s difficult to guess their actual behavior.

I can propose the following renames:

enforce → get_violations or find_invalid_events
manipulation_indices → get_allowed_range or get_edit_boundaries

VascoSch92 · 2026-02-19T08:03:32Z

openhands-sdk/openhands/sdk/context/view/properties/__init__.py

+)
+
+
+ALL_PROPERTIES: list[ViewPropertyBase] = [


Is that wanted?

Because you are actually instantiating the classes and not exposing closures.

I think this is fine? There's no state associated with the properties, so while we could instantiate them in the view every time this makes them effectively singletons and saves some object creation cycles.

openhands-sdk/openhands/sdk/context/view/properties/tool_call_matching.py

VascoSch92 · 2026-02-19T08:18:54Z

openhands-sdk/openhands/sdk/context/view/properties/tool_loop_atomicity.py

+            events in the loop.
+        """
+        tool_loops: list[set[EventID]] = []
+        current_tool_loop: set[EventID] | None = None


why not just current_tool_loop: set[EventID] = set()?

I believe the logic stay the same and we don' have two types for one variable

I like having the explicit None to indicate we're not in a tool loop instead of relying on an empty set for that check. The logic is basically the same but the extra typing requirements make the individual cases more clear (at least to my eyes).

VascoSch92 · 2026-02-19T08:28:44Z

openhands-sdk/openhands/sdk/context/view/properties/tool_loop_atomicity.py

+            # loops) are a subset of all the events. If a tool loop in the view isn't
+            # present in the total list of tool loops that indicates some element has
+            # been forgotten and we have to remove the remaining elements from the view.
+            if view_tool_loop not in all_tool_loops:


I’m not sure about the typical size of all_tool_loops, but if it starts to grow, we should consider modifying the method to return a set.

Since the current implementation returns a list of sets (and sets aren't hashable), we could convert each set into a string and return a set[str] instead.

This would reduce the complexity from O(n**2) to O(n).

If the length is always small, feel free to ignore this, better to keep it readable than to over-engineer it!

This might be slightly pre-mature optimization. You're right that the check can be optimized, but our more substantial optimization is going to be when we eliminate the dependency on all_events in the first place. We'll have to push some of this indexing deeper into the conversation state, but there's no reason we shouldn't be incrementally computing and storing tool loops and batches there instead of recomputing them each time in the view.

Sorry to interfere here, but maybe it's worth considering to not store agent events information, and instead store information about the user messages. I think maybe we get the same result, except user messages are much fewer and maybe it all becomes easier?

openhands-sdk/openhands/sdk/context/view/manipulation_indices.py

openhands-sdk/openhands/sdk/context/view/view.py

Co-authored-by: Vasco Schiavo <115561717+VascoSch92@users.noreply.github.com>

csmith49 · 2026-02-19T16:59:21Z

I'd love it if we can give it a little thought, or maybe I just need to understand a bit better, because the most important change, it seems to me, is that now we would force all other LLMs to get their events summarized if they were in an agent's turn ("tool loop").

Good catch, definitely unintended behavior. I've added the thinking block requirements back to the tool loop so now it only triggers for Anthropic models.

On the other hand, if we do this... does that mean we could just do full view? All the time, maybe?...If I understand correctly and those must have been full event view condensations, then how much reason is there left, to do less than full view?

Depends on the event distribution, but yeah, Opus has been operating like this for at least a month. I still think our condensation approach is better than collapsing the whole view when the models support it though. Our earlier experiments definitely indicated a slight performance boost for weaker models.

csmith49 · 2026-02-19T17:02:57Z

After re-reading my own comment + skimming the diff again, I think the ordering contract for ManipulationIndices is the only truly “this will bite you” issue I’d keep at 🔴 severity.

Reasonable. I caught earlier examples of this but missed a few test cases that were still casting them to a list. Those tests now rely on set comparisons instead.

github-actions · 2026-02-19T17:56:25Z

Hi! I started running the condenser tests on your PR. You will receive a comment with the results shortly.

Note: These are non-blocking tests that validate condenser functionality across different LLMs.

github-actions · 2026-02-19T18:01:25Z

Condenser Test Results (Non-Blocking)

These tests validate condenser functionality and do not block PR merges.

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $0.95
Models Tested: 2
Timestamp: 2026-02-19 18:01:17 UTC

📊 Summary

Model	Overall	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_anthropic_claude_opus_4_5_20251101	100.0%	5/5	0	5	$0.87	373,859
litellm_proxy_gpt_5.1_codex_max	100.0%	2/2	3	5	$0.08	62,739

📋 Detailed Results

litellm_proxy_anthropic_claude_opus_4_5_20251101

Success Rate: 100.0% (5/5)
Total Cost: $0.87
Token Usage: prompt: 357,559, completion: 16,300, cache_read: 304,516, cache_write: 39,704, reasoning: 1,055
Run Suffix: litellm_proxy_anthropic_claude_opus_4_5_20251101_2de10f3_opus_condenser_run_N5_20260219_175642

litellm_proxy_gpt_5.1_codex_max

Success Rate: 100.0% (2/2)
Total Cost: $0.08
Token Usage: prompt: 59,232, completion: 3,507, cache_read: 29,696, reasoning: 1,472
Run Suffix: litellm_proxy_gpt_5.1_codex_max_2de10f3_gpt51_condenser_run_N5_20260219_175642
Skipped Tests: 3

Skipped Tests:

c01_thinking_block_condenser: Model litellm_proxy/gpt-5.1-codex-max does not support extended thinking or reasoning effort
c05_size_condenser: This test stresses long repetitive tool loops to trigger size-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.
c04_token_condenser: This test stresses long repetitive tool loops to trigger token-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.

enyst · 2026-02-19T18:20:49Z

openhands-sdk/openhands/sdk/context/view/properties/base.py

+    Enforcement is intended as a fallback mechanism to handle edge cases, bad data, or
+    unforeseen situations. Because enforcement assumes the view is in a bad state, it
+    often requires a much larger perspective on the events and therefore depends on a
+    sequence of _all_ events in the conversation.


I think maybe we need to try to not depend on all events in the conversation, maybe we have alternatives?

fallback to forgetting all view minus keep_first

fallback to system prompt + user message(s) + a summary of agent actions in the view

or worse case, system prompt + some info that things went badly and all we know is that this was user's task, maybe summary, maybe ask the user for more

For example, I feel the cloud has difficulty keeping up with 2k events, sandboxes keep crashing and getting lost, I don't know the exact reason but I know I didn't manage to have a longer conversation for a while... (on v1)

Agreed we want to streamline some of these operations to avoid dealing with the whole event stream. But enforcing the properties we have unfortunately requires information from events outside the view.

Simple example: say I look at the current view and see a single action/observation pair. If I want to enforce batch atomicity, the only way I can know that the pair isn't part of a larger batch is to know that there are no other action events with matching llm_response_id values in the larger context.

There are a few solutions to avoiding doing that:

Compute some kind of batch map in the conversation state and expose that. We'd have to do that tracking for each property we want to enforce and make sure it gets propagated to the view.

Make sure the batch is never split in the first place. That's what the manipulation indices are trying to do.

In an ideal world we don't need the enforcement at all, just the manipulation indices. That's why every time enforcement happens it logs a warning. If we can verify that we don't see those warnings in practice, we can disable enforcement altogether (or only trigger it when we detect some error and need to recover).

enyst · 2026-02-19T18:28:04Z

openhands-sdk/openhands/sdk/context/view/properties/batch_atomicity.py

+            # of the all_events sequence -- if the batch ids in the view aren't exactly
+            # one-to-one with the batch ids generated by the all_events sequence, that
+            # can only mean something has been forgotten and we need to drop the entire
+            # batch.


Just a thought, not a review comment, FWIW, I don't know if we tested that this is really necessary, but maybe I'm missing it. It might be worth having a specific test, if we don't

Batch atomicity? I assume it was needed at some point, it's a carry-over from v0. Probably model-specific behavior. With this PR we can disable it pretty easily by just not applying that property.

I'm not sure, I thought it was new:

Fix batch atomicity when condensation forgets ObservationEvents #1450

It seemed the PR fixed act2 / summary / obs2 , but I think it also dropped act1 and obs1. Which I don't know if we had to or tested for.

I think it would be great to test with act1 / obs1, assuming that we still avoid inserting summary between 🤔

Comes from before that: #775

We need a better way to actually stress the API assumptions made by various providers, but I think that's beyond the scope of this PR

I completely forgot that PR! And I reviewed it at the time 😭

Thank you, yes I agree.

enyst

Thank you for this!

I do have a question re: previous results show that weaker LLMs do better with ~half events: when did we last test that, and what qualified then as "weaker" LLMs? If it's old enough, we really may want to re-eval/reconsider, the LLM capabilities went ahead for a few lifetimes. 😅

I'd like to (try to) make a few notes:

I'm starting to feel it's worth it to play with the idea of restating the tool loops constraint as a user messages constraint
they're much fewer, we can know exactly what's between them (I hope?)
we need to track user messages for a hook that is broken; maybe we could use that tracking for both features
it might be worth considering to delete keep_first attribute; in V1 we no longer need 4 minimum, the additional environment info that used to be a recall obs is now a suffix of the system prompt so I think we need two: system and user. That may save us from those bugs when we try to enforce keep_first but bump into the first events of a tool loop
we might want to test for batch atomicity, as noted, ~~though on the other hand, I suspect you might be right it was in some other form happening in V0...~~ 🤔 (update: you just showed it was older)
it's worth IMHO to think whether the summary of all events should be exactly the same way as partial view; for example codex-cli adds to the prompt that a summary just happened and nudges the agent to re-investigate a little to get its bearings again
I have seen in other parts of the codebase a need to know the current view, it could be great if we could somehow expose it

Maybe we could make issues for a few of these, to look into, if they don't sound totally off-base?

csmith49 · 2026-02-19T20:49:50Z

Thank you for this!

I do have a question re: previous results show that weaker LLMs do better with ~half events: when did we last test that, and what qualified then as "weaker" LLMs? If it's old enough, we really may want to re-eval/reconsider, the LLM capabilities went ahead for a few lifetimes. 😅

I'd like to (try to) make a few notes:

I'm starting to feel it's worth it to play with the idea of restating the tool loops constraint as a user messages constraint

they're much fewer, we can know exactly what's between them (I hope?)

we need to track user messages for a hook that is broken; maybe we could use that tracking for both features

it might be worth considering to delete keep_first attribute; in V1 we no longer need 4 minimum, the additional environment info that used to be a recall obs is now a suffix of the system prompt so I think we need two: system and user. That may save us from those bugs when we try to enforce keep_first but bump into the first events of a tool loop

we might want to test for batch atomicity, as noted, ~~though on the other hand, I suspect you might be right it was in some other form happening in V0...~~ 🤔 (update: you just showed it was older)

it's worth IMHO to think whether the summary of all events should be exactly the same way as partial view; for example codex-cli adds to the prompt that a summary just happened and nudges the agent to re-investigate a little to get its bearings again

I have seen in other parts of the codebase a need to know the current view, it could be great if we could somehow expose it

Maybe we could make issues for a few of these, to look into, if they don't sound totally off-base?

I think some of these are totally warranted. My mind is on some short-term tasks this PR unblocked, let me sit on this for a day or so and I'll get back to you.

csmith49 · 2026-02-20T15:31:26Z

I do have a question re: previous results show that weaker LLMs do better with ~half events: when did we last test that, and what qualified then as "weaker" LLMs? If it's old enough, we really may want to re-eval/reconsider, the LLM capabilities went ahead for a few lifetimes. 😅

Oh, this would have been way back when the condenser strategies were first being tested. We're due for a re-evaluation. It's been on my radar but haven't had any free cycles to get it set up.

I'm starting to feel it's worth it to play with the idea of restating the tool loops constraint as a user messages constraint

they're much fewer, we can know exactly what's between them (I hope?)

we need to track user messages for a hook that is broken; maybe we could use that tracking for both features

Tracking user messages seems helpful, but I'd be careful about restating the tool loop constraint as a negative like that. Could be a useful optimization to help detect boundaries but we'd still have to double-check the tool loop is actually a tool loop (action and observation events only, starts with thinking blocks, none in between, etc.)

it might be worth considering to delete keep_first attribute; in V1 we no longer need 4 minimum, the additional environment info that used to be a recall obs is now a suffix of the system prompt so I think we need two: system and user. That may save us from those bugs when we try to enforce keep_first but bump into the first events of a tool loop

The default for keep_first is now 2. Some users have saved CLI settings that haven't updated, unfortunately.

We could remove, but we definitely want to keep at least the system prompt event. So...maybe keep the attribute and set it to 1 if that's the behavior we want?

it's worth IMHO to think whether the summary of all events should be exactly the same way as partial view; for example codex-cli adds to the prompt that a summary just happened and nudges the agent to re-investigate a little to get its bearings again

Good point, our "all event" summary is handled by the hard context reset summary generation so we've already got an entry point to specialize the behavior a bit.

I have seen in other parts of the codebase a need to know the current view, it could be great if we could somehow expose it

Started working on #2141 yesterday (which I notice you've already seen).

csmith49 added 19 commits February 17, 2026 15:44

view module

fe8c254

initial property / manip indices structure

909cc4c

minor documentation,utility creation function

d36bc0a

batch atomicity first pass

2dacb49

minor tweak to iteration

1825c2b

property tests

63014ae

fixing typing issues in tests

d952847

tool call matching first pass

62aa4f3

test cases for manipulation indices complete function

431a812

minor docs, fixes to manipulation index creation to get tests passing

e93d1ae

fixingUp the batch atomicity tests, including updating them with the …

96ed583

…correct manipulation indices behavior

tool call matching tests

97f823f

minor syntax

2be9465

Merge branch 'main' into feat/view-properties

58db31b

test refactor, minor

f7a1fa1

tool loop atomicity first pass

2dfac60

simplifying property interface: manip indices dont depend on all events

a05871b

first pass on tool loop atomicity tests

f9185fb

setting up properties init file

9a9112c

csmith49 mentioned this pull request Feb 18, 2026

fix(condenser): View properties for capturing API assumptions #1649

Closed

5 tasks

csmith49 added 9 commits February 18, 2026 09:54

strengthening the tool loop tests

f666b2f

refactoring common event creation utilities into conftest

e85e3e5

circular imports fixed, view uses explicit property enforcement

4036ee7

dead code cleanup in view, removing related tests

3f48ae8

re-adding deleted test as property test

83e39a1

updating manipulation index calculations to use properties

c8fe1bb

manipulation indices no longer computed field -- no need to serialize

6540285

minor documentation, starting view test cleanup

740701c

fixing type mismatch in manip indices tests

7de5146

enyst reviewed Feb 19, 2026

View reviewed changes

enyst requested changes Feb 19, 2026

View reviewed changes

VascoSch92 approved these changes Feb 19, 2026

View reviewed changes

csmith49 and others added 10 commits February 19, 2026 07:54

no direct list conversion of manipulation indices

10ab9f1

last usages of manip indices as lists

b15bfd5

thinking block creation in the conftest utils

9335e18

more conftest changes for thinking blocks

6843581

updates to tool loop tests to include thinking block checks

6d2d89b

thinking aware tool loop detection

42aad69

fix tool loop enforcement

b68c6c6

fixing manip indices tests

ee1c879

mutable default to factory in view

14a7e1e

Update openhands-sdk/openhands/sdk/context/view/view.py

f0dfb72

Co-authored-by: Vasco Schiavo <115561717+VascoSch92@users.noreply.github.com>

Merge branch 'main' into feat/view-properties

2de10f3

enyst added the condenser-test Triggers a run of all condenser integration tests label Feb 19, 2026

enyst reviewed Feb 19, 2026

View reviewed changes

enyst approved these changes Feb 19, 2026

View reviewed changes

csmith49 merged commit c687179 into main Feb 19, 2026
61 of 62 checks passed

csmith49 deleted the feat/view-properties branch February 19, 2026 20:48

		)


		ALL_PROPERTIES: list[ViewPropertyBase] = [

Comments

Conversation

csmith49 commented Feb 18, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Breaking Changes

Other Changes

Future Improvements

Checklist

Uh oh!

github-actions bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openhands-ai bot commented Feb 19, 2026

Uh oh!

enyst commented Feb 19, 2026

[CRITICAL ISSUES]

1) ManipulationIndices is a set, but the API is still semantically ordered

2) ToolCallMatchingProperty regresses the old “None tool_call_id is invalid” behavior

3) ToolLoopAtomicityProperty semantics look broader than the previous implementation

[IMPROVEMENT OPPORTUNITIES]

[TESTING GAPS]

VERDICT

Uh oh!

enyst commented Feb 19, 2026

Re-read / self-audit of my review

Uh oh!

openhands-ai bot commented Feb 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enyst left a comment

Choose a reason for hiding this comment

Uh oh!

VascoSch92 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

csmith49 commented Feb 19, 2026

Uh oh!

csmith49 commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Condenser Test Results (Non-Blocking)

🧪 Integration Tests Results

📊 Summary

📋 Detailed Results

litellm_proxy_anthropic_claude_opus_4_5_20251101

litellm_proxy_gpt_5.1_codex_max

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

csmith49 commented Feb 18, 2026 •

edited by github-actions bot

Loading

github-actions bot commented Feb 18, 2026 •

edited

Loading

1) `ManipulationIndices` is a `set`, but the API is still semantically ordered

2) `ToolCallMatchingProperty` regresses the old “`None` tool_call_id is invalid” behavior

3) `ToolLoopAtomicityProperty` semantics look broader than the previous implementation

enyst left a comment •

edited

Loading