Skip to content

Comments

feat(condenser): Explicit view properties#2116

Merged
csmith49 merged 48 commits intomainfrom
feat/view-properties
Feb 19, 2026
Merged

feat(condenser): Explicit view properties#2116
csmith49 merged 48 commits intomainfrom
feat/view-properties

Conversation

@csmith49
Copy link
Collaborator

@csmith49 csmith49 commented Feb 18, 2026

Summary

One challenge in maintaining the condenser is ensuring it does not violate any of the properties the downstream APIs expect to hold. These change frequently and often without warning, but the code in the view that enforces them has metastasized and become difficult to update without unforeseen consequences.

This PR addresses this challenge by removing the property enforcement code from the View and moving them to a separate ViewPropertyBase implementation. This implementation ensures properties hold in two separate ways:

  1. By enforcing them explicitly. This is an expensive and destructive process by which events are removed from the view if they would violate a property. For example, an observation without a matching action would be forcibly removed.
  2. By generating manipulation indices. These are used by the condensers to carefully choose how they modify the view. If the modifications stick to those indices, then the properties we care about should hold inductively.

Breaking Changes

  • View.manipulation_indices type changed from a list of integers to a ManipulationIndices type that extends a set of integer. All current usages have been updated.
  • View.find_next_manipulation_index deprecated, replaced with ManipulationIndices.find_next. All current usages have been updated.
  • View.manipulation_indices no longer a computed property. Change made to avoid defining a Pydantic serialization scheme for ManipulationIndices -- since views were never explicitly serialized, this change should not impact any existing code.

Other Changes

  • Clean-up of the View tests, now organized around properties.
  • Significant documentation improvements.
  • Simpler property implementations.

Future Improvements

As a bonus, this PR unlocks several future improvements:

  • We know explicitly where in the View the entire event stream is necessary. This should make it much easier to sever that dependency by storing the appropriate metadata (like batch maps and tool loop ranges) in the conversation and exporting just that.
  • Property enforcement is all-or-nothing -- we enforce the same properties for all models at once. But if necessary we could dynamically load certain properties based on the model being used. For example, we might only load the ToolLoopAtomicityProperty if we know the model is an Anthropic model with thinking enabled.

Checklist

  • If the PR is changing/adding functionality, are there tests to reflect this?
  • If there is an example, have you run the example to make sure that it works?
  • If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
  • If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
  • Is the github CI passing?

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:94b828d-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-94b828d-python \
  ghcr.io/openhands/agent-server:94b828d-python

All tags pushed for this build

ghcr.io/openhands/agent-server:94b828d-golang-amd64
ghcr.io/openhands/agent-server:94b828d-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:94b828d-golang-arm64
ghcr.io/openhands/agent-server:94b828d-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:94b828d-java-amd64
ghcr.io/openhands/agent-server:94b828d-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:94b828d-java-arm64
ghcr.io/openhands/agent-server:94b828d-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:94b828d-python-amd64
ghcr.io/openhands/agent-server:94b828d-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:94b828d-python-arm64
ghcr.io/openhands/agent-server:94b828d-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:94b828d-golang
ghcr.io/openhands/agent-server:94b828d-java
ghcr.io/openhands/agent-server:94b828d-python

About Multi-Architecture Support

  • Each variant tag (e.g., 94b828d-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 94b828d-python-amd64) are also available if needed

@github-actions
Copy link
Contributor

github-actions bot commented Feb 18, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/context/condenser
   llm_summarizing_condenser.py1211686%275–276, 278–280, 285, 288–289, 292–293, 298, 303, 305–306, 319–320
openhands-sdk/openhands/sdk/context/view
   manipulation_indices.py14192%36
   view.py61198%76
TOTAL18314557269% 

@openhands-ai
Copy link

openhands-ai bot commented Feb 19, 2026

I'm on it! enyst can track my progress at all-hands.dev

Copy link
Collaborator

enyst commented Feb 19, 2026

Taste rating: 🔴 Needs improvement

This refactor is aiming at a real problem (the old View had a lot of policy embedded in it), and the general direction is good: properties as first-class objects is a clean separation.

But there are a couple of places where the data structure choice and behavioral guarantees don’t line up, and that’s where you get the kind of “works in tests, breaks in production” bugs.


[CRITICAL ISSUES]

1) ManipulationIndices is a set, but the API is still semantically ordered

  • Files:
    • openhands-sdk/openhands/sdk/context/view/manipulation_indices.py (class extends set[int])
    • tests/integration/tests/c01_thinking_block_condenser.py:52 (indices = list(view.manipulation_indices))
    • tests/sdk/context/view/test_view_tool_loop_boundaries.py:209 (assert list(indices) == [0, 1, 2])

A “manipulation index” is fundamentally a boundary in a sequence. The whole idea of “ranges between consecutive indices define atomic units” only makes sense if you have a sorted sequence of indices.

Right now:

  • View.manipulation_indices returns a set.
  • Code/tests convert it to list(...) and then treat adjacent elements as ordered boundaries.

That’s not a nit: a set’s iteration order is not an API contract. Even if it looks stable for small ints, it’s accidental.

Actionable fix options:

  1. Keep the internal representation as a set, but make iteration deterministic and ordered:
    • override __iter__ on ManipulationIndices to yield sorted(super()), and/or
    • add an explicit sorted()/as_list_sorted() method and update all call sites to use it.
  2. Or don’t pretend it’s a set: keep it as list[int] (sorted) and use set(...) only where you need set operations.

If you want the type to be a set for intersection (&=), fine — but then you need an explicit ordered view for any logic that treats indices as boundaries.


2) ToolCallMatchingProperty regresses the old “None tool_call_id is invalid” behavior

  • File: openhands-sdk/openhands/sdk/context/view/properties/tool_call_matching.py

The old View code had explicit tool_call_id is not None filtering when building the action/observation tool-call-id sets. That wasn’t there for fun — it was defensive against bad/legacy data.

The new enforcement logic effectively treats None as just another ID:

  • None ends up in both sets if present in both event types
  • then an action+observation with tool_call_id=None will “match” and survive enforcement

Even if the type system says ToolCallID = str, the old code suggests there’s a real-world edge case here (and a reviewer already flagged it).

Actionable fix: treat None as “unmatchable”:

  • don’t add None to the tool_call_id sets
  • and/or explicitly mark events with tool_call_id is None for removal

Also worth noting: manipulation_indices() uses pending_tool_call_ids.remove(...), which will throw if the stream is already malformed. If this is truly “impossible” post-enforcement, document that assumption; otherwise, make it robust.


3) ToolLoopAtomicityProperty semantics look broader than the previous implementation

  • File: openhands-sdk/openhands/sdk/context/view/properties/tool_loop_atomicity.py

Old behavior (per the removed View.manipulation_indices doc + logic): tool loops were tied to thinking-block batches.

New behavior: a “tool loop” is any contiguous run of action/observation events.

That’s a meaningful behavior change: it can dramatically shrink available manipulation ranges and cause condensers to hit the “no valid range” path more often — even for models that don’t have the thinking-block checksum constraints.

Maybe that’s intentional (“safer by default”), but then it should be called out as such because it affects condensation availability.

Actionable fix: either

  • restore the “thinking-block starts the loop” rule, or
  • keep the broader rule but explicitly document the change + expected impact, or
  • make loading this property conditional (as your PR description mentions as a future improvement).

[IMPROVEMENT OPPORTUNITIES]

  • View.enforce_properties recursion (openhands-sdk/openhands/sdk/context/view/view.py:95+):
    The recursion will terminate (events strictly decrease), but it’s still an odd choice. A simple while True loop would be clearer and avoids any “oops, we dropped 2k events and hit recursion depth” surprises.

  • Mutable defaults in Pydantic models (View.condensations: list[Condensation] = []):
    This was already present on main, but it’s still a footgun. Prefer Field(default_factory=list).


[TESTING GAPS]

  • The integration test and test_view_tool_loop_boundaries.py currently depend on set ordering (see critical issue Setup Makefile, Pre-commit, and initial Tool and Schema definition #1). They should be fixed to use an ordered form (or the type should provide one).

  • There’s no test covering the tool_call_id=None edge case that the old implementation explicitly guarded against.


VERDICT

Not ready to merge as-is: the ManipulationIndices ordering contract and the tool_call_id=None regression are the kind of “looks fine until it isn’t” issues.

Once those are addressed, the overall architecture (property isolation + tests organized by property) looks like a solid direction.

Key insight: right now the code says “set of safe boundaries” but the condensers/tests use it as an ordered boundary list. Pick one and make it explicit.

Copy link
Collaborator

enyst commented Feb 19, 2026

Re-read / self-audit of my review

After re-reading my own comment + skimming the diff again, I think the ordering contract for ManipulationIndices is the only truly “this will bite you” issue I’d keep at 🔴 severity.

A couple adjustments in severity/precision:

  • tool_call_id=None regression: I called this out as a hard regression, but given the current type (ToolCallID = str) and Pydantic validation, it’s plausible that None can’t exist in real persisted events anymore (unless something is using model_construct / bypassing validation / mocks). The old code was defensive, so I still think it’s worth deciding explicitly:

    • either keep the defensive None handling (and add a test that demonstrates a real path where it can occur),
    • or remove the legacy behavior/tests and treat “None tool_call_id” as invalid data that can’t load.
  • Tool loop semantics: This is a real behavior change vs the old “thinking-block-started loops” rule, but I may have framed it too negatively. If the goal is “safer by default across models,” enforcing atomicity on any contiguous tool run is defensible — it just needs to be an explicit, documented tradeoff because it reduces condensation flexibility.

  • __iter__ / sequence semantics: I didn’t call this out originally, and I’m not going to invent it now as a must-fix. The existing View already didn’t implement __iter__ on main, so this PR isn’t introducing that regression. The real issue is that any new tests/code that treat View as directly iterable are silently wrong due to BaseModel.__iter__ yielding (field, value) pairs.

Net: the main point stands — don’t return an unordered container and then reason about “consecutive indices.” Everything else is secondary / depends on intended semantics.

@openhands-ai
Copy link

openhands-ai bot commented Feb 19, 2026

I reviewed PR #2116 in a “/codereview-roasted (but not exaggerated)” style, focusing on fundamental design and correctness risks rather than style nits, and I posted that review as a markdown PR comment.

Then I re-read my own review and re-checked the diff “with fresh eyes,” adjusted the severity/precision of a couple points (notably downgrading how certain I am about the tool_call_id=None regression depending on real-world data validity, and reframing the broader tool-loop atomicity as a potentially intentional tradeoff), and posted that re-analysis as a second markdown PR comment.

No code changes were made and nothing was pushed; the requested deliverable was the PR review + the follow-up re-analysis comment.

This property is important to enforce for Anthropic models with thinking enabled.
They expect the first element of such a tool loop to have a thinking block, and use
some checksums to make sure it is correctly placed. In such a setup if we remove any
element of the tool loop we have to remove the whole thing.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I know we've been looking at this before, but I still don't know that this has to be this way, and it seems ... not possible? Unless it simply means we need to summarize ~everything.

I'm confused how the first element could have signature generated from tool calls that weren't generated by the model yet... sorry to be dense. I know there was some weird stuff in Anthropic docs, but it may be worth considering if perhaps there are other ways of looking at it.

"Tool loop" seems to mean an "agent turn" - the model is just doing tool calls, we give results, continue. Such an agent turn can run from initial prompt to any number of events, including 500, 1k or 10k just the same. So we summarize the ~whole thing, which means... the whole agent-only source events.

In other words, it seems that maybe all we need to know is if there was a user message (real or synthetic)? If only the initial message -> summarize the rest. If there is another somewhere, we can summarize from start until that message(s)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opus generates a thinking block for the first message in an agent turn, and will throw some API exceptions if it sees an agent turn without a thinking block.

The checksums just ensure the thinking blocks it sees were actually generated by Anthropic when it emitted that particular message, which means we can't modify the thinking block or put it anywhere else without Anthropic complaining.

So we can't get rid of the thinking blocks and we can't move or change them. We can modify the suffix of the agent turns (anything past the first action/observation pair) when we condense, but that isn't a very useful condensation strategy to us at the moment. The result is we treat agent turns as atomic.

If the history is just one long agent turn, yeah, we have to summarize the whole thing. Otherwise condensation will "snap" to the agent turn / user message boundaries.

To be clear, I don't like this. I believe our compact-the-prefix-and-keep-the-suffix approach is better for keeping agents on track than Anthropic's all-or-nothing compaction, but they're optimizing for Claude Code usage and this was the best solution I could find at the time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the history is just one long agent turn, yeah, we have to summarize the whole thing. Otherwise condensation will "snap" to the agent turn / user message boundaries.

Exactly 🤔

This might actually be a better mental model:

  • we summarize the "event blocks" between user messages
  • that might mean the whole view.

It's easy to understand or visualize, it seems to me (I'll come back to this, but it's not an issue for this PR, it's just trying to understand - specially the aspect where we might need information from outside the view...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remember that mental model only holds for Anthropic endpoints when thinking is enabled. In all other situations the "event blocks" are fine-grained enough for us to summarize parts of them without the completion endpoint complaining.

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this, it does seem cleaner!

I'd love it if we can give it a little thought, or maybe I just need to understand a bit better, because the most important change, it seems to me, is that now we would force all other LLMs to get their events summarized if they were in an agent's turn ("tool loop").

Is this... the case? Seems so, and if I understand correctly, that is what you pointed out in the description too. What are the consequences of that? Does it work reasonably well with a few other SOTA LLMs - I seem to recall most of the summarize prompt was Sonnet generated -ish, and idk, it's possible that some LLMs work with it as long as they continue the second half. But now there is no more second half.

On the other hand, if we do this... does that mean we could just do full view? All the time, maybe?

I think we are in a moment where SOTA LLMs got good at long term tasks. GPT-5.x got there first, Opus 4.6 was specifically targeted at lengthening the history (the agent turn!), the ability of Claude to do ~similar.

That, it seems to me, is a big deal, and the numbers are difficult to align with our older assumptions (~120, ~240 events): 1k events in a tool loop was even possible with Opus 4.5 without blinking, and passing through 2 condensations in-between. If I understand correctly and those must have been full event view condensations, then how much reason is there left, to do less than full view?

Copy link
Contributor

@VascoSch92 VascoSch92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve left a few comments, most of which are just nits or suggestions.

Overall, the core logic makes sense to me.

However, I have a suggestion regarding the ViewPropertyBase interface: the current method names are a bit misleading. Without reading the docstrings, it’s difficult to guess their actual behavior.

I can propose the following renames:

  • enforceget_violations or find_invalid_events
  • manipulation_indicesget_allowed_range or get_edit_boundaries

)


ALL_PROPERTIES: list[ViewPropertyBase] = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that wanted?

Because you are actually instantiating the classes and not exposing closures.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine? There's no state associated with the properties, so while we could instantiate them in the view every time this makes them effectively singletons and saves some object creation cycles.

events in the loop.
"""
tool_loops: list[set[EventID]] = []
current_tool_loop: set[EventID] | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just current_tool_loop: set[EventID] = set()?

I believe the logic stay the same and we don' have two types for one variable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having the explicit None to indicate we're not in a tool loop instead of relying on an empty set for that check. The logic is basically the same but the extra typing requirements make the individual cases more clear (at least to my eyes).

# loops) are a subset of all the events. If a tool loop in the view isn't
# present in the total list of tool loops that indicates some element has
# been forgotten and we have to remove the remaining elements from the view.
if view_tool_loop not in all_tool_loops:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure about the typical size of all_tool_loops, but if it starts to grow, we should consider modifying the method to return a set.

Since the current implementation returns a list of sets (and sets aren't hashable), we could convert each set into a string and return a set[str] instead.

This would reduce the complexity from O(n**2) to O(n).

If the length is always small, feel free to ignore this, better to keep it readable than to over-engineer it!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be slightly pre-mature optimization. You're right that the check can be optimized, but our more substantial optimization is going to be when we eliminate the dependency on all_events in the first place. We'll have to push some of this indexing deeper into the conversation state, but there's no reason we shouldn't be incrementally computing and storing tool loops and batches there instead of recomputing them each time in the view.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to interfere here, but maybe it's worth considering to not store agent events information, and instead store information about the user messages. I think maybe we get the same result, except user messages are much fewer and maybe it all becomes easier?

@csmith49
Copy link
Collaborator Author

I'd love it if we can give it a little thought, or maybe I just need to understand a bit better, because the most important change, it seems to me, is that now we would force all other LLMs to get their events summarized if they were in an agent's turn ("tool loop").

Good catch, definitely unintended behavior. I've added the thinking block requirements back to the tool loop so now it only triggers for Anthropic models.

On the other hand, if we do this... does that mean we could just do full view? All the time, maybe?...If I understand correctly and those must have been full event view condensations, then how much reason is there left, to do less than full view?

Depends on the event distribution, but yeah, Opus has been operating like this for at least a month. I still think our condensation approach is better than collapsing the whole view when the models support it though. Our earlier experiments definitely indicated a slight performance boost for weaker models.

@csmith49
Copy link
Collaborator Author

After re-reading my own comment + skimming the diff again, I think the ordering contract for ManipulationIndices is the only truly “this will bite you” issue I’d keep at 🔴 severity.

Reasonable. I caught earlier examples of this but missed a few test cases that were still casting them to a list. Those tests now rely on set comparisons instead.

@enyst enyst added the condenser-test Triggers a run of all condenser integration tests label Feb 19, 2026
@github-actions
Copy link
Contributor

Hi! I started running the condenser tests on your PR. You will receive a comment with the results shortly.

Note: These are non-blocking tests that validate condenser functionality across different LLMs.

@github-actions
Copy link
Contributor

Condenser Test Results (Non-Blocking)

These tests validate condenser functionality and do not block PR merges.

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $0.95
Models Tested: 2
Timestamp: 2026-02-19 18:01:17 UTC

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_anthropic_claude_opus_4_5_20251101 100.0% 5/5 0 5 $0.87 373,859
litellm_proxy_gpt_5.1_codex_max 100.0% 2/2 3 5 $0.08 62,739

📋 Detailed Results

litellm_proxy_anthropic_claude_opus_4_5_20251101

  • Success Rate: 100.0% (5/5)
  • Total Cost: $0.87
  • Token Usage: prompt: 357,559, completion: 16,300, cache_read: 304,516, cache_write: 39,704, reasoning: 1,055
  • Run Suffix: litellm_proxy_anthropic_claude_opus_4_5_20251101_2de10f3_opus_condenser_run_N5_20260219_175642

litellm_proxy_gpt_5.1_codex_max

  • Success Rate: 100.0% (2/2)
  • Total Cost: $0.08
  • Token Usage: prompt: 59,232, completion: 3,507, cache_read: 29,696, reasoning: 1,472
  • Run Suffix: litellm_proxy_gpt_5.1_codex_max_2de10f3_gpt51_condenser_run_N5_20260219_175642
  • Skipped Tests: 3

Skipped Tests:

  • c01_thinking_block_condenser: Model litellm_proxy/gpt-5.1-codex-max does not support extended thinking or reasoning effort
  • c05_size_condenser: This test stresses long repetitive tool loops to trigger size-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.
  • c04_token_condenser: This test stresses long repetitive tool loops to trigger token-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.

Enforcement is intended as a fallback mechanism to handle edge cases, bad data, or
unforeseen situations. Because enforcement assumes the view is in a bad state, it
often requires a much larger perspective on the events and therefore depends on a
sequence of _all_ events in the conversation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we need to try to not depend on all events in the conversation, maybe we have alternatives?

  • fallback to forgetting all view minus keep_first
  • fallback to system prompt + user message(s) + a summary of agent actions in the view
  • or worse case, system prompt + some info that things went badly and all we know is that this was user's task, maybe summary, maybe ask the user for more

For example, I feel the cloud has difficulty keeping up with 2k events, sandboxes keep crashing and getting lost, I don't know the exact reason but I know I didn't manage to have a longer conversation for a while... (on v1)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed we want to streamline some of these operations to avoid dealing with the whole event stream. But enforcing the properties we have unfortunately requires information from events outside the view.

Simple example: say I look at the current view and see a single action/observation pair. If I want to enforce batch atomicity, the only way I can know that the pair isn't part of a larger batch is to know that there are no other action events with matching llm_response_id values in the larger context.

There are a few solutions to avoiding doing that:

  1. Compute some kind of batch map in the conversation state and expose that. We'd have to do that tracking for each property we want to enforce and make sure it gets propagated to the view.
  2. Make sure the batch is never split in the first place. That's what the manipulation indices are trying to do.

In an ideal world we don't need the enforcement at all, just the manipulation indices. That's why every time enforcement happens it logs a warning. If we can verify that we don't see those warnings in practice, we can disable enforcement altogether (or only trigger it when we detect some error and need to recover).

# of the all_events sequence -- if the batch ids in the view aren't exactly
# one-to-one with the batch ids generated by the all_events sequence, that
# can only mean something has been forgotten and we need to drop the entire
# batch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, not a review comment, FWIW, I don't know if we tested that this is really necessary, but maybe I'm missing it. It might be worth having a specific test, if we don't

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch atomicity? I assume it was needed at some point, it's a carry-over from v0. Probably model-specific behavior. With this PR we can disable it pretty easily by just not applying that property.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, I thought it was new:

It seemed the PR fixed act2 / summary / obs2 , but I think it also dropped act1 and obs1. Which I don't know if we had to or tested for.

I think it would be great to test with act1 / obs1, assuming that we still avoid inserting summary between 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comes from before that: #775

We need a better way to actually stress the API assumptions made by various providers, but I think that's beyond the scope of this PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely forgot that PR! And I reviewed it at the time 😭

Thank you, yes I agree.

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this!

I do have a question re: previous results show that weaker LLMs do better with ~half events: when did we last test that, and what qualified then as "weaker" LLMs? If it's old enough, we really may want to re-eval/reconsider, the LLM capabilities went ahead for a few lifetimes. 😅

I'd like to (try to) make a few notes:

  • I'm starting to feel it's worth it to play with the idea of restating the tool loops constraint as a user messages constraint
  • they're much fewer, we can know exactly what's between them (I hope?)
  • we need to track user messages for a hook that is broken; maybe we could use that tracking for both features
  • it might be worth considering to delete keep_first attribute; in V1 we no longer need 4 minimum, the additional environment info that used to be a recall obs is now a suffix of the system prompt so I think we need two: system and user. That may save us from those bugs when we try to enforce keep_first but bump into the first events of a tool loop
  • we might want to test for batch atomicity, as noted, though on the other hand, I suspect you might be right it was in some other form happening in V0... 🤔 (update: you just showed it was older)
  • it's worth IMHO to think whether the summary of all events should be exactly the same way as partial view; for example codex-cli adds to the prompt that a summary just happened and nudges the agent to re-investigate a little to get its bearings again
  • I have seen in other parts of the codebase a need to know the current view, it could be great if we could somehow expose it

Maybe we could make issues for a few of these, to look into, if they don't sound totally off-base?

@csmith49 csmith49 merged commit c687179 into main Feb 19, 2026
61 of 62 checks passed
@csmith49 csmith49 deleted the feat/view-properties branch February 19, 2026 20:48
@csmith49
Copy link
Collaborator Author

Thank you for this!

I do have a question re: previous results show that weaker LLMs do better with ~half events: when did we last test that, and what qualified then as "weaker" LLMs? If it's old enough, we really may want to re-eval/reconsider, the LLM capabilities went ahead for a few lifetimes. 😅

I'd like to (try to) make a few notes:

  • I'm starting to feel it's worth it to play with the idea of restating the tool loops constraint as a user messages constraint
  • they're much fewer, we can know exactly what's between them (I hope?)
  • we need to track user messages for a hook that is broken; maybe we could use that tracking for both features
  • it might be worth considering to delete keep_first attribute; in V1 we no longer need 4 minimum, the additional environment info that used to be a recall obs is now a suffix of the system prompt so I think we need two: system and user. That may save us from those bugs when we try to enforce keep_first but bump into the first events of a tool loop
  • we might want to test for batch atomicity, as noted, though on the other hand, I suspect you might be right it was in some other form happening in V0... 🤔 (update: you just showed it was older)
  • it's worth IMHO to think whether the summary of all events should be exactly the same way as partial view; for example codex-cli adds to the prompt that a summary just happened and nudges the agent to re-investigate a little to get its bearings again
  • I have seen in other parts of the codebase a need to know the current view, it could be great if we could somehow expose it

Maybe we could make issues for a few of these, to look into, if they don't sound totally off-base?

I think some of these are totally warranted. My mind is on some short-term tasks this PR unblocked, let me sit on this for a day or so and I'll get back to you.

@csmith49
Copy link
Collaborator Author

I do have a question re: previous results show that weaker LLMs do better with ~half events: when did we last test that, and what qualified then as "weaker" LLMs? If it's old enough, we really may want to re-eval/reconsider, the LLM capabilities went ahead for a few lifetimes. 😅

Oh, this would have been way back when the condenser strategies were first being tested. We're due for a re-evaluation. It's been on my radar but haven't had any free cycles to get it set up.

  • I'm starting to feel it's worth it to play with the idea of restating the tool loops constraint as a user messages constraint
  • they're much fewer, we can know exactly what's between them (I hope?)
  • we need to track user messages for a hook that is broken; maybe we could use that tracking for both features

Tracking user messages seems helpful, but I'd be careful about restating the tool loop constraint as a negative like that. Could be a useful optimization to help detect boundaries but we'd still have to double-check the tool loop is actually a tool loop (action and observation events only, starts with thinking blocks, none in between, etc.)

  • it might be worth considering to delete keep_first attribute; in V1 we no longer need 4 minimum, the additional environment info that used to be a recall obs is now a suffix of the system prompt so I think we need two: system and user. That may save us from those bugs when we try to enforce keep_first but bump into the first events of a tool loop

The default for keep_first is now 2. Some users have saved CLI settings that haven't updated, unfortunately.

We could remove, but we definitely want to keep at least the system prompt event. So...maybe keep the attribute and set it to 1 if that's the behavior we want?

  • it's worth IMHO to think whether the summary of all events should be exactly the same way as partial view; for example codex-cli adds to the prompt that a summary just happened and nudges the agent to re-investigate a little to get its bearings again

Good point, our "all event" summary is handled by the hard context reset summary generation so we've already got an entry point to specialize the behavior a bit.

  • I have seen in other parts of the codebase a need to know the current view, it could be great if we could somehow expose it

Started working on #2141 yesterday (which I notice you've already seen).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

condenser-test Triggers a run of all condenser integration tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants