Skip to content

Bug: Duplicate ObservationEvent with same tool_call_id causes LLM API error on conversation resume #1782

@jpshackelford

Description

@jpshackelford

Bug Description

⚠️ This bug puts the conversation in an unrecoverable state. Once triggered, the agent cannot proceed and the conversation is permanently stuck. Every subsequent attempt to run the agent fails with the same error.

When a conversation is resumed after being paused/finished, a duplicate ObservationEvent can be created with the same tool_call_id as an existing observation. This causes the Anthropic API to reject the request with:

litellm.BadRequestError: Error code: 400 - {
  "error": {
    "message": "litellm.BadRequestError: AnthropicException - {
      \"type\": \"error\",
      \"error\": {
        \"type\": \"invalid_request_error\",
        \"message\": \"messages.59: `tool_use` ids were found without `tool_result` blocks immediately after: toolu_01CGASf7KnafqkuQMuLstHi4. Each `tool_use` block must have a corresponding `tool_result` block in the next message.\"
      },
      \"request_id\": \"req_011CXMLWpMmp7TBFbCVyLXY3\"
    }
    No fallback model group found for original model_group=prod/claude-opus-4-5-20251101.
    Fallbacks=[{"qwen3-coder-480b": ["qwen3-coder-480b-or"]}, {"glm-4.5": ["glm-4.5-or"]}].
    Received Model Group=prod/claude-opus-4-5-20251101
    Available Model Group Fallbacks=None",
    "type": null,
    "param": null,
    "code": "400"
  }
}

Since the duplicate observation is persisted in the event stream, there is no way to recover - every LLM call will include the malformed message history and fail.

Steps to Reproduce

Observed in conversation af22d964cb9f464e957320d647c7471e:

  1. Start a conversation and let the agent run actions
  2. Let the conversation finish normally (agent calls finish tool)
  3. Wait some time (in this case ~1.5 hours)
  4. Resume the conversation by sending a new user message
  5. The conversation fails with the tool_use/tool_result mismatch error
  6. All subsequent attempts to continue the conversation fail with the same error

Root Cause Analysis

Analyzing the event stream via /api/v1/conversation/{id}/events/search:

Event Timestamp Type Details
94 21:04:03 ActionEvent tool_call_id=toolu_01CGASf7KnafqkuQMuLstHi4
95 21:04:24 ObservationEvent First observation for action 94
110 21:06:10 ObservationEvent Finish action - conversation completed
111 21:06:10 StateUpdate execution_status=finished
(1.5 hour gap)
114 22:45:53 MessageEvent User sent new message
116 22:45:54 ObservationEvent DUPLICATE - same tool_call_id=toolu_01CGASf7KnafqkuQMuLstHi4

The result is:

  • 1 tool_use (ActionEvent) with ID toolu_01CGASf7KnafqkuQMuLstHi4
  • 2 tool_result (ObservationEvents) with the same ID

Claude's API requires exactly one tool_result per tool_use.

Likely Causes

  1. Event sync issue during resume: When the conversation was restarted, events may not have been fully loaded from persistence, causing get_unmatched_actions() to incorrectly identify the old action as unmatched and re-execute it.

  2. Missing deduplication in filter_unmatched_tool_calls(): The current implementation in View.filter_unmatched_tool_calls() uses sets to track tool_call_ids. If there are multiple observations with the same tool_call_id, they're all kept because the ID exists in the action set.

Suggested Fixes

Defensive Fix (view.py)

Modify filter_unmatched_tool_calls() to track counts and only keep one observation per tool_call_id:

@staticmethod
def filter_unmatched_tool_calls(
    events: list[LLMConvertibleEvent],
) -> list[LLMConvertibleEvent]:
    # ... existing code ...
    
    # Track which tool_call_ids have already been seen for observations
    seen_observation_tool_call_ids: set[ToolCallID] = set()
    
    result = []
    for event in events:
        if event.id in removed_event_ids:
            continue
        if isinstance(event, ObservationBaseEvent):
            if event.tool_call_id in tool_call_ids_to_remove:
                continue
            # NEW: Skip duplicate observations
            if event.tool_call_id in seen_observation_tool_call_ids:
                logger.warning(
                    f"Skipping duplicate observation for tool_call_id: {event.tool_call_id}"
                )
                continue
            if event.tool_call_id is not None:
                seen_observation_tool_call_ids.add(event.tool_call_id)
        result.append(event)
    return result

This defensive fix would also recover existing stuck conversations by filtering out the duplicate observations at LLM message construction time.

Root Cause Fix

Investigate why duplicate observations are created during conversation resume. Key locations:

  • openhands-agent-server/openhands/agent_server/event_service.py - the start() method
  • openhands-sdk/openhands/sdk/conversation/state.py - get_unmatched_actions()
  • Event persistence/loading logic during remote sandbox resume

Environment

  • Platform: OpenHands Cloud (app.all-hands.dev)
  • Model: prod/claude-opus-4-5-20251101
  • Conversation ID: af22d964cb9f464e957320d647c7471e

Related Files

  • openhands-sdk/openhands/sdk/context/view.py - filter_unmatched_tool_calls()
  • openhands-sdk/openhands/sdk/conversation/state.py - get_unmatched_actions()
  • openhands-agent-server/openhands/agent_server/event_service.py - start() method

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions