Skip to content

[BUG] Agent produces MessageEvent without ActionEvent, causing premature execution termination #1349

@simonrosenberg

Description

@simonrosenberg

[BUG] Agent produces MessageEvent without ActionEvent, causing premature execution termination

Summary

Agent execution incorrectly terminates when the agent produces a MessageEvent with reasoning/thinking content but fails to follow it with an ActionEvent. The system marks the execution as "finished" even though:

  • The agent never called the finish tool
  • Max iterations not reached (only 66/500 used)
  • No timeout or error occurred
  • Agent's reasoning clearly shows intent to continue with next steps

Impact

Severity: P0 - Critical
Affects: GAIA evaluation success rate (observed 1/3 failures due to this bug)
User Impact: Evaluation results show 66.7% success rate when true capability is likely 90-100%

Reproduction

Evaluation Run Details

  • Run ID: 20026595169
  • Model: claude-sonnet-4-5-20250929 via litellm_proxy
  • Failed Instance ID: c61d22de-5f6c-4958-a7f6-5e9707bd3466
  • Max iterations: 500 (only 66 used before termination)
  • Execution time: 7.5 minutes (11:34:06 - 11:41:47)

Artifact Locations

GCS Bucket: gs://openhands-evaluation-results/
Artifact File: sdk-main-20026595169-claude-son_litellm_proxy-claude-sonnet-4-5-20250929_25-12-08-11-43.tar.gz
Conversation Archive: conversations/c61d22de-5f6c-4958-a7f6-5e9707bd3466.tar.gz (within artifact)

Direct GCS URLs:

gs://openhands-evaluation-results/sdk-main-20026595169-claude-son_litellm_proxy-claude-sonnet-4-5-20250929_25-12-08-11-43.tar.gz

Bug Description

Event Sequence (Buggy Behavior)

The conversation trajectory shows this pattern at termination:

Event 201: MessageEvent (kind: "MessageEvent", source: "agent")
  - content: "Let me search for all papers with arXiv ID 1608.03* that might contain these keywords:"
  - reasoning_content: Shows agent planning next steps (see details below)
  - thinking_blocks: Contains extended reasoning about search strategy

Event 202: ConversationStateUpdateEvent (kind: "ConversationStateUpdateEvent", source: "environment")
  - key: "execution_status"
  - value: "finished" ← INCORRECT!

[NO ActionEvent produced between Event 201 and 202]

Agent's Last Reasoning (Event 201)

The agent was actively working and planning next steps:

Hmm, none of these papers contain the keywords. Let me think about this more carefully. 
Perhaps the keywords are used in a different form or the submission date isn't exactly 
August 11, 2016. Let me try searching for "equalitarian" (an alternative spelling) or 
"egalitarian society".

Wait, let me reconsider. The task says "submitted to arXiv.org on August 11, 2016". 
This means the submission date should be exactly August 11, 2016. Let me list ALL papers 
with arXiv IDs starting with 1608.03 from physics.soc-ph to make sure I haven't missed any.

Actually, let me search more broadly by looking at all 1608.03 papers in any category 
that might be related to society or physics.soc-ph, and check each one.

The agent clearly intended to continue but no ActionEvent was produced.

Expected Behavior

After Event 201 (MessageEvent with reasoning), the system should:

  1. Wait for ActionEvent: Agent should produce an ActionEvent to execute the planned search
  2. Continue execution: Process the action and return ObservationEvent
  3. Iterate: Continue agent loop until one of these conditions:
    • ✅ Agent calls finish tool explicitly
    • ✅ Max iterations reached (500)
    • ✅ Timeout occurs
    • ✅ Error/exception thrown

Actual Behavior

System immediately marks execution as "finished" after MessageEvent, causing:

  • ❌ No ActionEvent produced
  • ❌ No opportunity for agent to continue
  • ❌ Task incomplete (no <solution> tag in output)
  • ❌ Evaluation marked as failed

Evidence from Logs

Token Usage (Proves agent was actively working)

From conversation trajectory Event 203 (final stats):

{
  "accumulated_token_usage": {
    "model": "litellm_proxy/claude-sonnet-4-5-20250929",
    "prompt_tokens": 4045445,
    "completion_tokens": 11617,
    "cache_read_tokens": 3933291,
    "cache_write_tokens": 111950,
    "reasoning_tokens": 4583
  }
}

Key observation: This failed instance had the HIGHEST token usage (4.0M) and BEST cache efficiency (98%) compared to successful instances, proving the agent was working hard on the problem when termination occurred.

Action Count (Proves max iterations not hit)

  • Actions completed: 66
  • Max iterations: 500
  • Capacity remaining: 87%

No iteration limit was hit.

Log Entry

From logs/instance_c61d22de-5f6c-4958-a7f6-5e9707bd3466.log:

2025-12-08 11:41:47,680 - WARNING - benchmarks.gaia.run_infer - No <solution> tag found in: Let me search for all papers with arXiv ID 1608.03* that might contain these keywords:...
2025-12-08 11:41:47,680 - INFO - benchmarks.gaia.run_infer - Instance c61d22de-5f6c-4958-a7f6-5e9707bd3466: score=False, model_answer='Let me search for all papers with arXiv ID 1608.03* that might contain these keywords:', ground_truth='egalitarian'

The warning correctly identifies that no solution was provided, but the system proceeded to mark the task as complete anyway.

Comparison with Successful Runs

Successful Instance (for reference)

Instance ID: 17b5a6a3-bc87-42e8-b0fb-6ab0781ef2cc

Normal event pattern:

Event N:   MessageEvent (agent reasoning)
Event N+1: ActionEvent (e.g., BrowserNavigateAction, TerminalAction)
Event N+2: ObservationEvent (results)
Event N+3: ConversationStateUpdateEvent (stats update)
...continues until agent provides <solution> tag and finishes naturally

Execution stats:

  • Actions: 185 (normal operation)
  • Duration: 10 minutes
  • Result: ✅ Success (provided correct answer with <solution> tags)
  • Execution status: "finished" (CORRECT - agent called finish tool)

Root Cause Analysis

The issue appears to be in the agent execution loop where:

  1. Agent produces MessageEvent with reasoning content
  2. Something fails to trigger ActionEvent generation
  3. System interprets lack of ActionEvent as completion
  4. Execution terminates prematurely

Possible Causes

  1. LLM response parsing issue: Agent's response may not have been properly parsed to extract tool calls
  2. State machine bug: Agent state machine may incorrectly transition to "finished" after MessageEvent
  3. Missing validation: No validation that MessageEvent should be followed by ActionEvent or explicit finish call
  4. Timeout on LLM call: If LLM call times out while generating action, system may default to finishing

Proposed Fix

Validation Check

Add validation after MessageEvent:

def handle_message_event(message_event):
    # Process message event
    ...
    
    # If agent produced message without explicit finish call:
    if not agent_called_finish_tool(message_event):
        # Agent MUST produce an ActionEvent next
        next_event = wait_for_next_event(timeout=30)
        
        if next_event.kind != "ActionEvent":
            # Log warning and prompt agent to continue
            logger.warning(
                f"Agent produced MessageEvent without ActionEvent. "
                f"Message: {truncate(message_event.content)}"
            )
            # Inject continuation prompt
            inject_system_message("Please provide your next action.")
            continue

Termination Conditions

Execution should ONLY mark status as "finished" when:

def should_terminate(conversation_state):
    return (
        conversation_state.finish_tool_called or
        conversation_state.action_count >= MAX_ITERATIONS or
        conversation_state.has_error or
        conversation_state.timeout_exceeded
    )

Never terminate based on MessageEvent alone.

Testing

Test Case 1: Reproduce the bug

  1. Download artifact from GCS: gs://openhands-evaluation-results/sdk-main-20026595169-...tar.gz
  2. Extract conversation: conversations/c61d22de-5f6c-4958-a7f6-5e9707bd3466.tar.gz
  3. Examine events 201-202
  4. Verify no ActionEvent between MessageEvent and execution termination

Test Case 2: Verify fix

  1. Re-run instance c61d22de-5f6c-4958-a7f6-5e9707bd3466 with fixed SDK
  2. Verify agent produces ActionEvent after Event 201's MessageEvent
  3. Verify execution continues until agent provides <solution> tag
  4. Expected result: SUCCESS (agent should find answer "egalitarian")

Additional Context

Related Files

From SDK (likely affected):

  • openhands/sdk/conversation/impl/remote_conversation.py - Conversation management
  • Agent execution loop logic
  • Event handling and state machine

Metrics

Success rate impact:

  • Current (with bug): 66.7% (2/3 successful)
  • Expected (after fix): 90-100% (failed instance was performing well until bug)

Cost impact:

  • Failed run cost: $0.59 (wasted due to bug)
  • Re-run cost: ~$0.60
  • Total waste per bug occurrence: ~$1.20

Environment

  • SDK commit: 7ef3881
  • Evaluation image: ghcr.io/openhands/eval-agent-server:7ef3881-gaia-with-mcp
  • Runtime: Kubernetes (GKE)
  • Max iterations: 500
  • Evaluation date: 2025-12-08

References

  • Evaluation run: 20026595169
  • GitHub Actions workflow: eval-job.yml
  • Benchmark: GAIA 2023 validation set
  • Full analysis: [Internal artifact analysis]

Priority: P0 - Critical
Labels: bug, evaluation, sdk, agent-execution
Assignee: SDK team
Milestone: Next release

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions