[BUG] Agent produces MessageEvent without ActionEvent, causing premature execution termination

# [BUG] Agent produces MessageEvent without ActionEvent, causing premature execution termination

## Summary

Agent execution incorrectly terminates when the agent produces a `MessageEvent` with reasoning/thinking content but fails to follow it with an `ActionEvent`. The system marks the execution as "finished" even though:
- The agent never called the finish tool
- Max iterations not reached (only 66/500 used)
- No timeout or error occurred
- Agent's reasoning clearly shows intent to continue with next steps

## Impact

**Severity**: P0 - Critical  
**Affects**: GAIA evaluation success rate (observed 1/3 failures due to this bug)  
**User Impact**: Evaluation results show 66.7% success rate when true capability is likely 90-100%

## Reproduction

### Evaluation Run Details
- **Run ID**: 20026595169
- **Model**: claude-sonnet-4-5-20250929 via litellm_proxy
- **Failed Instance ID**: c61d22de-5f6c-4958-a7f6-5e9707bd3466
- **Max iterations**: 500 (only 66 used before termination)
- **Execution time**: 7.5 minutes (11:34:06 - 11:41:47)

### Artifact Locations
**GCS Bucket**: `gs://openhands-evaluation-results/`  
**Artifact File**: `sdk-main-20026595169-claude-son_litellm_proxy-claude-sonnet-4-5-20250929_25-12-08-11-43.tar.gz`  
**Conversation Archive**: `conversations/c61d22de-5f6c-4958-a7f6-5e9707bd3466.tar.gz` (within artifact)

**Direct GCS URLs**:
```
gs://openhands-evaluation-results/sdk-main-20026595169-claude-son_litellm_proxy-claude-sonnet-4-5-20250929_25-12-08-11-43.tar.gz
```

## Bug Description

### Event Sequence (Buggy Behavior)

The conversation trajectory shows this pattern at termination:

```
Event 201: MessageEvent (kind: "MessageEvent", source: "agent")
  - content: "Let me search for all papers with arXiv ID 1608.03* that might contain these keywords:"
  - reasoning_content: Shows agent planning next steps (see details below)
  - thinking_blocks: Contains extended reasoning about search strategy

Event 202: ConversationStateUpdateEvent (kind: "ConversationStateUpdateEvent", source: "environment")
  - key: "execution_status"
  - value: "finished" ← INCORRECT!

[NO ActionEvent produced between Event 201 and 202]
```

### Agent's Last Reasoning (Event 201)

The agent was actively working and planning next steps:

```
Hmm, none of these papers contain the keywords. Let me think about this more carefully. 
Perhaps the keywords are used in a different form or the submission date isn't exactly 
August 11, 2016. Let me try searching for "equalitarian" (an alternative spelling) or 
"egalitarian society".

Wait, let me reconsider. The task says "submitted to arXiv.org on August 11, 2016". 
This means the submission date should be exactly August 11, 2016. Let me list ALL papers 
with arXiv IDs starting with 1608.03 from physics.soc-ph to make sure I haven't missed any.

Actually, let me search more broadly by looking at all 1608.03 papers in any category 
that might be related to society or physics.soc-ph, and check each one.
```

**The agent clearly intended to continue but no ActionEvent was produced.**

### Expected Behavior

After Event 201 (MessageEvent with reasoning), the system should:

1. **Wait for ActionEvent**: Agent should produce an ActionEvent to execute the planned search
2. **Continue execution**: Process the action and return ObservationEvent
3. **Iterate**: Continue agent loop until one of these conditions:
   - ✅ Agent calls finish tool explicitly
   - ✅ Max iterations reached (500)
   - ✅ Timeout occurs
   - ✅ Error/exception thrown

### Actual Behavior

System immediately marks execution as "finished" after MessageEvent, causing:
- ❌ No ActionEvent produced
- ❌ No opportunity for agent to continue
- ❌ Task incomplete (no `<solution>` tag in output)
- ❌ Evaluation marked as failed

## Evidence from Logs

### Token Usage (Proves agent was actively working)

From conversation trajectory Event 203 (final stats):
```json
{
  "accumulated_token_usage": {
    "model": "litellm_proxy/claude-sonnet-4-5-20250929",
    "prompt_tokens": 4045445,
    "completion_tokens": 11617,
    "cache_read_tokens": 3933291,
    "cache_write_tokens": 111950,
    "reasoning_tokens": 4583
  }
}
```

**Key observation**: This failed instance had the HIGHEST token usage (4.0M) and BEST cache efficiency (98%) compared to successful instances, proving the agent was working hard on the problem when termination occurred.

### Action Count (Proves max iterations not hit)

- Actions completed: 66
- Max iterations: 500
- Capacity remaining: 87%

**No iteration limit was hit.**

### Log Entry

From `logs/instance_c61d22de-5f6c-4958-a7f6-5e9707bd3466.log`:
```
2025-12-08 11:41:47,680 - WARNING - benchmarks.gaia.run_infer - No <solution> tag found in: Let me search for all papers with arXiv ID 1608.03* that might contain these keywords:...
2025-12-08 11:41:47,680 - INFO - benchmarks.gaia.run_infer - Instance c61d22de-5f6c-4958-a7f6-5e9707bd3466: score=False, model_answer='Let me search for all papers with arXiv ID 1608.03* that might contain these keywords:', ground_truth='egalitarian'
```

The warning correctly identifies that no solution was provided, but the system proceeded to mark the task as complete anyway.

## Comparison with Successful Runs

### Successful Instance (for reference)

Instance ID: 17b5a6a3-bc87-42e8-b0fb-6ab0781ef2cc

**Normal event pattern**:
```
Event N:   MessageEvent (agent reasoning)
Event N+1: ActionEvent (e.g., BrowserNavigateAction, TerminalAction)
Event N+2: ObservationEvent (results)
Event N+3: ConversationStateUpdateEvent (stats update)
...continues until agent provides <solution> tag and finishes naturally
```

**Execution stats**:
- Actions: 185 (normal operation)
- Duration: 10 minutes
- Result: ✅ Success (provided correct answer with `<solution>` tags)
- Execution status: "finished" (CORRECT - agent called finish tool)

## Root Cause Analysis

The issue appears to be in the agent execution loop where:

1. Agent produces MessageEvent with reasoning content
2. **Something fails to trigger ActionEvent generation**
3. System interprets lack of ActionEvent as completion
4. Execution terminates prematurely

### Possible Causes

1. **LLM response parsing issue**: Agent's response may not have been properly parsed to extract tool calls
2. **State machine bug**: Agent state machine may incorrectly transition to "finished" after MessageEvent
3. **Missing validation**: No validation that MessageEvent should be followed by ActionEvent or explicit finish call
4. **Timeout on LLM call**: If LLM call times out while generating action, system may default to finishing

## Proposed Fix

### Validation Check

Add validation after MessageEvent:

```python
def handle_message_event(message_event):
    # Process message event
    ...
    
    # If agent produced message without explicit finish call:
    if not agent_called_finish_tool(message_event):
        # Agent MUST produce an ActionEvent next
        next_event = wait_for_next_event(timeout=30)
        
        if next_event.kind != "ActionEvent":
            # Log warning and prompt agent to continue
            logger.warning(
                f"Agent produced MessageEvent without ActionEvent. "
                f"Message: {truncate(message_event.content)}"
            )
            # Inject continuation prompt
            inject_system_message("Please provide your next action.")
            continue
```

### Termination Conditions

Execution should ONLY mark status as "finished" when:

```python
def should_terminate(conversation_state):
    return (
        conversation_state.finish_tool_called or
        conversation_state.action_count >= MAX_ITERATIONS or
        conversation_state.has_error or
        conversation_state.timeout_exceeded
    )
```

**Never terminate based on MessageEvent alone.**

## Testing

### Test Case 1: Reproduce the bug

1. Download artifact from GCS: `gs://openhands-evaluation-results/sdk-main-20026595169-...tar.gz`
2. Extract conversation: `conversations/c61d22de-5f6c-4958-a7f6-5e9707bd3466.tar.gz`
3. Examine events 201-202
4. Verify no ActionEvent between MessageEvent and execution termination

### Test Case 2: Verify fix

1. Re-run instance c61d22de-5f6c-4958-a7f6-5e9707bd3466 with fixed SDK
2. Verify agent produces ActionEvent after Event 201's MessageEvent
3. Verify execution continues until agent provides `<solution>` tag
4. Expected result: SUCCESS (agent should find answer "egalitarian")

## Additional Context

### Related Files

From SDK (likely affected):
- `openhands/sdk/conversation/impl/remote_conversation.py` - Conversation management
- Agent execution loop logic
- Event handling and state machine

### Metrics

**Success rate impact**:
- Current (with bug): 66.7% (2/3 successful)
- Expected (after fix): 90-100% (failed instance was performing well until bug)

**Cost impact**:
- Failed run cost: $0.59 (wasted due to bug)
- Re-run cost: ~$0.60
- Total waste per bug occurrence: ~$1.20

### Environment

- SDK commit: 7ef3881
- Evaluation image: `ghcr.io/openhands/eval-agent-server:7ef3881-gaia-with-mcp`
- Runtime: Kubernetes (GKE)
- Max iterations: 500
- Evaluation date: 2025-12-08

## References

- Evaluation run: 20026595169
- GitHub Actions workflow: `eval-job.yml`
- Benchmark: GAIA 2023 validation set
- Full analysis: [Internal artifact analysis]

---

**Priority**: P0 - Critical  
**Labels**: bug, evaluation, sdk, agent-execution  
**Assignee**: SDK team  
**Milestone**: Next release

[BUG] Agent produces MessageEvent without ActionEvent, causing premature execution termination #1349

Description

[BUG] Agent produces MessageEvent without ActionEvent, causing premature execution termination

Summary

Impact

Reproduction

Evaluation Run Details

Artifact Locations

Bug Description

Event Sequence (Buggy Behavior)

Agent's Last Reasoning (Event 201)

Expected Behavior

Actual Behavior

Evidence from Logs

Token Usage (Proves agent was actively working)

Action Count (Proves max iterations not hit)

Log Entry

Comparison with Successful Runs

Successful Instance (for reference)

Root Cause Analysis

Possible Causes

Proposed Fix

Validation Check

Termination Conditions

Testing

Test Case 1: Reproduce the bug

Test Case 2: Verify fix

Additional Context

Related Files

Metrics

Environment

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions