[smoke-detector] 🔍 Smoke Test Investigation - Smoke OpenCode Run #51: OpenCode Agent Execution Failure

# 🔍 Smoke Test Investigation - Run #51

## Summary
The Smoke OpenCode workflow failed due to an OpenCode agent execution failure during the "Run OpenCode" step. The agent job failed after only 5 seconds of runtime, preventing the creation of safe-outputs. The downstream `create_issue` job subsequently failed when it couldn't find the expected `agent_output.json` file.

## Failure Details
- **Run**: [18931623375]((redacted))
- **Run Number**: 51
- **Commit**: 11cfdb26adf4d0ef2fc7b13067263dea361dacc0
- **Branch**: main
- **Trigger**: schedule
- **Duration**: 1.1 minutes
- **Failed Jobs**: agent (33s), create_issue (10s)
- **Workflow**: Smoke OpenCode

## Root Cause Analysis

### Primary Issue
The OpenCode agent failed during execution in the "Run OpenCode" step after completing initial setup. The failure occurred very early (5 seconds into execution) before the agent could:
- Process the prompt
- Make any API calls
- Use any MCP tools
- Create any output artifacts

### Error Chain

**1. Agent Execution Failure** (agent job, step 22: "Run OpenCode")
```
The agent job failed during the 'Run OpenCode' step
Duration: 5 seconds
No error logs available - agent failed before creating stdio log
```

**2. Missing Artifacts** (agent job)
```
##[warning]No files were found with the provided path: /tmp/gh-aw/safeoutputs/outputs.jsonl
No artifacts will be uploaded.

##[warning]No files were found with the provided path: /tmp/gh-aw/agent-stdio.log
No artifacts will be uploaded.
```

**3. Downstream Failure** (create_issue job)
```
##[error]Error reading agent output file: ENOENT: no such file or directory, 
open '/tmp/gh-aw/safeoutputs/agent_output.json'
```

### Why Did This Happen?

This failure indicates one of the following scenarios:

1. **OpenCode Installation Issue**: The OpenCode CLI may not have installed correctly or dependencies are missing
2. **Runtime Error**: OpenCode encountered an unhandled exception during startup
3. **API Authentication**: Anthropic API credentials may be invalid or expired
4. **Transient Infrastructure Issue**: Runner environment had a temporary problem
5. **MCP Server Initialization**: One of the configured MCP servers (github, gh-aw, safeoutputs) failed to initialize

The very short execution time (5 seconds) suggests the failure occurred during OpenCode initialization rather than during actual task execution.

## Failed Jobs and Errors

### Job Sequence
1. ✅ **pre_activation** - succeeded (3s)
2. ✅ **activation** - succeeded (2s)
3. ❌ **agent** - failed (33s total, 5s execution)
4. ❌ **create_issue** - failed (10s)
5. ⏭️ **detection** - skipped
6. ⏭️ **missing_tool** - skipped

### Agent Job Steps (22 total)
- Steps 1-21: ✅ All succeeded (setup, download images, configure MCPs, create prompt)
- **Step 22** ("Run OpenCode"): ❌ **FAILED** (5s execution)
- Steps 23-29: Partially succeeded (redaction, artifact uploads with warnings)

### Key Observations
- All setup steps completed successfully
- MCP servers were configured: github, gh-aw, safeoutputs
- OpenCode version 0.15.13 was installed
- Failure occurred immediately when running OpenCode
- No agent stdio log was created (agent failed before logging started)
- No MCP logs were created

## Investigation Findings

### Environment Configuration
```bash
GH_AW_SAFE_OUTPUTS_STAGED=true
GH_AW_WORKFLOW_NAME=Smoke OpenCode
GH_AW_SAFE_OUTPUTS=/tmp/gh-aw/safeoutputs/outputs.jsonl
GH_AW_SAFE_OUTPUTS_CONFIG={"create_issue":{"max":1,"min":1},"missing_tool":{}}
```

### Expected Task
**Prompt**: "Review the last 5 merged pull requests in this repository and post summary in an issue."

**Required**: Agent should use safe-outputs MCP `create_issue` tool (min: 1, max: 1)

### What Was Missing
- ❌ No agent stdio log (`/tmp/gh-aw/agent-stdio.log`)
- ❌ No safe-outputs file (`/tmp/gh-aw/safeoutputs/outputs.jsonl`)
- ❌ No MCP logs (`/tmp/gh-aw/mcp-logs/`)
- ❌ No agent output artifact

## Historical Context

This is a **recurring pattern**. Similar failures have occurred:

| Date | Run ID | Issue | Status | Pattern |
|------|--------|-------|--------|---------|
| 2025-10-30 | 18926079635 | Investigation cached | - | Anthropic API error |
| 2025-10-27 | 18840299097 | #2604 | Closed | Agent doesn't use safe-outputs (Codex) |
| 2025-10-24 | 18788162015 | #2307 | Closed | Agent doesn't use safe-outputs (GenAIScript) |
| 2025-10-22 | 18722224746 | #2143 | Closed | Agent doesn't use safe-outputs (OpenCode) |
| 2025-10-22 | 18715612738 | #2121 | Closed | Missing agent_output.json |

### Pattern Classification
- **Pattern ID**: `OPENCODE_AGENT_EXECUTION_FAILURE`
- **Related Patterns**: `OPENCODE_ANTHROPIC_API_ERROR`, `OPENCODE_NO_SAFE_OUTPUTS`
- **Category**: AI Engine - Agent Execution Failure
- **Severity**: High
- **Is Flaky**: Yes - intermittent, not related to code changes
- **Is Transient**: Yes - likely infrastructure or API related
- **Occurrence Count**: 11+ occurrences

### Comparison with Previous Run
The most recent cached investigation (run 18926079635, ~6 hours earlier) shows a similar pattern but with more specific error information indicating an Anthropic API call failure. This suggests the issue may be:
- Transient Anthropic API availability issues
- Rate limiting or quota problems
- Network connectivity to Anthropic's API

## Recommended Actions

### High Priority

- [ ] **Implement retry logic for OpenCode agent**
  ```yaml
  - name: Run OpenCode (with retry)
    uses: nick-fields/retry@v3
    with:
      timeout_minutes: 5
      max_attempts: 3
      retry_wait_seconds: 30
      command: |
        # OpenCode execution command
  ```

- [ ] **Make create_issue job conditional on agent success**
  ```yaml
  create_issue:
    needs: [agent, detection]
    if: needs.agent.result == 'success'
    runs-on: ubuntu-latest
  ```

- [ ] **Add OpenCode validation step**
  ```yaml
  - name: Validate OpenCode Installation
    run: |
      opencode --version
      echo "Testing OpenCode CLI is functional"
  ```

- [ ] **Add pre-flight API health check**
  ```yaml
  - name: Check Anthropic API Health
    run: |
      # Minimal test request to verify API is accessible
      # Skip agent execution if API is down
  ```

### Medium Priority

- [ ] **Add verbose logging for OpenCode execution**
  - Enable debug mode to capture more diagnostic information
  - Log MCP server initialization status
  - Capture any stderr output

- [ ] **Monitor OpenCode failure rate**
  - Track scheduled smoke test success/failure rates
  - Alert if failure rate exceeds threshold
  - Identify patterns (time of day, specific commits, etc.)

- [ ] **Investigate OpenCode 0.15.13 stability**
  - Check if specific version has known issues
  - Consider pinning to a more stable version
  - Review OpenCode release notes for recent changes

### Low Priority

- [ ] **Add fallback mechanism**
  - If agent fails, create informational issue about the failure
  - Preserve workflow outputs for debugging

- [ ] **Improve error messages**
  - Provide clearer feedback when agent execution fails
  - Include troubleshooting steps in workflow output

## Prevention Strategies

1. **Retry Logic**: Implement automatic retries with exponential backoff for transient failures
   ```yaml
   retry:
     max_attempts: 3
     initial_delay: 30s
     backoff_multiplier: 2
   ```

2. **Conditional Job Execution**: Only run downstream jobs when agent succeeds
   ```yaml
   if: needs.agent.result == 'success' && needs.agent.outputs.has_output == 'true'
   ```

3. **Health Checks**: Add pre-flight checks for:
   - OpenCode CLI installation and version
   - Anthropic API availability
   - MCP server connectivity

4. **Graceful Degradation**: Don't fail the entire workflow if downstream jobs can't run
   ```yaml
   continue-on-error: true
   ```

5. **Enhanced Monitoring**: Track and alert on:
   - Agent execution failure rates
   - Artifact creation success rates
   - API response times and errors

## Technical Details

### Workflow Execution Timeline
```
06:11:14 - Workflow triggered (schedule)
06:11:18 - pre_activation started
06:11:21 - pre_activation completed ✅
06:11:24 - activation started
06:11:26 - activation completed ✅
06:11:29 - agent job started
06:11:54 - agent job: Run OpenCode step started
06:11:59 - agent job: Run OpenCode step FAILED ❌ (5s)
06:12:06 - create_issue job started
06:12:09 - create_issue job FAILED ❌
06:12:17 - Workflow completed (failure)
```

### Agent Job Steps Summary
- **Setup**: 21 steps, all succeeded (25s)
- **Execution**: Step 22 failed (5s)
- **Cleanup**: 7 steps, mostly succeeded with warnings (3s)

### MCP Servers Configured
1. **github** - GitHub MCP server for repository operations
2. **gh-aw** - GitHub Actions workflows MCP server
3. **safeoutputs** - Safe outputs MCP for creating issues

## Related Issues

- #2143 - OpenCode agent doesn't use safe-outputs (closed)
- #2121 - Missing agent_output.json file (closed)
- #2604 - Codex agent output artifact missing (closed)
- #2307 - GenAIScript agent doesn't use safe-outputs (closed)
- #2534 - Task: Add graceful artifact handling (closed)

## Next Steps

1. **Immediate**: Monitor next scheduled run to see if issue recurs
2. **Short-term**: Implement retry logic and conditional jobs
3. **Long-term**: Add comprehensive health checks and monitoring

---

**Investigation Metadata:**
- **Investigator**: Smoke Detector (automated investigator)
- **Investigation Run**: [18931646294]((redacted))
- **Pattern Database**: `/tmp/gh-aw/cache-memory/patterns/opencode_agent_execution_failure.json`
- **Investigation Record**: `/tmp/gh-aw/cache-memory/investigations/2025-10-30-18931623375.json`
- **Related PR**: #2768 (Add permissions validator for GitHub MCP toolsets) - merged, unrelated to failure




> AI generated by [Smoke Detector - Smoke Test Failure Investigator](https://github.com/githubnext/gh-aw/actions/runs/18931646294)

Date	Run ID	Issue	Status	Pattern
2025-10-30	18926079635	Investigation cached	-	Anthropic API error
2025-10-27	18840299097	#2604	Closed	Agent doesn't use safe-outputs (Codex)
2025-10-24	18788162015	#2307	Closed	Agent doesn't use safe-outputs (GenAIScript)
2025-10-22	18722224746	#2143	Closed	Agent doesn't use safe-outputs (OpenCode)
2025-10-22	18715612738	#2121	Closed	Missing agent_output.json

[smoke-detector] 🔍 Smoke Test Investigation - Smoke OpenCode Run #51: OpenCode Agent Execution Failure #2772

Description

🔍 Smoke Test Investigation - Run #51

Summary

Failure Details

Root Cause Analysis

Primary Issue

Error Chain

Why Did This Happen?

Failed Jobs and Errors

Job Sequence

Agent Job Steps (22 total)

Key Observations

Investigation Findings

Environment Configuration

Expected Task

What Was Missing

Historical Context

Pattern Classification

Comparison with Previous Run

Recommended Actions

High Priority

Medium Priority

Low Priority

Prevention Strategies

Technical Details

Workflow Execution Timeline

Agent Job Steps Summary

MCP Servers Configured

Related Issues

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions