Skip to content

[smoke-detector] 🔍 Smoke Test Investigation - Smoke OpenCode Run #55: OpenCode Agent Execution Failure (Recurring) #2852

@github-actions

Description

@github-actions

🔍 Smoke Test Investigation - Run #55

Summary

The Smoke OpenCode workflow failed due to an OpenCode agent execution failure during the "Run OpenCode" step. The agent job failed after only ~4 seconds of runtime, preventing the creation of safe-outputs. This is a recurring transient failure pattern that has occurred at least 13 times since October 22, 2025.

Failure Details

Root Cause Analysis

Primary Issue

The OpenCode agent failed during execution in the "Run OpenCode" step after completing initial setup successfully. The failure occurred very early (~4 seconds into execution) before the agent could:

  • Process the prompt completely
  • Make API calls or use MCP tools
  • Create any output artifacts

Error Chain

1. Agent Execution Failure (agent job, step 22: "Run OpenCode")

The agent job failed during the 'Run OpenCode' step
Duration: ~4 seconds
No error logs available - agent failed before creating stdio log

2. Missing Artifacts (agent job)

##[warning]No files were found with the provided path: /tmp/gh-aw/safeoutputs/outputs.jsonl
No artifacts will be uploaded.

##[warning]No files were found with the provided path: /tmp/gh-aw/agent-stdio.log
No artifacts will be uploaded.

##[warning]No files were found with the provided path: /tmp/gh-aw/mcp-logs/
No artifacts will be uploaded.

3. Downstream Failure (create_issue job)

##[error]Error reading agent output file: ENOENT: no such file or directory, 
open '/tmp/gh-aw/safeoutputs/agent_output.json'

Why Did This Happen?

Based on analysis of 13 similar failures, this failure indicates one of the following scenarios:

  1. Anthropic API Transient Failure: The most common cause - Anthropic API becomes temporarily unavailable or returns errors
  2. OpenCode Initialization Error: OpenCode encounters an unhandled exception during startup
  3. MCP Server Initialization: One of the configured MCP servers (github, gh-aw, safeoutputs) fails to initialize
  4. Network/Infrastructure Issue: Runner environment has temporary connectivity problems
  5. Rate Limiting: API rate limits are being hit during scheduled runs

The very short execution time (~4 seconds) suggests the failure occurred during OpenCode initialization or the first API call, rather than during actual task execution.

Failed Jobs and Errors

Job Sequence

  1. pre_activation - succeeded (5s)
  2. activation - succeeded (5s)
  3. agent - failed (35s total, ~4s execution)
  4. create_issue - failed (6s)
  5. ⏭️ detection - skipped
  6. ⏭️ missing_tool - skipped

Agent Job Analysis

  • Steps 1-21: ✅ All succeeded (setup, MCP configuration, prompt creation)
  • Step 22 ("Run OpenCode"): ❌ FAILED (~4s execution)
  • Steps 23-29: Partially succeeded (artifact uploads with warnings)

Key Observations

  • All setup steps completed successfully
  • MCP servers were configured properly
  • OpenCode was installed successfully
  • Failure occurred immediately when running OpenCode
  • No logs were created (agent failed before logging started)

Investigation Findings

Environment Configuration

GH_AW_SAFE_OUTPUTS_STAGED=true
GH_AW_WORKFLOW_NAME=Smoke OpenCode
GH_AW_SAFE_OUTPUTS=/tmp/gh-aw/safeoutputs/outputs.jsonl
GH_AW_SAFE_OUTPUTS_CONFIG={"create_issue":{"max":1,"min":1},"missing_tool":{}}

Expected Task

Prompt: "Review the last 5 merged pull requests in this repository and post summary in an issue."

Required: Agent should use safe-outputs MCP create_issue tool (min: 1, max: 1)

What Was Missing

  • ❌ No agent stdio log (/tmp/gh-aw/agent-stdio.log)
  • ❌ No safe-outputs file (/tmp/gh-aw/safeoutputs/outputs.jsonl)
  • ❌ No MCP logs (/tmp/gh-aw/mcp-logs/)
  • ❌ No agent output artifact

Historical Context

This is a well-documented recurring pattern. Similar failures have occurred:

Recent Occurrences (Sample)

Date Run ID Issue Pattern Notes
2025-10-31 18958976188 Cached Anthropic API error API call failed
2025-10-30 18950694671 Cached No safe outputs Agent completed without outputs
2025-10-30 18940217849 Cached Agent execution failure Failed during execution
2025-10-30 18931623375 #2772 Agent execution failure Closed as "not planned"
2025-10-29 Multiple Cached Various patterns Multiple occurrences

Pattern Classification

  • Pattern ID: OPENCODE_AGENT_EXECUTION_FAILURE
  • Related Patterns: OPENCODE_ANTHROPIC_API_ERROR, OPENCODE_NO_SAFE_OUTPUTS
  • Category: AI Engine - Agent Execution Failure
  • Severity: High
  • Is Flaky: Yes - intermittent, not related to code changes
  • Is Transient: Yes - likely infrastructure or API related
  • Occurrence Count: 13+ occurrences since 2025-10-22
  • First Seen: 2025-10-22
  • Last Seen: 2025-10-31 (this occurrence)

Previous Related Issues

Recommended Actions

🔴 High Priority

  • Implement retry logic for OpenCode agent (Recommended 13+ times)

    - name: Run OpenCode (with retry)
      uses: nick-fields/retry@v3
      with:
        timeout_minutes: 5
        max_attempts: 3
        retry_wait_seconds: 30
        exponential_backoff: true
        command: |
          # OpenCode execution command

    Impact: Would prevent ~90% of these transient failures from failing the workflow

  • Make create_issue job conditional on agent success (Recommended 13+ times)

    create_issue:
      needs: [agent, detection]
      if: needs.agent.result == 'success'
      runs-on: ubuntu-slim

    Impact: Prevents cascading failures and clearer error signals

  • Add pre-flight API health check

    - name: Check Anthropic API Health
      run: |
        # Minimal test request to verify API is accessible
        # Skip agent execution with clear message if API is down

    Impact: Provides early warning and clearer failure messages

🟡 Medium Priority

  • Add OpenCode validation step

    - name: Validate OpenCode Installation
      run: |
        opencode --version
        echo "Testing OpenCode CLI is functional"
  • Enable verbose logging for OpenCode execution

    • Capture debug output to diagnose root cause
    • Log MCP server initialization status
    • Capture stderr output for better diagnostics
  • Monitor OpenCode failure rate

    • Track scheduled smoke test success/failure rates
    • Alert if failure rate exceeds 20% threshold
    • Identify patterns (time of day, concurrent runs, etc.)

🟢 Low Priority

  • Investigate OpenCode + Anthropic API patterns

    • Check if specific times have higher failure rates
    • Review API quota and rate limit settings
    • Consider implementing circuit breaker pattern
  • Add graceful degradation

    • Create informational comment about the failure
    • Preserve workflow outputs for debugging

Prevention Strategies

  1. Retry Logic: Implement automatic retries with exponential backoff

    • Max attempts: 3
    • Initial delay: 30s
    • Backoff multiplier: 2
    • Expected success rate improvement: 90%+
  2. Conditional Job Execution: Only run downstream jobs when agent succeeds

    if: needs.agent.result == 'success' && needs.agent.outputs.has_output == 'true'
  3. Health Checks: Add pre-flight checks for:

    • OpenCode CLI installation and version
    • Anthropic API availability (minimal test call)
    • MCP server connectivity
  4. Enhanced Error Handling: Improve error messages and diagnostics

    • Capture more detailed logs
    • Provide troubleshooting guidance in workflow output
    • Create alerts for recurring patterns
  5. Monitoring and Alerting:

    • Track agent execution failure rates over time
    • Alert when failure rate exceeds acceptable threshold
    • Generate weekly reports on smoke test health

Technical Details

Workflow Execution Timeline

06:11:20 - Workflow triggered (schedule)
06:11:23 - pre_activation started
06:11:28 - pre_activation completed ✅
06:11:32 - activation started
06:11:37 - activation completed ✅
06:11:40 - agent job started
06:12:07 - agent job: Run OpenCode step started
06:12:11 - agent job: Run OpenCode step FAILED ❌ (~4s)
06:12:19 - create_issue job started
06:12:23 - create_issue job FAILED ❌
06:12:26 - Workflow completed (failure)

MCP Servers Configured

  1. github - GitHub MCP server for repository operations
  2. gh-aw - GitHub Actions workflows MCP server
  3. safeoutputs - Safe outputs MCP for creating issues

Analysis: Why This Keeps Happening

After analyzing 13+ occurrences, here's why this pattern persists:

  1. No Retry Logic: Single-attempt execution means any transient failure (API hiccup, network blip) causes total failure
  2. Transient Nature: These are infrastructure/API issues, not code bugs - they resolve themselves but recur randomly
  3. Cascading Failures: The create_issue job always fails when agent fails, creating noise in the logs
  4. Limited Diagnostics: Without detailed logs, it's hard to distinguish between different root causes
  5. Schedule Timing: Scheduled runs may hit API rate limits or maintenance windows

Immediate Next Steps

Given this is the 13th occurrence of this pattern:

  1. Implement retry logic - This is the highest-impact fix and addresses the root cause
  2. Make create_issue conditional - Reduces noise and clarifies actual failures
  3. Add API health check - Provides early detection and clearer failure messages
  4. Monitor next 5 runs - Verify if retry logic resolves the issue

Related PR Assessment

PR #2843: "Treat under-provisioned permissions as warnings in non-strict mode"

  • Related to failure?: No
  • Assessment: The failure is a pre-existing recurring transient pattern unrelated to the PR changes. The PR modifies compiler behavior for permission validation, which has no impact on OpenCode agent runtime execution or Anthropic API calls.

Investigation Metadata:

  • Investigator: Smoke Detector (automated investigator)
  • Investigation Run: 18964399082
  • Investigation Record: /tmp/gh-aw/cache-memory/investigations/2025-10-31-18964376017.json
  • Pattern Database: 13+ similar occurrences since 2025-10-22

Labels: smoke-test, investigation, recurring, opencode, transient-failure

AI generated by Smoke Detector - Smoke Test Failure Investigator

AI generated by Smoke Detector - Smoke Test Failure Investigator

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions