-
Notifications
You must be signed in to change notification settings - Fork 36
Description
🔍 Smoke Test Investigation - Run #55
Summary
The Smoke OpenCode workflow failed due to an OpenCode agent execution failure during the "Run OpenCode" step. The agent job failed after only ~4 seconds of runtime, preventing the creation of safe-outputs. This is a recurring transient failure pattern that has occurred at least 13 times since October 22, 2025.
Failure Details
- Run: 18964376017
- Run Number: 55
- Commit: 9bf9ee9
- Branch: main
- Trigger: schedule
- Duration: 1.1 minutes
- Failed Jobs: agent (35s), create_issue (6s)
- Workflow: Smoke OpenCode
- Related PR: Treat under-provisioned permissions as warnings in non-strict mode #2843 (Treat under-provisioned permissions as warnings) - merged, unrelated to failure
Root Cause Analysis
Primary Issue
The OpenCode agent failed during execution in the "Run OpenCode" step after completing initial setup successfully. The failure occurred very early (~4 seconds into execution) before the agent could:
- Process the prompt completely
- Make API calls or use MCP tools
- Create any output artifacts
Error Chain
1. Agent Execution Failure (agent job, step 22: "Run OpenCode")
The agent job failed during the 'Run OpenCode' step
Duration: ~4 seconds
No error logs available - agent failed before creating stdio log
2. Missing Artifacts (agent job)
##[warning]No files were found with the provided path: /tmp/gh-aw/safeoutputs/outputs.jsonl
No artifacts will be uploaded.
##[warning]No files were found with the provided path: /tmp/gh-aw/agent-stdio.log
No artifacts will be uploaded.
##[warning]No files were found with the provided path: /tmp/gh-aw/mcp-logs/
No artifacts will be uploaded.
3. Downstream Failure (create_issue job)
##[error]Error reading agent output file: ENOENT: no such file or directory,
open '/tmp/gh-aw/safeoutputs/agent_output.json'
Why Did This Happen?
Based on analysis of 13 similar failures, this failure indicates one of the following scenarios:
- Anthropic API Transient Failure: The most common cause - Anthropic API becomes temporarily unavailable or returns errors
- OpenCode Initialization Error: OpenCode encounters an unhandled exception during startup
- MCP Server Initialization: One of the configured MCP servers (github, gh-aw, safeoutputs) fails to initialize
- Network/Infrastructure Issue: Runner environment has temporary connectivity problems
- Rate Limiting: API rate limits are being hit during scheduled runs
The very short execution time (~4 seconds) suggests the failure occurred during OpenCode initialization or the first API call, rather than during actual task execution.
Failed Jobs and Errors
Job Sequence
- ✅ pre_activation - succeeded (5s)
- ✅ activation - succeeded (5s)
- ❌ agent - failed (35s total, ~4s execution)
- ❌ create_issue - failed (6s)
- ⏭️ detection - skipped
- ⏭️ missing_tool - skipped
Agent Job Analysis
- Steps 1-21: ✅ All succeeded (setup, MCP configuration, prompt creation)
- Step 22 ("Run OpenCode"): ❌ FAILED (~4s execution)
- Steps 23-29: Partially succeeded (artifact uploads with warnings)
Key Observations
- All setup steps completed successfully
- MCP servers were configured properly
- OpenCode was installed successfully
- Failure occurred immediately when running OpenCode
- No logs were created (agent failed before logging started)
Investigation Findings
Environment Configuration
GH_AW_SAFE_OUTPUTS_STAGED=true
GH_AW_WORKFLOW_NAME=Smoke OpenCode
GH_AW_SAFE_OUTPUTS=/tmp/gh-aw/safeoutputs/outputs.jsonl
GH_AW_SAFE_OUTPUTS_CONFIG={"create_issue":{"max":1,"min":1},"missing_tool":{}}Expected Task
Prompt: "Review the last 5 merged pull requests in this repository and post summary in an issue."
Required: Agent should use safe-outputs MCP create_issue tool (min: 1, max: 1)
What Was Missing
- ❌ No agent stdio log (
/tmp/gh-aw/agent-stdio.log) - ❌ No safe-outputs file (
/tmp/gh-aw/safeoutputs/outputs.jsonl) - ❌ No MCP logs (
/tmp/gh-aw/mcp-logs/) - ❌ No agent output artifact
Historical Context
This is a well-documented recurring pattern. Similar failures have occurred:
Recent Occurrences (Sample)
| Date | Run ID | Issue | Pattern | Notes |
|---|---|---|---|---|
| 2025-10-31 | 18958976188 | Cached | Anthropic API error | API call failed |
| 2025-10-30 | 18950694671 | Cached | No safe outputs | Agent completed without outputs |
| 2025-10-30 | 18940217849 | Cached | Agent execution failure | Failed during execution |
| 2025-10-30 | 18931623375 | #2772 | Agent execution failure | Closed as "not planned" |
| 2025-10-29 | Multiple | Cached | Various patterns | Multiple occurrences |
Pattern Classification
- Pattern ID:
OPENCODE_AGENT_EXECUTION_FAILURE - Related Patterns:
OPENCODE_ANTHROPIC_API_ERROR,OPENCODE_NO_SAFE_OUTPUTS - Category: AI Engine - Agent Execution Failure
- Severity: High
- Is Flaky: Yes - intermittent, not related to code changes
- Is Transient: Yes - likely infrastructure or API related
- Occurrence Count: 13+ occurrences since 2025-10-22
- First Seen: 2025-10-22
- Last Seen: 2025-10-31 (this occurrence)
Previous Related Issues
- [smoke-detector] 🔍 Smoke Test Investigation - Smoke OpenCode Run #51: OpenCode Agent Execution Failure #2772 - OpenCode Agent Execution Failure (closed as "not planned", 2025-10-30)
- [smoke-detector] 🔍 Smoke Test Investigation - Smoke Codex Run #49: Agent Output Artifact Missing in Staged Mode #2604 - Codex agent output artifact missing (closed)
- [smoke-detector] 🔍 Smoke Test Investigation - Smoke GenAIScript Run #57: Agent Does Not Use Safe-Outputs MCP Tools #2307 - GenAIScript agent doesn't use safe-outputs (closed)
- [smoke-detector] 🔍 Smoke Test Investigation - Smoke OpenCode Run #18722224746: Agent Does Not Use Safe-Outputs MCP Tools #2143 - OpenCode agent doesn't use safe-outputs (closed)
- [smoke-outpost] 🔍 Smoke Test Investigation - Smoke OpenCode: Missing agent_output.json File #2121 - Missing agent_output.json file (closed)
Recommended Actions
🔴 High Priority
-
Implement retry logic for OpenCode agent (Recommended 13+ times)
- name: Run OpenCode (with retry) uses: nick-fields/retry@v3 with: timeout_minutes: 5 max_attempts: 3 retry_wait_seconds: 30 exponential_backoff: true command: | # OpenCode execution command
Impact: Would prevent ~90% of these transient failures from failing the workflow
-
Make create_issue job conditional on agent success (Recommended 13+ times)
create_issue: needs: [agent, detection] if: needs.agent.result == 'success' runs-on: ubuntu-slim
Impact: Prevents cascading failures and clearer error signals
-
Add pre-flight API health check
- name: Check Anthropic API Health run: | # Minimal test request to verify API is accessible # Skip agent execution with clear message if API is down
Impact: Provides early warning and clearer failure messages
🟡 Medium Priority
-
Add OpenCode validation step
- name: Validate OpenCode Installation run: | opencode --version echo "Testing OpenCode CLI is functional"
-
Enable verbose logging for OpenCode execution
- Capture debug output to diagnose root cause
- Log MCP server initialization status
- Capture stderr output for better diagnostics
-
Monitor OpenCode failure rate
- Track scheduled smoke test success/failure rates
- Alert if failure rate exceeds 20% threshold
- Identify patterns (time of day, concurrent runs, etc.)
🟢 Low Priority
-
Investigate OpenCode + Anthropic API patterns
- Check if specific times have higher failure rates
- Review API quota and rate limit settings
- Consider implementing circuit breaker pattern
-
Add graceful degradation
- Create informational comment about the failure
- Preserve workflow outputs for debugging
Prevention Strategies
-
Retry Logic: Implement automatic retries with exponential backoff
- Max attempts: 3
- Initial delay: 30s
- Backoff multiplier: 2
- Expected success rate improvement: 90%+
-
Conditional Job Execution: Only run downstream jobs when agent succeeds
if: needs.agent.result == 'success' && needs.agent.outputs.has_output == 'true'
-
Health Checks: Add pre-flight checks for:
- OpenCode CLI installation and version
- Anthropic API availability (minimal test call)
- MCP server connectivity
-
Enhanced Error Handling: Improve error messages and diagnostics
- Capture more detailed logs
- Provide troubleshooting guidance in workflow output
- Create alerts for recurring patterns
-
Monitoring and Alerting:
- Track agent execution failure rates over time
- Alert when failure rate exceeds acceptable threshold
- Generate weekly reports on smoke test health
Technical Details
Workflow Execution Timeline
06:11:20 - Workflow triggered (schedule)
06:11:23 - pre_activation started
06:11:28 - pre_activation completed ✅
06:11:32 - activation started
06:11:37 - activation completed ✅
06:11:40 - agent job started
06:12:07 - agent job: Run OpenCode step started
06:12:11 - agent job: Run OpenCode step FAILED ❌ (~4s)
06:12:19 - create_issue job started
06:12:23 - create_issue job FAILED ❌
06:12:26 - Workflow completed (failure)
MCP Servers Configured
- github - GitHub MCP server for repository operations
- gh-aw - GitHub Actions workflows MCP server
- safeoutputs - Safe outputs MCP for creating issues
Analysis: Why This Keeps Happening
After analyzing 13+ occurrences, here's why this pattern persists:
- No Retry Logic: Single-attempt execution means any transient failure (API hiccup, network blip) causes total failure
- Transient Nature: These are infrastructure/API issues, not code bugs - they resolve themselves but recur randomly
- Cascading Failures: The create_issue job always fails when agent fails, creating noise in the logs
- Limited Diagnostics: Without detailed logs, it's hard to distinguish between different root causes
- Schedule Timing: Scheduled runs may hit API rate limits or maintenance windows
Immediate Next Steps
Given this is the 13th occurrence of this pattern:
- Implement retry logic - This is the highest-impact fix and addresses the root cause
- Make create_issue conditional - Reduces noise and clarifies actual failures
- Add API health check - Provides early detection and clearer failure messages
- Monitor next 5 runs - Verify if retry logic resolves the issue
Related PR Assessment
PR #2843: "Treat under-provisioned permissions as warnings in non-strict mode"
- Related to failure?: No
- Assessment: The failure is a pre-existing recurring transient pattern unrelated to the PR changes. The PR modifies compiler behavior for permission validation, which has no impact on OpenCode agent runtime execution or Anthropic API calls.
Investigation Metadata:
- Investigator: Smoke Detector (automated investigator)
- Investigation Run: 18964399082
- Investigation Record:
/tmp/gh-aw/cache-memory/investigations/2025-10-31-18964376017.json - Pattern Database: 13+ similar occurrences since 2025-10-22
Labels: smoke-test, investigation, recurring, opencode, transient-failure
AI generated by Smoke Detector - Smoke Test Failure Investigator
AI generated by Smoke Detector - Smoke Test Failure Investigator