-
Notifications
You must be signed in to change notification settings - Fork 45
Description
🔍 Smoke Test Investigation - Run #51
Summary
The Smoke OpenCode workflow failed due to an OpenCode agent execution failure during the "Run OpenCode" step. The agent job failed after only 5 seconds of runtime, preventing the creation of safe-outputs. The downstream create_issue job subsequently failed when it couldn't find the expected agent_output.json file.
Failure Details
- Run: 18931623375
- Run Number: 51
- Commit: 11cfdb2
- Branch: main
- Trigger: schedule
- Duration: 1.1 minutes
- Failed Jobs: agent (33s), create_issue (10s)
- Workflow: Smoke OpenCode
Root Cause Analysis
Primary Issue
The OpenCode agent failed during execution in the "Run OpenCode" step after completing initial setup. The failure occurred very early (5 seconds into execution) before the agent could:
- Process the prompt
- Make any API calls
- Use any MCP tools
- Create any output artifacts
Error Chain
1. Agent Execution Failure (agent job, step 22: "Run OpenCode")
The agent job failed during the 'Run OpenCode' step
Duration: 5 seconds
No error logs available - agent failed before creating stdio log
2. Missing Artifacts (agent job)
##[warning]No files were found with the provided path: /tmp/gh-aw/safeoutputs/outputs.jsonl
No artifacts will be uploaded.
##[warning]No files were found with the provided path: /tmp/gh-aw/agent-stdio.log
No artifacts will be uploaded.
3. Downstream Failure (create_issue job)
##[error]Error reading agent output file: ENOENT: no such file or directory,
open '/tmp/gh-aw/safeoutputs/agent_output.json'
Why Did This Happen?
This failure indicates one of the following scenarios:
- OpenCode Installation Issue: The OpenCode CLI may not have installed correctly or dependencies are missing
- Runtime Error: OpenCode encountered an unhandled exception during startup
- API Authentication: Anthropic API credentials may be invalid or expired
- Transient Infrastructure Issue: Runner environment had a temporary problem
- MCP Server Initialization: One of the configured MCP servers (github, gh-aw, safeoutputs) failed to initialize
The very short execution time (5 seconds) suggests the failure occurred during OpenCode initialization rather than during actual task execution.
Failed Jobs and Errors
Job Sequence
- ✅ pre_activation - succeeded (3s)
- ✅ activation - succeeded (2s)
- ❌ agent - failed (33s total, 5s execution)
- ❌ create_issue - failed (10s)
- ⏭️ detection - skipped
- ⏭️ missing_tool - skipped
Agent Job Steps (22 total)
- Steps 1-21: ✅ All succeeded (setup, download images, configure MCPs, create prompt)
- Step 22 ("Run OpenCode"): ❌ FAILED (5s execution)
- Steps 23-29: Partially succeeded (redaction, artifact uploads with warnings)
Key Observations
- All setup steps completed successfully
- MCP servers were configured: github, gh-aw, safeoutputs
- OpenCode version 0.15.13 was installed
- Failure occurred immediately when running OpenCode
- No agent stdio log was created (agent failed before logging started)
- No MCP logs were created
Investigation Findings
Environment Configuration
GH_AW_SAFE_OUTPUTS_STAGED=true
GH_AW_WORKFLOW_NAME=Smoke OpenCode
GH_AW_SAFE_OUTPUTS=/tmp/gh-aw/safeoutputs/outputs.jsonl
GH_AW_SAFE_OUTPUTS_CONFIG={"create_issue":{"max":1,"min":1},"missing_tool":{}}Expected Task
Prompt: "Review the last 5 merged pull requests in this repository and post summary in an issue."
Required: Agent should use safe-outputs MCP create_issue tool (min: 1, max: 1)
What Was Missing
- ❌ No agent stdio log (
/tmp/gh-aw/agent-stdio.log) - ❌ No safe-outputs file (
/tmp/gh-aw/safeoutputs/outputs.jsonl) - ❌ No MCP logs (
/tmp/gh-aw/mcp-logs/) - ❌ No agent output artifact
Historical Context
This is a recurring pattern. Similar failures have occurred:
| Date | Run ID | Issue | Status | Pattern |
|---|---|---|---|---|
| 2025-10-30 | 18926079635 | Investigation cached | - | Anthropic API error |
| 2025-10-27 | 18840299097 | #2604 | Closed | Agent doesn't use safe-outputs (Codex) |
| 2025-10-24 | 18788162015 | #2307 | Closed | Agent doesn't use safe-outputs (GenAIScript) |
| 2025-10-22 | 18722224746 | #2143 | Closed | Agent doesn't use safe-outputs (OpenCode) |
| 2025-10-22 | 18715612738 | #2121 | Closed | Missing agent_output.json |
Pattern Classification
- Pattern ID:
OPENCODE_AGENT_EXECUTION_FAILURE - Related Patterns:
OPENCODE_ANTHROPIC_API_ERROR,OPENCODE_NO_SAFE_OUTPUTS - Category: AI Engine - Agent Execution Failure
- Severity: High
- Is Flaky: Yes - intermittent, not related to code changes
- Is Transient: Yes - likely infrastructure or API related
- Occurrence Count: 11+ occurrences
Comparison with Previous Run
The most recent cached investigation (run 18926079635, ~6 hours earlier) shows a similar pattern but with more specific error information indicating an Anthropic API call failure. This suggests the issue may be:
- Transient Anthropic API availability issues
- Rate limiting or quota problems
- Network connectivity to Anthropic's API
Recommended Actions
High Priority
-
Implement retry logic for OpenCode agent
- name: Run OpenCode (with retry) uses: nick-fields/retry@v3 with: timeout_minutes: 5 max_attempts: 3 retry_wait_seconds: 30 command: | # OpenCode execution command
-
Make create_issue job conditional on agent success
create_issue: needs: [agent, detection] if: needs.agent.result == 'success' runs-on: ubuntu-latest
-
Add OpenCode validation step
- name: Validate OpenCode Installation run: | opencode --version echo "Testing OpenCode CLI is functional"
-
Add pre-flight API health check
- name: Check Anthropic API Health run: | # Minimal test request to verify API is accessible # Skip agent execution if API is down
Medium Priority
-
Add verbose logging for OpenCode execution
- Enable debug mode to capture more diagnostic information
- Log MCP server initialization status
- Capture any stderr output
-
Monitor OpenCode failure rate
- Track scheduled smoke test success/failure rates
- Alert if failure rate exceeds threshold
- Identify patterns (time of day, specific commits, etc.)
-
Investigate OpenCode 0.15.13 stability
- Check if specific version has known issues
- Consider pinning to a more stable version
- Review OpenCode release notes for recent changes
Low Priority
-
Add fallback mechanism
- If agent fails, create informational issue about the failure
- Preserve workflow outputs for debugging
-
Improve error messages
- Provide clearer feedback when agent execution fails
- Include troubleshooting steps in workflow output
Prevention Strategies
-
Retry Logic: Implement automatic retries with exponential backoff for transient failures
retry: max_attempts: 3 initial_delay: 30s backoff_multiplier: 2
-
Conditional Job Execution: Only run downstream jobs when agent succeeds
if: needs.agent.result == 'success' && needs.agent.outputs.has_output == 'true'
-
Health Checks: Add pre-flight checks for:
- OpenCode CLI installation and version
- Anthropic API availability
- MCP server connectivity
-
Graceful Degradation: Don't fail the entire workflow if downstream jobs can't run
continue-on-error: true
-
Enhanced Monitoring: Track and alert on:
- Agent execution failure rates
- Artifact creation success rates
- API response times and errors
Technical Details
Workflow Execution Timeline
06:11:14 - Workflow triggered (schedule)
06:11:18 - pre_activation started
06:11:21 - pre_activation completed ✅
06:11:24 - activation started
06:11:26 - activation completed ✅
06:11:29 - agent job started
06:11:54 - agent job: Run OpenCode step started
06:11:59 - agent job: Run OpenCode step FAILED ❌ (5s)
06:12:06 - create_issue job started
06:12:09 - create_issue job FAILED ❌
06:12:17 - Workflow completed (failure)
Agent Job Steps Summary
- Setup: 21 steps, all succeeded (25s)
- Execution: Step 22 failed (5s)
- Cleanup: 7 steps, mostly succeeded with warnings (3s)
MCP Servers Configured
- github - GitHub MCP server for repository operations
- gh-aw - GitHub Actions workflows MCP server
- safeoutputs - Safe outputs MCP for creating issues
Related Issues
- [smoke-detector] 🔍 Smoke Test Investigation - Smoke OpenCode Run #18722224746: Agent Does Not Use Safe-Outputs MCP Tools #2143 - OpenCode agent doesn't use safe-outputs (closed)
- [smoke-outpost] 🔍 Smoke Test Investigation - Smoke OpenCode: Missing agent_output.json File #2121 - Missing agent_output.json file (closed)
- [smoke-detector] 🔍 Smoke Test Investigation - Smoke Codex Run #49: Agent Output Artifact Missing in Staged Mode #2604 - Codex agent output artifact missing (closed)
- [smoke-detector] 🔍 Smoke Test Investigation - Smoke GenAIScript Run #57: Agent Does Not Use Safe-Outputs MCP Tools #2307 - GenAIScript agent doesn't use safe-outputs (closed)
- [task] Add graceful artifact handling for missing agent outputs #2534 - Task: Add graceful artifact handling (closed)
Next Steps
- Immediate: Monitor next scheduled run to see if issue recurs
- Short-term: Implement retry logic and conditional jobs
- Long-term: Add comprehensive health checks and monitoring
Investigation Metadata:
- Investigator: Smoke Detector (automated investigator)
- Investigation Run: 18931646294
- Pattern Database:
/tmp/gh-aw/cache-memory/patterns/opencode_agent_execution_failure.json - Investigation Record:
/tmp/gh-aw/cache-memory/investigations/2025-10-30-18931623375.json - Related PR: Add permissions validator for GitHub MCP toolsets #2768 (Add permissions validator for GitHub MCP toolsets) - merged, unrelated to failure
AI generated by Smoke Detector - Smoke Test Failure Investigator