-
Notifications
You must be signed in to change notification settings - Fork 46
Description
🔍 Smoke Test Investigation - Run #63
Summary
Smoke Claude workflow failed in the detection job due to a transient Anthropic API 500 error: "Overloaded". The main agent job completed successfully and created the expected issue output, but the secondary threat detection analysis failed when the API service was temporarily overloaded.
Failure Details
- Run: 18907236182
- Workflow: Smoke Claude
- Commit: 8f8cfbd
- Trigger: schedule
- Duration: 4.7 minutes
- Date: 2025-10-29 12:09:23 UTC
Root Cause Analysis
Primary Issue
The detection job failed with an Anthropic API 500 error indicating the service was overloaded:
API Error: 500 {"type":"error","error":{"type":"api_error","message":"Overloaded"},"request_id":null}
This occurred when the detection agent attempted to read /tmp/gh-aw/threat-detection/prompt.txt using the Read tool. The API call failed with "Streaming fallback triggered" before returning the overload error.
Why This Matters
- The main
agentjob succeeded (created issue output, 16 turns, 162k tokens, ~$0.22) - The
create_issuejob succeeded - Only the
detectionjob failed due to this transient API error - This is a new failure pattern - first occurrence of "Overloaded" error in our investigation history
Is This a Real Problem?
This is a transient infrastructure issue with the Anthropic API, not a code bug. However, it represents a workflow robustness issue:
- The detection job is a secondary analysis step
- Its failure caused the entire smoke test to fail
- This could mask real issues in future runs
Failed Jobs and Errors
Failed: detection (2.7m)
Error Type: api_error - Anthropic API Overloaded
HTTP Status: 500
Message: "Overloaded"
Context: Failed while attempting to read threat detection prompt file
Is Transient: Yes ✅
Succeeded Jobs
- ✅ pre_activation (4s)
- ✅ activation (4s)
- ✅ agent (1.4m) - Main job completed successfully
- ✅ create_issue (7s) - Issue created successfully
Investigation Findings
Agent Performance
- Turns: 16
- Tokens Used: 162,803
- Estimated Cost: $0.22
- Task: Review last 5 merged PRs and create summary issue
- Result: Task completed successfully ✅
Detection Job Analysis
- Started threat detection analysis
- Attempted to read
/tmp/gh-aw/threat-detection/prompt.txt - Hit Anthropic API overload before completing first read
- Duration: 2.7 minutes (includes wait time for API)
- 4 turns attempted before failure
Pattern Analysis
Pattern ID: ANTHROPIC_API_OVERLOADED (New Pattern)
- First Occurrence: This run (2025-10-29)
- Severity: Medium (transient, not code-related)
- Category: AI Engine - API Failure
- Flaky: Yes - depends on Anthropic API load
Related Patterns Found:
- Similar to
OPENCODE_ANTHROPIC_API_ERROR(AI_APICallError) seen in run 18893290104 - Different from typical OpenCode "no safe-outputs" patterns
Recommended Actions
🔴 High Priority
- Add retry logic to detection job: Implement exponential backoff retry for Anthropic API 500 errors
- Initial delay: 30 seconds
- Max retries: 3
- Doubles delay on each retry
- Prevents: Transient API failures causing workflow failure
🟡 Medium Priority
-
Make detection job non-blocking: Add
continue-on-error: trueto detection job or make it optional- The detection job is a secondary analysis
- Main agent success should be the primary signal
- Prevents: False negatives in smoke tests masking real issues
-
Add rate limit handling: Implement monitoring and graceful handling of Anthropic API rate limits
- Track frequency of overload errors
- Identify peak load times (scheduled runs may create spikes)
- Prevents: Cascading failures during high load periods
🟢 Low Priority
- Implement fallback detection: Consider simpler static analysis when API unavailable
- Prevents: Complete detection failure
Prevention Strategies
Immediate Actions
- ✅ Pattern documented: Created
/tmp/gh-aw/cache-memory/patterns/anthropic_api_overloaded.json - ✅ Investigation saved: Created
/tmp/gh-aw/cache-memory/investigations/2025-10-29-18907236182.json - Next: Implement retry logic for detection job
Long-term Improvements
- Consider scheduled workflow staggering to reduce concurrent API load
- Monitor Anthropic API status before running detection jobs
- Implement circuit breaker pattern for repeated API failures
Historical Context
Similar Past Failures
- Run 18893290104 (2025-10-29): OpenCode with
AI_APICallErrorcalling Anthropic API- Different error: 401 vs 500
- Different engine: OpenCode vs Claude
- Same root: Anthropic API issues
Pattern Comparison
This is the first occurrence of the "Overloaded" error specifically. Previous Anthropic API failures were:
- Authentication errors (401)
- Generic API call errors
- Not "Overloaded" capacity issues
Frequency
- Anthropic API failures: Rare but recurring (3 occurrences in past week across different engines)
- This specific error: First occurrence
Related PR Context
PR #2717: "docs: document zizmor URL links and verbose Docker command output"
- Merged: Just before workflow run
- Content: Documentation updates
- Relevance: ❌ Not related to failure
- Assessment: Failure was due to external Anthropic API overload, not code changes
Investigation Metadata
- Pattern:
ANTHROPIC_API_OVERLOADED(New) - Investigator: smoke-detector
- Investigation Run: 18907360485
- Investigation Files:
/tmp/gh-aw/cache-memory/investigations/2025-10-29-18907236182.json/tmp/gh-aw/cache-memory/patterns/anthropic_api_overloaded.json
AI generated by Smoke Detector - Smoke Test Failure Investigator