Skip to content

[smoke-detector] 🔍 Smoke Test Investigation - Run #63: Anthropic API Overload in Detection Job #2730

@github-actions

Description

@github-actions

🔍 Smoke Test Investigation - Run #63

Summary

Smoke Claude workflow failed in the detection job due to a transient Anthropic API 500 error: "Overloaded". The main agent job completed successfully and created the expected issue output, but the secondary threat detection analysis failed when the API service was temporarily overloaded.

Failure Details

  • Run: 18907236182
  • Workflow: Smoke Claude
  • Commit: 8f8cfbd
  • Trigger: schedule
  • Duration: 4.7 minutes
  • Date: 2025-10-29 12:09:23 UTC

Root Cause Analysis

Primary Issue

The detection job failed with an Anthropic API 500 error indicating the service was overloaded:

API Error: 500 {"type":"error","error":{"type":"api_error","message":"Overloaded"},"request_id":null}

This occurred when the detection agent attempted to read /tmp/gh-aw/threat-detection/prompt.txt using the Read tool. The API call failed with "Streaming fallback triggered" before returning the overload error.

Why This Matters

  • The main agent job succeeded (created issue output, 16 turns, 162k tokens, ~$0.22)
  • The create_issue job succeeded
  • Only the detection job failed due to this transient API error
  • This is a new failure pattern - first occurrence of "Overloaded" error in our investigation history

Is This a Real Problem?

This is a transient infrastructure issue with the Anthropic API, not a code bug. However, it represents a workflow robustness issue:

  1. The detection job is a secondary analysis step
  2. Its failure caused the entire smoke test to fail
  3. This could mask real issues in future runs

Failed Jobs and Errors

Failed: detection (2.7m)

Error Type: api_error - Anthropic API Overloaded
HTTP Status: 500
Message: "Overloaded"
Context: Failed while attempting to read threat detection prompt file
Is Transient: Yes ✅

Succeeded Jobs

  • ✅ pre_activation (4s)
  • ✅ activation (4s)
  • ✅ agent (1.4m) - Main job completed successfully
  • ✅ create_issue (7s) - Issue created successfully

Investigation Findings

Agent Performance

  • Turns: 16
  • Tokens Used: 162,803
  • Estimated Cost: $0.22
  • Task: Review last 5 merged PRs and create summary issue
  • Result: Task completed successfully ✅

Detection Job Analysis

  • Started threat detection analysis
  • Attempted to read /tmp/gh-aw/threat-detection/prompt.txt
  • Hit Anthropic API overload before completing first read
  • Duration: 2.7 minutes (includes wait time for API)
  • 4 turns attempted before failure

Pattern Analysis

Pattern ID: ANTHROPIC_API_OVERLOADED (New Pattern)

  • First Occurrence: This run (2025-10-29)
  • Severity: Medium (transient, not code-related)
  • Category: AI Engine - API Failure
  • Flaky: Yes - depends on Anthropic API load

Related Patterns Found:

  • Similar to OPENCODE_ANTHROPIC_API_ERROR (AI_APICallError) seen in run 18893290104
  • Different from typical OpenCode "no safe-outputs" patterns

Recommended Actions

🔴 High Priority

  • Add retry logic to detection job: Implement exponential backoff retry for Anthropic API 500 errors
    • Initial delay: 30 seconds
    • Max retries: 3
    • Doubles delay on each retry
    • Prevents: Transient API failures causing workflow failure

🟡 Medium Priority

  • Make detection job non-blocking: Add continue-on-error: true to detection job or make it optional

    • The detection job is a secondary analysis
    • Main agent success should be the primary signal
    • Prevents: False negatives in smoke tests masking real issues
  • Add rate limit handling: Implement monitoring and graceful handling of Anthropic API rate limits

    • Track frequency of overload errors
    • Identify peak load times (scheduled runs may create spikes)
    • Prevents: Cascading failures during high load periods

🟢 Low Priority

  • Implement fallback detection: Consider simpler static analysis when API unavailable
    • Prevents: Complete detection failure

Prevention Strategies

Immediate Actions

  1. Pattern documented: Created /tmp/gh-aw/cache-memory/patterns/anthropic_api_overloaded.json
  2. Investigation saved: Created /tmp/gh-aw/cache-memory/investigations/2025-10-29-18907236182.json
  3. Next: Implement retry logic for detection job

Long-term Improvements

  • Consider scheduled workflow staggering to reduce concurrent API load
  • Monitor Anthropic API status before running detection jobs
  • Implement circuit breaker pattern for repeated API failures

Historical Context

Similar Past Failures

  • Run 18893290104 (2025-10-29): OpenCode with AI_APICallError calling Anthropic API
    • Different error: 401 vs 500
    • Different engine: OpenCode vs Claude
    • Same root: Anthropic API issues

Pattern Comparison

This is the first occurrence of the "Overloaded" error specifically. Previous Anthropic API failures were:

  • Authentication errors (401)
  • Generic API call errors
  • Not "Overloaded" capacity issues

Frequency

  • Anthropic API failures: Rare but recurring (3 occurrences in past week across different engines)
  • This specific error: First occurrence

Related PR Context

PR #2717: "docs: document zizmor URL links and verbose Docker command output"

  • Merged: Just before workflow run
  • Content: Documentation updates
  • Relevance: ❌ Not related to failure
  • Assessment: Failure was due to external Anthropic API overload, not code changes

Investigation Metadata

  • Pattern: ANTHROPIC_API_OVERLOADED (New)
  • Investigator: smoke-detector
  • Investigation Run: 18907360485
  • Investigation Files:
    • /tmp/gh-aw/cache-memory/investigations/2025-10-29-18907236182.json
    • /tmp/gh-aw/cache-memory/patterns/anthropic_api_overloaded.json

AI generated by Smoke Detector - Smoke Test Failure Investigator

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions