Skip to content

Research Workflow - Still failing after TAVILY_API_KEY fix (20% success rate, 17 days offline) #11722

@github-actions

Description

@github-actions

Problem

Research workflow remains largely non-operational with only 20% success rate despite TAVILY_API_KEY secret being added. The workflow has been effectively offline for 17 days.

Current Status (2026-01-25)

Root Cause Analysis

Key insight: Daily News workflow recovered immediately after TAVILY_API_KEY was added (2026-01-22), but Research workflow did NOT recover. This suggests:

  1. Hypothesis 1: Workflow needs recompilation

    • Secret was added AFTER last compilation
    • Lock file may not reference the new secret
    • Solution: make recompile
  2. Hypothesis 2: Different MCP Gateway configuration

    • Research may use different MCP server setup than Daily News
    • May need additional configuration beyond TAVILY_API_KEY
    • Review frontmatter differences
  3. Hypothesis 3: Intermittent MCP Gateway issues

    • 1/5 runs succeeded (20% rate)
    • May be timing/connectivity related
    • Could be transient MCP server availability

Comparison with Daily News and MCP Inspector

Aspect Daily News (✅) Research (⚠️) MCP Inspector (❌)
TAVILY_API_KEY Present Present Present
Recovery Immediate Partial (20%) None (0%)
Success rate 40% recovering 20% low 0% failing
Last compiled Unknown Unknown Unknown
MCP Gateway Working Intermittent Failing

Recommended Investigation Steps

Step 1: Recompile Workflow

cd /path/to/repo
make recompile
git add .github/workflows/research.lock.yml
git commit -m "Recompile Research workflow after TAVILY_API_KEY fix"
git push

Step 2: Compare Frontmatter

Compare configurations:

  • .github/workflows/daily-news.md (working, 40% success)
  • .github/workflows/research.md (failing, 20% success)
  • .github/workflows/mcp-inspector.md (failing, 0% success)

Look for differences in:

  • MCP server configuration
  • Tool permissions
  • Timeout settings
  • Environment variables

Step 3: Analyze Failed Run Logs

Download artifacts from run 21078189533:

  • Check /tmp/gh-aw/mcp-logs/ for MCP Gateway errors
  • Review agent stdio logs
  • Look for timeout or connection issues

Step 4: Test Manually Multiple Times

# Run 3-5 times to check for intermittent issues
for i in {1..5}; do
  gh workflow run research.lock.yml
  sleep 60
done

Monitor success rate of manual runs.

Success Criteria

  • Research workflow runs successfully
  • Success rate returns to >80% over next 5 runs
  • Research and knowledge work capabilities fully operational
  • No intermittent failures

Priority: P1 (High)

Impact: Research capabilities severely limited for 17 days. This blocks automated research tasks, knowledge work, and investigation workflows.

Urgency: High - research functionality is critical for knowledge-based agents and analysis workflows.

Next steps:

  1. Recompile workflow (5 min)
  2. Test manually 3-5 times (30 min)
  3. Analyze intermittent failure pattern (30 min)
  4. Apply fix based on findings (variable)

References:

AI generated by Workflow Health Manager - Meta-Orchestrator

  • expires on Jan 26, 2026, 3:08 AM UTC

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions