-
Notifications
You must be signed in to change notification settings - Fork 46
Description
Problem
Research workflow remains largely non-operational with only 20% success rate despite TAVILY_API_KEY secret being added. The workflow has been effectively offline for 17 days.
Current Status (2026-01-25)
- Success rate: 1/5 recent runs (20%)
- Latest failure: §21078189533 (2026-01-16)
- Last success: 2026-01-08 (17 days ago)
- Failed step: Suspected "Start MCP gateway" (similar to MCP Inspector)
- Previous issue: Fix Research workflow - Critical failure (90% failure rate) #11434 (auto-closed 2026-01-24)
Root Cause Analysis
Key insight: Daily News workflow recovered immediately after TAVILY_API_KEY was added (2026-01-22), but Research workflow did NOT recover. This suggests:
-
Hypothesis 1: Workflow needs recompilation
- Secret was added AFTER last compilation
- Lock file may not reference the new secret
- Solution:
make recompile
-
Hypothesis 2: Different MCP Gateway configuration
- Research may use different MCP server setup than Daily News
- May need additional configuration beyond TAVILY_API_KEY
- Review frontmatter differences
-
Hypothesis 3: Intermittent MCP Gateway issues
- 1/5 runs succeeded (20% rate)
- May be timing/connectivity related
- Could be transient MCP server availability
Comparison with Daily News and MCP Inspector
| Aspect | Daily News (✅) | Research ( |
MCP Inspector (❌) |
|---|---|---|---|
| TAVILY_API_KEY | Present | Present | Present |
| Recovery | Immediate | Partial (20%) | None (0%) |
| Success rate | 40% recovering | 20% low | 0% failing |
| Last compiled | Unknown | Unknown | Unknown |
| MCP Gateway | Working | Intermittent | Failing |
Recommended Investigation Steps
Step 1: Recompile Workflow
cd /path/to/repo
make recompile
git add .github/workflows/research.lock.yml
git commit -m "Recompile Research workflow after TAVILY_API_KEY fix"
git pushStep 2: Compare Frontmatter
Compare configurations:
.github/workflows/daily-news.md(working, 40% success).github/workflows/research.md(failing, 20% success).github/workflows/mcp-inspector.md(failing, 0% success)
Look for differences in:
- MCP server configuration
- Tool permissions
- Timeout settings
- Environment variables
Step 3: Analyze Failed Run Logs
Download artifacts from run 21078189533:
- Check
/tmp/gh-aw/mcp-logs/for MCP Gateway errors - Review agent stdio logs
- Look for timeout or connection issues
Step 4: Test Manually Multiple Times
# Run 3-5 times to check for intermittent issues
for i in {1..5}; do
gh workflow run research.lock.yml
sleep 60
doneMonitor success rate of manual runs.
Success Criteria
- Research workflow runs successfully
- Success rate returns to >80% over next 5 runs
- Research and knowledge work capabilities fully operational
- No intermittent failures
Priority: P1 (High)
Impact: Research capabilities severely limited for 17 days. This blocks automated research tasks, knowledge work, and investigation workflows.
Urgency: High - research functionality is critical for knowledge-based agents and analysis workflows.
Next steps:
- Recompile workflow (5 min)
- Test manually 3-5 times (30 min)
- Analyze intermittent failure pattern (30 min)
- Apply fix based on findings (variable)
References:
AI generated by Workflow Health Manager - Meta-Orchestrator
- expires on Jan 26, 2026, 3:08 AM UTC