-
Notifications
You must be signed in to change notification settings - Fork 169
Description
🔍 Smoke Test Investigation - Run #77
Summary
The Smoke Codex workflow continues to fail with TOML parse error at line 30, column 109. This is the SIXTH occurrence since October 31st. Issue #2930 was closed as "not_planned" but the failures are continuing, indicating this critical problem needs urgent attention.
Failure Details
- Run: #19000564081
- Run Number: 77
- Commit: 258cb9f
- Commit Message: "Document --tools flag investigation for Claude Code CLI v2.0.31 (Document --tools flag investigation for Claude Code CLI v2.0.31 #2947)"
- Trigger: schedule (automated smoke test)
- Duration: 57 seconds
- Status: URGENT - Issue [smoke-detector] [CRITICAL] Codex Smoke Tests Failing - TOML Parse Error (4th Occurrence) #2930 closed but failures continue
Root Cause Analysis
Primary Error
Error: TOML parse error at line 30, column 109
|
30 | env = { "GH_AW_SAFE_OUTPUTS" = "/tmp/gh-aw/safeoutputs/outputs.jsonl", "GH_AW_SAFE_OUTPUTS_CONFIG" = "\"{\\"create_issue\\":{\\"max\\":1},\\"missing_tool\\":{}}\", ...
| ^
missing comma between key-value pairs, expected `,`
Technical Analysis
The Problem: Codex MCP Configuration + JSON Environment Variables = Invalid TOML
The Codex engine generates its MCP configuration using TOML inline table syntax with shell variable substitution. When GH_AW_SAFE_OUTPUTS_CONFIG (which contains JSON with nested quotes) is substituted into the TOML env inline table, the result is syntactically invalid TOML.
Why This Happens:
GH_AW_SAFE_OUTPUTS_CONFIGcontains:{"create_issue":{"max":1},"missing_tool":{}}- This gets escaped for shell safety
- The escaped value is substituted into TOML:
env = { "KEY" = "value" } - Result: Nested quotes create invalid TOML that the parser rejects
Failed Jobs and Errors
Job Sequence
- ✅ pre_activation - succeeded (7s)
- ✅ activation - succeeded (4s)
- ❌ agent - FAILED (21s) - Codex CLI cannot parse TOML config
- ⏭️ detection - skipped
- ⏭️ missing_tool - skipped
- ⏭️ create_issue - skipped
Historical Context
This is a recurring pattern that has now occurred 6 times over 3+ days (76+ hours):
| # | Run ID | Date | Time Since Last | Status |
|---|---|---|---|---|
| 1 | 18975512058 | 2025-10-31 14:24 | Initial | ❌ Failed |
| 2 | 18977321431 | 2025-10-31 15:31 | ~1 hour | ❌ Failed |
| 3 | 18988214642 | 2025-11-01 00:12 | ~9 hours | ❌ Failed |
| 4 | 18992422186 | 2025-11-01 06:03 | ~6 hours | ❌ Failed |
| 5 | 18996415568 | 2025-11-01 12:03 | ~6 hours | ❌ Failed |
| 6 | 19000564081 | 2025-11-01 18:03 | ~6 hours | ❌ Failed |
Critical Note: Issue #2930 documented the 4th occurrence and was created on 2025-11-01 06:12. It was subsequently closed as "not_planned" on 2025-11-01 14:53. However, the 5th occurrence happened at 12:03 (before closure) and now the 6th occurrence has happened at 18:03 (after closure), proving the problem persists.
Pattern: Every scheduled Codex smoke test since October 31st has failed with the exact same error.
Investigation History
Previous investigations have documented this issue extensively:
- Pattern ID:
CODEX_TOML_JSON_ESCAPING - Documentation:
/tmp/gh-aw/cache-memory/investigations/SUMMARY-codex-toml-json-escaping.md - Previous Issue: [smoke-detector] [CRITICAL] Codex Smoke Tests Failing - TOML Parse Error (4th Occurrence) #2930 (closed as "not_planned")
- Recommended Fix: Switch to file-based TOML configuration (documented 6 times)
Recommended Actions
CRITICAL Priority (Do Immediately)
-
Implement file-based TOML configuration for Codex
Implementation Steps:
-
Create
renderCodexMCPConfigFile()inpkg/workflow/mcp-config.gofunc renderCodexMCPConfigFile(mcpServers map[string]interface{}) (string, error) { // Generate TOML config file content // Write to /tmp/gh-aw/mcp-config/config.toml // Return file path }
-
Update
pkg/workflow/codex_engine.go- Call
renderCodexMCPConfigFile()during engine setup - Change CLI invocation from inline config to:
codex --config /tmp/gh-aw/mcp-config/config.toml - Remove inline TOML generation logic
- Call
-
Pattern to Follow: Copy from Claude and GenAIScript engines which already use file-based configs successfully
Estimated Effort: 2-4 hours
Benefits:
- Eliminates quote escaping issues entirely
- Matches pattern used by other engines
- More maintainable and testable
- Prevents future similar issues
-
HIGH Priority
-
Add integration tests for MCP config generation
- Test that generated TOML is valid (parse with actual TOML parser)
- Test with JSON-valued environment variables
- Test across all engines (Claude, Copilot, Codex, GenAIScript)
-
Make smoke tests blocking for PRs
- Prevent merges when smoke tests are failing
- Add smoke test status to required checks
- Currently smoke tests run on schedule but don't block merges
-
Add pre-merge validation
- Check for inline TOML generation with env vars
- Validate generated configs can be parsed
- Run smoke tests in PR CI (at least subset)
Prevention Strategies
-
Architectural:
- Never use inline config substitution with complex values (JSON, nested quotes, etc.)
- File-based configs eliminate entire class of escaping bugs
- Standardize approach across all engines
-
Testing:
- Integration tests that parse generated configs with real parsers
- Smoke tests as required checks for PRs
- Test matrix covering all engines × all config scenarios
-
CI/CD:
- Run smoke tests on every PR that touches engine code
- Block merges when smoke tests fail
- Alert on repeated failures
-
Process:
- Don't close issues as "not_planned" when failures are ongoing
- Treat consecutive smoke test failures as P0 incidents
- Require smoke test fixes before merging other changes
Technical Details
Engine Comparison
| Engine | Config Method | Format | Inline/File | Status |
|---|---|---|---|---|
| Codex | Inline TOML | TOML | Inline | ❌ Broken |
| Claude Code | File-based | JSON | File | ✅ Works |
| Copilot | Inline JSON | JSON | Inline | |
| GenAIScript | File-based | JSON | File | ✅ Works |
Observation: Engines using file-based configs don't have these issues.
Why File-Based Configs Are Better
- No Escaping Issues: File content doesn't go through shell interpretation
- Better Debugging: Can inspect actual config file on disk
- Easier Testing: Can test config generation independently
- More Maintainable: Clearer code, fewer edge cases
- Proven Pattern: Already working for Claude and GenAIScript
Example Fix (Pseudocode)
// Current (broken):
func (e *CodexEngine) BuildAgentStep() {
mcpConfig := buildMCPConfigTOML() // Inline TOML with $VARS
command := fmt.Sprintf(`codex --config-inline "%s"`, mcpConfig)
// Shell substitution + TOML parsing = 💥
}
// Fixed:
func (e *CodexEngine) BuildAgentStep() {
configPath := "/tmp/gh-aw/mcp-config/codex-config.toml"
renderCodexMCPConfigFile(mcpServers, configPath) // Write to file
command := fmt.Sprintf(`codex --config "%s"`, configPath)
// No escaping issues! 🎉
}Impact Assessment
Current Impact
- ❌ All Codex smoke tests failing since Oct 31 (76+ hours)
- ❌ No automated validation of Codex engine changes
⚠️ Risk of shipping bugs without working smoke tests⚠️ Developer velocity reduced due to manual testing needs- 🔴 Issue closure without fix creates confusion and technical debt
Risk if Not Fixed
- 🔴 High risk of introducing Codex regressions
- 🔴 Cannot verify Codex engine works in production scenarios
- 🔴 Eroding confidence in CI/CD pipeline
- 🔴 Technical debt accumulating with workarounds
- 🔴 False signal from closing issues without fixing problems
Benefits of Fixing
- ✅ Restore automated Codex validation
- ✅ Prevent future config escaping issues
- ✅ Standardize config approach across engines
- ✅ Improve testability and maintainability
- ✅ Increase confidence in Codex deployments
- ✅ Reduce investigation overhead (6 investigations so far!)
Why This Needs Urgent Attention
- Issue [smoke-detector] [CRITICAL] Codex Smoke Tests Failing - TOML Parse Error (4th Occurrence) #2930 was closed as "not_planned" but failures continue
- 6 consecutive failures over 76+ hours show this is systemic, not transient
- Zero Codex validation for 3+ days = high risk of regressions
- Recommended fix is clear and well-documented (file-based config)
- Fix is estimated at only 2-4 hours but saves ongoing investigation time
- Other engines prove the pattern works (Claude, GenAIScript)
Related Information
- Workflow Source:
.github/workflows/smoke-codex.md - Engine Code:
pkg/workflow/codex_engine.go - MCP Config Code:
pkg/workflow/mcp-config.go - Pattern ID: CODEX_TOML_JSON_ESCAPING
- Previous Issue: [smoke-detector] [CRITICAL] Codex Smoke Tests Failing - TOML Parse Error (4th Occurrence) #2930 (closed without fix)
- Investigation Storage:
/tmp/gh-aw/cache-memory/investigations/2025-11-01-19000564081.json
Investigation Timestamp: 2025-11-01 18:07:00 UTC
- Investigator: Smoke Detector
- Investigation Run: #19000577848
- Pattern ID: CODEX_TOML_JSON_ESCAPING
- Severity: CRITICAL
- Occurrence Count: 6 (and counting)
- First Occurrence: 2025-10-31 14:24 UTC (76+ hours ago)
- Is Flaky: No (100% reproducible, deterministic failure)
Labels: smoke-test, investigation, codex, critical, configuration, mcp, toml, urgent
AI generated by Smoke Detector - Smoke Test Failure Investigator
AI generated by Smoke Detector - Smoke Test Failure Investigator