-
Notifications
You must be signed in to change notification settings - Fork 38
Description
🔍 Smoke Test Investigation - Run #75
Summary
The Smoke Codex workflow continues to fail with TOML parse error at line 30, column 109. This is the FOURTH occurrence of the same issue since October 31st, indicating a critical systemic problem that needs immediate attention. All scheduled Codex smoke tests are currently failing.
Failure Details
- Run: #18992422186
- Run Number: 75
- Commit: 9545f45
- Commit Message: "Fix command injection and artipacked vulnerabilities in unbloat-docs workflow (Fix command injection and artipacked vulnerabilities in unbloat-docs workflow #2925)"
- Trigger: schedule (automated smoke test)
- Duration: 56 seconds
- Status: CRITICAL - Blocking all Codex smoke tests
Root Cause Analysis
Primary Error
Error: TOML parse error at line 30, column 109
|
30 | env = { "GH_AW_SAFE_OUTPUTS" = "/tmp/gh-aw/safeoutputs/outputs.jsonl", "GH_AW_SAFE_OUTPUTS_CONFIG" = "\"{\\"create_issue\\":{\\"max\\":1},\\"missing_tool\\":{}}\"", "GH_AW_ASSETS_BRANCH" = "", "GH_AW_ASSETS_MAX_SIZE_KB" = "", "GH_AW_ASSETS_ALLOWED_EXTS" = "", "GITHUB_REPOSITORY" = "githubnext/gh-aw", "GITHUB_SERVER_URL" = "(redacted)" }
| ^
missing comma between key-value pairs, expected `,`
Technical Analysis
The Problem: Codex MCP Configuration + JSON Environment Variables = Invalid TOML
The Codex engine generates its MCP configuration using TOML inline table syntax with shell variable substitution. When GH_AW_SAFE_OUTPUTS_CONFIG (which contains JSON with nested quotes) is substituted into the TOML env inline table, the result is syntactically invalid TOML.
Why This Happens:
GH_AW_SAFE_OUTPUTS_CONFIGcontains:{"create_issue":{"max":1},"missing_tool":{}}- This gets JSON-escaped for shell safety
- The escaped value is substituted into TOML:
env = { "KEY" = "value" } - Result: Nested quotes create invalid TOML that the parser rejects
Simplified Example:
# Invalid TOML generated:
env = { "CONFIG" = "\"{\\"key\\":\\"value\\"}\"" }
^^ conflicting quotes!
# Valid TOML would be:
env.CONFIG = '{"key":"value"}' # Single quotes, or...
# Use file-based config (recommended)Failed Jobs and Errors
Job Sequence
- ✅ pre_activation - succeeded (6s)
- ✅ activation - succeeded (4s)
- ❌ agent - FAILED (21s) - Codex CLI cannot parse TOML config
- ⏭️ detection - skipped
- ⏭️ missing_tool - skipped
- ⏭️ create_issue - skipped
Error Timeline
- **(redacted) Codex CLI invoked with MCP config
- **(redacted) TOML parse error at line 30, column 109
- **(redacted) Process exited with code 1
- **(redacted) Error validation detected the TOML parse error
Historical Context
This is a recurring pattern with increasing frequency:
| Run ID | Date | Time Since Last | Status |
|---|---|---|---|
| 18975512058 | 2025-10-31 (redacted) | Initial | ❌ Failed |
| 18977321431 | 2025-10-31 (redacted) | ~1 hour | ❌ Failed |
| 18988214642 | 2025-11-01 (redacted) | ~9 hours | ❌ Failed |
| 18992422186 | **2025-11-01 (redacted) | ~6 hours | ❌ Failed |
Pattern: Every scheduled Codex smoke test since October 31st has failed with the same error.
Investigation History: Previous investigations have documented this issue extensively:
- Pattern ID:
CODEX_TOML_JSON_ESCAPING - Documented in:
/tmp/gh-aw/cache-memory/investigations/SUMMARY-codex-toml-json-escaping.md - Recommended fix: Switch to file-based TOML configuration
Recommended Actions
CRITICAL Priority (Do Immediately)
-
Implement file-based TOML configuration for Codex
Implementation Steps:
-
Create
renderCodexMCPConfigFile()inpkg/workflow/mcp-config.gofunc renderCodexMCPConfigFile(mcpServers map[string]interface{}) (string, error) { // Generate TOML config file content // Write to /tmp/gh-aw/mcp-config/config.toml // Return file path }
-
Update
pkg/workflow/codex_engine.go- Call
renderCodexMCPConfigFile()during engine setup - Change CLI invocation from inline config to:
codex --config /tmp/gh-aw/mcp-config/config.toml - Remove inline TOML generation logic
- Call
-
Pattern to Follow: Copy from Claude and GenAIScript engines which already use file-based configs successfully
Estimated Effort: 2-4 hours
Benefits:
- Eliminates quote escaping issues entirely
- Matches pattern used by other engines
- More maintainable and testable
- Prevents future similar issues
-
HIGH Priority
-
Add integration tests for MCP config generation
- Test that generated TOML is valid (parse with actual TOML parser)
- Test with JSON-valued environment variables
- Test across all engines (Claude, Copilot, Codex, GenAIScript)
-
Make smoke tests blocking for PRs
- Prevent merges when smoke tests are failing
- Add smoke test status to required checks
- Currently smoke tests run on schedule but don't block merges
-
Add pre-merge validation
- Check for inline TOML generation with env vars
- Validate generated configs can be parsed
- Run smoke tests in PR CI (at least subset)
MEDIUM Priority
-
Standardize ALL engines to file-based configs
- Audit: Claude (✅ file-based), Copilot (inline JSON), Codex (inline TOML), GenAIScript (✅ file-based)
- Migrate Copilot to file-based config
- Document standard approach in contributing guide
-
Improve error messages
- Add context about what config was being parsed
- Show the problematic line in workflow logs
- Link to troubleshooting documentation
Prevention Strategies
-
Architectural:
- Never use inline config substitution with complex values (JSON, nested quotes, etc.)
- File-based configs eliminate entire class of escaping bugs
- Standardize approach across all engines
-
Testing:
- Integration tests that parse generated configs with real parsers
- Smoke tests as required checks for PRs
- Test matrix covering all engines × all config scenarios
-
CI/CD:
- Run smoke tests on every PR that touches engine code
- Block merges when smoke tests fail
- Alert on repeated failures
-
Documentation:
- Document the file-based config pattern
- Add troubleshooting guide for config errors
- Explain why inline substitution is problematic
Technical Details
Engine Comparison
| Engine | Config Method | Format | Inline/File | Status |
|---|---|---|---|---|
| Codex | Inline TOML | TOML | Inline | ❌ Broken |
| Claude Code | File-based | JSON | File | ✅ Works |
| Copilot | Inline JSON | JSON | Inline | |
| GenAIScript | File-based | JSON | File | ✅ Works |
Observation: Engines using file-based configs don't have these issues.
Why File-Based Configs Are Better
- No Escaping Issues: File content doesn't go through shell interpretation
- Better Debugging: Can inspect actual config file on disk
- Easier Testing: Can test config generation independently
- More Maintainable: Clearer code, fewer edge cases
- Proven Pattern: Already working for Claude and GenAIScript
Example Fix (Pseudocode)
// Current (broken):
func (e *CodexEngine) BuildAgentStep() {
mcpConfig := buildMCPConfigTOML() // Inline TOML with $VARS
command := fmt.Sprintf(`codex --config-inline "%s"`, mcpConfig)
// Shell substitution + TOML parsing = 💥
}
// Fixed:
func (e *CodexEngine) BuildAgentStep() {
configPath := "/tmp/gh-aw/mcp-config/codex-config.toml"
renderCodexMCPConfigFile(mcpServers, configPath) // Write to file
command := fmt.Sprintf(`codex --config "%s"`, configPath)
// No escaping issues! 🎉
}Related Information
- Workflow Source:
.github/workflows/smoke-codex.md - Engine Code:
pkg/workflow/codex_engine.go - MCP Config Code:
pkg/workflow/mcp-config.go - Related PR: Fix command injection and artipacked vulnerabilities in unbloat-docs workflow #2925 (Security fixes - unrelated to this issue)
- Pattern ID: CODEX_TOML_JSON_ESCAPING
- Investigation Storage:
/tmp/gh-aw/cache-memory/investigations/2025-11-01-18992422186.json
Impact Assessment
Current Impact
- ❌ All Codex smoke tests failing since Oct 31
- ❌ No automated validation of Codex engine changes
⚠️ Risk of shipping bugs without working smoke tests⚠️ Developer velocity reduced due to manual testing needs
Risk if Not Fixed
- 🔴 High risk of introducing Codex regressions
- 🔴 Cannot verify Codex engine works in production scenarios
- 🔴 Eroding confidence in CI/CD pipeline
- 🔴 Technical debt accumulating with workarounds
Benefits of Fixing
- ✅ Restore automated Codex validation
- ✅ Prevent future config escaping issues
- ✅ Standardize config approach across engines
- ✅ Improve testability and maintainability
- ✅ Increase confidence in Codex deployments
**Investigation (redacted)
- Investigator: Smoke Detector
- Investigation Run: #18992435114
- Pattern ID: CODEX_TOML_JSON_ESCAPING
- Severity: CRITICAL
- Occurrence Count: 4 (and counting)
- First Occurrence: 2025-10-31
- Is Flaky: No (deterministic failure)
Labels: smoke-test, investigation, codex, critical, configuration, mcp, toml
AI generated by Smoke Detector - Smoke Test Failure Investigator