Skip to content

[smoke-detector] [CRITICAL] Codex Smoke Tests Failing - TOML Parse Error (4th Occurrence) #2930

@github-actions

Description

@github-actions

🔍 Smoke Test Investigation - Run #75

Summary

The Smoke Codex workflow continues to fail with TOML parse error at line 30, column 109. This is the FOURTH occurrence of the same issue since October 31st, indicating a critical systemic problem that needs immediate attention. All scheduled Codex smoke tests are currently failing.

Failure Details

Root Cause Analysis

Primary Error

Error: TOML parse error at line 30, column 109
   |
30 | env = { "GH_AW_SAFE_OUTPUTS" = "/tmp/gh-aw/safeoutputs/outputs.jsonl", "GH_AW_SAFE_OUTPUTS_CONFIG" = "\"{\\"create_issue\\":{\\"max\\":1},\\"missing_tool\\":{}}\"", "GH_AW_ASSETS_BRANCH" = "", "GH_AW_ASSETS_MAX_SIZE_KB" = "", "GH_AW_ASSETS_ALLOWED_EXTS" = "", "GITHUB_REPOSITORY" = "githubnext/gh-aw", "GITHUB_SERVER_URL" = "(redacted)" }
   |                                                                                                             ^
missing comma between key-value pairs, expected `,`

Technical Analysis

The Problem: Codex MCP Configuration + JSON Environment Variables = Invalid TOML

The Codex engine generates its MCP configuration using TOML inline table syntax with shell variable substitution. When GH_AW_SAFE_OUTPUTS_CONFIG (which contains JSON with nested quotes) is substituted into the TOML env inline table, the result is syntactically invalid TOML.

Why This Happens:

  1. GH_AW_SAFE_OUTPUTS_CONFIG contains: {"create_issue":{"max":1},"missing_tool":{}}
  2. This gets JSON-escaped for shell safety
  3. The escaped value is substituted into TOML: env = { "KEY" = "value" }
  4. Result: Nested quotes create invalid TOML that the parser rejects

Simplified Example:

# Invalid TOML generated:
env = { "CONFIG" = "\"{\\"key\\":\\"value\\"}\"" }
                    ^^ conflicting quotes!

# Valid TOML would be:
env.CONFIG = '{"key":"value"}'  # Single quotes, or...
# Use file-based config (recommended)

Failed Jobs and Errors

Job Sequence

  1. pre_activation - succeeded (6s)
  2. activation - succeeded (4s)
  3. agent - FAILED (21s) - Codex CLI cannot parse TOML config
  4. ⏭️ detection - skipped
  5. ⏭️ missing_tool - skipped
  6. ⏭️ create_issue - skipped

Error Timeline

  • **(redacted) Codex CLI invoked with MCP config
  • **(redacted) TOML parse error at line 30, column 109
  • **(redacted) Process exited with code 1
  • **(redacted) Error validation detected the TOML parse error

Historical Context

This is a recurring pattern with increasing frequency:

Run ID Date Time Since Last Status
18975512058 2025-10-31 (redacted) Initial ❌ Failed
18977321431 2025-10-31 (redacted) ~1 hour ❌ Failed
18988214642 2025-11-01 (redacted) ~9 hours ❌ Failed
18992422186 **2025-11-01 (redacted) ~6 hours ❌ Failed

Pattern: Every scheduled Codex smoke test since October 31st has failed with the same error.

Investigation History: Previous investigations have documented this issue extensively:

  • Pattern ID: CODEX_TOML_JSON_ESCAPING
  • Documented in: /tmp/gh-aw/cache-memory/investigations/SUMMARY-codex-toml-json-escaping.md
  • Recommended fix: Switch to file-based TOML configuration

Recommended Actions

CRITICAL Priority (Do Immediately)

  • Implement file-based TOML configuration for Codex

    Implementation Steps:

    1. Create renderCodexMCPConfigFile() in pkg/workflow/mcp-config.go

      func renderCodexMCPConfigFile(mcpServers map[string]interface{}) (string, error) {
          // Generate TOML config file content
          // Write to /tmp/gh-aw/mcp-config/config.toml
          // Return file path
      }
    2. Update pkg/workflow/codex_engine.go

      • Call renderCodexMCPConfigFile() during engine setup
      • Change CLI invocation from inline config to: codex --config /tmp/gh-aw/mcp-config/config.toml
      • Remove inline TOML generation logic
    3. Pattern to Follow: Copy from Claude and GenAIScript engines which already use file-based configs successfully

    Estimated Effort: 2-4 hours

    Benefits:

    • Eliminates quote escaping issues entirely
    • Matches pattern used by other engines
    • More maintainable and testable
    • Prevents future similar issues

HIGH Priority

  • Add integration tests for MCP config generation

    • Test that generated TOML is valid (parse with actual TOML parser)
    • Test with JSON-valued environment variables
    • Test across all engines (Claude, Copilot, Codex, GenAIScript)
  • Make smoke tests blocking for PRs

    • Prevent merges when smoke tests are failing
    • Add smoke test status to required checks
    • Currently smoke tests run on schedule but don't block merges
  • Add pre-merge validation

    • Check for inline TOML generation with env vars
    • Validate generated configs can be parsed
    • Run smoke tests in PR CI (at least subset)

MEDIUM Priority

  • Standardize ALL engines to file-based configs

    • Audit: Claude (✅ file-based), Copilot (inline JSON), Codex (inline TOML), GenAIScript (✅ file-based)
    • Migrate Copilot to file-based config
    • Document standard approach in contributing guide
  • Improve error messages

    • Add context about what config was being parsed
    • Show the problematic line in workflow logs
    • Link to troubleshooting documentation

Prevention Strategies

  1. Architectural:

    • Never use inline config substitution with complex values (JSON, nested quotes, etc.)
    • File-based configs eliminate entire class of escaping bugs
    • Standardize approach across all engines
  2. Testing:

    • Integration tests that parse generated configs with real parsers
    • Smoke tests as required checks for PRs
    • Test matrix covering all engines × all config scenarios
  3. CI/CD:

    • Run smoke tests on every PR that touches engine code
    • Block merges when smoke tests fail
    • Alert on repeated failures
  4. Documentation:

    • Document the file-based config pattern
    • Add troubleshooting guide for config errors
    • Explain why inline substitution is problematic

Technical Details

Engine Comparison

Engine Config Method Format Inline/File Status
Codex Inline TOML TOML Inline ❌ Broken
Claude Code File-based JSON File ✅ Works
Copilot Inline JSON JSON Inline ⚠️ Fragile
GenAIScript File-based JSON File ✅ Works

Observation: Engines using file-based configs don't have these issues.

Why File-Based Configs Are Better

  1. No Escaping Issues: File content doesn't go through shell interpretation
  2. Better Debugging: Can inspect actual config file on disk
  3. Easier Testing: Can test config generation independently
  4. More Maintainable: Clearer code, fewer edge cases
  5. Proven Pattern: Already working for Claude and GenAIScript

Example Fix (Pseudocode)

// Current (broken):
func (e *CodexEngine) BuildAgentStep() {
    mcpConfig := buildMCPConfigTOML()  // Inline TOML with $VARS
    command := fmt.Sprintf(`codex --config-inline "%s"`, mcpConfig)
    // Shell substitution + TOML parsing = 💥
}

// Fixed:
func (e *CodexEngine) BuildAgentStep() {
    configPath := "/tmp/gh-aw/mcp-config/codex-config.toml"
    renderCodexMCPConfigFile(mcpServers, configPath)  // Write to file
    command := fmt.Sprintf(`codex --config "%s"`, configPath)
    // No escaping issues! 🎉
}

Related Information

Impact Assessment

Current Impact

  • All Codex smoke tests failing since Oct 31
  • No automated validation of Codex engine changes
  • ⚠️ Risk of shipping bugs without working smoke tests
  • ⚠️ Developer velocity reduced due to manual testing needs

Risk if Not Fixed

  • 🔴 High risk of introducing Codex regressions
  • 🔴 Cannot verify Codex engine works in production scenarios
  • 🔴 Eroding confidence in CI/CD pipeline
  • 🔴 Technical debt accumulating with workarounds

Benefits of Fixing

  • ✅ Restore automated Codex validation
  • ✅ Prevent future config escaping issues
  • ✅ Standardize config approach across engines
  • ✅ Improve testability and maintainability
  • ✅ Increase confidence in Codex deployments

**Investigation (redacted)

  • Investigator: Smoke Detector
  • Investigation Run: #18992435114
  • Pattern ID: CODEX_TOML_JSON_ESCAPING
  • Severity: CRITICAL
  • Occurrence Count: 4 (and counting)
  • First Occurrence: 2025-10-31
  • Is Flaky: No (deterministic failure)

Labels: smoke-test, investigation, codex, critical, configuration, mcp, toml

AI generated by Smoke Detector - Smoke Test Failure Investigator

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions