Skip to content

[smoke-detector] 🔄 THIRD REGRESSION - Run #39: Second Fix Also Failed #2346

@github-actions

Description

@github-actions

🔄 THIRD REGRESSION ALERT - Run #39: Second Fix Also Failed

Critical Update

BOTH fix attempts have now failed. Run #39 shows that commit 7cc5320, which claimed to address the runtime JSON construction issue identified in #2341, has NOT resolved the problem.

Summary

After TWO separate fix attempts:

  1. ✅ Commit f9dece6 (Run Test coverage console formatting august 13 #38) - Fixed YAML-level quoting
  2. ✅ Commit 7cc5320 (Run Fix Codex token usage summing for multiple chat requests #39) - Removed toJSON() wrapper

Result: ❌ Issue STILL persists

Failure Details - Run #39

Error Analysis

Error Message

Invalid JSON in --additional-mcp-config: Expected ',' or '}' after property value in JSON at position 558 (line 1 column 559)

Error Position Shift

Run # Commit Error Position Status
36 a160185 557 ❌ Original failure
37 (same) 557 ❌ Duplicate
38 f9dece6 557 ❌ First fix failed
39 7cc5320 558 ❌ Second fix failed

Observation: Error position shifted by 1 character (557→558), indicating the JSON structure was modified by the fix, but the fundamental problem remains.

What the Second Fix Attempted

Commit 7cc5320 specifically addressed the runtime JSON construction issue:

Root Cause Analysis (from commit message)

1. Job-level env: GH_AW_SAFE_OUTPUTS_CONFIG = '{"create_issue":...}'
2. GitHub Actions evaluates: ${{ env.GH_AW_SAFE_OUTPUTS_CONFIG }} → {"create_issue":...}
3. With toJSON(): ${{ toJSON(env.GH_AW_SAFE_OUTPUTS_CONFIG) }} → "{\"create_issue\":...}"
4. Go's json.Marshal() adds another layer: ""{\"create_issue\":...}"" (DOUBLE QUOTES)

The Fix

Remove toJSON() wrapper since the env variable is already a JSON string.

Expected Result

"GH_AW_SAFE_OUTPUTS_CONFIG":"{\"create_issue\":...}" (Valid JSON)

Actual Result

❌ Still invalid at position 558

Why This Fix Also Failed

Possible Reasons

  1. Not Applied: The workflow lock files may not have been properly recompiled

    • Need to verify the actual YAML contains the fix
    • Check if there are cached artifacts preventing updates
  2. Multiple Locations: There may be other places using this env var

    • The fix was applied to one code path but not all
    • Different steps/jobs might handle it differently
  3. Third Layer of Escaping: There's another transformation we haven't identified

    • Shell-level escaping before CLI execution?
    • GitHub Actions expression evaluation doing something unexpected?
    • String templating in Go code?
  4. Wrong Analysis: The theory was correct but implementation missed something

    • Maybe toJSON() wasn't actually the problem
    • Or it's only part of the problem
  5. Timing/Cache Issue: Changes not taking effect

    • Build artifacts from previous runs?
    • GitHub Actions cache?

Timeline of Failures

Run # Date Time Commit Status Fix Attempt
36 Oct 24 21:10 a160185 -
37 Oct 24 22:43 (same) -
- Oct 24 23:13 f9dece6 - First fix: YAML quoting
38 Oct 24 23:22 f9dece6 Created #2341
- Oct 24 23:59 7cc5320 - Second fix: Remove toJSON()
39 Oct 25 00:04 7cc5320 This report

Time between second fix and third failure: 5 minutes

Critical Next Steps

1. Verify the Fix Was Actually Applied

# Check the actual workflow lock file
cat .github/workflows/smoke-copilot.lock.yml | grep -A 20 "GH_AW_SAFE_OUTPUTS_CONFIG"

# Look for toJSON() usage
grep -r "toJSON.*GH_AW_SAFE_OUTPUTS_CONFIG" .github/workflows/

If toJSON() is still present: The recompilation didn't work or wasn't committed.

If toJSON() is gone: The fix was applied but something else is wrong.

2. Add Comprehensive Debug Logging

# In the workflow, before executing copilot:
- name: Debug MCP Config
  run: |
    echo "Raw env var:"
    echo "$GH_AW_SAFE_OUTPUTS_CONFIG"
    echo "---"
    echo "After expression evaluation:"
    echo '${{ env.GH_AW_SAFE_OUTPUTS_CONFIG }}'
    echo "---"
    echo "Full MCP config that will be passed:"
    # Print the actual --additional-mcp-config argument

3. Test Locally

# Reproduce outside GitHub Actions:
export GH_AW_SAFE_OUTPUTS_CONFIG='{"create_issue":{"max":1,"min":1},"missing_tool":{}}'

# Build the MCP config JSON as the code does
go run ./cmd/awcli test-mcp-config

4. Examine the Actual Command

From workflow logs, extract the EXACT command being run:

copilot --additional-mcp-config '...' --prompt "..."

Then manually validate that JSON string.

5. Check All Code Paths

# Find all places where GH_AW_SAFE_OUTPUTS_CONFIG is used:
grep -rn "GH_AW_SAFE_OUTPUTS_CONFIG" pkg/
grep -rn "GH_AW_SAFE_OUTPUTS_CONFIG" .github/workflows/

# Check if there are multiple templates or steps:
grep -rn "additional-mcp-config" pkg/workflow/

Recommended Actions

Immediate

  • STOP trying to fix blindly - We need visibility first
  • Add debug logging to see the actual values at each transformation step
  • Verify the second fix was applied by checking the lock file
  • Check if there are multiple code paths that need fixing

Short-term

  • Add validation that fails fast with detailed error message

    func validateMCPConfig(config string) error {
        var test map[string]interface{}
        if err := json.Unmarshal([]byte(config), &test); err != nil {
            log.Printf("INVALID MCP CONFIG:")
            log.Printf("%s", config)
            return fmt.Errorf("MCP config is invalid JSON: %w", err)
        }
        return nil
    }
  • Add integration test that actually runs the CLI with the generated config

  • Simplify the approach: Maybe don't use env vars for JSON, use files instead?

Long-term

  • Redesign env var handling to be type-safe

    • Use structs + json.Marshal() instead of string concatenation
    • Have a clear contract for when values are JSON vs strings
    • Add validation at compile time, not runtime
  • Document the transformation pipeline

    1. Go code defines env var
    2. YAML template is rendered
    3. Lock file is generated  
    4. GitHub Actions evaluates expressions
    5. Shell receives the command
    6. CLI parses the JSON
    

    Identify EXACTLY where each quoting/escaping happens.

Analysis

This is now a systematic failure of the fix process:

  1. ❌ First fix: Incomplete (missed runtime layer)
  2. ❌ Second fix: Also incomplete or not applied

The issue has survived two separate attempts by the Copilot agent to fix it, with detailed analysis and commit messages. This suggests:

  • The problem is more complex than currently understood
  • The debugging approach needs to change
  • We need visibility into the actual data at runtime, not just theory

Recommendation: Before attempting a third fix, add comprehensive logging and verification. A third blind fix attempt without more data will likely also fail.


Investigation stored: /tmp/gh-aw/cache-memory/investigations/2025-10-25-18795080703.json
Pattern ID: COPILOT_DOUBLE_QUOTED_ENV_VAR_IN_MCP_CONFIG
Status: THIRD REGRESSION - TWO FIX ATTEMPTS FAILED
Critical: YES - Blocking Copilot smoke tests

AI generated by Smoke Detector - Smoke Test Failure Investigator

AI generated by Smoke Detector - Smoke Test Failure Investigator

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions