Skip to content

Conversation

@ramarivera
Copy link

@ramarivera ramarivera commented Jan 4, 2026

Summary

This PR fixes a bug where GLM-4.7 occasionally outputs tool call XML tags directly within the reasoning_content thinking block instead of using the proper separate tool_calls field. This malformed output causes tool calls to not be executed properly.

Problem Statement

GLM-4.7 with interleaved thinking mode sometimes "leaks" tool call syntax into the reasoning/thinking block:

Example Malformed Output

Thinking:
<invoke name="bash">
  <command>bun test</command>
  <description>Run tests</description>
</invoke>

Or with MCP tools:

Thinking:
<tool_call>pal_thinkdeep<arg_key>step</arg_key><arg_value>...</arg_value>...</tool_call>

Key Issue: These malformed tool calls in reasoning_content are stored as text and never executed. The client receives them as thinking text rather than executable tool calls.

Root Cause Analysis

  1. Enhanced Thinking Mechanism: GLM-4.7 implements a "think before acting" mechanism that sometimes causes tool call syntax to leak into the thinking block

  2. Interleaved Thinking Complexity: When the model thinks before each tool call, there's a higher chance of thinking content "leaking" tool call syntax

  3. API Processing: The OpenAI-compatible SDK looks for function_call items in response output, but malformed tool calls in reasoning_content are just stored as text and never executed

  4. ProviderTransform Behavior: When interleaved.field === "reasoning_content", the transform moves reasoning text but doesn't extract tool calls from malformed reasoning

Session Evidence

Real-world example from a development session:

  • User correction: "you sent a tool call in a thinking block, try again"
  • Assistant response contained tool invocation XML in reasoning block
  • pal_thinkdeep is an MCP server tool, NOT a built-in reasoning mechanism
  • Key insight: Tool invocations should NEVER appear in thinking blocks—this indicates model output malformation requiring client-side sanitization

🚨 Alert: Sanitized Session Logs for GLM-4.7 Malformed Tool Calls

📢 Callout: These logs prove the issue where GLM-4.7 embeds malformed XML tool calls in reasoning blocks. The fix extracts and parses these into proper tool-call parts, as shown in the test cases. All sensitive data (paths, IDs) has been sanitized with placeholders.

🔍 View Sanitized Session Logs Hierarchy
ramarivera_glm4.7_interleaved_thinking_fix/
└── session_logs/
    └── <session-id>/
        ├── msg_example.json
        │   {
        │     "id": "msg_b4c2bb883001Yd3E3BGGMOn1tK",
        │     "sessionID": "<session-id>",
        │     "role": "assistant",
        │     "time": {
        │       "created": 1766509492355,
        │       "completed": 1766509558418
        │     },
        │     "parentID": "<msg-id>",
        │     "modelID": "glm-4.7",
        │     "providerID": "zai-coding-plan",
        │     "mode": "build",
        │     "agent": "build",
        │     "path": {
        │       "cwd": "<project-root>",
        │       "root": "<project-root>"
        │     },
        │     "cost": 0,
        │     "tokens": {
        │       "input": 32,
        │       "output": 506,
        │       "reasoning": 0,
        │       "cache": {
        │         "read": 163403,
        │         "write": 0
        │       }
        │     },
        │     "finish": "stop"
        │   }
        └── parts/
            └── msg_<msg-id>_1/
                └── part_example.json
                    {
                      "id": "<part-id>",
                      "sessionID": "<session-id>",
                      "messageID": "<msg-id>",
                      "type": "reasoning",
                      "text": "<tool_call>pal_thinkdeep<arg_key>step</arg_key><arg_value>Reviewing Phase 1.8: Workflow Inference implementation for architecture and code quality assessment.\n\nKey implementation files:\n- <project-root>/packages/portal/scripts/generate/workflow/inference.ts\n- <project-root>/packages/portal/scripts/generate/workflow/manifest.ts\n- <project-root>/packages/portal/scripts/generate/workflow/embeddings.ts\n- <project-root>/packages/portal/scripts/generate/workflow/orchestrator.ts\n- <project-root>/packages/portal/scripts/generate/workflow/types.ts\n- <project-root>/packages/portal/scripts/generate/workflow/workflow.test.ts\n\nOpenSpec context: Phase 1.8 is defined in openspec/changes/add-grimoire-portal/tasks.md\n\nCurrent implementation focus: LLM-driven workflow inference with manifest persistence and embeddings integration.\n\nKey concerns to assess:\n1. Service architecture and separation of concerns\n2. Mocking strategy and test quality\n3. Type safety and TypeScript practices\n4. Integration readiness with shared-embeddings package</arg_value><arg_key>step_number</arg_key><arg_value>1</arg_value><arg_key>total_steps</arg_key><arg_value>4</arg_value><arg_key>next_step_required</arg_key><arg_value>true</arg_value><arg_key>findings</arg_key><arg_value>I am reviewing the Phase 1.8: Workflow Inference implementation across multiple service files. The workflow inference service (inference.ts) appears to be the core orchestrator that calls LLM services, reads volume content, and produces workflow manifests. The manifest service (manifest.ts) handles JSON file persistence. The embeddings service (embeddings.ts) uses LanceEmbeddingStore from @grimoire/shared-embeddings to update workflow relationships. The orchestrator (orchestrator.ts) coordinates the full pipeline flow. Types are defined in workflow/types.ts and re-exported in services/types.ts.\n\nThe implementation seems conceptually sound - LLM inference → manifest write → embeddings update. However I nee...</arg_value><arg_key>step</arg_key><arg_value>Reviewing Phase 1.8 implementation</arg_value><arg_key>step_number</arg_key><arg_value>1</arg_value><arg_key>total_steps</arg_key><arg_value>4</arg_value><arg_key>next_step_required</arg_key><arg_value>true</arg_value>",
                      "time": {
                        "start": 1766509536091,
                        "end": 1766509558307
                      }
                    }

Solution Architecture

Location: packages/opencode/src/provider/transform.ts

Added GLM-specific normalization in ProviderTransform.normalizeMessages(), following existing patterns for Claude and Mistral normalization:

  1. Detect malformed tool calls in reasoning_content using regex pattern matching
  2. Extract and parse embedded tool call XML (<tool_call> and <invoke> tags)
  3. Remove tool call artifacts from reasoning text
  4. Add extracted calls as proper tool-call parts in the message
  5. Preserve clean reasoning text without tool call artifacts

Testing

Comprehensive test suite in packages/opencode/test/provider/test_glm47_thinking_fix.test.ts:

Single tool call XML in reasoning - Extracts properly formatted bash command from malformed reasoning
Multiple tool calls in reasoning - Extracts all tool invocations
MCP tools in reasoning - Handles pal_thinkdeep and other MCP server tools correctly
Properly formatted responses - Preserves existing structure without modification

Test Coverage

  • Real-world malformed <invoke name="bash"> patterns
  • Complex <tool_call> with multiple arguments
  • Mixed reasoning content with and without tool calls
  • Regression tests for properly formatted responses

Risk Assessment

Low Risk:

  • Defensive sanitization - only modifies content containing malformed syntax
  • Non-breaking - preserves existing behavior for properly formatted outputs
  • Isolated - only affects GLM-4.7/GLM-4.6 via z.ai provider
  • Rollback easy - conditional on provider/model detection

References

  • Provider: Z.AI (Anthropic-compatible endpoint)
  • Affected Models: GLM-4.7, GLM-4.6
  • Pattern: Follows existing Claude/Mistral normalization patterns
  • Related: OpenAI-compatible SDK response handling

…ing new tests and documentation, and update prompt input placeholders.
Add GLM-specific normalization in ProviderTransform to extract tool call XML
from reasoning_content and convert to proper tool-call parts. Supports both
<arg_key>/<arg_value> pairs and direct child tags.

Includes test cases covering single/multiple tool calls in reasoning and
properly formatted responses that should not be affected.
- Add typeof checks to narrow union type before filtering
- Remove any type casts for better type safety
- Ensures content is an array before calling array methods
- Sanitize sensitive information in investigation files (paths, usernames, session IDs)
- Remove PROPOSED_FIX.md (content migrated to PR description)
- Remove session logs and example files
- Investigation details now in PR description for better context
- Replace 'as any[]' and 'as any' with proper ModelMessage type from ai SDK
- Extract createModel() factory function to reduce boilerplate duplication
- Use Provider.Model type for proper type safety throughout tests
- Keep type narrowing for runtime safety checks

All 4 tests pass with full TypeScript type coverage.
- packages/opencode/src/provider/transform.ts
- packages/opencode/test/provider/test_glm47_thinking_fix.test.ts

Removed unnecessary explanatory comments, replaced let with chained const assignments, avoided any types by using casts, aligned style with project guidelines, and restored necessary test comments that prove the fix's behavior.
- Restored investigation files from a35c5bb with sanitized placeholders
- Removed verbatim versions with real paths and IDs
- Kept only example files with <placeholder> values for privacy

These files prove the GLM-4.7 malformed tool call issue and fix.
- Obliterated PROPOSED_FIX.md as requested
- Content migrated to PR description
@rekram1-node
Copy link
Collaborator

i dont think this is correct fix

@ramarivera ramarivera closed this Jan 5, 2026
@ramarivera ramarivera deleted the fix/ramarivera_glm4.7_interleaved_thinking_fix branch January 5, 2026 01:38
@ramarivera
Copy link
Author

Thanks for the input

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants