Skip to content

UTF-8 Decode Error in Claude SDK when Processing CLI stdout #549

@SerenNoble

Description

@SerenNoble

UTF-8 Decode Error in Claude SDK when Processing CLI stdout

Summary

The Claude Agent SDK for Python throws a UnicodeDecodeError when processing stdout from the Claude CLI, due to concurrent writes in the Node.js process causing byte stream reordering.

Error Message

'utf-8' codec can't decode byte 0xbb in position 0: invalid start byte

Root Cause

The Claude CLI (Node.js) has multiple concurrent output sources writing to stdout:

  1. Main output (file processing, progress) - outputs Chinese filenames and content
  2. HTTP request logger - outputs logs like [log_b914a8] sending request

These sources interleave their output, causing UTF-8 multi-byte characters to be split across different write operations.

Evidence from Debug Logs

Analyzing the byte stream received from CLI stdout:

Block 132 end (6034 bytes):

...
1780: 50 57 53 74 61 74 69 73 74 69 63 73 20 28 e6 96
1790: 87 e4
  • e6 96 87 = "文" (complete)
  • e4 = "件" (first byte, incomplete)
  • Missing bytes: bb b6

Block 133 (5125 bytes) - HTTP log interleaved:

0000: 5b 6c 6f 67 5f 62 39 31 34 61 38 5d 20 73 65 6e
       [log_b914a8] sen
  • This is an HTTP request log that interrupted the main output

Block 134 (8192 bytes) - delayed bytes appear:

0000: bb b6 3a 20 43 6f 6d 77 61 72 65 20 4c 32 56 50
      »¶: Comware L2VP
  • bb b6 = remaining bytes of "件" (delayed!)
  • These bytes finally appear after the HTTP log

Timeline of What Happened

T1: Main output writes "文件" → bytes: E6 96 87 E4
    (gets suspended, waiting for async operation)

T2: HTTP request starts → logger outputs: [log_b914a8] sending request
    (this is Block 133, 5125 bytes)

T3: Main output resumes → writes remaining bytes: BB B6 3A ...
    (this is Block 134, 8192 bytes)

Why the SDK Fails

Code Analysis

Layer 1: OS Kernel

  • stdout pipe buffer (8KB)
  • Reads by byte count, no character boundary awareness

Layer 2: TextReceiveStream (anyio/streams/text.py)

class TextReceiveStream:
    def __post_init__(self, encoding: str, errors: str) -> None:
        decoder_class = codecs.getincrementaldecoder(encoding)
        self._decoder = decoder_class(errors=errors)  # ← errors='strict' by default

    async def receive(self) -> str:
        chunk = await self.transport_stream.receive()
        decoded = self._decoder.decode(chunk)  # ← Fails here on 0xBB
        return decoded

When the decoder encounters byte 0xBB (which is not a valid UTF-8 start byte and doesn't follow the expected 0xE4 from previous chunk), it throws UnicodeDecodeError with errors='strict'.

Layer 3: _read_messages_impl (subprocess_cli.py:703)

async def _read_messages_impl(self) -> AsyncIterator[dict[str, Any]]:
    json_buffer = ""

    try:
        async for line in self._stdout_stream:  # ← Fails here!
            # JSON buffering logic that never gets a chance to run
            json_buffer += json_line
            data = json.loads(json_buffer)
            yield data
    except UnicodeDecodeError:
        # Error propagates from Layer 2

Why JSON Buffering Doesn't Help

The json_buffer mechanism (line 708-748) is designed to handle incomplete JSON messages at the JSON layer, but the UTF-8 decode error happens at the byte stream → text conversion layer, which is before JSON processing.

Error propagation:
IncrementalDecoder.decode() → UnicodeDecodeError ❌
    ↓
TextReceiveStream.receive() → Exception propagates
    ↓
async for line in self._stdout_stream → Iterator fails
    ↓
json_buffer += json_line → Never executes ✗

Impact

  • Severity: High - breaks entire sessions
  • Frequency: Intermittent - depends on async operation timing
  • Scope: Affects any usage where CLI outputs non-ASCII text (Chinese, etc.)

Possible Solutions

Option 1: Use Tolerant Decoding (Recommended)

Change TextReceiveStream initialization to use errors='replace' or errors='ignore':

File: claude_agent_sdk/_internal/transport/subprocess_cli.py
Line: 555

# Current:
self._stdout_stream = TextReceiveStream(self._process.stdout)

# Proposed:
self._stdout_stream = TextReceiveStream(
    self._process.stdout,
    encoding='utf-8',
    errors='replace'  # or 'surrogateescape'
)

Pros:

  • Simple fix
  • Prevents crashes from byte reordering
  • surrogateescape preserves invalid bytes for potential recovery

Cons:

  • May mask actual encoding issues
  • Replaced characters (�) in output

Option 2: Fix CLI Concurrent Writes

Ensure the CLI process outputs to stdout atomically, using proper synchronization.

Pros:

  • Fixes root cause
  • No data corruption

Cons:

  • Requires changes to CLI (Node.js)
  • May impact performance

Option 3: Add Byte Stream Recovery

Implement a recovery mechanism in the SDK to detect and handle out-of-order UTF-8 byte sequences.

Pros:

  • Handles edge cases
  • No CLI changes needed

Cons:

  • Complex implementation
  • May not catch all cases

Environment

  • SDK Version: claude-agent-sdk 0.1.18
  • CLI Version: claude-code 2.0.72
  • Python: 3.13
  • OS: Linux (Ubuntu)
  • Node.js: v24.3.0

Reproduction

The issue is intermittent but occurs when:

  1. CLI outputs Chinese characters (or any multi-byte UTF-8)
  2. An async operation (HTTP request, file read, etc.) triggers logging
  3. The logger output interleaves with the main output
  4. UTF-8 bytes get reordered

Expected Behavior

The SDK should handle byte stream reordering gracefully without crashing, especially when the root cause is concurrent writes in the CLI process itself.

Actual Behavior

The SDK crashes with UnicodeDecodeError, terminating the entire session.

Additional Context

This is not a traditional 8KB buffer truncation issue. The byte stream reordering happens inside the CLI process due to Node.js's async/concurrent nature, before data even reaches the SDK's receive buffer.

The issue is particularly problematic because:

  • It's intermittent (depends on timing)
  • It affects legitimate use cases (Chinese file paths, content)
  • The error message doesn't indicate the real cause (CLI concurrent writes)

References

  • Related to anyio's TextReceiveStream using errors='strict' by default
  • Similar to issues with multi-byte character handling in stream processing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions