-
Notifications
You must be signed in to change notification settings - Fork 620
Description
UTF-8 Decode Error in Claude SDK when Processing CLI stdout
Summary
The Claude Agent SDK for Python throws a UnicodeDecodeError when processing stdout from the Claude CLI, due to concurrent writes in the Node.js process causing byte stream reordering.
Error Message
'utf-8' codec can't decode byte 0xbb in position 0: invalid start byte
Root Cause
The Claude CLI (Node.js) has multiple concurrent output sources writing to stdout:
- Main output (file processing, progress) - outputs Chinese filenames and content
- HTTP request logger - outputs logs like
[log_b914a8] sending request
These sources interleave their output, causing UTF-8 multi-byte characters to be split across different write operations.
Evidence from Debug Logs
Analyzing the byte stream received from CLI stdout:
Block 132 end (6034 bytes):
...
1780: 50 57 53 74 61 74 69 73 74 69 63 73 20 28 e6 96
1790: 87 e4
e6 96 87= "文" (complete)e4= "件" (first byte, incomplete)- Missing bytes:
bb b6
Block 133 (5125 bytes) - HTTP log interleaved:
0000: 5b 6c 6f 67 5f 62 39 31 34 61 38 5d 20 73 65 6e
[log_b914a8] sen
- This is an HTTP request log that interrupted the main output
Block 134 (8192 bytes) - delayed bytes appear:
0000: bb b6 3a 20 43 6f 6d 77 61 72 65 20 4c 32 56 50
»¶: Comware L2VP
bb b6= remaining bytes of "件" (delayed!)- These bytes finally appear after the HTTP log
Timeline of What Happened
T1: Main output writes "文件" → bytes: E6 96 87 E4
(gets suspended, waiting for async operation)
T2: HTTP request starts → logger outputs: [log_b914a8] sending request
(this is Block 133, 5125 bytes)
T3: Main output resumes → writes remaining bytes: BB B6 3A ...
(this is Block 134, 8192 bytes)
Why the SDK Fails
Code Analysis
Layer 1: OS Kernel
- stdout pipe buffer (8KB)
- Reads by byte count, no character boundary awareness
Layer 2: TextReceiveStream (anyio/streams/text.py)
class TextReceiveStream:
def __post_init__(self, encoding: str, errors: str) -> None:
decoder_class = codecs.getincrementaldecoder(encoding)
self._decoder = decoder_class(errors=errors) # ← errors='strict' by default
async def receive(self) -> str:
chunk = await self.transport_stream.receive()
decoded = self._decoder.decode(chunk) # ← Fails here on 0xBB
return decodedWhen the decoder encounters byte 0xBB (which is not a valid UTF-8 start byte and doesn't follow the expected 0xE4 from previous chunk), it throws UnicodeDecodeError with errors='strict'.
Layer 3: _read_messages_impl (subprocess_cli.py:703)
async def _read_messages_impl(self) -> AsyncIterator[dict[str, Any]]:
json_buffer = ""
try:
async for line in self._stdout_stream: # ← Fails here!
# JSON buffering logic that never gets a chance to run
json_buffer += json_line
data = json.loads(json_buffer)
yield data
except UnicodeDecodeError:
# Error propagates from Layer 2Why JSON Buffering Doesn't Help
The json_buffer mechanism (line 708-748) is designed to handle incomplete JSON messages at the JSON layer, but the UTF-8 decode error happens at the byte stream → text conversion layer, which is before JSON processing.
Error propagation:
IncrementalDecoder.decode() → UnicodeDecodeError ❌
↓
TextReceiveStream.receive() → Exception propagates
↓
async for line in self._stdout_stream → Iterator fails
↓
json_buffer += json_line → Never executes ✗
Impact
- Severity: High - breaks entire sessions
- Frequency: Intermittent - depends on async operation timing
- Scope: Affects any usage where CLI outputs non-ASCII text (Chinese, etc.)
Possible Solutions
Option 1: Use Tolerant Decoding (Recommended)
Change TextReceiveStream initialization to use errors='replace' or errors='ignore':
File: claude_agent_sdk/_internal/transport/subprocess_cli.py
Line: 555
# Current:
self._stdout_stream = TextReceiveStream(self._process.stdout)
# Proposed:
self._stdout_stream = TextReceiveStream(
self._process.stdout,
encoding='utf-8',
errors='replace' # or 'surrogateescape'
)Pros:
- Simple fix
- Prevents crashes from byte reordering
surrogateescapepreserves invalid bytes for potential recovery
Cons:
- May mask actual encoding issues
- Replaced characters (�) in output
Option 2: Fix CLI Concurrent Writes
Ensure the CLI process outputs to stdout atomically, using proper synchronization.
Pros:
- Fixes root cause
- No data corruption
Cons:
- Requires changes to CLI (Node.js)
- May impact performance
Option 3: Add Byte Stream Recovery
Implement a recovery mechanism in the SDK to detect and handle out-of-order UTF-8 byte sequences.
Pros:
- Handles edge cases
- No CLI changes needed
Cons:
- Complex implementation
- May not catch all cases
Environment
- SDK Version:
claude-agent-sdk 0.1.18 - CLI Version:
claude-code 2.0.72 - Python: 3.13
- OS: Linux (Ubuntu)
- Node.js: v24.3.0
Reproduction
The issue is intermittent but occurs when:
- CLI outputs Chinese characters (or any multi-byte UTF-8)
- An async operation (HTTP request, file read, etc.) triggers logging
- The logger output interleaves with the main output
- UTF-8 bytes get reordered
Expected Behavior
The SDK should handle byte stream reordering gracefully without crashing, especially when the root cause is concurrent writes in the CLI process itself.
Actual Behavior
The SDK crashes with UnicodeDecodeError, terminating the entire session.
Additional Context
This is not a traditional 8KB buffer truncation issue. The byte stream reordering happens inside the CLI process due to Node.js's async/concurrent nature, before data even reaches the SDK's receive buffer.
The issue is particularly problematic because:
- It's intermittent (depends on timing)
- It affects legitimate use cases (Chinese file paths, content)
- The error message doesn't indicate the real cause (CLI concurrent writes)
References
- Related to anyio's
TextReceiveStreamusingerrors='strict'by default - Similar to issues with multi-byte character handling in stream processing