fix(claude-code): respawn persistent CLI process on broken pipe by rabi · Pull Request #7070 · block/goose

rabi · 2026-02-07T07:14:51Z

Summary

Replace OnceCell with Mutex<Option> so a dead CLI process can be detected and respawned. The persistent stream-json process (introduced in #7029) dies when the user uses Ctrl+C, leaving the OnceCell permanently poisoned with a stale process whose stdin pipe is closed.

Type of Change

AI Assistance

This PR was created or reviewed with AI assistance

Testing

Tested locally and Ctrl+C works as expected.

Replace OnceCell with Mutex<Option<CliProcess>> so a dead CLI process can be detected and respawned. The persistent stream-json process (introduced in block#7029) could die unexpectedly (e.g. after Ctrl+C), leaving the OnceCell permanently poisoned with a stale process whose stdin pipe is closed. Signed-off-by: rabi <ramishra@redhat.com>

codefromthecrypt · 2026-02-07T09:33:36Z

curious is this the right solution? "dies when the user uses Ctrl+C," wouldn't we want to not propagate the ctrl-c instead? respawning would lose state.

codefromthecrypt · 2026-02-07T09:36:37Z

basically I'm not sure the flow, but in other tools I've worked on there's either intent that ctrl+c means exit the process (which should propagate) or it means stop the current command (should not).

maybe look at how other claude-code runners work as even if we want to restart the subprocess instead of handling the signal.. if we do it that way, it is best to cite why.

codefromthecrypt · 2026-02-07T10:19:44Z

spent some time for a more thorough answer.

The child dies on Ctrl-C because it inherits the parent's process group and receives SIGINT directly. Respawning works around this but has side effects:

Silent context loss — the respawned CLI has no memory of the previous conversation. Message replay doesn't restore Claude Code's internal state (tool results, file edits, etc.).
Fragile error detection — is_recoverable matches "Broken pipe" as a substring, which is OS/locale-dependent. Other death modes (SIGKILL, OOM) won't match.
Single retry, no backoff — if the respawn also fails immediately, the error propagates with no further recovery.

The root cause fix is to spawn the child in its own process group so it doesn't receive SIGINT. For reference, forge-core does this with group_spawn(). That one-line change would eliminate the need for respawn logic entirely.

See also: how other projects manage the Claude Code subprocess

Project	Signal handling	Persistent NDJSON	Auto-restart
claude-agent-sdk-go (Go)	None — no OS signal trapping. Close does stdin close + kill after 5s. Interrupt is a protocol message, not a signal.	Yes — one process at Connect(), `--output-format stream-json --input-format stream-json`, bufio.Scanner line reader	None — `IsAlive()` is stubbed (`return true`, TODO comment). EOF closes channel; consumer must re-`Connect()`.
forge-core (Rust)	kill_on_drop + group_spawn — process group killed when handle drops. No explicit signal trapping.	Yes — `--output-format=stream-json --input-format=stream-json` via ProtocolPeer on stdin/stdout.	None — new process per execution. Supports `--fork-session`/`--resume` but no auto-respawn.
mcpc (TypeScript)	kill() on `forceCleanup()`. No SIGINT/SIGTERM trapping on the parent process.	Yes — `spawn()` with `ndJsonStream(input, output)` over persistent stdin/stdout pipes.	None — if the process dies, session fails. No retry or respawn logic.

rabi · 2026-02-07T11:30:08Z

I looked into the group_spawn() approach in forge-core, the command_group crate uses calls pre_exec + setpgid in an unsafe block under the hood and would also introduce orphan-process risks if the parent exits without cleaning up the separate process group. From what I can tell, forge-core spawns a new process per execution, so they never face this persistent-process problem in the first place.

Silent context loss: Context loss would happen regardless if the process dies for any reason, Respawn at least lets the user continue rather than hitting a permanent error. The message replay sends the full conversation history.

Fragile error detection: Good point, but the try_wait() proactive check handles the common case, and the "terminated unexpectedly" string (which we control) covers the rest. But I can see if this can be improved, probably by preserving ErrorKind through the error chain.

Single retry, no backoff: I'm not sure backoff makes sense here. If the respawn fails, it's because the binary is gone or permissions changed,retrying won't help unlike a transient network error.

codefromthecrypt · 2026-02-08T00:57:02Z

@rabi mainly I'm concerned about respawn in general especially if the rationale to do this is about "dies when the user uses Ctrl+C" having dealt with things like double ctrl+c problems in the past there are enough concerns and hard to test areas as they are.

I'll leave this to another reviewer if you insist on incidental respawn. I think if a process dies it is a problem and the cure worse than the disease applies. If we are killing a process when we shouldn't this seems just not the right fix.

Good luck either way out!

rabi · 2026-02-08T02:18:16Z

@codefromthecrypt We had a new child process for every conversation turn before, and one has to use double Ctrl+C to exit, but you can stop a response using Ctrl+C (more amplified when using streaming). After the change to use a persistent process, the user gets below error after Ctrl+C and there is no way to continue without starting a new session.

   Context: ○○○○○○○○○○ 0% (100/200000 tokens)
    ( O)> write a poem
    Ran into this error: Request failed: Failed to write to stdin: Broken pipe (os error 32).
    
    Please retry if you think this is a transient or recoverable error.
    
    ⏱️  Elapsed time: 0.13s
    Context: ○○○○○○○○○○ 0% (257/200000 tokens)
    ( O)> Write a poem again
    Ran into this error: Request failed: Failed to write to stdin: Broken pipe (os error 32).
    
    Please retry if you think this is a transient or recoverable error.

I understand the concern about respawn hiding real problems, that's probably a valid worry. But Ctrl+C is just the most reproducible trigger; the persistent process can also die from OOM, crashes, etc. The respawn is meant as general resilience for a long-lived process, not specifically a Ctrl+C workaround. Open to alternatives, if there's a cleaner approach, but to me this seemed like the least invasive option since it's invisible to the user.

alexhancock

It makes sense to me as an issue, and the code looks reasonable to me.

May want @codefromthecrypt eyes on it as well

codefromthecrypt

the ctrl-C to cancel a stream should not stop any subprocess, not MCP, CLI or ACP processes.

The issue I have with this is that we are papering over a design mistake where we are papering over the cancel flow problem. I've attempted to explain this, having personally dealt with this in other projects which deal with the double ctrl-c problem.

I don't think the "extra reslience" argument holds in this context, and when we move to ACP and delete this provider, we won't control that process directly anyway.

I will raise an alternative.

codefromthecrypt · 2026-02-09T00:38:18Z

First PR to expect will do this:

add a practical test with the double ctrl C so we know the impact on stream cancel and our fix
the fix to the cancel propagation problem

That way, we can keep separate any "enhanced resilience" concepts like potentially auto-restarting in a loop, as doing anything like that, if we wanted to do that would be for any subprocess we launch (MCP, ACP, and CLI while they exist) or we'd need to explain why only do this in claude.

codefromthecrypt · 2026-02-09T01:01:02Z

also, noticed with claude only complete_with_model is defined, not streaming, so this is important part of the problem statement.

https://github.com/block/goose/blob/main/crates/goose/src/providers/claude_code.rs#L507

alexhancock · 2026-02-09T03:15:47Z

Ah I missed that you'd already reviewed @codefromthecrypt; I just looked at the diff and weighed in, missed the previous conversation.

codefromthecrypt · 2026-02-09T03:37:22Z

this peels off the first part. we were inconsistent between MCP and CLI providers, so it is an easy fix.

Note I wasn't able to reproduce the ctrl-c killing things, possibly due to platform or linux in use. we can retry after with exact platform info to see what's going on. #7083

rabi · 2026-02-09T05:47:03Z

Note I wasn't able to reproduce the ctrl-c killing things

Did you try to do ctrl+c during the wait (before the response)?

Context: ○○○○○○○○○○ 0% (0/200000 tokens)
( O)> can you write a poem of 20 lines with start and end markers?
◒  Decoding human intent...                                                                                                                                                                                         ◐  Decoding human intent...                                                                                                                                                                                         Interrupted before the model replied and removed the last message.

⏱️  Elapsed time: 6.03s
Context: ○○○○○○○○○○ 0% (0/200000 tokens)
( O)> can you write a poem of 20 lines with start and end markers?
Ran into this error: Request failed: Failed to write to stdin: Broken pipe (os error 32).

Please retry if you think this is a transient or recoverable error.

also, noticed with claude only complete_with_model is defined, not streaming, so this is important part of the problem statement.

Yes it is, I think I mentioned earlir in the conversation that the issue was noticed in the context of the streaming implementation I was doing in #6833, but it applies to when we're not using streaming as demonstrated above.

codefromthecrypt · 2026-02-09T06:13:03Z

thanks for the clarification @rabi please check based on latest main (build etc) without streaming change. if it kills the subprocess, please report back the Claude Code etc version and platform. if it doesn't kill the subprocess, let's close this PR out so we can focus!

cheers

rabi · 2026-02-09T06:36:48Z

thanks for the clarification @rabi please check based on latest main (build etc) without streaming change. if it kills the subprocess, please report back the Claude Code etc version and platform. if it doesn't kill the subprocess, let's close this PR out so we can focus!

cheers

Yeah, thanks! #7083 would fix that issue. But I'll keep my concern about child dying without the parent knowing about it (one has to start a new session) for another day 😄 I've updated #6833 to implement streaming and also handle the draining of leftover NDJSON events from the cancelled response. Have a look when possible.

rabi mentioned this pull request Feb 7, 2026

feat(claude-code): use stream-json protocol for persistent sessions #7029

Merged

3 tasks

alexhancock approved these changes Feb 8, 2026

View reviewed changes

codefromthecrypt requested changes Feb 9, 2026

View reviewed changes

codefromthecrypt self-assigned this Feb 9, 2026

codefromthecrypt mentioned this pull request Feb 9, 2026

fix: use command.process_group(0) for CLI providers, not just MCP #7083

Merged

2 tasks

rabi closed this Feb 9, 2026

Conversation

rabi commented Feb 7, 2026

Summary

Type of Change

AI Assistance

Testing

Uh oh!

codefromthecrypt commented Feb 7, 2026

Uh oh!

codefromthecrypt commented Feb 7, 2026

Uh oh!

codefromthecrypt commented Feb 7, 2026

Uh oh!

rabi commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codefromthecrypt commented Feb 8, 2026

Uh oh!

rabi commented Feb 8, 2026

Uh oh!

alexhancock left a comment

Choose a reason for hiding this comment

Uh oh!

codefromthecrypt left a comment

Choose a reason for hiding this comment

Uh oh!

codefromthecrypt commented Feb 9, 2026

Uh oh!

codefromthecrypt commented Feb 9, 2026

Uh oh!

alexhancock commented Feb 9, 2026

Uh oh!

codefromthecrypt commented Feb 9, 2026

Uh oh!

rabi commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codefromthecrypt commented Feb 9, 2026

Uh oh!

rabi commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

rabi commented Feb 7, 2026 •

edited

Loading

rabi commented Feb 9, 2026 •

edited

Loading

rabi commented Feb 9, 2026 •

edited

Loading