Skip to content

fix(claude-code): respawn persistent CLI process on broken pipe#7070

Closed
rabi wants to merge 1 commit intoblock:mainfrom
rabi:fix/respawn-cli-on-broken-pipe
Closed

fix(claude-code): respawn persistent CLI process on broken pipe#7070
rabi wants to merge 1 commit intoblock:mainfrom
rabi:fix/respawn-cli-on-broken-pipe

Conversation

@rabi
Copy link
Contributor

@rabi rabi commented Feb 7, 2026

Summary

Replace OnceCell with Mutex<Option> so a dead CLI process can be detected and respawned. The persistent stream-json process (introduced in #7029) dies when the user uses Ctrl+C, leaving the OnceCell permanently poisoned with a stale process whose stdin pipe is closed.

Type of Change

  • Feature
  • Bug fix
  • Refactor / Code quality
  • Performance improvement
  • Documentation
  • Tests
  • Security fix
  • Build / Release
  • Other (specify below)

AI Assistance

  • This PR was created or reviewed with AI assistance

Testing

Tested locally and Ctrl+C works as expected.

Replace OnceCell with Mutex<Option<CliProcess>> so a dead CLI process
can be detected and respawned. The persistent stream-json process
(introduced in block#7029) could die unexpectedly (e.g. after Ctrl+C),
leaving the OnceCell permanently poisoned with a stale process whose
stdin pipe is closed.

Signed-off-by: rabi <ramishra@redhat.com>
@codefromthecrypt
Copy link
Collaborator

curious is this the right solution? "dies when the user uses Ctrl+C," wouldn't we want to not propagate the ctrl-c instead? respawning would lose state.

@codefromthecrypt
Copy link
Collaborator

basically I'm not sure the flow, but in other tools I've worked on there's either intent that ctrl+c means exit the process (which should propagate) or it means stop the current command (should not).

maybe look at how other claude-code runners work as even if we want to restart the subprocess instead of handling the signal.. if we do it that way, it is best to cite why.

@codefromthecrypt
Copy link
Collaborator

spent some time for a more thorough answer.


The child dies on Ctrl-C because it inherits the parent's process group and receives SIGINT directly. Respawning works around this but has side effects:

  • Silent context loss — the respawned CLI has no memory of the previous conversation. Message replay doesn't restore Claude Code's internal state (tool results, file edits, etc.).
  • Fragile error detectionis_recoverable matches "Broken pipe" as a substring, which is OS/locale-dependent. Other death modes (SIGKILL, OOM) won't match.
  • Single retry, no backoff — if the respawn also fails immediately, the error propagates with no further recovery.

The root cause fix is to spawn the child in its own process group so it doesn't receive SIGINT. For reference, forge-core does this with group_spawn(). That one-line change would eliminate the need for respawn logic entirely.

See also: how other projects manage the Claude Code subprocess
Project Signal handling Persistent NDJSON Auto-restart
claude-agent-sdk-go (Go) None — no OS signal trapping. Close does stdin close + kill after 5s. Interrupt is a protocol message, not a signal. Yes — one process at Connect(), --output-format stream-json --input-format stream-json, bufio.Scanner line reader NoneIsAlive() is stubbed (return true, TODO comment). EOF closes channel; consumer must re-Connect().
forge-core (Rust) kill_on_drop + group_spawn — process group killed when handle drops. No explicit signal trapping. Yes--output-format=stream-json --input-format=stream-json via ProtocolPeer on stdin/stdout. None — new process per execution. Supports --fork-session/--resume but no auto-respawn.
mcpc (TypeScript) kill() on forceCleanup(). No SIGINT/SIGTERM trapping on the parent process. Yesspawn() with ndJsonStream(input, output) over persistent stdin/stdout pipes. None — if the process dies, session fails. No retry or respawn logic.

@rabi
Copy link
Contributor Author

rabi commented Feb 7, 2026

I looked into the group_spawn() approach in forge-core, the command_group crate uses calls pre_exec + setpgid in an unsafe block under the hood and would also introduce orphan-process risks if the parent exits without cleaning up the separate process group. From what I can tell, forge-core spawns a new process per execution, so they never face this persistent-process problem in the first place.

Silent context loss: Context loss would happen regardless if the process dies for any reason, Respawn at least lets the user continue rather than hitting a permanent error. The message replay sends the full conversation history.

Fragile error detection: Good point, but the try_wait() proactive check handles the common case, and the "terminated unexpectedly" string (which we control) covers the rest. But I can see if this can be improved, probably by preserving ErrorKind through the error chain.

Single retry, no backoff: I'm not sure backoff makes sense here. If the respawn fails, it's because the binary is gone or permissions changed,retrying won't help unlike a transient network error.

@codefromthecrypt
Copy link
Collaborator

@rabi mainly I'm concerned about respawn in general especially if the rationale to do this is about "dies when the user uses Ctrl+C" having dealt with things like double ctrl+c problems in the past there are enough concerns and hard to test areas as they are.

I'll leave this to another reviewer if you insist on incidental respawn. I think if a process dies it is a problem and the cure worse than the disease applies. If we are killing a process when we shouldn't this seems just not the right fix.

Good luck either way out!

@rabi
Copy link
Contributor Author

rabi commented Feb 8, 2026

@codefromthecrypt We had a new child process for every conversation turn before, and one has to use double Ctrl+C to exit, but you can stop a response using Ctrl+C (more amplified when using streaming). After the change to use a persistent process, the user gets below error after Ctrl+C and there is no way to continue without starting a new session.

   Context: ○○○○○○○○○○ 0% (100/200000 tokens)
    ( O)> write a poem
    Ran into this error: Request failed: Failed to write to stdin: Broken pipe (os error 32).
    
    Please retry if you think this is a transient or recoverable error.
    
    ⏱️  Elapsed time: 0.13s
    Context: ○○○○○○○○○○ 0% (257/200000 tokens)
    ( O)> Write a poem again
    Ran into this error: Request failed: Failed to write to stdin: Broken pipe (os error 32).
    
    Please retry if you think this is a transient or recoverable error.

I understand the concern about respawn hiding real problems, that's probably a valid worry. But Ctrl+C is just the most reproducible trigger; the persistent process can also die from OOM, crashes, etc. The respawn is meant as general resilience for a long-lived process, not specifically a Ctrl+C workaround. Open to alternatives, if there's a cleaner approach, but to me this seemed like the least invasive option since it's invisible to the user.

Copy link
Collaborator

@alexhancock alexhancock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me as an issue, and the code looks reasonable to me.

May want @codefromthecrypt eyes on it as well

Copy link
Collaborator

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ctrl-C to cancel a stream should not stop any subprocess, not MCP, CLI or ACP processes.

The issue I have with this is that we are papering over a design mistake where we are papering over the cancel flow problem. I've attempted to explain this, having personally dealt with this in other projects which deal with the double ctrl-c problem.

I don't think the "extra reslience" argument holds in this context, and when we move to ACP and delete this provider, we won't control that process directly anyway.

I will raise an alternative.

@codefromthecrypt codefromthecrypt self-assigned this Feb 9, 2026
@codefromthecrypt
Copy link
Collaborator

First PR to expect will do this:

  1. add a practical test with the double ctrl C so we know the impact on stream cancel and our fix
  2. the fix to the cancel propagation problem

That way, we can keep separate any "enhanced resilience" concepts like potentially auto-restarting in a loop, as doing anything like that, if we wanted to do that would be for any subprocess we launch (MCP, ACP, and CLI while they exist) or we'd need to explain why only do this in claude.

@codefromthecrypt
Copy link
Collaborator

also, noticed with claude only complete_with_model is defined, not streaming, so this is important part of the problem statement.

https://github.com/block/goose/blob/main/crates/goose/src/providers/claude_code.rs#L507

@alexhancock
Copy link
Collaborator

Ah I missed that you'd already reviewed @codefromthecrypt; I just looked at the diff and weighed in, missed the previous conversation.

@codefromthecrypt
Copy link
Collaborator

this peels off the first part. we were inconsistent between MCP and CLI providers, so it is an easy fix.

Note I wasn't able to reproduce the ctrl-c killing things, possibly due to platform or linux in use. we can retry after with exact platform info to see what's going on. #7083

@rabi
Copy link
Contributor Author

rabi commented Feb 9, 2026

Note I wasn't able to reproduce the ctrl-c killing things

Did you try to do ctrl+c during the wait (before the response)?

Context: ○○○○○○○○○○ 0% (0/200000 tokens)
( O)> can you write a poem of 20 lines with start and end markers?
◒  Decoding human intent...                                                                                                                                                                                         ◐  Decoding human intent...                                                                                                                                                                                         Interrupted before the model replied and removed the last message.

⏱️  Elapsed time: 6.03s
Context: ○○○○○○○○○○ 0% (0/200000 tokens)
( O)> can you write a poem of 20 lines with start and end markers?
Ran into this error: Request failed: Failed to write to stdin: Broken pipe (os error 32).

Please retry if you think this is a transient or recoverable error.


also, noticed with claude only complete_with_model is defined, not streaming, so this is important part of the problem statement.

Yes it is, I think I mentioned earlir in the conversation that the issue was noticed in the context of the streaming implementation I was doing in #6833, but it applies to when we're not using streaming as demonstrated above.

@codefromthecrypt
Copy link
Collaborator

thanks for the clarification @rabi please check based on latest main (build etc) without streaming change. if it kills the subprocess, please report back the Claude Code etc version and platform. if it doesn't kill the subprocess, let's close this PR out so we can focus!

cheers

@rabi
Copy link
Contributor Author

rabi commented Feb 9, 2026

thanks for the clarification @rabi please check based on latest main (build etc) without streaming change. if it kills the subprocess, please report back the Claude Code etc version and platform. if it doesn't kill the subprocess, let's close this PR out so we can focus!

cheers

Yeah, thanks! #7083 would fix that issue. But I'll keep my concern about child dying without the parent knowing about it (one has to start a new session) for another day 😄 I've updated #6833 to implement streaming and also handle the draining of leftover NDJSON events from the cancelled response. Have a look when possible.

@rabi rabi closed this Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments