fix(claude-code): respawn persistent CLI process on broken pipe#7070
fix(claude-code): respawn persistent CLI process on broken pipe#7070rabi wants to merge 1 commit intoblock:mainfrom
Conversation
Replace OnceCell with Mutex<Option<CliProcess>> so a dead CLI process can be detected and respawned. The persistent stream-json process (introduced in block#7029) could die unexpectedly (e.g. after Ctrl+C), leaving the OnceCell permanently poisoned with a stale process whose stdin pipe is closed. Signed-off-by: rabi <ramishra@redhat.com>
|
curious is this the right solution? "dies when the user uses Ctrl+C," wouldn't we want to not propagate the ctrl-c instead? respawning would lose state. |
|
basically I'm not sure the flow, but in other tools I've worked on there's either intent that ctrl+c means exit the process (which should propagate) or it means stop the current command (should not). maybe look at how other claude-code runners work as even if we want to restart the subprocess instead of handling the signal.. if we do it that way, it is best to cite why. |
|
spent some time for a more thorough answer. The child dies on Ctrl-C because it inherits the parent's process group and receives SIGINT directly. Respawning works around this but has side effects:
The root cause fix is to spawn the child in its own process group so it doesn't receive SIGINT. For reference, forge-core does this with See also: how other projects manage the Claude Code subprocess
|
|
I looked into the group_spawn() approach in forge-core, the command_group crate uses calls pre_exec + setpgid in an unsafe block under the hood and would also introduce orphan-process risks if the parent exits without cleaning up the separate process group. From what I can tell, forge-core spawns a new process per execution, so they never face this persistent-process problem in the first place. Silent context loss: Context loss would happen regardless if the process dies for any reason, Respawn at least lets the user continue rather than hitting a permanent error. The message replay sends the full conversation history. Fragile error detection: Good point, but the try_wait() proactive check handles the common case, and the "terminated unexpectedly" string (which we control) covers the rest. But I can see if this can be improved, probably by preserving ErrorKind through the error chain. Single retry, no backoff: I'm not sure backoff makes sense here. If the respawn fails, it's because the binary is gone or permissions changed,retrying won't help unlike a transient network error. |
|
@rabi mainly I'm concerned about respawn in general especially if the rationale to do this is about "dies when the user uses Ctrl+C" having dealt with things like double ctrl+c problems in the past there are enough concerns and hard to test areas as they are. I'll leave this to another reviewer if you insist on incidental respawn. I think if a process dies it is a problem and the cure worse than the disease applies. If we are killing a process when we shouldn't this seems just not the right fix. Good luck either way out! |
|
@codefromthecrypt We had a new child process for every conversation turn before, and one has to use double Ctrl+C to exit, but you can stop a response using Ctrl+C (more amplified when using streaming). After the change to use a persistent process, the user gets below error after Ctrl+C and there is no way to continue without starting a new session. I understand the concern about respawn hiding real problems, that's probably a valid worry. But Ctrl+C is just the most reproducible trigger; the persistent process can also die from OOM, crashes, etc. The respawn is meant as general resilience for a long-lived process, not specifically a Ctrl+C workaround. Open to alternatives, if there's a cleaner approach, but to me this seemed like the least invasive option since it's invisible to the user. |
alexhancock
left a comment
There was a problem hiding this comment.
It makes sense to me as an issue, and the code looks reasonable to me.
May want @codefromthecrypt eyes on it as well
codefromthecrypt
left a comment
There was a problem hiding this comment.
the ctrl-C to cancel a stream should not stop any subprocess, not MCP, CLI or ACP processes.
The issue I have with this is that we are papering over a design mistake where we are papering over the cancel flow problem. I've attempted to explain this, having personally dealt with this in other projects which deal with the double ctrl-c problem.
I don't think the "extra reslience" argument holds in this context, and when we move to ACP and delete this provider, we won't control that process directly anyway.
I will raise an alternative.
|
First PR to expect will do this:
That way, we can keep separate any "enhanced resilience" concepts like potentially auto-restarting in a loop, as doing anything like that, if we wanted to do that would be for any subprocess we launch (MCP, ACP, and CLI while they exist) or we'd need to explain why only do this in claude. |
|
also, noticed with claude only complete_with_model is defined, not streaming, so this is important part of the problem statement. https://github.com/block/goose/blob/main/crates/goose/src/providers/claude_code.rs#L507 |
|
Ah I missed that you'd already reviewed @codefromthecrypt; I just looked at the diff and weighed in, missed the previous conversation. |
|
this peels off the first part. we were inconsistent between MCP and CLI providers, so it is an easy fix. Note I wasn't able to reproduce the ctrl-c killing things, possibly due to platform or linux in use. we can retry after with exact platform info to see what's going on. #7083 |
Did you try to do ctrl+c during the wait (before the response)?
Yes it is, I think I mentioned earlir in the conversation that the issue was noticed in the context of the streaming implementation I was doing in #6833, but it applies to when we're not using streaming as demonstrated above. |
|
thanks for the clarification @rabi please check based on latest main (build etc) without streaming change. if it kills the subprocess, please report back the Claude Code etc version and platform. if it doesn't kill the subprocess, let's close this PR out so we can focus! cheers |
Yeah, thanks! #7083 would fix that issue. But I'll keep my concern about child dying without the parent knowing about it (one has to start a new session) for another day 😄 I've updated #6833 to implement streaming and also handle the draining of leftover NDJSON events from the cancelled response. Have a look when possible. |
Summary
Replace OnceCell with Mutex<Option> so a dead CLI process can be detected and respawned. The persistent stream-json process (introduced in #7029) dies when the user uses Ctrl+C, leaving the OnceCell permanently poisoned with a stale process whose stdin pipe is closed.
Type of Change
AI Assistance
Testing
Tested locally and Ctrl+C works as expected.