-
-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider the behavior of nursery.start when the target nursery is cancelled #1431
Comments
Let me give a little more context and color on my scenario (since I think this scenario will be relatively common in an async codebase of sufficient size). Specifically in our scenario, our code is running in some nursery A, with some child nursery B. We launch a OS process P in nursery B with B.start_soon; the function to launch it waits until it exits and throws if it exits uncleanly. Then also in B we start up a function M with B.start to monitor process P (in a process-specific way; it essentially reads structured logging information from disk); this monitoring task returns a monitoring interface with task_status.started that we query for state changes in P. It doesn't return the monitoring interface until it's able to read the first structured logging information from disk, indicating that P has come up successfully. The problem is if P exits uncleanly during startup, before it starts writing any structured logging, then the task monitoring P will throw an exception, which will cancel B - but monitoring function M will never get any structured logging information, so it will just hang forever. This kind of pattern is pretty common in our codebase - it might be less common for others since we've been developing complex process management code with trio long before trio natively supported subprocesses. |
I came across this today while looking at what AnyIO does in a similar situation (as part of agronholm/anyio#717). I was just running abstract tests rather than encountering it in the wild though (and the nurseries were siblings rather than parent/child). I've got to say, I was very surprised that a cancelled nursery could be held up by a task that has no indication that it should be cancelling. I considered the same two fixes as njs: allow |
There was a discussion today in chat about the subtle interactions of
nursery.start
and cancellation, involving a somewhat complicated example: https://gitter.im/python-trio/general?at=5e6684aea2897318a999fe4eIn the example, @catern has an inner nursery and an outer nursery. A task in the outer nursery tries to spawn a task, using
await inner_nursery.start(...)
. In the mean time, another task inside the inner nursery crashes. And unluckily, the new task's startup involved waiting for the crashed task. This led to a deadlock, because:nursery.start
, and hasn't yet calledtask_status.started()
, then it's in a weird state where it keeps the target nursery open, but it's still in the original task's cancel scope.task_status.started()
, the new child is stuck and should be cancelled, it won't be cancelled until the exception propagates out of the inner nursery → deadlock.I'm not sure how common this particular situation is, but it does seem kinda problematic that we have a state where a nursery is trying to crash but it can't actually cancel everything inside it.
There are two features that are nice on their own, but play a role in creating this situation:
Feature 1: We want
task_status.started()
to be infallible. To achieve this,nursery.start
has to keep the target nursery open while the task is starting; otherwise, the nursery might disappear while the task is starting up (e.g. to all its other tasks exiting).Feature 2: If the new child raises an exception before calling
task_status.started()
(e.g., imagine you passed the wrong number of arguments to the function), then that exception is propagated out of the call tonursery.start
, not into the nursery itself.And, as a general rule, we always have to keep our exception propagation hierarchy and our cancel scope hierarchy lined up, so we can't have wild
Cancelled
exceptions propagating outside of the associated cancel scope. Put together, these mean that the new child has to run inside the caller's cancel scope untiltask_status.started()
is called, and then the call tostarted
transitions it to running inside the target nursery's cancel scope.And, this is also kind of a nice feature in its own right, because e.g. it means if you want to put a timeout on a task starting up, that's easy to do:
So that's how we end up with code that's holding the nursery open, but isn't inside its cancel scope.
Thinking about it, I think we could potentially relax the first feature... we could make it so
nursery.start
didn't keep the nursery open, and if the nursery was closed when we reached the call totask_status.started()
, then it could raiseRuntimeError
. This is already whatnursery.start_soon
andnursery.start
do if the nursery is closed when you call them. But right now, we only check for this once, before the task starts running; the new thing would be that the task could get through its startup phase and then fail due to the missing nursery.This could potentially cause new failures, if code isn't being careful to keep the nursery open until the task has finished starting up? But I think any code that could hit that failure was already race-y, because if the nursery could spontaneously close while
nursery.start
is running, then it could also spontaneously close just beforenursery.start
is called, right? Unless the startup phase code itself does something to trigger the other tasks to exit, but that seems pretty perverse.OTOH, this wouldn't actually help with @catern's particular situation (because their code is deadlocking before it reaches
task_status.started()
). And it doesn't really feel like it addresses the core theoretical problem, that a nursery can be trying to tear itself down while there's still code "running inside it" that hasn't been informed. To do that, we need to actually propagate cancellation into the new task.But, because of "Feature 2", we can't simply go ahead and put the child task directly into the target nursery's cancel scope.
One possibility that came up in chat: we could potentially put the child task into both cancel scopes simultaneously, but with a twist: if the task terminates with a
Cancelled
exception before callingtask_status.started()
, and thenursery.start
scope is cancelled, then theCancelled
is allowed to propagate into the caller, like it does now. But if thisCancelled
came from the nursery's cancel scope, then we can't allow it to propagate out into the caller, because that would become a wildCancelled
that might never be caught. So instead,nursery.start
would need to detect this situation, and convert it into some other exception, like aNurseryWasCancelledError
or something.This seems potentially complicated to implement, and
nursery.start
is already pretty complicated. (Though @oremanj's cancel scope refactoring made it much better; maybe this wouldn't actually be too hard? idk) But it does seem like it might be The Right Solution? I'm not sure.The text was updated successfully, but these errors were encountered: