-
-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for asynchronous waitpid on Linux systems. #622
Conversation
Codecov Report
@@ Coverage Diff @@
## master #622 +/- ##
=========================================
- Coverage 99.31% 99.3% -0.02%
=========================================
Files 93 95 +2
Lines 10954 12022 +1068
Branches 782 1018 +236
=========================================
+ Hits 10879 11938 +1059
- Misses 56 63 +7
- Partials 19 21 +2
Continue to review full report at Codecov.
|
Highlevel comment on both this and #621: #621 (comment) |
High-level review: Looking at this again, I think this is actually working harder than it has to! :-) I made it too complicated in my sample code. (Also, I think maybe I was still imagining we would expose a public API for On Unix, we know that we'll eventually be calling So for this private API the operations we need are:
So:
Sorry for sending you astray... |
trio/_subprocess/linux_waitpid.py
Outdated
result = _pending_waitpids.pop(pid) | ||
result.outcome = outcome.Error(e) | ||
result.event.set() | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way to hit this except
block (modulo bugs) is if the run_sync_in_worker_thread
gets cancelled. Cancellation is always a BaseException
, not an Exception
, so right now I think this is a no-op? Also,
It's really not clear whether we even care about cancellation here... since this is a system task, the only way it can get cancelled is if the entire run
is shutting down, either because the main task has already finished, or because there was an internal error and we're crashing. So... I guess the only time it matters whether we call result.event.set()
here is when someone is doing a cancel-shielded wait for a subprocess to finish? And if they do, we won't handle it correctly anyway – the user would expect a cancel-shielded wait to actually wait for the subprocess to exit, but we'll still abort the run_sync_in_worker_task
, so we won't know when the subprocess exits. Maybe we need to resurrect #303 to handle this corner case? Or toggle the system task's shielding on and off depending on whether anyone is currently blocked in Popen.wait
? Or use threading.Thread
directly instead of trying to go through run_sync_in_worker_task
? (...can we even reliably re-enter the trio thread once the system nursery is cancelled? The re-entry queue processor is also a system task...)
Man, I hate waitpid
.
This is such an obscure use case that I don't think we need to worry about it right now. Trying to do a cancel-shielded wait for a subprocess isn't too ridiculous – nursery.__aexit__
is one of the standard use cases for cancel-shielding, and people will want to do nursery-like things for managing subprocesses. So we might end up caring eventually. But right now we should just shrug and accept that it won't do exactly the right thing here. And if we don't care about that, we can make this whole function a lot simpler, like...
try:
result.outcome = await run_sync_in_worker_thread(...)
finally:
# Paranoia to make sure we unblock any waiters even if something goes horribly wrong
result.event.set()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to do a cancel-shielded wait for a subprocess isn't too ridiculous –
nursery.__aexit__
is one of the standard use cases for cancel-shielding, and people will want to do nursery-like things for managing subprocesses.
Actually, on further thought, this is still overstating the importance of this edge case. If someone does a cancel-shielded wait for a subprocess, that will mostly work fine, even with my "simplified" code above. The only way this task gets cancelled is if we're crashing, or if the main task has already exited. That means that the only ways a cancel-shielded Popen.wait
can fail to actually wait are:
-
We're in the middle of crashing with a
TrioInternalError
: well, sorry, that means internal invariants have been violated and we're blowing up the world, so it's OK if your task manager doesn't wait for child processes correctly. At this point all guarantees are void. -
If we're not crashing with
TrioInternalError
, and the system task is cancelled, and someone is doing a cancel-shieldedPopen.wait
, then the main task has already exited, so they must be doing it from inside a system task. But doing anything inside a cancel-shield in a system task is highly dubious, because by the time a system task gets cancelled the world is being torn down around you. Don't do that please.
async def test_waitpid(): | ||
pid = os.spawnvp(os.P_NOWAIT, "/bin/false", ("false",)) | ||
result = await waitpid(pid) | ||
# exit code is a 16-bit int: (code, signal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory it's a bit more complicated than that, there's a "was a core dumped?" flag in there and the ability to distinguish between stop signals and termination signals (see).
In practice this is a pretty pedantic distinction. If we want to be really POSIX-ly correct, though, I guess the tests should make assertions like
assert os.WIFEXITED(code) and os.WEXITSTATUS(code) == 1
assert os.WIFEXITED(code) and os.WEXITSTATUS(code) == 0
assert os.WIFSIGNALED(code) and os.WTERMSIG(code) == signal.SIGKILL
?
async def test_waitpid_no_process(): | ||
with pytest.raises(ChildProcessError): | ||
# this PID probably doesn't exist | ||
await waitpid(100000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oo I thought of a trick to make this deterministic.
You can only wait for your child processes. So we need the pid of a process that we know is not a child process. How about... waitpid(os.getpid())
? I think that deterministically raises the error.
Or I guess waitpid(1)
would also work, since init
has to exist and is never a child of any other process.
I have no clue what the hell happened to the tests here. |
Huh, bizarre. Could be a change in a third-party dependency, like pytest-cov or coverage...? |
There's something weird with your rebase too... did you accidentally rebase master onto this, or something? Somehow the commits in this branch that came from master have different hashes than they do on master... |
Owch. You've been rebasing in the wrong direction, i.e. instead of rebasing your work on top of the release you've been rebasing the release tree on top of your work. :-( |
Frankly I don't like rebasing anyway, just reset to 133e46b, merge up, and be done with it. |
(and then cherry-pick cfbedd0) |
Oh I think I accidentally rebased onto master then master from my fork onto this. Whoops. |
Codecov Report
@@ Coverage Diff @@
## master #622 +/- ##
==========================================
+ Coverage 99.31% 99.31% +<.01%
==========================================
Files 93 95 +2
Lines 10975 11028 +53
Branches 785 786 +1
==========================================
+ Hits 10900 10953 +53
Misses 56 56
Partials 19 19
Continue to review full report at Codecov.
|
Still getting those Lacking better ideas, at this point I'd probably try debugging by trying to isolate what exactly is triggering that, e.g. by temporarily pushing a commit that turns off all the tests added in this PR, and by opening a trivial no-op PR to confirm whether the issue even is specific to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good, one small change requested below. (And the nice thing is, that should trigger a new run of the tests and untangle the CI mess.)
_core.spawn_system_task(_task, waiter) | ||
|
||
await waiter.event.wait() | ||
return waiter.outcome.unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we'll eventually need to split this up into two functions, but that's fine, no particular reason to do that now.
pid = os.spawnvp(os.P_NOWAIT, "/bin/false", ("false",)) | ||
result = await waitpid(pid) | ||
# exit code is a 16-bit int: (code, signal) | ||
assert os.WIFEXITED(result[1]) and os.WEXITSTATUS(result[1]) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably assert result[0] == pid
as well
Shouldn't need a rebase – the CI systems don't test the head of the PR branch, they make a temporary merge between the PR branch and master and then test the merge. So the next time they run on this PR, they should pick up the fix from #647. |
Tests are failing (typo in the new asserts) |
Jenkins seems to be confused... Maybe this will tickle it? |
That was a new one: apparently Jenkins just didn't notice the latest push or something; it didn't even create a job record, never mind actually run anything. But close/reopen seems to have fixed it. (Which is funny because usually Jenkins ignores close/reopen.) |
Ok code looks good, CI is green, I'm going to merge this PR quick before it gets hit another bizarre mishap. |
This completes step 6 of #4 (comment).