Add support for asynchronous waitpid on Linux systems. #622

Fuyukai · 2018-08-22T21:01:21Z

This completes step 6 of #4 (comment).

codecov · 2018-08-22T21:21:27Z

Codecov Report

Merging #622 into master will decrease coverage by 0.01%.
The diff coverage is 91.35%.

@@            Coverage Diff            @@
##           master    #622      +/-   ##
=========================================
- Coverage   99.31%   99.3%   -0.02%     
=========================================
  Files          93      95       +2     
  Lines       10954   12022    +1068     
  Branches      782    1018     +236     
=========================================
+ Hits        10879   11938    +1059     
- Misses         56      63       +7     
- Partials       19      21       +2

Impacted Files	Coverage Δ
trio/tests/subprocess/test_waitpid_linux.py	`100% <100%> (ø)`
trio/_subprocess/linux_waitpid.py	`86.27% <86.27%> (ø)`
trio/_core/_run.py	`99.66% <0%> (-0.03%)`	⬇️
trio/_core/tests/test_run.py	`100% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fbb4543...cfbedd0. Read the comment docs.

njsmith · 2018-08-23T09:02:00Z

Highlevel comment on both this and #621: #621 (comment)

njsmith · 2018-09-02T06:24:42Z

High-level review:

Looking at this again, I think this is actually working harder than it has to! :-) I made it too complicated in my sample code. (Also, I think maybe I was still imagining we would expose a public API for waitpid? That seems overambitious for right now.)

On Unix, we know that we'll eventually be calling waitpid exactly once for each process: you can't call it multiple times because the PID gets reassigned after you call it once, and you have to call it once to avoid leaking zombies. So in our Popen class, on Linux, I figure the simplest thing to do will be to spawn the waitpid thread immediately when we start the subprocess, and then our Popen.wait method will have to check whether it's already gotten a result, or else wait for that thread to finish.

So for this private API the operations we need are:

Spin up the waitpid thread
Wait for the waitpid thread to finish
Once it's finished, find out what happened to it (exactly once, since then we'll save the result onto the Popen object)

So:

The _cached_result stuff to work around Outcome.unwrap consuming the outcome is unnecessary, because we'll only need to look at the outcome once. (This is also good because this reintroduces the bug that Outcome.unwrap is trying to protect you against by consuming the object :-/. Since Python 3 exceptions carry their traceback on them, you get corrupted tracebacks if you raise the same exception object in multiple contexts.)
We don't need the pending_waitpids global dict: in our "Spin up the waitpid thread" operation we can create a WaitpidResult, hand it off to the system task to fill in, and then return it. Then our "Wait for the waitpid thread to finish" operation operates on that result object, so... no need to look it up.

Sorry for sending you astray...

njsmith · 2018-09-02T07:37:44Z

trio/_subprocess/linux_waitpid.py

+        result = _pending_waitpids.pop(pid)
+        result.outcome = outcome.Error(e)
+        result.event.set()
+        raise


The only way to hit this except block (modulo bugs) is if the run_sync_in_worker_thread gets cancelled. Cancellation is always a BaseException, not an Exception, so right now I think this is a no-op? Also,

It's really not clear whether we even care about cancellation here... since this is a system task, the only way it can get cancelled is if the entire run is shutting down, either because the main task has already finished, or because there was an internal error and we're crashing. So... I guess the only time it matters whether we call result.event.set() here is when someone is doing a cancel-shielded wait for a subprocess to finish? And if they do, we won't handle it correctly anyway – the user would expect a cancel-shielded wait to actually wait for the subprocess to exit, but we'll still abort the run_sync_in_worker_task, so we won't know when the subprocess exits. Maybe we need to resurrect #303 to handle this corner case? Or toggle the system task's shielding on and off depending on whether anyone is currently blocked in Popen.wait? Or use threading.Thread directly instead of trying to go through run_sync_in_worker_task? (...can we even reliably re-enter the trio thread once the system nursery is cancelled? The re-entry queue processor is also a system task...)

Man, I hate waitpid.

This is such an obscure use case that I don't think we need to worry about it right now. Trying to do a cancel-shielded wait for a subprocess isn't too ridiculous – nursery.__aexit__ is one of the standard use cases for cancel-shielding, and people will want to do nursery-like things for managing subprocesses. So we might end up caring eventually. But right now we should just shrug and accept that it won't do exactly the right thing here. And if we don't care about that, we can make this whole function a lot simpler, like...

try: result.outcome = await run_sync_in_worker_thread(...) finally: # Paranoia to make sure we unblock any waiters even if something goes horribly wrong result.event.set()

Trying to do a cancel-shielded wait for a subprocess isn't too ridiculous – nursery.__aexit__ is one of the standard use cases for cancel-shielding, and people will want to do nursery-like things for managing subprocesses.

Actually, on further thought, this is still overstating the importance of this edge case. If someone does a cancel-shielded wait for a subprocess, that will mostly work fine, even with my "simplified" code above. The only way this task gets cancelled is if we're crashing, or if the main task has already exited. That means that the only ways a cancel-shielded Popen.wait can fail to actually wait are:

We're in the middle of crashing with a TrioInternalError: well, sorry, that means internal invariants have been violated and we're blowing up the world, so it's OK if your task manager doesn't wait for child processes correctly. At this point all guarantees are void.

If we're not crashing with TrioInternalError, and the system task is cancelled, and someone is doing a cancel-shielded Popen.wait, then the main task has already exited, so they must be doing it from inside a system task. But doing anything inside a cancel-shield in a system task is highly dubious, because by the time a system task gets cancelled the world is being torn down around you. Don't do that please.

njsmith · 2018-09-02T07:42:36Z

trio/tests/subprocess/test_waitpid_linux.py

+async def test_waitpid():
+    pid = os.spawnvp(os.P_NOWAIT, "/bin/false", ("false",))
+    result = await waitpid(pid)
+    # exit code is a 16-bit int: (code, signal)


In theory it's a bit more complicated than that, there's a "was a core dumped?" flag in there and the ability to distinguish between stop signals and termination signals (see).

In practice this is a pretty pedantic distinction. If we want to be really POSIX-ly correct, though, I guess the tests should make assertions like

assert os.WIFEXITED(code) and os.WEXITSTATUS(code) == 1
assert os.WIFEXITED(code) and os.WEXITSTATUS(code) == 0
assert os.WIFSIGNALED(code) and os.WTERMSIG(code) == signal.SIGKILL

?

njsmith · 2018-09-02T07:50:27Z

trio/tests/subprocess/test_waitpid_linux.py

+async def test_waitpid_no_process():
+    with pytest.raises(ChildProcessError):
+        # this PID probably doesn't exist
+        await waitpid(100000)


oo I thought of a trick to make this deterministic.

You can only wait for your child processes. So we need the pid of a process that we know is not a child process. How about... waitpid(os.getpid())? I think that deterministically raises the error.

Or I guess waitpid(1) would also work, since init has to exist and is never a child of any other process.

Fuyukai · 2018-09-04T03:53:50Z

I have no clue what the hell happened to the tests here.

njsmith · 2018-09-04T04:53:39Z

Huh, bizarre. Could be a change in a third-party dependency, like pytest-cov or coverage...?

njsmith · 2018-09-04T04:57:42Z

There's something weird with your rebase too... did you accidentally rebase master onto this, or something? Somehow the commits in this branch that came from master have different hashes than they do on master...

smurfix · 2018-09-04T06:42:22Z

Owch. You've been rebasing in the wrong direction, i.e. instead of rebasing your work on top of the release you've been rebasing the release tree on top of your work. :-(

smurfix · 2018-09-04T06:43:38Z

Frankly I don't like rebasing anyway, just reset to 133e46b, merge up, and be done with it.

smurfix · 2018-09-04T06:44:11Z

(and then cherry-pick cfbedd0)

Fuyukai · 2018-09-04T12:54:34Z

Oh I think I accidentally rebased onto master then master from my fork onto this. Whoops.

codecov-io · 2018-09-04T13:25:35Z

Codecov Report

Merging #622 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #622      +/-   ##
==========================================
+ Coverage   99.31%   99.31%   +<.01%     
==========================================
  Files          93       95       +2     
  Lines       10975    11028      +53     
  Branches      785      786       +1     
==========================================
+ Hits        10900    10953      +53     
  Misses         56       56              
  Partials       19       19

Impacted Files	Coverage Δ
trio/tests/subprocess/test_waitpid_linux.py	`100% <100%> (ø)`
trio/_subprocess/linux_waitpid.py	`100% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 081a45b...e8b8861. Read the comment docs.

njsmith · 2018-09-04T17:35:51Z

Still getting those coverage combine failures though :-(. What on earth...

Lacking better ideas, at this point I'd probably try debugging by trying to isolate what exactly is triggering that, e.g. by temporarily pushing a commit that turns off all the tests added in this PR, and by opening a trivial no-op PR to confirm whether the issue even is specific to this PR.

pquentin · 2018-09-05T05:34:31Z

OK, the coverage combine failures were sorted out in #647, thanks @njsmith!

Does that mean we need another rebase?

njsmith

This is looking good, one small change requested below. (And the nice thing is, that should trigger a new run of the tests and untangle the CI mess.)

njsmith · 2018-09-05T05:53:48Z

trio/_subprocess/linux_waitpid.py

+    _core.spawn_system_task(_task, waiter)
+
+    await waiter.event.wait()
+    return waiter.outcome.unwrap()


I guess we'll eventually need to split this up into two functions, but that's fine, no particular reason to do that now.

njsmith · 2018-09-05T05:54:15Z

trio/tests/subprocess/test_waitpid_linux.py

+    pid = os.spawnvp(os.P_NOWAIT, "/bin/false", ("false",))
+    result = await waitpid(pid)
+    # exit code is a 16-bit int: (code, signal)
+    assert os.WIFEXITED(result[1]) and os.WEXITSTATUS(result[1]) == 1


We should probably assert result[0] == pid as well

njsmith · 2018-09-05T05:56:24Z

Shouldn't need a rebase – the CI systems don't test the head of the PR branch, they make a temporary merge between the PR branch and master and then test the merge. So the next time they run on this PR, they should pick up the fix from #647.

njsmith · 2018-09-05T18:04:04Z

Tests are failing (typo in the new asserts)

njsmith · 2018-09-05T20:31:07Z

Jenkins seems to be confused... Maybe this will tickle it?

njsmith · 2018-09-05T20:33:50Z

That was a new one: apparently Jenkins just didn't notice the latest push or something; it didn't even create a job record, never mind actually run anything. But close/reopen seems to have fixed it. (Which is funny because usually Jenkins ignores close/reopen.)

njsmith · 2018-09-05T20:53:26Z

Ok code looks good, CI is green, I'm going to merge this PR quick before it gets hit another bizarre mishap.

This was referenced Aug 23, 2018

Subprocess support #4

Closed

Add support for pipes. #621

Merged

njsmith reviewed Sep 2, 2018

View reviewed changes

njsmith reviewed Sep 5, 2018

View reviewed changes

Fuyukai added 4 commits September 5, 2018 19:43

Add support for asynchronous waitpid on Linux systems.

9d9c2f8

Add more waitpid tests.

210c8c9

Rearrange files.

28f7d52

Make requested changes.

e8b8861

njsmith closed this Sep 5, 2018

njsmith reopened this Sep 5, 2018

njsmith merged commit ee45fab into python-trio:master Sep 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for asynchronous waitpid on Linux systems. #622

Add support for asynchronous waitpid on Linux systems. #622

Fuyukai commented Aug 22, 2018

codecov bot commented Aug 22, 2018 •

edited

Loading

njsmith commented Aug 23, 2018

njsmith commented Sep 2, 2018

njsmith Sep 2, 2018

njsmith Sep 2, 2018

njsmith Sep 2, 2018

njsmith Sep 2, 2018

Fuyukai commented Sep 4, 2018

njsmith commented Sep 4, 2018

njsmith commented Sep 4, 2018

smurfix commented Sep 4, 2018

smurfix commented Sep 4, 2018

smurfix commented Sep 4, 2018

Fuyukai commented Sep 4, 2018

codecov-io commented Sep 4, 2018 •

edited by codecov bot

Loading

njsmith commented Sep 4, 2018

pquentin commented Sep 5, 2018 •

edited

Loading

njsmith left a comment

njsmith Sep 5, 2018

njsmith Sep 5, 2018

njsmith commented Sep 5, 2018

njsmith commented Sep 5, 2018

njsmith commented Sep 5, 2018

njsmith commented Sep 5, 2018

njsmith commented Sep 5, 2018

Add support for asynchronous waitpid on Linux systems. #622

Add support for asynchronous waitpid on Linux systems. #622

Conversation

Fuyukai commented Aug 22, 2018

codecov bot commented Aug 22, 2018 • edited Loading

Codecov Report

njsmith commented Aug 23, 2018

njsmith commented Sep 2, 2018

njsmith Sep 2, 2018

Choose a reason for hiding this comment

njsmith Sep 2, 2018

Choose a reason for hiding this comment

njsmith Sep 2, 2018

Choose a reason for hiding this comment

njsmith Sep 2, 2018

Choose a reason for hiding this comment

Fuyukai commented Sep 4, 2018

njsmith commented Sep 4, 2018

njsmith commented Sep 4, 2018

smurfix commented Sep 4, 2018

smurfix commented Sep 4, 2018

smurfix commented Sep 4, 2018

Fuyukai commented Sep 4, 2018

codecov-io commented Sep 4, 2018 • edited by codecov bot Loading

Codecov Report

njsmith commented Sep 4, 2018

pquentin commented Sep 5, 2018 • edited Loading

njsmith left a comment

Choose a reason for hiding this comment

njsmith Sep 5, 2018

Choose a reason for hiding this comment

njsmith Sep 5, 2018

Choose a reason for hiding this comment

njsmith commented Sep 5, 2018

njsmith commented Sep 5, 2018

njsmith commented Sep 5, 2018

njsmith commented Sep 5, 2018

njsmith commented Sep 5, 2018

codecov bot commented Aug 22, 2018 •

edited

Loading

codecov-io commented Sep 4, 2018 •

edited by codecov bot

Loading

pquentin commented Sep 5, 2018 •

edited

Loading