Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for asynchronous waitpid on Linux systems. #622

Merged
merged 4 commits into from
Sep 5, 2018
Merged

Add support for asynchronous waitpid on Linux systems. #622

merged 4 commits into from
Sep 5, 2018

Conversation

Fuyukai
Copy link
Member

@Fuyukai Fuyukai commented Aug 22, 2018

This completes step 6 of #4 (comment).

@codecov
Copy link

codecov bot commented Aug 22, 2018

Codecov Report

Merging #622 into master will decrease coverage by 0.01%.
The diff coverage is 91.35%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #622      +/-   ##
=========================================
- Coverage   99.31%   99.3%   -0.02%     
=========================================
  Files          93      95       +2     
  Lines       10954   12022    +1068     
  Branches      782    1018     +236     
=========================================
+ Hits        10879   11938    +1059     
- Misses         56      63       +7     
- Partials       19      21       +2
Impacted Files Coverage Δ
trio/tests/subprocess/test_waitpid_linux.py 100% <100%> (ø)
trio/_subprocess/linux_waitpid.py 86.27% <86.27%> (ø)
trio/_core/_run.py 99.66% <0%> (-0.03%) ⬇️
trio/_core/tests/test_run.py 100% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fbb4543...cfbedd0. Read the comment docs.

This was referenced Aug 23, 2018
@njsmith
Copy link
Member

njsmith commented Aug 23, 2018

Highlevel comment on both this and #621: #621 (comment)

@njsmith
Copy link
Member

njsmith commented Sep 2, 2018

High-level review:

Looking at this again, I think this is actually working harder than it has to! :-) I made it too complicated in my sample code. (Also, I think maybe I was still imagining we would expose a public API for waitpid? That seems overambitious for right now.)

On Unix, we know that we'll eventually be calling waitpid exactly once for each process: you can't call it multiple times because the PID gets reassigned after you call it once, and you have to call it once to avoid leaking zombies. So in our Popen class, on Linux, I figure the simplest thing to do will be to spawn the waitpid thread immediately when we start the subprocess, and then our Popen.wait method will have to check whether it's already gotten a result, or else wait for that thread to finish.

So for this private API the operations we need are:

  • Spin up the waitpid thread
  • Wait for the waitpid thread to finish
  • Once it's finished, find out what happened to it (exactly once, since then we'll save the result onto the Popen object)

So:

  • The _cached_result stuff to work around Outcome.unwrap consuming the outcome is unnecessary, because we'll only need to look at the outcome once. (This is also good because this reintroduces the bug that Outcome.unwrap is trying to protect you against by consuming the object :-/. Since Python 3 exceptions carry their traceback on them, you get corrupted tracebacks if you raise the same exception object in multiple contexts.)

  • We don't need the pending_waitpids global dict: in our "Spin up the waitpid thread" operation we can create a WaitpidResult, hand it off to the system task to fill in, and then return it. Then our "Wait for the waitpid thread to finish" operation operates on that result object, so... no need to look it up.

Sorry for sending you astray...

result = _pending_waitpids.pop(pid)
result.outcome = outcome.Error(e)
result.event.set()
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way to hit this except block (modulo bugs) is if the run_sync_in_worker_thread gets cancelled. Cancellation is always a BaseException, not an Exception, so right now I think this is a no-op? Also,

It's really not clear whether we even care about cancellation here... since this is a system task, the only way it can get cancelled is if the entire run is shutting down, either because the main task has already finished, or because there was an internal error and we're crashing. So... I guess the only time it matters whether we call result.event.set() here is when someone is doing a cancel-shielded wait for a subprocess to finish? And if they do, we won't handle it correctly anyway – the user would expect a cancel-shielded wait to actually wait for the subprocess to exit, but we'll still abort the run_sync_in_worker_task, so we won't know when the subprocess exits. Maybe we need to resurrect #303 to handle this corner case? Or toggle the system task's shielding on and off depending on whether anyone is currently blocked in Popen.wait? Or use threading.Thread directly instead of trying to go through run_sync_in_worker_task? (...can we even reliably re-enter the trio thread once the system nursery is cancelled? The re-entry queue processor is also a system task...)

Man, I hate waitpid.

This is such an obscure use case that I don't think we need to worry about it right now. Trying to do a cancel-shielded wait for a subprocess isn't too ridiculous – nursery.__aexit__ is one of the standard use cases for cancel-shielding, and people will want to do nursery-like things for managing subprocesses. So we might end up caring eventually. But right now we should just shrug and accept that it won't do exactly the right thing here. And if we don't care about that, we can make this whole function a lot simpler, like...

try:
    result.outcome = await run_sync_in_worker_thread(...)
finally:
    # Paranoia to make sure we unblock any waiters even if something goes horribly wrong
    result.event.set()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to do a cancel-shielded wait for a subprocess isn't too ridiculous – nursery.__aexit__ is one of the standard use cases for cancel-shielding, and people will want to do nursery-like things for managing subprocesses.

Actually, on further thought, this is still overstating the importance of this edge case. If someone does a cancel-shielded wait for a subprocess, that will mostly work fine, even with my "simplified" code above. The only way this task gets cancelled is if we're crashing, or if the main task has already exited. That means that the only ways a cancel-shielded Popen.wait can fail to actually wait are:

  • We're in the middle of crashing with a TrioInternalError: well, sorry, that means internal invariants have been violated and we're blowing up the world, so it's OK if your task manager doesn't wait for child processes correctly. At this point all guarantees are void.

  • If we're not crashing with TrioInternalError, and the system task is cancelled, and someone is doing a cancel-shielded Popen.wait, then the main task has already exited, so they must be doing it from inside a system task. But doing anything inside a cancel-shield in a system task is highly dubious, because by the time a system task gets cancelled the world is being torn down around you. Don't do that please.

async def test_waitpid():
pid = os.spawnvp(os.P_NOWAIT, "/bin/false", ("false",))
result = await waitpid(pid)
# exit code is a 16-bit int: (code, signal)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory it's a bit more complicated than that, there's a "was a core dumped?" flag in there and the ability to distinguish between stop signals and termination signals (see).

In practice this is a pretty pedantic distinction. If we want to be really POSIX-ly correct, though, I guess the tests should make assertions like

assert os.WIFEXITED(code) and os.WEXITSTATUS(code) == 1
assert os.WIFEXITED(code) and os.WEXITSTATUS(code) == 0
assert os.WIFSIGNALED(code) and os.WTERMSIG(code) == signal.SIGKILL

?

async def test_waitpid_no_process():
with pytest.raises(ChildProcessError):
# this PID probably doesn't exist
await waitpid(100000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oo I thought of a trick to make this deterministic.

You can only wait for your child processes. So we need the pid of a process that we know is not a child process. How about... waitpid(os.getpid())? I think that deterministically raises the error.

Or I guess waitpid(1) would also work, since init has to exist and is never a child of any other process.

@Fuyukai
Copy link
Member Author

Fuyukai commented Sep 4, 2018

I have no clue what the hell happened to the tests here.

@njsmith
Copy link
Member

njsmith commented Sep 4, 2018

Huh, bizarre. Could be a change in a third-party dependency, like pytest-cov or coverage...?

@njsmith
Copy link
Member

njsmith commented Sep 4, 2018

There's something weird with your rebase too... did you accidentally rebase master onto this, or something? Somehow the commits in this branch that came from master have different hashes than they do on master...

@smurfix
Copy link
Contributor

smurfix commented Sep 4, 2018

Owch. You've been rebasing in the wrong direction, i.e. instead of rebasing your work on top of the release you've been rebasing the release tree on top of your work. :-(

@smurfix
Copy link
Contributor

smurfix commented Sep 4, 2018

Frankly I don't like rebasing anyway, just reset to 133e46b, merge up, and be done with it.

@smurfix
Copy link
Contributor

smurfix commented Sep 4, 2018

(and then cherry-pick cfbedd0)

@Fuyukai
Copy link
Member Author

Fuyukai commented Sep 4, 2018

Oh I think I accidentally rebased onto master then master from my fork onto this. Whoops.

@codecov-io
Copy link

codecov-io commented Sep 4, 2018

Codecov Report

Merging #622 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #622      +/-   ##
==========================================
+ Coverage   99.31%   99.31%   +<.01%     
==========================================
  Files          93       95       +2     
  Lines       10975    11028      +53     
  Branches      785      786       +1     
==========================================
+ Hits        10900    10953      +53     
  Misses         56       56              
  Partials       19       19
Impacted Files Coverage Δ
trio/tests/subprocess/test_waitpid_linux.py 100% <100%> (ø)
trio/_subprocess/linux_waitpid.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 081a45b...e8b8861. Read the comment docs.

@njsmith
Copy link
Member

njsmith commented Sep 4, 2018

Still getting those coverage combine failures though :-(. What on earth...

Lacking better ideas, at this point I'd probably try debugging by trying to isolate what exactly is triggering that, e.g. by temporarily pushing a commit that turns off all the tests added in this PR, and by opening a trivial no-op PR to confirm whether the issue even is specific to this PR.

@pquentin
Copy link
Member

pquentin commented Sep 5, 2018

OK, the coverage combine failures were sorted out in #647, thanks @njsmith!

Does that mean we need another rebase?

Copy link
Member

@njsmith njsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good, one small change requested below. (And the nice thing is, that should trigger a new run of the tests and untangle the CI mess.)

_core.spawn_system_task(_task, waiter)

await waiter.event.wait()
return waiter.outcome.unwrap()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we'll eventually need to split this up into two functions, but that's fine, no particular reason to do that now.

pid = os.spawnvp(os.P_NOWAIT, "/bin/false", ("false",))
result = await waitpid(pid)
# exit code is a 16-bit int: (code, signal)
assert os.WIFEXITED(result[1]) and os.WEXITSTATUS(result[1]) == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably assert result[0] == pid as well

@njsmith
Copy link
Member

njsmith commented Sep 5, 2018

Shouldn't need a rebase – the CI systems don't test the head of the PR branch, they make a temporary merge between the PR branch and master and then test the merge. So the next time they run on this PR, they should pick up the fix from #647.

@njsmith
Copy link
Member

njsmith commented Sep 5, 2018

Tests are failing (typo in the new asserts)

@njsmith
Copy link
Member

njsmith commented Sep 5, 2018

Jenkins seems to be confused... Maybe this will tickle it?

@njsmith njsmith closed this Sep 5, 2018
@njsmith njsmith reopened this Sep 5, 2018
@njsmith
Copy link
Member

njsmith commented Sep 5, 2018

That was a new one: apparently Jenkins just didn't notice the latest push or something; it didn't even create a job record, never mind actually run anything. But close/reopen seems to have fixed it. (Which is funny because usually Jenkins ignores close/reopen.)

@njsmith
Copy link
Member

njsmith commented Sep 5, 2018

Ok code looks good, CI is green, I'm going to merge this PR quick before it gets hit another bizarre mishap.

@njsmith njsmith merged commit ee45fab into python-trio:master Sep 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants