Should run_sync_in_worker_thread count as a blocked task for wait_all_tasks_blocked / autojump_clock? #456

njsmith · 2018-02-23T03:52:04Z

See some discussion here: https://gitter.im/python-trio/general?at=5a8f0962c3c5f8b90df08c95

sorcio · 2018-02-23T18:12:15Z

Including the minimal example from the discussion. The example employs pytest-trio to run this test function with a MockClock configured to autojump:

@pytest.mark.trio
async def test_worker_thread_autojump(nursery, autojump_clock):
    def slowish_task():
        for i in range(1000000):
            pass
        return 'foo'

    async def async_run_task(queue):
        result = await trio.run_sync_in_worker_thread(slowish_task)
        await queue.put(result)

    queue = trio.Queue(1)
    nursery.start_soon(async_run_task, queue)
    with trio.fail_after(300):
        assert await queue.get() == 'foo'

This test might fail with a trio.TooSlowError if autojump_threshold is set to 0 (the default for autojump_clock) or too small a value. The fail is not deterministic, i.e. in some instances the worker thread could be done running and report back to Trio thread before we wait_all_tasks_blocked. Setting a high enough autojump_threshold fixed the test (although it still depends on timing, so I guess there might be a very small chance it still fails on occasion).

Tricky part: the code I was exercising through this test didn't make it immediately obvious that it was spawning a worker thread. My expectation was that code that doesn't depend on I/O would interact well with MockClock, but synchronizing a thread is in fact a form of I/O crossing Trio boundaries. Unfortunately this is not always obvious unless you inspect the code.

@njsmith mentioned that implicitly waiting for a worker thread happens in Trio itself (in trio.socket.getaddrinfo()); although in this case the user might already expect that waiting on the network might be involved (getaddrinfo is blocking, and an async interface to it can be expected to be blocked). On the other hand my instance involved relatively trivial operations, with no syscalls, and it was running in the order of a few microsecs.

Should wait_all_tasks_blocked() (or MockClock) wait for all worker threads to be done?

Pros:

tests like the one in the example would not fail even in their naive formulation
refactoring to delegate work to a worker thread would not cause a previously passing test to fail
no need to guess a threshold time, no non-deterministic failures
the magic combo pytest-trio with autojump_clock would stay magic ✨

Cons:

tests that expect tasks to block while waiting for a thread would require separate machinery (probably different configuration for MockClock or different arguments for wait_all_tasks_blocked()) than the ones specified above
implicit waiting means that tests can unexpectedly become slower
documenting this fact could increase barrier to access to these tools, cancelling the benefit of having a less surprising behavior
is run_sync_in_worker_thread() special enough to deserve special casing? (*)

The last question (*) makes me wonder whether this issue might indicate that there are different semantics in the general context of "running a function synchronously in a worker thread". Maybe it's up to the implementation to specify that we are spawning a thread with the expectation that its running state is bound to the task that is waiting for it (i.e. it should be opaque to users that we are in fact spawning a thread to do our job).

Simpler alternatives: add a note to the documentation for autojump; add a trio.testing function to wait for all worker threads to be done.

njsmith · 2018-02-23T23:34:28Z

If we do do something like this, then I think it'll be by exposing some function in hazmat that lets you declare that a task should be considered "not blocked" even if it looks like it is (and also a way to set it back afterwards of course). Then run_sync_in_worker_thread would use the API, instead of being special cased.

Ideally, worker threads should only be used for blocking (io bound) code. If you try to use them for CPU bound code, then you run into bugs in CPython's GIL, where it ends up starving one of the threads. So maybe the getaddrinfo case is the typical one, and it's fair to treat calls to run_sync_in_worker_thread as blocking, on the assumption that the code in the thread will be blocking.

On the other hand, when using autojump_clock, there's something weird about the idea that part of your program just jumped forward in time, but the getaddrinfo didn't get the memo and is still running. It is a general limitation of autojump_clock that it only works on stuff actually running inside trio; it can't control threads, or that memcached server you spun up to run your tests against, etc. And when trio is running something in a worker thread, it means trio knows that it can't pull its magic trick and make it seamless, so maybe it shouldn't try.

Hmm hmm, this is tricky!

We should definitely add a note to the docs in any case :-)

njsmith · 2018-07-26T03:21:28Z

Additional wrinkle – the idea here: python-trio/pytest-trio#53 (comment) – depends on the current behavior where worker threads don't count as running for autojump_clock. (A test runner wants to have a way to track real time independent of any fake clock, so it can kill deadlocked tests, and the idea there is to use a thread to do that without interfering with the code under test or adding extensions to trio itself.)

njsmith · 2019-06-07T04:15:21Z

#1085 suggests another possibility, that Trio keep track of whether it expects any run_sync_soon calls could arrive. If we used this to block resolution of wait_all_tasks_blocked then it would automatically make run_sync_in_worker_thread count as a running task. It would also mean any task that was listening for signals (via open_signal_receiver) counted as always running, which might not be what we want. (At the least we'd probably want to rename wait_all_tasks_blocked!)

Maybe the open_run_sync_soon operation (or whatever we call it) should just take an argument to let you say whether to track this handle? That seems potentially useful both for wait_all_tasks_blocked and for the deadlock detector ... except that I'm not sure whether you always want the flag set the same way for both of those. Like, maybe a task blocked waiting for a signal should count as blocked for wait_all_tasks_blocked, but count as potentially-runnable for a deadlock detector. (Or maybe not? If the only way your program can ever do anything again is by someone sending it a SIGINT, then maybe you do want to count that as deadlocked? It should certainly count as blocked for autojump_clock though...)

Relevant golang bug: golang/go#21576

njsmith · 2019-06-07T04:30:30Z

trio.hazmat.WaitForSingleObject and trio.Process.wait are interesting cases too, since they use threads internally, but definitely should count as "blocked on I/O".

If we did use a thread to implement timeouts in pytest-trio, then it's also an interesting case, because it should not count as an I/O source for the deadlock detector. Note that there are potentially other ways to implement this too though.

OK let's work through some intuitions:

wait_all_tasks_blocked and the autojump clock both assume that there could be external I/O sources that trio can't see, and their strategy for coping with that is to wait a configurable amount of real-time, and if no I/O arrives during that time then they conclude that it's not going to happen. This makes sense for their use in tests. If you're not in a situation where this makes sense, then you just don't use them.
- I'm not sure whether anyone actually uses these in program fragments that can do I/O though... in my experience the real-time timeout is always set to zero.
- For this purpose, threads are a special case, because they're external to the Trio event loop, but otherwise feel like part of the program, in a way that other external I/O sources do not. Except that we mostly use threads to perform blocking operations (WaitForSingleObject and Process.wait being notable examples!)
- It's kind of OK if these don't work with run_sync_in_worker_thread (or open_signal_receiver, for that matter), because they're a specialized tool that you can just not use in those cases
a global deadlock detector handles external I/O very differently: it wants to run all the time, so it can't rely on heuristics; if it makes a mistake, it has to be on the side of "well, maybe this could wake up again". So it definitely can't conclude things are deadlocked based on a timeout, like wait_all_tasks_blocked does. Instead it wants to peek into the I/O code directly. For this, threads should definitely count as still-running.
- But maybe in special cases it makes sense to opt-out of this, like a pytest-trio timeout thread.

njsmith added the design discussion label Feb 23, 2018

oremanj added the pytest-trio relevant label May 4, 2019

njsmith mentioned this issue Jun 7, 2019

Global deadlock detector #1085

Open

oremanj added the threads label Oct 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should run_sync_in_worker_thread count as a blocked task for wait_all_tasks_blocked / autojump_clock? #456

Should run_sync_in_worker_thread count as a blocked task for wait_all_tasks_blocked / autojump_clock? #456

njsmith commented Feb 23, 2018

sorcio commented Feb 23, 2018

njsmith commented Feb 23, 2018

njsmith commented Jul 26, 2018

njsmith commented Jun 7, 2019

njsmith commented Jun 7, 2019

Should run_sync_in_worker_thread count as a blocked task for wait_all_tasks_blocked / autojump_clock? #456

Should run_sync_in_worker_thread count as a blocked task for wait_all_tasks_blocked / autojump_clock? #456

Comments

njsmith commented Feb 23, 2018

sorcio commented Feb 23, 2018

njsmith commented Feb 23, 2018

njsmith commented Jul 26, 2018

njsmith commented Jun 7, 2019

njsmith commented Jun 7, 2019