-
-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock detection #182
Comments
This library also has an extremely simple feature: print a stacktrace whenever taking a lock takes too long, period. Which is surprisingly reasonable... |
In working on #181 (and #156, sorta), I realized that threads make this a bit more complicated. In particular, here's the problem I ran into: in Conceptually, I think the way we would want to model this for deadlock-detection purposes is:
So... I was initially thinking in terms of complicated APIs that let you transfer "ownership" of a lock/token to another task, by materializing it in an object and then passing that object over etc., but this suggests something much simpler. Maybe all we need is like thread_id = new_thread_id()
# While blocked here, this task is *waiting-for* the limit
# But afterwards, the thread_id *holds* the limit
await limit.acquire_on_behalf_of(thread_id)
spawn_thread_with_id(...) (If we start re-using threads, #6, then the "thread id" here would become more like a "job id", i.e. it refers to a particular chunk of work done by a thread corresponding to a single call to And eventually we'd want to stash the |
- New synchronization primitive: CapacityLimiter. Like a Semaphore but more specialized. See python-triogh-182 for rationale. - Add limiter= argument to run_in_worker_thread, that allows one to correctly (modulo python-trio#6 (comment)) control the number of active threads. - Added new function current_default_worker_thread_limiter(), which creates or returns a run-local CapacityLimiter, and made run_in_worker_thread use it by default when no other limiter= is given. Closes: python-triogh-10, python-triogh-57, python-triogh-156
- New synchronization primitive: CapacityLimiter. Like a Semaphore but more specialized. See python-triogh-182 for rationale. - Add limiter= argument to run_in_worker_thread, that allows one to correctly (modulo python-trio#6 (comment)) control the number of active threads. - Added new function current_default_worker_thread_limiter(), which creates or returns a run-local CapacityLimiter, and made run_in_worker_thread use it by default when no other limiter= is given. Closes: python-triogh-10, python-triogh-57, python-triogh-156
While #1085 will take care of global deadlock detection, real-life code probably needs some more intelligence than that, as otherwise no server will ever have a deadlock (its I don't think that that any timeout-based solution is sufficient, but some "this lock stays locked for more than N seconds" instrumentation (restarting the timer when it's immediately re-acquired by a task that's waiting on it) on selected |
Our
Lock
objects enforce the constraint that they can only be released by the same task that acquired them.We might (probably should) add a version ofSemaphore
that enforces the same constraint (see discussion in #156).CapacityLimit
objects are likeSemaphore
s that enforce the same constraint. We should take advantage of these constraints to provide better error detection. In particular:Detecting never-released locks
This is pretty trivial, yet useful: if a Task exits while holding a
Lock
, then this is a bug (thatLock
can never be released), and we should alert the user. (I guess technically it's not a bug if this is the only Task that knows about the Lock, but I don't think requiring people to issue a spurious release in this corner case is too onerous if it helps catch errors in general.) This would be pretty straightforward – just keep a record of the lock-or-lock-like objects held by each Task, and then when a task exits, if this set is non-empty, whine loudly.Not entirely sure if this means "print a warning" or "raise an exception". There's also the option of forcibly releasing the lock, though I'm dubious about the wisdom of carrying on once your distributed system has lost internal consistency... maybe poison the lock so anyone who waits on it gets an exception, and if people are following good Erlang style then they'll all get rebooted and all may be well?
Full-fledged deadlock detection
We can also detect classic deadlocks by tracing the wait-for graph.
This scope would be limited, e.g. we can't detect deadlocks caused by two Tasks that are using two Queues for communication and are both blocked, because we don't have any way to know that that task there that's blocked is the only one that can send on this Queue. Maybe we should! (E.g., by having the concept of binding a Queue to a Task so only that Task can send on it.) But even if all we can detect are mutex-based deadlocks that's still potentially very useful!
There are a lot of possible variants on this. We don't want to add too much overhead; I don't know if tracing the wait-for graph on every contended acquire is fast enough to be acceptable. (It worked for these people, but that's a pretty different context.) It's probably a reasonable thing to try first, though. (One can also do stuff like trigger a trace only after a task has been blocked for N seconds. Though in practice keeping track of this on every contended acquire might be just as expensive as tracing the graph immediately, if the graph trace usually finishes almost immediately.)
Another thing that can be useful to detect is we see one Task acquire A-then-B, while another Task acquires B-then-A. This by itself doesn't prove that there's a deadlock, but it's certainly suspicious, and can help catch latent bugs before they hit.
Again the basic data needed for all of these is the ability to look at a Task and get a list of all the lock and lock-like objects its waiting on.
The text was updated successfully, but these errors were encountered: