-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unstable on_thread_park_id() to runtime Builder (for stuck task watchdog) #6370
Add unstable on_thread_park_id() to runtime Builder (for stuck task watchdog) #6370
Conversation
It's possible to find out from WorkerMetrics which workers are not actively polling for tasks. However it's not possible to tell if non-polling workers are merely idle (parked) or if they are stuck. By adding the worker id (same usize as used in WorkerMetrics calls) to the on_thread_park() and on_thread_unpark() callbacks it is possible to track which specific workers are parked. Then any worker that is not polling tasks and is not parked is a worker that is stuck. With this information it's possible to create a watchdog that alerts or kills the process if a worker is stuck for too long.
Maybe a better option would be add a new metric for this. It could count the number of parks/unparks. When the number is odd, the worker is parked. When it is even, the worker is active. |
Note that you can push an empty commit when you want to trigger a new build. You will probably need to merge in master to get the build to work. There were some changes to the CI setup. |
Ah thanks, I was puzzled what was going wrong. I hope to get back to making more progress on this in the next couple of days.
It did occur to me to add a metric for "is currently parked" or something similar but it seemed more useful to have the generic callback mechanism (with the proper worker id). With that it's possible to for anyone to collect their own set of metrics, instead of relying on |
I think a new worker metric makes sense. |
Closing as this seems like it should be a metric instead. |
Hi, I have a similar problem and would like to integrate stuck worker detection in my watchdog system. Have you made any progress with this approach or found an alternate solution? |
Someone put together this tool: https://github.com/facebookexperimental/rust-shed/tree/main/shed/tokio-detectors |
This seems to follow a completely different approach. It submits tasks and measures the delay until the task is executed. If I understand correctly, it will not detect a single stuck worker, but only trigger once all worker threads are blocked or saturated. Also, it is not as lightweight, since it requires periodic submission of tasks to the scheduler to perform its probing. Adding a metric that allows to determine the parking state of a worker thread seems to be a much better solution to me. However, for diagnostic purposes it would also be necessary to know which worker thread got stuck, i.e. we would need a way to get the native thread id of each worker thread. This would probably be impossible to provide via worker metrics, since it would require exposing Therefore, I suggest to reconsider the approach presented in this PR. By providing the Tokio worker id via the callback mechanism here, the lightweight watchdog can build a mapping between native thread id (obtained via An alternative would be to combine both approaches:
|
It's entirely normal for us to provide per-worker metrics, so I don't think there is any blocker to going the metric route. |
I can submit a PR, but how should I provide the thread id in the worker metrics? Providing EDIT: Nevermind, |
Thanks! I just took a look. |
Motivation
Fixes #6353. I'm looking for a way to detect when a task is not yielding back to tokio properly. Currently such a task can cause other random tasks to not get run. We had an incident where a buggy task went into a busy loop and put a service into a zombie state: it still was responding to most requests, but some background tasks were not running as expected.
If there is an existing way to detect a stuck task, I'd be delighted to be enlightened.
Solution
Add a new method
on_thread_park_id()
that includes the worker id being parked. This allows a watchdog process to determine which workers are not parked and also not polling (available already fromRuntimeMetrics::worker_poll_count()
). The watchdog code would look something like this:Rather than printing, a real watchdog implementation could call
std::process::abort()
if the task stays stuck past some duration (e.g. 10 seconds).