Add unstable on_thread_park_id() to runtime Builder (for stuck task watchdog) #6370

theron-eg · 2024-02-29T00:47:55Z

Motivation

Fixes #6353. I'm looking for a way to detect when a task is not yielding back to tokio properly. Currently such a task can cause other random tasks to not get run. We had an incident where a buggy task went into a busy loop and put a service into a zombie state: it still was responding to most requests, but some background tasks were not running as expected.

If there is an existing way to detect a stuck task, I'd be delighted to be enlightened.

Solution

Add a new method on_thread_park_id() that includes the worker id being parked. This allows a watchdog process to determine which workers are not parked and also not polling (available already from RuntimeMetrics::worker_poll_count()). The watchdog code would look something like this:

fn main() {
    const WORKERS: usize = 4;
    const UNPARKED: AtomicBool = AtomicBool::new(false);
    static PARKED: [AtomicBool; WORKERS] = [UNPARKED; WORKERS];

    let runtime = runtime::Builder::new_multi_thread()
        .worker_threads(WORKERS)
        .on_thread_park_id(|id| PARKED[id].store(true, Ordering::Release))
        .on_thread_unpark_id(|id| PARKED[id].store(false, Ordering::Release))
        .build()
        .unwrap();

    let metrics = runtime.handle().metrics();
    thread::spawn(move || {
        let mut stuck_secs = [0; WORKERS];
        let mut prev_poll_counts = [None; WORKERS];
        loop {
            thread::sleep(time::Duration::from_secs(1));
            for ii in 0..WORKERS {
                if PARKED[ii].load(Ordering::Acquire) {
                    prev_poll_counts[ii] = None;
                } else {
                    let poll_count = metrics.worker_poll_count(ii);
                    if Some(poll_count) == prev_poll_counts[ii] {
                        stuck_secs[ii] += 1;
                        println!("*** worker {} is stuck >= {} secs ***", ii, stuck_secs[ii]);
                    } else {
                        prev_poll_counts[ii] = Some(poll_count);
                        stuck_secs[ii] = 0;
                    }
                }
            }
        }
    });

    runtime.spawn(async { loop {} });
    runtime.block_on(async {
        loop {
            tokio::task::yield_now().await
        }
    });
}

Rather than printing, a real watchdog implementation could call std::process::abort() if the task stays stuck past some duration (e.g. 10 seconds).

It's possible to find out from WorkerMetrics which workers are not actively polling for tasks. However it's not possible to tell if non-polling workers are merely idle (parked) or if they are stuck. By adding the worker id (same usize as used in WorkerMetrics calls) to the on_thread_park() and on_thread_unpark() callbacks it is possible to track which specific workers are parked. Then any worker that is not polling tasks and is not parked is a worker that is stuck. With this information it's possible to create a watchdog that alerts or kills the process if a worker is stuck for too long.

Darksonn · 2024-03-05T19:07:00Z

Maybe a better option would be add a new metric for this. It could count the number of parks/unparks. When the number is odd, the worker is parked. When it is even, the worker is active.

Darksonn · 2024-03-05T19:08:10Z

Note that you can push an empty commit when you want to trigger a new build.

You will probably need to merge in master to get the build to work. There were some changes to the CI setup.

theron-eg · 2024-03-06T00:27:19Z

You will probably need to merge in master to get the build to work. There were some changes to the CI setup.

Ah thanks, I was puzzled what was going wrong. I hope to get back to making more progress on this in the next couple of days.

Maybe a better option would be add a new metric for this. It could count the number of parks/unparks. When the number is odd, the worker is parked. When it is even, the worker is active.

It did occur to me to add a metric for "is currently parked" or something similar but it seemed more useful to have the generic callback mechanism (with the proper worker id). With that it's possible to for anyone to collect their own set of metrics, instead of relying on WorkerMetrics having exactly the right set of info (and incurring the overhead to accurately / atomically count things that 99.9% of people don't care about). But if you'd prefer a change to WorkerMetrics instead I'm happy to go down that route. Let me at least get this ready for official review and then I'm happy to take feedback.

Darksonn · 2024-03-14T09:47:53Z

I think a new worker metric makes sense.

Darksonn · 2024-06-04T22:18:19Z

Closing as this seems like it should be a metric instead.

surban · 2024-07-17T16:12:06Z

Hi,

I have a similar problem and would like to integrate stuck worker detection in my watchdog system.

Have you made any progress with this approach or found an alternate solution?

Darksonn · 2024-07-17T17:58:19Z

Someone put together this tool: https://github.com/facebookexperimental/rust-shed/tree/main/shed/tokio-detectors

surban · 2024-07-19T09:50:38Z

Someone put together this tool: https://github.com/facebookexperimental/rust-shed/tree/main/shed/tokio-detectors

This seems to follow a completely different approach. It submits tasks and measures the delay until the task is executed. If I understand correctly, it will not detect a single stuck worker, but only trigger once all worker threads are blocked or saturated. Also, it is not as lightweight, since it requires periodic submission of tasks to the scheduler to perform its probing.

Adding a metric that allows to determine the parking state of a worker thread seems to be a much better solution to me. However, for diagnostic purposes it would also be necessary to know which worker thread got stuck, i.e. we would need a way to get the native thread id of each worker thread. This would probably be impossible to provide via worker metrics, since it would require exposing std::thread::JoinHandle.

Therefore, I suggest to reconsider the approach presented in this PR. By providing the Tokio worker id via the callback mechanism here, the lightweight watchdog can build a mapping between native thread id (obtained via libc::gettid or similar) and Tokio worker id while keeping track of the park/unpark state of each worker.

An alternative would be to combine both approaches:

add a worker metric that provides the current park state of the worker, as you proposed, and,
add a callback on_worker_start(id: usize) that is called once for each started worker and allows an observer to build a mapping between Tokio worker ids and native threads.

Darksonn · 2024-07-19T13:27:23Z

It's entirely normal for us to provide per-worker metrics, so I don't think there is any blocker to going the metric route.

surban · 2024-07-19T16:14:11Z

I can submit a PR, but how should I provide the thread id in the worker metrics? Providing std::thread::ThreadId would not be enough, since there is no way to get the native thread id from that. I could add OS-specific ids (HANDLE for Windows, pthread_t for UNIX, tid for Linux), but is that what you want?

EDIT: Nevermind, std::thread::ThreadId should be good enough. We can gather the remaining info in on_thread_start and match via the ThreadId.

surban · 2024-07-19T19:13:17Z

I submitted #6695 and #6696 for review.

Darksonn · 2024-07-21T10:32:36Z

Thanks! I just took a look.

theron-eg added 2 commits February 28, 2024 13:41

Add on_thread_park_id() as unstable method on runtime Builder

dde346f

github-actions bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR labels Feb 29, 2024

Darksonn added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime M-metrics Module: tokio/runtime/metrics labels Feb 29, 2024

Ensure unstable tests are using the new on_thread_park_id() method

4d4b823

theron-eg marked this pull request as draft February 29, 2024 22:36

theron-eg added 3 commits February 29, 2024 16:06

Better behaved example for stuck task watchdog

dab40f7

Add comment about spawning a stuck task

2301515

Minor change to trigger new test run.

451410f

Darksonn closed this Jun 4, 2024

This was referenced Jul 19, 2024

metrics: add worker thread id #6695

Merged

metrics: add worker_park_unpark_count #6696

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unstable on_thread_park_id() to runtime Builder (for stuck task watchdog) #6370

Add unstable on_thread_park_id() to runtime Builder (for stuck task watchdog) #6370

theron-eg commented Feb 29, 2024

Darksonn commented Mar 5, 2024

Darksonn commented Mar 5, 2024

theron-eg commented Mar 6, 2024

Darksonn commented Mar 14, 2024

Darksonn commented Jun 4, 2024

surban commented Jul 17, 2024

Darksonn commented Jul 17, 2024

surban commented Jul 19, 2024

Darksonn commented Jul 19, 2024

surban commented Jul 19, 2024 •

edited

Loading

surban commented Jul 19, 2024

Darksonn commented Jul 21, 2024

Add unstable on_thread_park_id() to runtime Builder (for stuck task watchdog) #6370

Add unstable on_thread_park_id() to runtime Builder (for stuck task watchdog) #6370

Conversation

theron-eg commented Feb 29, 2024

Motivation

Solution

Darksonn commented Mar 5, 2024

Darksonn commented Mar 5, 2024

theron-eg commented Mar 6, 2024

Darksonn commented Mar 14, 2024

Darksonn commented Jun 4, 2024

surban commented Jul 17, 2024

Darksonn commented Jul 17, 2024

surban commented Jul 19, 2024

Darksonn commented Jul 19, 2024

surban commented Jul 19, 2024 • edited Loading

surban commented Jul 19, 2024

Darksonn commented Jul 21, 2024

surban commented Jul 19, 2024 •

edited

Loading