Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unstable on_thread_park_id() to runtime Builder (for stuck task watchdog) #6370

Conversation

theron-eg
Copy link

Motivation

Fixes #6353. I'm looking for a way to detect when a task is not yielding back to tokio properly. Currently such a task can cause other random tasks to not get run. We had an incident where a buggy task went into a busy loop and put a service into a zombie state: it still was responding to most requests, but some background tasks were not running as expected.

If there is an existing way to detect a stuck task, I'd be delighted to be enlightened.

Solution

Add a new method on_thread_park_id() that includes the worker id being parked. This allows a watchdog process to determine which workers are not parked and also not polling (available already from RuntimeMetrics::worker_poll_count()). The watchdog code would look something like this:

fn main() {
    const WORKERS: usize = 4;
    const UNPARKED: AtomicBool = AtomicBool::new(false);
    static PARKED: [AtomicBool; WORKERS] = [UNPARKED; WORKERS];

    let runtime = runtime::Builder::new_multi_thread()
        .worker_threads(WORKERS)
        .on_thread_park_id(|id| PARKED[id].store(true, Ordering::Release))
        .on_thread_unpark_id(|id| PARKED[id].store(false, Ordering::Release))
        .build()
        .unwrap();

    let metrics = runtime.handle().metrics();
    thread::spawn(move || {
        let mut stuck_secs = [0; WORKERS];
        let mut prev_poll_counts = [None; WORKERS];
        loop {
            thread::sleep(time::Duration::from_secs(1));
            for ii in 0..WORKERS {
                if PARKED[ii].load(Ordering::Acquire) {
                    prev_poll_counts[ii] = None;
                } else {
                    let poll_count = metrics.worker_poll_count(ii);
                    if Some(poll_count) == prev_poll_counts[ii] {
                        stuck_secs[ii] += 1;
                        println!("*** worker {} is stuck >= {} secs ***", ii, stuck_secs[ii]);
                    } else {
                        prev_poll_counts[ii] = Some(poll_count);
                        stuck_secs[ii] = 0;
                    }
                }
            }
        }
    });

    runtime.spawn(async { loop {} });
    runtime.block_on(async {
        loop {
            tokio::task::yield_now().await
        }
    });
}

Rather than printing, a real watchdog implementation could call std::process::abort() if the task stays stuck past some duration (e.g. 10 seconds).

It's possible to find out from WorkerMetrics which workers are not
actively polling for tasks.  However it's not possible to tell if
non-polling workers are merely idle (parked) or if they are stuck.
By adding the worker id (same usize as used in WorkerMetrics calls)
to the on_thread_park() and on_thread_unpark() callbacks it is
possible to track which specific workers are parked.  Then any
worker that is not polling tasks and is not parked is a worker
that is stuck.

With this information it's possible to create a watchdog that alerts
or kills the process if a worker is stuck for too long.
@github-actions github-actions bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR labels Feb 29, 2024
@Darksonn Darksonn added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime M-metrics Module: tokio/runtime/metrics labels Feb 29, 2024
@theron-eg theron-eg marked this pull request as draft February 29, 2024 22:36
@Darksonn
Copy link
Contributor

Darksonn commented Mar 5, 2024

Maybe a better option would be add a new metric for this. It could count the number of parks/unparks. When the number is odd, the worker is parked. When it is even, the worker is active.

@Darksonn
Copy link
Contributor

Darksonn commented Mar 5, 2024

Note that you can push an empty commit when you want to trigger a new build.

You will probably need to merge in master to get the build to work. There were some changes to the CI setup.

@theron-eg
Copy link
Author

You will probably need to merge in master to get the build to work. There were some changes to the CI setup.

Ah thanks, I was puzzled what was going wrong. I hope to get back to making more progress on this in the next couple of days.

Maybe a better option would be add a new metric for this. It could count the number of parks/unparks. When the number is odd, the worker is parked. When it is even, the worker is active.

It did occur to me to add a metric for "is currently parked" or something similar but it seemed more useful to have the generic callback mechanism (with the proper worker id). With that it's possible to for anyone to collect their own set of metrics, instead of relying on WorkerMetrics having exactly the right set of info (and incurring the overhead to accurately / atomically count things that 99.9% of people don't care about). But if you'd prefer a change to WorkerMetrics instead I'm happy to go down that route. Let me at least get this ready for official review and then I'm happy to take feedback.

@Darksonn
Copy link
Contributor

I think a new worker metric makes sense.

@Darksonn
Copy link
Contributor

Darksonn commented Jun 4, 2024

Closing as this seems like it should be a metric instead.

@Darksonn Darksonn closed this Jun 4, 2024
@surban
Copy link
Contributor

surban commented Jul 17, 2024

Hi,

I have a similar problem and would like to integrate stuck worker detection in my watchdog system.

Have you made any progress with this approach or found an alternate solution?

@Darksonn
Copy link
Contributor

@surban
Copy link
Contributor

surban commented Jul 19, 2024

Someone put together this tool: https://github.com/facebookexperimental/rust-shed/tree/main/shed/tokio-detectors

This seems to follow a completely different approach. It submits tasks and measures the delay until the task is executed. If I understand correctly, it will not detect a single stuck worker, but only trigger once all worker threads are blocked or saturated. Also, it is not as lightweight, since it requires periodic submission of tasks to the scheduler to perform its probing.

Adding a metric that allows to determine the parking state of a worker thread seems to be a much better solution to me. However, for diagnostic purposes it would also be necessary to know which worker thread got stuck, i.e. we would need a way to get the native thread id of each worker thread. This would probably be impossible to provide via worker metrics, since it would require exposing std::thread::JoinHandle.

Therefore, I suggest to reconsider the approach presented in this PR. By providing the Tokio worker id via the callback mechanism here, the lightweight watchdog can build a mapping between native thread id (obtained via libc::gettid or similar) and Tokio worker id while keeping track of the park/unpark state of each worker.

An alternative would be to combine both approaches:

  1. add a worker metric that provides the current park state of the worker, as you proposed, and,
  2. add a callback on_worker_start(id: usize) that is called once for each started worker and allows an observer to build a mapping between Tokio worker ids and native threads.

@Darksonn
Copy link
Contributor

It's entirely normal for us to provide per-worker metrics, so I don't think there is any blocker to going the metric route.

@surban
Copy link
Contributor

surban commented Jul 19, 2024

I can submit a PR, but how should I provide the thread id in the worker metrics? Providing std::thread::ThreadId would not be enough, since there is no way to get the native thread id from that. I could add OS-specific ids (HANDLE for Windows, pthread_t for UNIX, tid for Linux), but is that what you want?

EDIT: Nevermind, std::thread::ThreadId should be good enough. We can gather the remaining info in on_thread_start and match via the ThreadId.

@surban
Copy link
Contributor

surban commented Jul 19, 2024

I submitted #6695 and #6696 for review.

@Darksonn
Copy link
Contributor

Thanks! I just took a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-metrics Module: tokio/runtime/metrics M-runtime Module: tokio/runtime R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add worker id to on_thread_park() and on_thread_unpark() callbacks (for stuck worker watchdog)
3 participants