Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LocalSet, JoinHandle and budget coming together can lead to hangs with 100% CPU usage. #2460

Closed
najamelan opened this issue Apr 29, 2020 · 4 comments · Fixed by #2462
Closed
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. I-hang Program never terminates, resulting from infinite loops, deadlock, livelock, etc. M-runtime Module: tokio/runtime

Comments

@najamelan
Copy link
Contributor

Affected Versions

0.2.14 - 0.2.19

Platform

arch linux, but no IO involved.

Description

Please see the issue reported first at futures-rs because I suspected JoinAll was maybe at fault.

JoinHandle often reports Pending even though it's task has already returned Ready. Over a collection of futures like in LocalSet, this can require several runs before all contained JoinHandles return Ready. When there are more tasks in the set than budget, this can lead to the JoinHandles never returning Ready and tasks awaiting them to hang.

It is possible that an inefficiency in the synchronization of JoinHandle also has an impact when using a threadpool, but it might not be so noticable since this just throws more resources at the problem.

@Darksonn Darksonn added A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-runtime Module: tokio/runtime T-performance Topic: performance and benchmarks I-hang Program never terminates, resulting from infinite loops, deadlock, livelock, etc. and removed T-performance Topic: performance and benchmarks labels Apr 29, 2020
hawkw added a commit that referenced this issue Apr 29, 2020
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
@hawkw
Copy link
Member

hawkw commented Apr 29, 2020

I think I know what's up here.

LocalSet wraps each time a local task is run in budget:

Some(task) => crate::coop::budget(|| task.run()),

This is identical to what tokio's other schedulers do when running tasks, and in theory should give each task its own budget every time it's polled.

However, LocalSet is different from other schedulers. Unlike the runtime schedulers, a LocalSet is itself a future that's run on another scheduler, in block_on. block_on also sets a budget:

if let Ready(v) = crate::coop::budget(|| future.as_mut().poll(&mut cx)) {

The docs for budget state that:

/// Enabling budgeting when it is already enabled is a no-op.

This means that inside of a LocalSet, the calls to budget are no-ops. Instead, each future polled by the LocalSet is subtracting from a single global budget.

LocalSet's RunUntil future polls the provided future before polling any other tasks spawned on the local set:

if let Poll::Ready(output) = me.future.poll(cx) {
return Poll::Ready(output);
}
if me.local_set.tick() {
// If `tick` returns `true`, we need to notify the local future again:
// there are still tasks remaining in the run queue.
cx.waker().wake_by_ref();
}
Poll::Pending

In this case, the provided future is JoinAll. Unfortunately, every time a JoinAll is polled, it polls every joined future that has not yet completed. When the number of futures in the JoinAll is >= 128, this means that the JoinAll immediately exhausts the task budget. This would, in theory, be a good thing --- if the JoinAll had a huge number of JoinHandles in it and none of them are ready, it would limit the time we spend polling those join handles.

However, because the LocalSet actually has a single shared task budget, this means polling the JoinAll always exhausts the entire budget. There is now no budget remaining to poll any other tasks spawned on the LocalSet, and they are never able to complete.

There are two possible solutions: LocalSet could use coop::limit to poll each future with a smaller budget, ensuring no one future can exhaust the entire budget. However, this would mean that tasks on a LocalSet would be given a smaller budget than comparable tasks on a regular worker or basic_scheduler. Alternatively, LocalSet could call coop::stop to reset the budget when it is polled. That way, each task spawned on the LocalSet would get its own separate budget rather than starving other tasks. Since LocalSets can only be polled in block_on, and are not competing with other tasks, I think it's correct for it to exempt itself from budgeting.

@najamelan
Copy link
Contributor Author

najamelan commented Apr 29, 2020

That's some awesome debugging! Thanks for looking into that. There is one thing I don't understand though:

However, because the LocalSet actually has a single shared task budget, this means polling the JoinAll always exhausts the entire budget. There is now no budget remaining to poll any other tasks spawned on the LocalSet, and they are never able to complete.

From what I gather from the logs I posted, the inner tasks have in fact always run and completed before the outer ones. After they print, they immediately return Poll::Ready. However the joinhandles still said they hadn't. The outer tasks which were being polled would have completed had the JoinHandles not returned Pending.

It is very late here, so I have to sleep, but tomorrow I will run the test with your PR. I imagine it will solve the hang, but I suspect there is still an underlying issue that could have solved this from another angle and that may be desirable to solve as well. I'll have a look at the code of JoinHandle tomorrow as well.

@hawkw
Copy link
Member

hawkw commented Apr 29, 2020

When the JoinHandles are polled, the budget is 0. This causes them to return Pending immediately, without checking if the task has actually completed.

@najamelan
Copy link
Contributor Author

Yes, thanks. It dawned to me overnight that the budget was returning Pending, not the JoinHandle...

hawkw added a commit that referenced this issue Apr 30, 2020
## Motivation

Currently, an issue exists where a `LocalSet` has a single cooperative
task budget that's shared across all futures spawned on the `LocalSet`
_and_ by any future passed to `LocalSet::run_until` or
`LocalSet::block_on`. Because these methods will poll the `run_until`
future before polling spawned tasks, it is possible for that task to
_always_ deterministically starve the entire `LocalSet` so that no local
tasks can proceed. When the completion of that future _itself_ depends
on other tasks on the `LocalSet`, this will then result in a deadlock,
as in issue #2460.

A detailed description of why this is the case, taken from [this 
comment][1]:

`LocalSet` wraps each time a local task is run in `budget`:
https://github.com/tokio-rs/tokio/blob/947045b9445f15fb9314ba0892efa2251076ae73/tokio/src/task/local.rs#L406

This is identical to what tokio's other schedulers do when running
tasks, and in theory should give each task its own budget every time
it's polled. 

_However_, `LocalSet` is different from other schedulers. Unlike the
runtime schedulers, a `LocalSet` is itself a future that's run on
another scheduler, in `block_on`.  `block_on` _also_ sets a budget:
https://github.com/tokio-rs/tokio/blob/947045b9445f15fb9314ba0892efa2251076ae73/tokio/src/runtime/basic_scheduler.rs#L131

The docs for `budget` state that:
https://github.com/tokio-rs/tokio/blob/947045b9445f15fb9314ba0892efa2251076ae73/tokio/src/coop.rs#L73

This means that inside of a `LocalSet`, the calls to `budget` are
no-ops. Instead, each future polled by the `LocalSet` is subtracting
from a single global budget.

`LocalSet`'s `RunUntil` future polls the provided future before polling
any other tasks spawned on the local set:
https://github.com/tokio-rs/tokio/blob/947045b9445f15fb9314ba0892efa2251076ae73/tokio/src/task/local.rs#L525-L535

In this case, the provided future is `JoinAll`. Unfortunately, every
time a `JoinAll` is polled, it polls _every_ joined future that has not
yet completed. When the number of futures in the `JoinAll` is >= 128,
this means that the `JoinAll` immediately exhausts the task budget. This
would, in theory, be a _good_ thing --- if the `JoinAll` had a huge
number of `JoinHandle`s in it and none of them are ready, it would limit
the time we spend polling those join handles. 

However, because the `LocalSet` _actually_ has a single shared task
budget, this means polling the `JoinAll` _always_ exhausts the entire
budget. There is now no budget remaining to poll any other tasks spawned
on the `LocalSet`, and they are never able to complete.

[1]: #2460 (comment)

## Solution

This branch solves this issue by resetting the task budget when polling
a `LocalSet`. I've added a new function to `coop` for resetting the task
budget to `UNCONSTRAINED` for the duration of a closure, and thus
allowing the `budget` calls in `LocalSet` to _actually_ create a new
budget for each spawned local task. Additionally, I've changed
`LocalSet` to _also_ ensure that a separate task budget is applied to
any future passed to `block_on`/`run_until`.

Additionally, I've added a test reproducing the issue described in
#2460. This test fails prior to this change, and passes after it.

Fixes #2460

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. I-hang Program never terminates, resulting from infinite loops, deadlock, livelock, etc. M-runtime Module: tokio/runtime
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants