Internalize task distinction into TaskPool #4740

james7132 · 2022-05-13T18:05:43Z

Objective

Fixes #1907. IO and Async Compute threads are sitting idle even when the compute threads are at capacity. This results in poor thread utilization.

Solution

Move the distinction between compute, async compute, and IO threads out of userspace, and internalize the distinction into TaskPool. This creates three tiers of threads and tasks: one for each group. Higher priority tasks can run on lower priority threads, if and only if they're currently idle, but the lower priority threads will prioritize their specific workload over higher priority ones. Priority goes compute > io > async compute.

For example, heavy per-frame compute workload, async compute and IO otherwise sitting idle: compute tasks will be scheduled onto async compute and IO threads whenever idle. Any IO task that is awoken will take precedence over the compute tasks.

In another case, a heavy IO workload will schedule onto idle async compute threads, but will never use a compute thread. This prevents lower priority tasks from starving out higher priority tasks.

The priority scheme chosen assumes well-behaved scheduling that adheres to the following principles:

Compute tasks are latency sensitive and generally do not hold the thread indefinitely or at the very minimum for time longer than the course of a frame.
IO tasks are not CPU intensive and will do minimal processing. Generally will yield readily at any incoming or outbound communication, and thus will only generally be under consistent heavy load only in specific circumstances.
Async compute tasks may run for a very long time, across multiple frames, and will generally not yield at any point during a given task's execution. Only under heavy load when a developer explicitly needs it, and can otherwise be used for IO or compute in the general case.

TODO:

Add two additional executors to TaskPool
Set up TaskPool threads to run tasks from all 3 with some prioritization
Add user-facing APIs for choosing which priority to spawn tasks under.
Remove the usages newtypes. Fix up dependent code.
Properly set up individual thread groups and their individual prioritization schemes.
Update docs.

Changelog

Changed: Newtypes of TaskPool (ComputeTaskPool, AsyncComputeTaskPool, and IoTaskPool) have been replaced with TaskPool::spawn_as with the appropriate TaskGroup. Thread utilization across these segments has been generally been increased via internal task prioritization. For convenience, TaskPool::spawn will always spawn on the compute task group.

Migration Guide

ComputeTaskPool, AsyncComputeTaskPool, and IoTaskPool have been removed. These individual TaskPools have been merged together, and can be accessed by Res<TaskPool>. To spawn a task on a specific segment of the threads, use TaskPool::spawn_as with the appropriate TaskGroup.

Before:

fn system(task_pool: Res<AsyncComputeTaskPool>) {
   task_pool.spawn(async {
       ...
   });
}

After:

fn system(task_pool: Res<TaskPool>) {
   task_pool.spawn(TaskGroup::AsyncCompute, async {
       ...
   });
}

hymm · 2022-05-13T18:31:15Z

I had a slightly different idea on how to combine the task pools. I wanted to give tasks spawned on a scope the highest priority; give tasks spawned onto the task pool a lower priority; and then add a new spawn_blocking function that works like it does in tokio where these tasks are spawned onto a different set of threads. It reuses an idle thread if there are any or spawns a new thread when there aren't any.

james7132 · 2022-05-13T18:41:20Z

I had a slightly different idea on how to combine the task pools. I wanted to give tasks spawned on a scope the highest priority; give tasks spawned onto the task pool a lower priority; and then add a new spawn_blocking function that works like it does in tokio where these tasks are spawned onto a different set of threads. It reuses an idle thread if there are any or spawns a new thread when there aren't any.

I think the first two parts are still reconcilable with the current proposed approach; however, the problem with spawn_blocking as a general approach is that if a compute thread is idle, you can easily starve out the compute tasks, which is unacceptable in a game engine. Spinning up and down threads is also an overhead we generally want to avoid too, particularly since it can cause even heavier thrashing on the CPU for the other threads.

hymm · 2022-05-13T18:46:52Z

however, the problem with spawn_blocking as a general approach is that if a compute thread is idle, you can easily starve out the compute tasks, which is unacceptable in a game engine.

isn't that a problem with this approach too. If a user starts too many blocking tasks the compute task pool won't have any threads to work with either.

edit: thinking on this more I guess. The approach here prevents the user from running more than x blocking tasks at the same time. So the surprise for the user is that when they run too many blocking tasks, they only get x running in parallel. Even if they were expecting them to all run in parallel. In the case I suggest, they run as many as they want, but suddenly their game starts stuttering because they've used all the threads.

james7132 · 2022-05-13T19:04:00Z

isn't that a problem with this approach too. If a user starts too many blocking tasks the compute task pool won't have any threads to work with either.

If and only if they start it as a normal compute task, something that would require spawn_blocking should be scheduled to the async compute segment, where under this approach, it'll be delayed until the thread(s) are available.

james7132 · 2022-05-14T06:14:58Z

bors try

bors · 2022-05-14T07:15:32Z

try

Build failed:

miri

# Objective Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by #2250 as these threads are global and cannot be replaced after initialization. Removes the need for temporary fixes like #4998. Fixes #4996. Fixes #6081. Fixes #5285. Fixes #5054. Supersedes #2307. ## Solution The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~ - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below). - ~~no error is logged of any kind~~ (See below) - ~~it's unclear if it drops other tasks in the executor~~ (it does not) - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread. ### Alternatives A final solution likely will incorporate elements of any or all of the following. #### ~~Log and Ignore~~ ~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~ Panics already do this by default, even when caught by `catch_unwind`. #### ~~`catch_unwind` in `ParallelExecutor`~~ ~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~ ~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### Abort on Panic The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la #4740). Roughly takes the shape of: ```rust struct AbortOnPanic; impl Drop for AbortOnPanic { fn drop(&mut self) { abort!(); } } let guard = AbortOnPanic; // Run task std::mem::forget(AbortOnPanic); ``` --- ## Changelog Changed: `bevy_tasks::TaskPool`'s threads will no longer terminate permanently when a task scheduled onto them panics. Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.

james7132 · 2022-11-05T09:28:13Z

@cart @hymm I've mostly updated the PR and this is once again up for review. #4466 added an unusual interaction since we need a temporary Executor, which I originally made with the assumption that it would be static and never be dropped. There were a few optimizations that definitely decrease overhead when (re)scheduling tasks. I can make another "SimpleExecutor" (as a Send-only MPSC executor) for this case that resolves this, but the problem remains that Executor is not meant to handle being dropped.

Should we:

Take the perf hit and revert those changes, supporting dropping Executor and by proxy TaskPool?
Or go all in on a global TaskPool and keep the optimizations?

2 would move us closer to a tokio-esque implementation where there's a single global runtime, while 1 would keep what we have and retain the approach that bevy_tasks has had thus far. Which one should we pursue? I'm personally inclined to push for 2 since:

it lets us make those assumptions and cut more overhead out
allows for more ergonomic initialization (i.e. #[tokio::main] style #[bevy::main]),
this PR already reduces/removes the need to pass around &TaskPool or some newtype of it.
provides a bit of familiarity for people using tokio's APIs if there's a clear mapping between the two.

hymm · 2022-11-07T06:04:37Z

We don't necessarily need a temporary executor. We just need a way of running tasks that only run on the thread the scope is running on. This could potentially be a thread local storage queue that can't be stolen from.

A consideration here is that in #6503, I removed the temporary executor for a global main thread executor that has run time checks to only allow it to be ticked on the main thread. But not sure if that is going to get merged or not.

hymm · 2022-11-07T06:43:08Z

I'm leaning towards option 2. Seems like there isn't much need for users to construct their own taskpools now that we're using global task pools. The one issue might be if users will want to reconfigure the number of threads or thread distribution during execution.

On another note, it might be a good idea to port the micro benchmarks over from async_executor to get a better idea of how the current executor compares to the original.

hymm · 2022-11-15T07:00:28Z

Will this completely block us off from using the tokio ecosystem forever on our task pools? Or would we still be able to switch the tokio executor into our task pools somehow even if it doesn't have all the niceties (priorities)?

hymm · 2022-11-15T07:23:37Z

My biggest problem with reviewing this PR is that I have no idea how to understand the changes that were made to the async executor. I can look at the diff https://github.com/james7132/bevy/compare/b5fc0d2..task-pool-internalization and figure out what it's doing, but I have no context for why some things were changed. If you could port over the tests and benches from async executor I would have a lot more confidence that things are working as they're supposed to. Bonus points if you can get loom working somehow.

inodentry · 2023-01-03T09:15:05Z

Just something to note: this sort of PR, where we have very tricky performance implications due to multithreading, different CPU cores being involved, etc, can be very sensitive to CPU hardware boost behaviors and such!

I am worried that there is a high risk of benchmarks being wildly inaccurate (when it comes to judging how much more/less efficient bevy's scheduling has become), due to different thread utilization, causing different boost clocks of different CPUs, etc. The benchmark results may therefore be skewed because of the particular CPU model of the computer they are run on.

I recommend, if possible, to benchmark this PR (and other multithreading/scheduling things) on a machine where all CPU cores can be forced to run at the same clock speeds (via BIOS settings, etc, to disable boost and clock frequency scaling).

Otherwise we can't really know how much we actually win/lose due to better multithreading / thread utilization, and how much of it was due to the particular CPU hardware boosting differently due to its cores being loaded differently.

This kind of thing is important in general, but for these sorts of PRs especially so. I would not trust benchmarks that were not done that way.

# Objective Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by bevyengine#2250 as these threads are global and cannot be replaced after initialization. Removes the need for temporary fixes like bevyengine#4998. Fixes bevyengine#4996. Fixes bevyengine#6081. Fixes bevyengine#5285. Fixes bevyengine#5054. Supersedes bevyengine#2307. ## Solution The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~ - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below). - ~~no error is logged of any kind~~ (See below) - ~~it's unclear if it drops other tasks in the executor~~ (it does not) - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread. ### Alternatives A final solution likely will incorporate elements of any or all of the following. #### ~~Log and Ignore~~ ~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~ Panics already do this by default, even when caught by `catch_unwind`. #### ~~`catch_unwind` in `ParallelExecutor`~~ ~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~ ~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### Abort on Panic The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la bevyengine#4740). Roughly takes the shape of: ```rust struct AbortOnPanic; impl Drop for AbortOnPanic { fn drop(&mut self) { abort!(); } } let guard = AbortOnPanic; // Run task std::mem::forget(AbortOnPanic); ``` --- ## Changelog Changed: `bevy_tasks::TaskPool`'s threads will no longer terminate permanently when a task scheduled onto them panics. Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.

goshhhy · 2023-02-02T04:53:12Z

I recommend, if possible, to benchmark this PR (and other multithreading/scheduling things) on a machine where all CPU cores can be forced to run at the same clock speeds (via BIOS settings, etc, to disable boost and clock frequency scaling).

i have a machine with 272 threads, where the cores are all running the same speed, and as long as at least 3 cores are active then there is only 100mhz difference between base and boost clocks (1400MHz vs 1500MHz; 1600 is possible with 2 or fewer cores active), and i may be able to disable boost entirely.

i also have a simulation workload for that machine, which is affected by #1907, with rather severe underutilization of threads compared to what i would hope for.

i will test this PR with that machine/workload and share the results.

james7132 · 2023-03-08T06:24:55Z

Going to return to the drawing board on this one. Taking on an entire fork of async_executor might not be in our interests. In the meantime, I'm potentially looking to experiment with switchyard, which provides better per-thread control over task allocation.

Internalize task distinction into TaskPool

aa75703

james7132 marked this pull request as draft May 13, 2022 18:05

james7132 added 15 commits May 13, 2022 12:53

Basic priortization strategy

3f97865

Add missing docs

f0a70b9

Formatting

dadfe15

Fix WASM

3f260b2

Fix local tests for bevy_task

e35e11a

Fix CI

2eab8fc

Fix wasm32

02a36b5

Fix miri

76a2da0

More wasm32 tomfoolery

2a0ae14

Derp

c32bea7

Formatting woes

7911617

Fix up TaskPoolBuilder for wasm32

c3e4cff

Remove unnecessary Send bounds

0b67525

More wasm woes

e87790e

Formatting

8c722e3

bors bot added a commit that referenced this pull request May 14, 2022

Try #4740:

362294f

Use run instead custom run_forever

6ad621e

james7132 marked this pull request as ready for review May 14, 2022 20:57

james7132 mentioned this pull request Nov 1, 2022

[Merged by Bors] - TaskPool Panic Handling #6443

Closed

james7132 mentioned this pull request Nov 5, 2022

Add default feature for pinning TaskPool threads to CPU cores #6471

Closed

james7132 added 7 commits November 4, 2022 21:51

Merge branch 'main' into task-pool-internalization

310c3b9

Attempt to fix CI

0ca76c9

Update st3

1378672

Use let-else

994a250

Fix locality tests

18d2a8f

Fix CI

9cb309b

Fix WASM

b4f700c

james7132 added 2 commits November 5, 2022 03:33

Make LocalExecutor not require atomics

30aad1e

Add simple executor to avoid using Executor in scope

2b0d2d7

hymm mentioned this pull request Nov 10, 2022

[Merged by Bors] - Pipelined Rendering #6503

Closed

2 tasks

james7132 added 3 commits November 15, 2022 00:39

Clean up thread spawn logic

c380b57

Merge branch 'main' into task-pool-internalization

9be6fe2

Formatting

07e1acf

james7132 mentioned this pull request Nov 26, 2022

Add tokio as an optional bevy_tasks backend #6762

Closed

alice-i-cecile added the D-Complex Quite challenging from either a design or technical perspective. Ask for help! label Jan 3, 2023

james7132 closed this Mar 8, 2023

james7132 mentioned this pull request Feb 24, 2024

Consolidate together Bevy's TaskPools #12090

Open

james7132 mentioned this pull request Apr 16, 2024

Optimize TaskPools for use in static variables. #12990

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internalize task distinction into TaskPool #4740

Internalize task distinction into TaskPool #4740

james7132 commented May 13, 2022 •

edited

Loading

hymm commented May 13, 2022

james7132 commented May 13, 2022 •

edited

Loading

hymm commented May 13, 2022 •

edited

Loading

james7132 commented May 13, 2022 •

edited

Loading

james7132 commented May 14, 2022

bors bot commented May 14, 2022

james7132 commented Nov 5, 2022 •

edited

Loading

hymm commented Nov 7, 2022

hymm commented Nov 7, 2022

hymm commented Nov 15, 2022

hymm commented Nov 15, 2022

inodentry commented Jan 3, 2023

goshhhy commented Feb 2, 2023

james7132 commented Mar 8, 2023

Internalize task distinction into TaskPool #4740

Internalize task distinction into TaskPool #4740

Conversation

james7132 commented May 13, 2022 • edited Loading

Objective

Solution

Changelog

Migration Guide

hymm commented May 13, 2022

james7132 commented May 13, 2022 • edited Loading

hymm commented May 13, 2022 • edited Loading

james7132 commented May 13, 2022 • edited Loading

james7132 commented May 14, 2022

bors bot commented May 14, 2022

try

james7132 commented Nov 5, 2022 • edited Loading

hymm commented Nov 7, 2022

hymm commented Nov 7, 2022

hymm commented Nov 15, 2022

hymm commented Nov 15, 2022

inodentry commented Jan 3, 2023

goshhhy commented Feb 2, 2023

james7132 commented Mar 8, 2023

james7132 commented May 13, 2022 •

edited

Loading

james7132 commented May 13, 2022 •

edited

Loading

hymm commented May 13, 2022 •

edited

Loading

james7132 commented May 13, 2022 •

edited

Loading

james7132 commented Nov 5, 2022 •

edited

Loading