Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
executor: rewrite the work-stealing thread pool (tokio-rs#1657)
This patch is a ground up rewrite of the existing work-stealing thread pool. The goal is to reduce overhead while simplifying code when possible. At a high level, the following architectural changes were made: - The local run queues were switched for bounded circle buffer queues. - Reduce cross-thread synchronization. - Refactor task constructs to use a single allocation and always include a join handle (tokio-rs#887). - Simplify logic around putting workers to sleep and waking them up. **Local run queues** Move away from crossbeam's implementation of the Chase-Lev deque. This implementation included unnecessary overhead as it supported capabilities that are not needed for the work-stealing thread pool. Instead, a fixed size circle buffer is used for the local queue. When the local queue is full, half of the tasks contained in it are moved to the global run queue. **Reduce cross-thread synchronization** This is done via many small improvements. Primarily, an upper bound is placed on the number of concurrent stealers. Limiting the number of stealers results in lower contention. Secondly, the rate at which workers are notified and woken up is throttled. This also reduces contention by preventing many threads from racing to steal work. **Refactor task structure** Now that Tokio is able to target a rust version that supports `std::alloc` as well as `std::task`, the pool is able to optimize how the task structure is laid out. Now, a single allocation per task is required and a join handle is always provided enabling the spawner to retrieve the result of the task (tokio-rs#887). **Simplifying logic** When possible, complexity is reduced in the implementation. This is done by using locks and other simpler constructs in cold paths. The set of sleeping workers is now represented as a `Mutex<VecDeque<usize>>`. Instead of optimizing access to this structure, we reduce the amount the pool must access this structure. Secondly, we have (temporarily) removed `threadpool::blocking`. This capability will come back later, but the original implementation was way more complicated than necessary. **Results** The thread pool benchmarks have improved significantly: Old thread pool: ``` test chained_spawn ... bench: 2,019,796 ns/iter (+/- 302,168) test ping_pong ... bench: 1,279,948 ns/iter (+/- 154,365) test spawn_many ... bench: 10,283,608 ns/iter (+/- 1,284,275) test yield_many ... bench: 21,450,748 ns/iter (+/- 1,201,337) ``` New thread pool: ``` test chained_spawn ... bench: 147,943 ns/iter (+/- 6,673) test ping_pong ... bench: 537,744 ns/iter (+/- 20,928) test spawn_many ... bench: 7,454,898 ns/iter (+/- 283,449) test yield_many ... bench: 16,771,113 ns/iter (+/- 733,424) ``` Real-world benchmarks improve significantly as well. This is testing the hyper hello world server using: `wrk -t1 -c50 -d10`: Old scheduler: ``` Running 10s test @ http://127.0.0.1:3000 1 threads and 50 connections Thread Stats Avg Stdev Max +/- Stdev Latency 371.53us 99.05us 1.97ms 60.53% Req/Sec 114.61k 8.45k 133.85k 67.00% 1139307 requests in 10.00s, 95.61MB read Requests/sec: 113923.19 Transfer/sec: 9.56MB ``` New scheduler: ``` Running 10s test @ http://127.0.0.1:3000 1 threads and 50 connections Thread Stats Avg Stdev Max +/- Stdev Latency 275.05us 69.81us 1.09ms 73.57% Req/Sec 153.17k 10.68k 171.51k 71.00% 1522671 requests in 10.00s, 127.79MB read Requests/sec: 152258.70 Transfer/sec: 12.78MB ```
- Loading branch information