-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft: Add BWoS-queue backend to tokio #5283
Conversation
This is really cool! I'll take a look in a bit. |
I would also really like to take a look at this change some time in the next couple of days! |
Looks cool. I have no bandwidth this week, and next week is christmas, but I will definitely take a look eventually. |
I investigated this and it turns out that the loom wrapper type (for non-loom builds) adds significant overhead. LTO seems to mostly eliminate that overhead, but it would be preferable if non-lto builds wouldn't lose more than 50% queue performance on x86. Benchmark results on commit 87cee67
Current version (without lto)
lto = "fat"
lto = "thin"Performance compared to
With std AtomicUsize type instead of loom wrapper type (no lto)
With std AtomicUsize type instead of loom wrapper type (lto = "thin")
|
All the functions in that file (atomic_usize.rs) are non-generic and trivial, so try adding |
@jschwe did you find it adding this overhead in vanilla tokio or just in this fork? |
I didn't have time to investigate more before going on vacation, but in flamegraphs the deref operation of the tokio mock wrapper was quite visible in the microbenchmarks (bwosqueue/benches/bench.rs) of the queue (without lto). These microbenchmarks are currently standalone and the mock loom implementation was just copy-pasted from tokio (the
Adding the Edit: I just reran the microbenchmarks and adding |
I need to take a look through this again soon. |
That would be great! If it would help, I'd also offer to discuss / walkthrough the queue in a call. The diff of this MR is quite big, but the important parts are ~1200 lines and are basically:
Most of the other stuff is tests or microbenchmarks - so their review and discussion can probably be delayed until later when you have decided you are interested in merging. |
I've updated this branch with a Draft implementation of selecting the queue backend via the Builder at runtime. CI failing is expected, as I haven't updated all the tests yet, since I first wanted to get some feedback. |
c510c2b
to
574a77b
Compare
This commit is a work-in-progress snapshot of BWoS for tokio, with the intention to get early feedback. Currently, the BWoS queue is just dropped in as a replacement, with the intention to make benchmarking easier (just patch downstream crates to use the modified version). Before merging the queue should be integrated as an alternate queue instead of replacing the current one. The design of the BWoS queue was done by Jiawei Wang. Co-authored-by: Jiawei Wang <jiawei.wang@huawei.com> Designed-by: Jiawei Wang <jiawei.wang@huawei.com> Signed-off-by: Jonathan Schwender <jonathan.schwender@huawei.com> Signed-off-by: Jiawei Wang <jiawei.wang@huawei.com> Signed-off-by: Ming Fu <ming.fu@huawei.com>
bwosqueue/src/lib.rs
Outdated
// From the statistics perspective we consider the reserved range to already be | ||
// stolen, since it is not available for the consumer or other stealers anymore. | ||
#[cfg(feature = "stats")] | ||
self.queue.stats.increment_stolen(num_reserved); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to inquire if it is necessary to adjust the value of stolen here?
blk.stolen.fetch_add(num_reserved , Release);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that would be wrong. At this point we have just reserved the entries, i.e. the consumer can't access them anymore, but the stealer has not finished copying the entries over to the stealers queue. The Drop
implementation for StealerBlockIter
increments stolen
once the (now empty) iterator is dropped.
The statistics feature is just an approximation, giving metrics on how many items are in the queue, so that the utilization could be tracked over time.
Unfortunately, I don't think we will have the review bandwidth to review a change like this anytime soon. |
Motivation
This the PR related to #5240 with the motivation to provide an alternate workstealing queue backend for the multithreaded runtime. The BWoS queue is based on the BBQ (Block-based Bounded Queue) and is specially designed for the workstealing scenario. Based on the real-world observation that the "stealing" operation is
rare and most of the operations are local enqueues and dequeues this queue implementation
offers a single
Owner
which can enqueue and dequeue without any heavy synchronization mechanismson the fast path (intra block) and thus offers a very high performance for these operations.
Concurrent stealing is possible and does not slow done the Owner too much. The improved performance allows stealing policies which steal single items or in small batches, which improves load balancing. Cache contention is reduced due to the split of Metadata into Global Metadata and Block local Metadata.
Remarks about the current status of this PR
The microbenchmarks are in the bwosqueue directory and use criterion. Currently, the BWoSrequires some LTO optimizations for the best performance, but that can be fixed before merging.
trait
based, and the worker uses dynamic dispatching on trait objects.the downstream project to use this branch.
Evaluation scripts (using rust-web-benchmarks)
For easier evaluation of the queue changes in a hyper "hello-world" application scenario, feel free to use the bwos bench branch forked from rust-web-benchmarks.
The fork mainly differs in that applications which do not use the "rt-multithread" runtime were removed, and a metrics feature was added (which uses a forked version of tokio-metrics to expose both the number of stealing operations and total number of stolen tasks).
At the top-level it has a
bench_with_metrics.sh
script which should be inspected and modified (adjust which cores are bound to and how many coresrewrk
uses). This will benchmark 6 different rust web frameworks, which all provide more or less similar results. The script can benchmark different branches of tokio. I created a number of those to investigate the influence of stealing strategies. I'll update this post with some insights later.