-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crossbeam-channel uses pure userspace spinlocks, which perform very poorly in some situations #821
Comments
Spinlock was replaced with $ cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync Rust crossbeam-channel 4.685 sec
$ taskset 1 cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync Rust crossbeam-channel 66.005 sec New results (same benchmark, but rebased on latest master at the moment (7070018)): $ cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync Rust crossbeam-channel 4.562 sec
$ taskset 1 cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync Rust crossbeam-channel 65.679 sec Inspected with |
Running the benchmarks (both old and new) using |
I can confirm this causes severe cpu overhead when using channel as a low throughput spsc channel, the spin locks eat away the cpu. If you look at the channel Furthermore Backoff seems to spin a bit too much before yielding to the scheduler? For comparison parking lot spin exponent is 3 vs crossbeam 6. This is a profiling of a real application |
I think this stems from (not ideal) micro benchmarking and benchmark marketing. By reducing spinning most channel benchmarks regress, because they are often running non-contented, with #actors <= #cpus available, and/or with no work between send/recv. But if you make it contented (e.g. using |
According to rust-lang/rust#93563 by @ibraheemdev, is the real problem here not the unbounded spinning in unbounded queue (i.e., wait_write pointed by #366), but the amount of spin_loop_hint and yield_now and the fact that it is being called even where it is not needed? |
With the changes from std applied, which removes all extra spinning on
Removing the spinning seems to affect throughput in this benchmark by a lot (which makes sense considering the benchmark), but as you can see |
I can confirm that My timing numbers (running on a different machine than before, still amd64 Linux, only one timed run of each without any care to reduce outside interference so don't trust small differences):
Some cautions about my test code, before I post it:
I posted my (somewhat hacky) test code at https://github.com/bsilver8192/mpsc-spin-test if anybody else wants to try it. There are several commented-out sections to switch between the various tests. |
I really do not want to pile on, but currently the loop at the top of |
This is from a profile from tantivy-cli. In the profile it seems most time is spent in the spin: for _ in 0..1 << self.step.get() {
// TODO(taiki-e): once we bump the minimum required Rust version to 1.49+,
// use [`core::hint::spin_loop`] instead.
#[allow(deprecated)]
atomic::spin_loop_hint();
} The send flow is using mpmc channels with 100 capacity. mpmc, since the user can configure the number of threads, which in this case is set to 1. Indexing is slower, so senders have to wait typically. stdio lines -> 1 thread parse JSON -> 1 thread index in tantivy
|
crossbeam-channel appears to use pure spinlocks (which never block in the kernel), from userspace, which is a bad idea. Looking at the behavior with
strace
, it appears to besched_yield
ing in a loop.On my Linux system (Debian 5.15 kernel, standard configuration) with a simple little benchmark I wrote, I see this:
It takes over 20x longer with both threads pinned to the same CPU, because each thread spins until its scheduler tick expires repeatedly. In my full application, I'm using SCHED_RR in one thread, which makes this a full-on deadlock because it will never yield and let the other thread run.
I'm running https://github.com/bsilver8192/crossbeam/tree/crossbeam-channel-spin-benchmark, which is recent master (450d237) + my benchmark.
This seems related to #366 and #447.
https://matklad.github.io/2020/01/02/spinlocks-considered-harmful.html goes into more detail about why this is not a good implementation. It links to https://matklad.github.io/2020/01/04/mutexes-are-faster-than-spinlocks.html which has some performance numbers indicating that this can be fixed with practically no cost in any case, and greatly increased speeds in some cases. It also links to https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-and-how-bad-the-linux-scheduler-really-is/ which shows some examples of just how bad this can be.
The text was updated successfully, but these errors were encountered: