crossbeam-channel uses pure userspace spinlocks, which perform very poorly in some situations #821

bsilver8192 · 2022-05-04T02:38:53Z

crossbeam-channel appears to use pure spinlocks (which never block in the kernel), from userspace, which is a bad idea. Looking at the behavior with strace, it appears to be sched_yielding in a loop.

On my Linux system (Debian 5.15 kernel, standard configuration) with a simple little benchmark I wrote, I see this:

brian[13116] rico (master) ~/Desktop/crossbeam
$ cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
[warnings because I commented stuff out elided]
warning: `benchmarks` (bin "crossbeam-channel") generated 7 warnings
    Finished dev [unoptimized + debuginfo] target(s) in 0.06s
     Running `target/debug/crossbeam-channel`
unbounded_spsc_sync       Rust crossbeam-channel   4.473 sec
brian[13117] rico (master) ~/Desktop/crossbeam
$ taskset 1 cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
[warnings because I commented stuff out elided]
warning: `benchmarks` (bin "crossbeam-channel") generated 7 warnings
    Finished dev [unoptimized + debuginfo] target(s) in 0.06s
     Running `target/debug/crossbeam-channel`
unbounded_spsc_sync       Rust crossbeam-channel 106.810 sec

It takes over 20x longer with both threads pinned to the same CPU, because each thread spins until its scheduler tick expires repeatedly. In my full application, I'm using SCHED_RR in one thread, which makes this a full-on deadlock because it will never yield and let the other thread run.

I'm running https://github.com/bsilver8192/crossbeam/tree/crossbeam-channel-spin-benchmark, which is recent master (450d237) + my benchmark.

This seems related to #366 and #447.

https://matklad.github.io/2020/01/02/spinlocks-considered-harmful.html goes into more detail about why this is not a good implementation. It links to https://matklad.github.io/2020/01/04/mutexes-are-faster-than-spinlocks.html which has some performance numbers indicating that this can be fixed with practically no cost in any case, and greatly increased speeds in some cases. It also links to https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-and-how-bad-the-linux-scheduler-really-is/ which shows some examples of just how bad this can be.

The text was updated successfully, but these errors were encountered:

yjhn · 2022-08-31T13:29:35Z

Spinlock was replaced with Mutex from std in #835, but that somehow has not improved things at all. Old results on my laptop (from https://github.com/bsilver8192/crossbeam/tree/crossbeam-channel-spin-benchmark):

$ cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync       Rust crossbeam-channel   4.685 sec
$ taskset 1 cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync       Rust crossbeam-channel  66.005 sec

New results (same benchmark, but rebased on latest master at the moment (7070018)):

$ cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync       Rust crossbeam-channel   4.562 sec
$ taskset 1 cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync       Rust crossbeam-channel  65.679 sec

Inspected with strace, both versions look mostly the same (a lot of sched_yield() with some futex(...) calls mixed in).
I am using rustc 1.64.0-beta.3 (82bf34178 2022-08-18).

yjhn · 2022-08-31T14:03:55Z

Running the benchmarks (both old and new) using taskset 1 with gdb and setting a breakpoint on sched_yield shows that most calls to sched_yield() are from Backoff::snooze(). Backoff is used in some of channel's internal data structures. This somewhat explains why there is almost no improvement from replacing the lock.

arthurprs · 2022-11-10T10:06:40Z

I can confirm this causes severe cpu overhead when using channel as a low throughput spsc channel, the spin locks eat away the cpu.

If you look at the channel ::recv source code you can tell it uses Backoff itself and Context::wait_until uses it again.

Furthermore Backoff seems to spin a bit too much before yielding to the scheduler? For comparison parking lot spin exponent is 3 vs crossbeam 6.

This is a profiling of a real application

arthurprs · 2022-11-11T10:54:50Z

I think this stems from (not ideal) micro benchmarking and benchmark marketing. By reducing spinning most channel benchmarks regress, because they are often running non-contented, with #actors <= #cpus available, and/or with no work between send/recv.

But if you make it contented (e.g. using taskset) or have significant work outside of recv/send calls the situation often reverses.

taiki-e · 2022-11-20T08:58:36Z

According to rust-lang/rust#93563 by @ibraheemdev, is the real problem here not the unbounded spinning in unbounded queue (i.e., wait_write pointed by #366), but the amount of spin_loop_hint and yield_now and the fact that it is being called even where it is not needed?

ibraheemdev · 2022-11-20T18:55:09Z

With the changes from std applied, which removes all extra spinning on recv, these are the new results on my machine:

before:
unbounded_spsc_sync       Rust crossbeam-channel   5.869 sec

before (taskset 1):
unbounded_spsc_sync       Rust crossbeam-channel  52.105 sec

after:
unbounded_spsc_sync       Rust crossbeam-channel  30.612 sec

after (taskset 1):
unbounded_spsc_sync       Rust crossbeam-channel  31.899 sec

Removing the spinning seems to affect throughput in this benchmark by a lot (which makes sense considering the benchmark), but as you can see taskset 1 had almost no effect. So it does look like the issue here is the extra spinning, not the blocking queue implementation.

bsilver8192 · 2022-11-20T22:33:51Z

I can confirm that std::sync::mpsc in Rust nightly 1.67.0-nightly (c5d82ed7a 2022-11-19) does not have this problem, and crossbeam-channel 0.5.6 does still have the worse performance. Also, crossbeam-channel 0.5.6 has already fixed the deadlock I originally posted.

My timing numbers (running on a different machine than before, still amd64 Linux, only one timed run of each without any care to reduce outside interference so don't trust small differences):

	unpinned	pinned	pinned equal realtime	pinned unequal realtime
crossbeam-channel `450d237`	2.505	63.061	66.994	deadlock
crossbeam-channel 0.5.6	2.454	63.629	66.208	102.530
Rust 1.65.0	17.778	11.566	11.292	8.133
Rust 1.67.0-nightly (c5d82ed7a 2022-11-19)	20.636	16.873	16.151	34.160

Some cautions about my test code, before I post it:

Do NOT run it on a single core machine, it WILL wedge your terminal.
It has some chance of breaking device drivers, with a slim chance of that leading to hardware damage. Rebooting to get certain devices working again is a very real possibility.
Running it on a CONFIG_PREEMPT_RT kernel is probably going to create problems.
Don't do it on a core that handles important interrupts, in my experience core 0 is usually the only one with dedicated interrupts but it depends.

I posted my (somewhat hacky) test code at https://github.com/bsilver8192/mpsc-spin-test if anybody else wants to try it. There are several commented-out sections to switch between the various tests.

OlegOAndreev · 2023-01-12T22:36:20Z

I really do not want to pile on, but currently the loop at the top of recv which boils down to
while !backoff.is_completed() { backoff.snooze(); }
takes ~70-90 microseconds on recent Intel CPU, which is 20-50 greater than the cost of context switch. This really seems a bit too much for polling an empty queue.

PSeitz · 2023-10-24T04:33:36Z

This is from a profile from tantivy-cli. In the profile it seems most time is spent in the spin:

            for _ in 0..1 << self.step.get() {
                // TODO(taiki-e): once we bump the minimum required Rust version to 1.49+,
                // use [`core::hint::spin_loop`] instead.
                #[allow(deprecated)]
                atomic::spin_loop_hint();
            }

The send flow is using mpmc channels with 100 capacity. mpmc, since the user can configure the number of threads, which in this case is set to 1. Indexing is slower, so senders have to wait typically.

stdio lines -> 1 thread parse JSON -> 1 thread index in tantivy

perf record --call-graph dwarf 
perf report --stdio

    46.78%     0.00%  tantivy          libc.so.6             [.] 0x00007f98910037cb
            |
            ---0x7f98910037cb
               0x7f9890f7f9ea
               std::sys::unix::thread::Thread::new::thread_start
               |          
               |--24.35%--core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h1f6f2d69b5f287ba
               |          std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h73769fc82e80e4bb (inlined)
               |          std::panic::catch_unwind (inlined)
               |          std::panicking::try (inlined)
               |          std::panicking::try::do_call (inlined)
               |          <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (inlined)
               |          std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h93815b136742a096 (inlined)
               |          std::sys_common::backtrace::__rust_begin_short_backtrace (inlined)
               |          tantivy::commands::index::run_index::_$u7b$$u7b$closure$u7d$$u7d$::h3facae76dbfd38bf (inlined)
               |          |          
               |          |--16.98%--crossbeam_channel::channel::Sender<T>::send
               |          |          |          
               |          |           --16.93%--crossbeam_channel::flavors::array::Channel<T>::send
               |          |                     |          
               |          |                     |--12.35%--crossbeam_utils::backoff::Backoff::snooze (inlined)
               |          |                     |          |          
               |          |                     |          |--10.68%--core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (inlined)
               |          |                     |          |          <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next (inlined)
               |          |                     |          |          |          
               |          |                     |          |           --10.05%--core::cmp::impls::<impl core::cmp::PartialOrd for i32>::lt (inlined)
               |          |                     |          |          
               |          |                     |           --1.64%--__sched_yield
               |          |                     |                     |          
               |          |                     |                      --1.42%--entry_SYSCALL_64_after_hwframe
               |          |                     |                                |          
               |          |                     |                                 --1.40%--do_syscall_64
               |          |                     |                                           |          
               |          |                     |                                            --1.24%--__x64_sys_sched_yield
               |          |                     |                                                      |          
               |          |                     |                                                       --1.01%--schedule
               |          |                     |                                                                 |          
               |          |                     |                                                                  --0.95%--__schedule
               |          |                     |                                                                            |          
               |          |                     |                                                                             --0.54%--pick_next_task_fair
               |          |                     |          
               |          |                     |--3.15%--crossbeam_channel::context::Context::with (inlined)
               |          |                     |          |          
               |          |                     |           --3.15%--std::thread::local::LocalKey<T>::try_with (inlined)
               |          |                     |                     |          
               |          |                     |                      --3.14%--crossbeam_channel::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::hd64289cd2a2278bd (inlined)
               |          |                     |                                |          
               |          |                     |                                 --3.12%--crossbeam_channel::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::h0b76ad33a3783422
               |          |                     |                                           |          
               |          |                     |                                            --3.12%--crossbeam_channel::flavors::array::Channel$LT$T$GT$::send::_$u7b$$u7b$closure$u7d$$u7d$::h00a9cec38e2e6332 (inlined)
               |          |                     |                                                      |          
               |          |                     |                                                       --3.00%--crossbeam_channel::context::Context::wait_until
               |          |                     |                                                                 |          
               |          |                     |                                                                  --2.42%--crossbeam_utils::backoff::Backoff::snooze (inlined)
               |          |                     |                                                                            |          
               |          |                     |                                                                             --2.24%--core::sync::atomic::spin_loop_hint (inlined)
               |          |                     |                                                                                       core::hint::spin_loop (inlined)
               |          |                     |                                                                                       core::core_arch::x86::sse2::_mm_pause (inlined)
               |          |                     |          
               |          |                      --1.32%--crossbeam_channel::flavors::array::Channel<T>::start_send (inlined)
               |          |          
               |          |--6.45%--tantivy::index_document::IndexDocument::parse_json
               |          |          |          
               |          |           --6.03%--tantivy::index_document::to_value (inlined)
               |          |                     |          
               |          |                     |--4.22%--simd_json::Deserializer::from_slice
               |          |                     |          |          
               |          |                     |           --3.68%--simd_json::Deserializer::from_slice_with_buffers
               |          |                     |                     |          
               |          |                     |                     |--2.45%--simd_json::Deserializer::find_structural_bits (inlined)
               |          |                     |                     |          |          
               |          |                     |                     |           --2.45%--simd_json::Deserializer::find_structural_bits_avx
               |          |                     |                     |                     |          
               |          |                     |                     |                      --2.44%--simd_json::Deserializer::_find_structural_bits (inlined)
               |          |                     |                     |                                |          
               |          |                     |                     |                                 --1.81%--<simd_json::avx2::stage1::SimdInputAVX as simd_json::Stage1Parse<core::core_arch::x86::__m256i>>::flatten_bits >
               |          |                     |                     |                                           |          
               |          |                     |                     |                                            --1.51%--alloc::vec::Vec<T,A>::reserve (inlined)
               |          |                     |                     |                                                      alloc::raw_vec::RawVec<T,A>::reserve (inlined)
               |          |                     |                     |                                                      |          
               |          |                     |                     |                                                       --1.44%--alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
               |          |                     |                     |                                                                 |          
               |          |                     |                     |                                                                  --1.44%--alloc::raw_vec::RawVec<T,A>::grow_amortized (inlined)
               |          |                     |                     |                                                                            |          
               |          |                     |                     |                                                                             --1.42%--alloc::raw_vec::finish_grow (inlined)
               |          |                     |                     |                                                                                       |          
               |          |                     |                     |                                                                                        --1.41%--<alloc::alloc::Global as core::alloc::Allocator>::grow (>
               |          |                     |                     |                                                                                                  alloc::alloc::Global::grow_impl (inlined)
               |          |                     |                     |                                                                                                  alloc::alloc::realloc (inlined)
               |          |                     |                     |                                                                                                  |          
               |          |                     |                     |                                                                                                   --1.40%--realloc
               |          |                     |                     |          
               |          |                     |                      --1.08%--simd_json::stage2::<impl simd_json::Deserializer>::build_tape (inlined)
               |          |                     |          
               |          |                      --1.65%--tantivy::index_document::BorrowDeserializer::into_flattened (inlined)
               |          |                                |          
               |          |                                 --1.57%--tantivy::index_document::BorrowDeserializer::parse
               |          |                                           |          
               |          |                                            --1.56%--tantivy::index_document::BorrowDeserializer::parse_map (inlined)
               |          |                                                      |          
               |          |                     |                     |                                                                                                   --1.40%--realloc
               |          |                     |                     |          
               |          |                     |                      --1.08%--simd_json::stage2::<impl simd_json::Deserializer>::build_tape (inlined)
               |          |                     |          
               |          |                      --1.65%--tantivy::index_document::BorrowDeserializer::into_flattened (inlined)
               |          |                                |          
               |          |                                 --1.57%--tantivy::index_document::BorrowDeserializer::parse
               |          |                                           |          
               |          |                                            --1.56%--tantivy::index_document::BorrowDeserializer::parse_map (inlined)
               |          |                                                      |          
               |          |                                                      |--0.75%--tantivy::index_document::BorrowDeserializer::parse
               |          |                                                      |          |          
               |          |                                                      |           --0.71%--tantivy::index_document::BorrowDeserializer::parse::_$u7b$$u7b$closure$u7d$$u7d$::he8db8863125cdbbb
               |          |                                                      |                     |          
               |          |                                                      |                      --0.55%--alloc::vec::Vec<T,A>::push (inlined)
               |          |                                                      |                                |          
               |          |                                                      |                                 --0.53%--alloc::raw_vec::RawVec<T,A>::reserve_for_push
               |          |                                                      |                                           |          
               |          |                                                      |                                            --0.53%--alloc::raw_vec::RawVec<T,A>::grow_amortized (inlined)
               |          |                                                      |                                                      |          
               |          |                                                      |                                                       --0.52%--alloc::raw_vec::finish_grow (inlined)
               |          |                                                      |                                                                 |          
               |          |                                                      |                                                                  --0.50%--malloc
               |          |                                                      |          
               |          |                                                       --0.56%--tantivy::schema::schema::Schema::get_field
               |          |          
               |           --0.63%--<crossbeam_channel::channel::IntoIter<T> as core::iter::traits::iterator::Iterator>::next
               |                     |          
               |                      --0.62%--crossbeam_channel::channel::Receiver<T>::recv (inlined)
               |                                |          
               |                                 --0.61%--crossbeam_channel::flavors::array::Channel<T>::recv
               |          
                --22.43%--core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h989fd13cc428f62c
                          std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h38d9a4e86260e253 (inlined)
                          |          
                           --22.43%--std::panic::catch_unwind (inlined)
                                     std::panicking::try (inlined)
                                     std::panicking::try::do_call (inlined)
                                     <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (inlined)
                                     std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h7e3f4df267158938 (inlined)
                                     std::sys_common::backtrace::__rust_begin_short_backtrace (inlined)
                                     tantivy::commands::index::run_index::_$u7b$$u7b$closure$u7d$$u7d$::h6a107a0b11f61b63 (inlined)
                                     |          
                                     |--20.09%--crossbeam_channel::channel::Sender<T>::send
                                     |          |          
               |                                |          
               |                                 --0.61%--crossbeam_channel::flavors::array::Channel<T>::recv
               |          
                --22.43%--core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h989fd13cc428f62c
                          std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h38d9a4e86260e253 (inlined)
                          |          
                           --22.43%--std::panic::catch_unwind (inlined)
                                     std::panicking::try (inlined)
                                     std::panicking::try::do_call (inlined)
                                     <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (inlined)
                                     std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h7e3f4df267158938 (inlined)
                                     std::sys_common::backtrace::__rust_begin_short_backtrace (inlined)
                                     tantivy::commands::index::run_index::_$u7b$$u7b$closure$u7d$$u7d$::h6a107a0b11f61b63 (inlined)
                                     |          
                                     |--20.09%--crossbeam_channel::channel::Sender<T>::send
                                     |          |          
                                     |           --20.04%--crossbeam_channel::flavors::array::Channel<T>::send
                                     |                     |          
                                     |                     |--13.61%--crossbeam_utils::backoff::Backoff::snooze (inlined)
                                     |                     |          |          
                                     |                     |          |--11.13%--core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (inlined)
                                     |                     |          |          <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next (inlined)
                                     tantivy::commands::index::run_index::_$u7b$$u7b$closure$u7d$$u7d$::h6a107a0b11f61b63 (inlined)
                                     |          
                                     |--20.09%--crossbeam_channel::channel::Sender<T>::send
                                     |          |          
                                     |           --20.04%--crossbeam_channel::flavors::array::Channel<T>::send
                                     |                     |          
                                     |                     |--13.61%--crossbeam_utils::backoff::Backoff::snooze (inlined)
                                     |                     |          |          
                                     |                     |          |--11.13%--core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (inlined)
                                     |                     |          |          <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next (inlined)
                                     |                     |          |          |          
                                     |                     |          |           --10.48%--core::cmp::impls::<impl core::cmp::PartialOrd for i32>::lt (inlined)
                                     |                     |          |          
                                     |                     |           --2.43%--__sched_yield

This was referenced May 9, 2022

Document that SegQueue is not completely lock free #675

Open

Add a ping-pong benchmark to crossbeam-channel #492

Open

arthurprs mentioned this issue Nov 13, 2022

Slight improvements to channel benchmark #925

Open

taiki-e mentioned this issue Nov 22, 2022

Replace custom spinlock with spin smol-rs/event-listener#41

Closed

matklad mentioned this issue Mar 31, 2023

On a wip branch, there are some subtle soundness issues with Lazy BurntSushi/regex-automata#30

Closed

Boscop mentioned this issue May 2, 2023

Multithreaded sequential programs spend most their active time waiting HigherOrderCO/HVM#184

Closed

uklotzde mentioned this issue Jun 19, 2023

Prevent out-of-order arrival of audio node settings orottier/web-audio-api-rs#297

Closed

orottier mentioned this issue Jun 21, 2023

Do not use unbounded channels in online audio processing orottier/web-audio-api-rs#304

Closed

matklad mentioned this issue Oct 17, 2023

crossbeam channel is very slow if restricted to a single core #1031

Open

ibraheemdev mentioned this issue Nov 7, 2023

Reduce backoff spinning/usage #1038

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crossbeam-channel uses pure userspace spinlocks, which perform very poorly in some situations #821

crossbeam-channel uses pure userspace spinlocks, which perform very poorly in some situations #821

bsilver8192 commented May 4, 2022

yjhn commented Aug 31, 2022 •

edited

Loading

yjhn commented Aug 31, 2022

arthurprs commented Nov 10, 2022 •

edited

Loading

arthurprs commented Nov 11, 2022 •

edited

Loading

taiki-e commented Nov 20, 2022 •

edited

Loading

ibraheemdev commented Nov 20, 2022

bsilver8192 commented Nov 20, 2022

OlegOAndreev commented Jan 12, 2023

PSeitz commented Oct 24, 2023

crossbeam-channel uses pure userspace spinlocks, which perform very poorly in some situations #821

crossbeam-channel uses pure userspace spinlocks, which perform very poorly in some situations #821

Comments

bsilver8192 commented May 4, 2022

yjhn commented Aug 31, 2022 • edited Loading

yjhn commented Aug 31, 2022

arthurprs commented Nov 10, 2022 • edited Loading

arthurprs commented Nov 11, 2022 • edited Loading

taiki-e commented Nov 20, 2022 • edited Loading

ibraheemdev commented Nov 20, 2022

bsilver8192 commented Nov 20, 2022

OlegOAndreev commented Jan 12, 2023

PSeitz commented Oct 24, 2023

yjhn commented Aug 31, 2022 •

edited

Loading

arthurprs commented Nov 10, 2022 •

edited

Loading

arthurprs commented Nov 11, 2022 •

edited

Loading

taiki-e commented Nov 20, 2022 •

edited

Loading