Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crossbeam-channel uses pure userspace spinlocks, which perform very poorly in some situations #821

Open
bsilver8192 opened this issue May 4, 2022 · 9 comments

Comments

@bsilver8192
Copy link

crossbeam-channel appears to use pure spinlocks (which never block in the kernel), from userspace, which is a bad idea. Looking at the behavior with strace, it appears to be sched_yielding in a loop.

On my Linux system (Debian 5.15 kernel, standard configuration) with a simple little benchmark I wrote, I see this:

brian[13116] rico (master) ~/Desktop/crossbeam
$ cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
[warnings because I commented stuff out elided]
warning: `benchmarks` (bin "crossbeam-channel") generated 7 warnings
    Finished dev [unoptimized + debuginfo] target(s) in 0.06s
     Running `target/debug/crossbeam-channel`
unbounded_spsc_sync       Rust crossbeam-channel   4.473 sec
brian[13117] rico (master) ~/Desktop/crossbeam
$ taskset 1 cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
[warnings because I commented stuff out elided]
warning: `benchmarks` (bin "crossbeam-channel") generated 7 warnings
    Finished dev [unoptimized + debuginfo] target(s) in 0.06s
     Running `target/debug/crossbeam-channel`
unbounded_spsc_sync       Rust crossbeam-channel 106.810 sec

It takes over 20x longer with both threads pinned to the same CPU, because each thread spins until its scheduler tick expires repeatedly. In my full application, I'm using SCHED_RR in one thread, which makes this a full-on deadlock because it will never yield and let the other thread run.

I'm running https://github.com/bsilver8192/crossbeam/tree/crossbeam-channel-spin-benchmark, which is recent master (450d237) + my benchmark.

This seems related to #366 and #447.

https://matklad.github.io/2020/01/02/spinlocks-considered-harmful.html goes into more detail about why this is not a good implementation. It links to https://matklad.github.io/2020/01/04/mutexes-are-faster-than-spinlocks.html which has some performance numbers indicating that this can be fixed with practically no cost in any case, and greatly increased speeds in some cases. It also links to https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-and-how-bad-the-linux-scheduler-really-is/ which shows some examples of just how bad this can be.

@yjhn
Copy link
Contributor

yjhn commented Aug 31, 2022

Spinlock was replaced with Mutex from std in #835, but that somehow has not improved things at all. Old results on my laptop (from https://github.com/bsilver8192/crossbeam/tree/crossbeam-channel-spin-benchmark):

$ cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync       Rust crossbeam-channel   4.685 sec
$ taskset 1 cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync       Rust crossbeam-channel  66.005 sec

New results (same benchmark, but rebased on latest master at the moment (7070018)):

$ cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync       Rust crossbeam-channel   4.562 sec
$ taskset 1 cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync       Rust crossbeam-channel  65.679 sec

Inspected with strace, both versions look mostly the same (a lot of sched_yield() with some futex(...) calls mixed in).
I am using rustc 1.64.0-beta.3 (82bf34178 2022-08-18).

@yjhn
Copy link
Contributor

yjhn commented Aug 31, 2022

Running the benchmarks (both old and new) using taskset 1 with gdb and setting a breakpoint on sched_yield shows that most calls to sched_yield() are from Backoff::snooze(). Backoff is used in some of channel's internal data structures. This somewhat explains why there is almost no improvement from replacing the lock.

@arthurprs
Copy link

arthurprs commented Nov 10, 2022

I can confirm this causes severe cpu overhead when using channel as a low throughput spsc channel, the spin locks eat away the cpu.

If you look at the channel ::recv source code you can tell it uses Backoff itself and Context::wait_until uses it again.

Furthermore Backoff seems to spin a bit too much before yielding to the scheduler? For comparison parking lot spin exponent is 3 vs crossbeam 6.

This is a profiling of a real application

image

@arthurprs
Copy link

arthurprs commented Nov 11, 2022

I think this stems from (not ideal) micro benchmarking and benchmark marketing. By reducing spinning most channel benchmarks regress, because they are often running non-contented, with #actors <= #cpus available, and/or with no work between send/recv.

But if you make it contented (e.g. using taskset) or have significant work outside of recv/send calls the situation often reverses.

@taiki-e
Copy link
Member

taiki-e commented Nov 20, 2022

According to rust-lang/rust#93563 by @ibraheemdev, is the real problem here not the unbounded spinning in unbounded queue (i.e., wait_write pointed by #366), but the amount of spin_loop_hint and yield_now and the fact that it is being called even where it is not needed?

@ibraheemdev
Copy link
Contributor

With the changes from std applied, which removes all extra spinning on recv, these are the new results on my machine:

before:
unbounded_spsc_sync       Rust crossbeam-channel   5.869 sec

before (taskset 1):
unbounded_spsc_sync       Rust crossbeam-channel  52.105 sec

after:
unbounded_spsc_sync       Rust crossbeam-channel  30.612 sec

after (taskset 1):
unbounded_spsc_sync       Rust crossbeam-channel  31.899 sec

Removing the spinning seems to affect throughput in this benchmark by a lot (which makes sense considering the benchmark), but as you can see taskset 1 had almost no effect. So it does look like the issue here is the extra spinning, not the blocking queue implementation.

@bsilver8192
Copy link
Author

I can confirm that std::sync::mpsc in Rust nightly 1.67.0-nightly (c5d82ed7a 2022-11-19) does not have this problem, and crossbeam-channel 0.5.6 does still have the worse performance. Also, crossbeam-channel 0.5.6 has already fixed the deadlock I originally posted.

My timing numbers (running on a different machine than before, still amd64 Linux, only one timed run of each without any care to reduce outside interference so don't trust small differences):

unpinned pinned pinned equal realtime pinned unequal realtime
crossbeam-channel 450d237 2.505 63.061 66.994 deadlock
crossbeam-channel 0.5.6 2.454 63.629 66.208 102.530
Rust 1.65.0 17.778 11.566 11.292 8.133
Rust 1.67.0-nightly (c5d82ed7a 2022-11-19) 20.636 16.873 16.151 34.160

Some cautions about my test code, before I post it:

  • Do NOT run it on a single core machine, it WILL wedge your terminal.
  • It has some chance of breaking device drivers, with a slim chance of that leading to hardware damage. Rebooting to get certain devices working again is a very real possibility.
  • Running it on a CONFIG_PREEMPT_RT kernel is probably going to create problems.
  • Don't do it on a core that handles important interrupts, in my experience core 0 is usually the only one with dedicated interrupts but it depends.

I posted my (somewhat hacky) test code at https://github.com/bsilver8192/mpsc-spin-test if anybody else wants to try it. There are several commented-out sections to switch between the various tests.

@OlegOAndreev
Copy link

I really do not want to pile on, but currently the loop at the top of recv which boils down to
while !backoff.is_completed() { backoff.snooze(); }
takes ~70-90 microseconds on recent Intel CPU, which is 20-50 greater than the cost of context switch. This really seems a bit too much for polling an empty queue.

@PSeitz
Copy link

PSeitz commented Oct 24, 2023

This is from a profile from tantivy-cli. In the profile it seems most time is spent in the spin:

            for _ in 0..1 << self.step.get() {
                // TODO(taiki-e): once we bump the minimum required Rust version to 1.49+,
                // use [`core::hint::spin_loop`] instead.
                #[allow(deprecated)]
                atomic::spin_loop_hint();
            }

The send flow is using mpmc channels with 100 capacity. mpmc, since the user can configure the number of threads, which in this case is set to 1. Indexing is slower, so senders have to wait typically.

stdio lines -> 1 thread parse JSON -> 1 thread index in tantivy

perf record --call-graph dwarf 
perf report --stdio
    46.78%     0.00%  tantivy          libc.so.6             [.] 0x00007f98910037cb
            |
            ---0x7f98910037cb
               0x7f9890f7f9ea
               std::sys::unix::thread::Thread::new::thread_start
               |          
               |--24.35%--core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h1f6f2d69b5f287ba
               |          std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h73769fc82e80e4bb (inlined)
               |          std::panic::catch_unwind (inlined)
               |          std::panicking::try (inlined)
               |          std::panicking::try::do_call (inlined)
               |          <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (inlined)
               |          std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h93815b136742a096 (inlined)
               |          std::sys_common::backtrace::__rust_begin_short_backtrace (inlined)
               |          tantivy::commands::index::run_index::_$u7b$$u7b$closure$u7d$$u7d$::h3facae76dbfd38bf (inlined)
               |          |          
               |          |--16.98%--crossbeam_channel::channel::Sender<T>::send
               |          |          |          
               |          |           --16.93%--crossbeam_channel::flavors::array::Channel<T>::send
               |          |                     |          
               |          |                     |--12.35%--crossbeam_utils::backoff::Backoff::snooze (inlined)
               |          |                     |          |          
               |          |                     |          |--10.68%--core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (inlined)
               |          |                     |          |          <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next (inlined)
               |          |                     |          |          |          
               |          |                     |          |           --10.05%--core::cmp::impls::<impl core::cmp::PartialOrd for i32>::lt (inlined)
               |          |                     |          |          
               |          |                     |           --1.64%--__sched_yield
               |          |                     |                     |          
               |          |                     |                      --1.42%--entry_SYSCALL_64_after_hwframe
               |          |                     |                                |          
               |          |                     |                                 --1.40%--do_syscall_64
               |          |                     |                                           |          
               |          |                     |                                            --1.24%--__x64_sys_sched_yield
               |          |                     |                                                      |          
               |          |                     |                                                       --1.01%--schedule
               |          |                     |                                                                 |          
               |          |                     |                                                                  --0.95%--__schedule
               |          |                     |                                                                            |          
               |          |                     |                                                                             --0.54%--pick_next_task_fair
               |          |                     |          
               |          |                     |--3.15%--crossbeam_channel::context::Context::with (inlined)
               |          |                     |          |          
               |          |                     |           --3.15%--std::thread::local::LocalKey<T>::try_with (inlined)
               |          |                     |                     |          
               |          |                     |                      --3.14%--crossbeam_channel::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::hd64289cd2a2278bd (inlined)
               |          |                     |                                |          
               |          |                     |                                 --3.12%--crossbeam_channel::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::h0b76ad33a3783422
               |          |                     |                                           |          
               |          |                     |                                            --3.12%--crossbeam_channel::flavors::array::Channel$LT$T$GT$::send::_$u7b$$u7b$closure$u7d$$u7d$::h00a9cec38e2e6332 (inlined)
               |          |                     |                                                      |          
               |          |                     |                                                       --3.00%--crossbeam_channel::context::Context::wait_until
               |          |                     |                                                                 |          
               |          |                     |                                                                  --2.42%--crossbeam_utils::backoff::Backoff::snooze (inlined)
               |          |                     |                                                                            |          
               |          |                     |                                                                             --2.24%--core::sync::atomic::spin_loop_hint (inlined)
               |          |                     |                                                                                       core::hint::spin_loop (inlined)
               |          |                     |                                                                                       core::core_arch::x86::sse2::_mm_pause (inlined)
               |          |                     |          
               |          |                      --1.32%--crossbeam_channel::flavors::array::Channel<T>::start_send (inlined)
               |          |          
               |          |--6.45%--tantivy::index_document::IndexDocument::parse_json
               |          |          |          
               |          |           --6.03%--tantivy::index_document::to_value (inlined)
               |          |                     |          
               |          |                     |--4.22%--simd_json::Deserializer::from_slice
               |          |                     |          |          
               |          |                     |           --3.68%--simd_json::Deserializer::from_slice_with_buffers
               |          |                     |                     |          
               |          |                     |                     |--2.45%--simd_json::Deserializer::find_structural_bits (inlined)
               |          |                     |                     |          |          
               |          |                     |                     |           --2.45%--simd_json::Deserializer::find_structural_bits_avx
               |          |                     |                     |                     |          
               |          |                     |                     |                      --2.44%--simd_json::Deserializer::_find_structural_bits (inlined)
               |          |                     |                     |                                |          
               |          |                     |                     |                                 --1.81%--<simd_json::avx2::stage1::SimdInputAVX as simd_json::Stage1Parse<core::core_arch::x86::__m256i>>::flatten_bits >
               |          |                     |                     |                                           |          
               |          |                     |                     |                                            --1.51%--alloc::vec::Vec<T,A>::reserve (inlined)
               |          |                     |                     |                                                      alloc::raw_vec::RawVec<T,A>::reserve (inlined)
               |          |                     |                     |                                                      |          
               |          |                     |                     |                                                       --1.44%--alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
               |          |                     |                     |                                                                 |          
               |          |                     |                     |                                                                  --1.44%--alloc::raw_vec::RawVec<T,A>::grow_amortized (inlined)
               |          |                     |                     |                                                                            |          
               |          |                     |                     |                                                                             --1.42%--alloc::raw_vec::finish_grow (inlined)
               |          |                     |                     |                                                                                       |          
               |          |                     |                     |                                                                                        --1.41%--<alloc::alloc::Global as core::alloc::Allocator>::grow (>
               |          |                     |                     |                                                                                                  alloc::alloc::Global::grow_impl (inlined)
               |          |                     |                     |                                                                                                  alloc::alloc::realloc (inlined)
               |          |                     |                     |                                                                                                  |          
               |          |                     |                     |                                                                                                   --1.40%--realloc
               |          |                     |                     |          
               |          |                     |                      --1.08%--simd_json::stage2::<impl simd_json::Deserializer>::build_tape (inlined)
               |          |                     |          
               |          |                      --1.65%--tantivy::index_document::BorrowDeserializer::into_flattened (inlined)
               |          |                                |          
               |          |                                 --1.57%--tantivy::index_document::BorrowDeserializer::parse
               |          |                                           |          
               |          |                                            --1.56%--tantivy::index_document::BorrowDeserializer::parse_map (inlined)
               |          |                                                      |          
               |          |                     |                     |                                                                                                   --1.40%--realloc
               |          |                     |                     |          
               |          |                     |                      --1.08%--simd_json::stage2::<impl simd_json::Deserializer>::build_tape (inlined)
               |          |                     |          
               |          |                      --1.65%--tantivy::index_document::BorrowDeserializer::into_flattened (inlined)
               |          |                                |          
               |          |                                 --1.57%--tantivy::index_document::BorrowDeserializer::parse
               |          |                                           |          
               |          |                                            --1.56%--tantivy::index_document::BorrowDeserializer::parse_map (inlined)
               |          |                                                      |          
               |          |                                                      |--0.75%--tantivy::index_document::BorrowDeserializer::parse
               |          |                                                      |          |          
               |          |                                                      |           --0.71%--tantivy::index_document::BorrowDeserializer::parse::_$u7b$$u7b$closure$u7d$$u7d$::he8db8863125cdbbb
               |          |                                                      |                     |          
               |          |                                                      |                      --0.55%--alloc::vec::Vec<T,A>::push (inlined)
               |          |                                                      |                                |          
               |          |                                                      |                                 --0.53%--alloc::raw_vec::RawVec<T,A>::reserve_for_push
               |          |                                                      |                                           |          
               |          |                                                      |                                            --0.53%--alloc::raw_vec::RawVec<T,A>::grow_amortized (inlined)
               |          |                                                      |                                                      |          
               |          |                                                      |                                                       --0.52%--alloc::raw_vec::finish_grow (inlined)
               |          |                                                      |                                                                 |          
               |          |                                                      |                                                                  --0.50%--malloc
               |          |                                                      |          
               |          |                                                       --0.56%--tantivy::schema::schema::Schema::get_field
               |          |          
               |           --0.63%--<crossbeam_channel::channel::IntoIter<T> as core::iter::traits::iterator::Iterator>::next
               |                     |          
               |                      --0.62%--crossbeam_channel::channel::Receiver<T>::recv (inlined)
               |                                |          
               |                                 --0.61%--crossbeam_channel::flavors::array::Channel<T>::recv
               |          
                --22.43%--core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h989fd13cc428f62c
                          std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h38d9a4e86260e253 (inlined)
                          |          
                           --22.43%--std::panic::catch_unwind (inlined)
                                     std::panicking::try (inlined)
                                     std::panicking::try::do_call (inlined)
                                     <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (inlined)
                                     std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h7e3f4df267158938 (inlined)
                                     std::sys_common::backtrace::__rust_begin_short_backtrace (inlined)
                                     tantivy::commands::index::run_index::_$u7b$$u7b$closure$u7d$$u7d$::h6a107a0b11f61b63 (inlined)
                                     |          
                                     |--20.09%--crossbeam_channel::channel::Sender<T>::send
                                     |          |          
               |                                |          
               |                                 --0.61%--crossbeam_channel::flavors::array::Channel<T>::recv
               |          
                --22.43%--core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h989fd13cc428f62c
                          std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h38d9a4e86260e253 (inlined)
                          |          
                           --22.43%--std::panic::catch_unwind (inlined)
                                     std::panicking::try (inlined)
                                     std::panicking::try::do_call (inlined)
                                     <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (inlined)
                                     std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h7e3f4df267158938 (inlined)
                                     std::sys_common::backtrace::__rust_begin_short_backtrace (inlined)
                                     tantivy::commands::index::run_index::_$u7b$$u7b$closure$u7d$$u7d$::h6a107a0b11f61b63 (inlined)
                                     |          
                                     |--20.09%--crossbeam_channel::channel::Sender<T>::send
                                     |          |          
                                     |           --20.04%--crossbeam_channel::flavors::array::Channel<T>::send
                                     |                     |          
                                     |                     |--13.61%--crossbeam_utils::backoff::Backoff::snooze (inlined)
                                     |                     |          |          
                                     |                     |          |--11.13%--core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (inlined)
                                     |                     |          |          <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next (inlined)
                                     tantivy::commands::index::run_index::_$u7b$$u7b$closure$u7d$$u7d$::h6a107a0b11f61b63 (inlined)
                                     |          
                                     |--20.09%--crossbeam_channel::channel::Sender<T>::send
                                     |          |          
                                     |           --20.04%--crossbeam_channel::flavors::array::Channel<T>::send
                                     |                     |          
                                     |                     |--13.61%--crossbeam_utils::backoff::Backoff::snooze (inlined)
                                     |                     |          |          
                                     |                     |          |--11.13%--core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (inlined)
                                     |                     |          |          <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next (inlined)
                                     |                     |          |          |          
                                     |                     |          |           --10.48%--core::cmp::impls::<impl core::cmp::PartialOrd for i32>::lt (inlined)
                                     |                     |          |          
                                     |                     |           --2.43%--__sched_yield

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

7 participants