Skip to content

JLockerman/std_spsc_is_slow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

std_spsc_is_slow

As of rustc 1.20.0 (f3d6973f4 2017-08-27) the standard library's mpsc has an optimization for the case where there is exactly 1 sender and 1 receiver. Unfortunately, in currently this optimized version seems to be slower than the regular multi producer code. This repository contains investigations into that slowdown.

Basic Benchmark

The basic benchmark, based on one from crossbeam is:

fn bench_mpsc_stream() -> f64 {
    let (sender, reciever) = channel();
    bench_spsc(sender, reciever)
}

fn bench_mpsc_shared() -> f64 {
    let (sender, reciever) = channel();
    // this clone forces the queue into shared mode and makes the benchmark faster
    let _clone = sender.clone();
    bench_spsc(sender, reciever)
}

const COUNT: u64 = 10_000_000;

fn bench_spsc(tx: Sender<u64>, rx: Receiver<u64>) -> f64 {
    // ensure that the channel is not in Once mode
    tx.send(0).unwrap();
    tx.send(0).unwrap();
    rx.recv().unwrap();
    rx.recv().unwrap();

    let start = ::std::time::Instant::now();
    scope(|scope| {
        scope.spawn(move || {
            for x in 0..(COUNT*2) {
                let _ = black_box(tx.send(x));
            }
        });

        for _i in 0..(COUNT*2) {
            let _ = black_box(rx.recv().unwrap());
        }
    });
    let d = start.elapsed();

    nanos(d) / ((COUNT*2) as f64)
}

It can be run on stable with cargo run --release, and, on a recent intel machine, typically has an output like

spsc stream        185 ns/send
spsc shared        112 ns/send

Other Investigations

The repo also contains investigations into what may be causing this slowdown, along with other explorations of mpsc performance that occurred along the way, currently focusing on the underlying datastructures.

These benchmarks can be run with cargo +nightly run --release --features "queue_experiments" and a the results from a typical run are:

spsc stream        201 ns/send
spsc shared        134 ns/send
----
mpmc baseline       59 ns/send
aligned             46 ns/send
----
spsc baseline       75 ns/send
bigger cache        67 ns/send
aligned             97 ns/send
unbounded           30 ns/send
no cache            46 ns/send
unbounded, aligned  12 ns/send
no cache, aligned   47 ns/send
----
less contention spsc  35 ns/send
aligned                9 ns/send
aligned, size =    1   9 ns/send
aligned, size =    8  10 ns/send
aligned, size =   16   9 ns/send
aligned, size =   32  11 ns/send
aligned, size =   64  10 ns/send
aligned, size =  128  13 ns/send
aligned, size =  256  12 ns/send
aligned, size =  512  13 ns/send
aligned, size = 1024   9 ns/send
----
stream baseline      205 ns/send
aligned              212 ns/send
no cache             103 ns/send
aligned, no cache     91 ns/send
less contend         179 ns/send
less contend aligned 102 ns/send
----
stream2 baseline     131 ns/send
aligned              112 ns/send
no cache              91 ns/send
aligned, no cache     58 ns/send
less contend         103 ns/send
less contend aligned  27 ns/send

From this I draw the following tentative conclusions:

  1. There is false sharing happening in mpsc_queue.
  2. The current fixed-size node cache in spsc_queue seems to be a pessimization, mainly due to contention over the counters. A version which only keeps counters on the consumer side only (shown in the last set above) can perform on par with an ubounded cache.
  3. False sharing may also become an issue for spsc_queue, but it is currently hidden by other overheads.
  4. The remaining slowdown seems to be due to the shared counter in stream, though I am unsure whether this is due to the counter itself, or if it is simply on the same cache line as something else.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages