Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpsc stream is currently a pessimization #44512

Closed
JLockerman opened this issue Sep 12, 2017 · 3 comments
Closed

mpsc stream is currently a pessimization #44512

JLockerman opened this issue Sep 12, 2017 · 3 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@JLockerman
Copy link
Contributor

TLDR
As of rustc 1.22.0-nightly (981ce7d8d 2017-09-03) cloning a mpsc::Sender speeds up spsc workloads. My benchmarking seems to indicate that this is due to:

  1. Contention on shared counters in the node cache (push, pop).

  2. False sharing between the producer and consumer fields in the underlying queue.

  3. False sharing or contention in the wakeup code.

2 can be fixed by simply changing the alignment of the offending members so that the producer and consumer parts are on separate cache lines. 1 can be fixed with a small rewrite so that the queue only tracks its cache size at the consumer (a version of this code can be found here). 3 can be mitigated by reworking the alignment, but I am not sure if it's a full fix, rewriting the the send and recv logic so that no counter may also fix the issue (a version of the code that does this can be found (here and here), but it is not complete).


In the following benchmark:

fn bench_mpsc_stream() -> f64 {
    let (sender, reciever) = channel();
    bench_spsc(sender, reciever)
}

fn bench_mpsc_shared() -> f64 {
    let (sender, reciever) = channel();
    // this clone forces the queue into shared mode and makes the benchmark faster
    let _clone = sender.clone();
    bench_spsc(sender, reciever)
}

const COUNT: u64 = 10_000_000;

fn bench_spsc(tx: Sender<u64>, rx: Receiver<u64>) -> f64 {
    // ensure that the channel is not in Once mode
    tx.send(0).unwrap();
    tx.send(0).unwrap();
    rx.recv().unwrap();
    rx.recv().unwrap();

    let start = ::std::time::Instant::now();
    scope(|scope| {
        scope.spawn(move || {
            for x in 0..(COUNT*2) {
                let _ = black_box(tx.send(x));
            }
        });

        for _i in 0..(COUNT*2) {
            let _ = black_box(rx.recv().unwrap());
        }
    });
    let d = start.elapsed();

    nanos(d) / ((COUNT*2) as f64)
}

(derived from crossbeam, full code can be found in this repo)

on an early 2014 Intel i5 running macos, using rustc 1.22.0-nightly (981ce7d8d 2017-09-03) or rustc 1.20.0 (f3d6973f4 2017-08-27), stream runs at roughly 201 ns/send while shared runs at 134 ns/send.
Running on linux on EC2 and a raspberry pi show similar behavior.

The underlying datastructures show some difference in performance (spsc queue 75 ns/send, mpsc 59 ns/send), though not enough to fully explain the difference. Though I have not yet looked enough into the mpsc code enough to be sure of the difference, I did find potential improvements for spsc:

SPSC Performance Unaligned Cache Aligned
Default Cache 75 ns/send 97 ns/send
No Cache 46 ns/send 47 ns/send
Unbounded Cache 30 ns/send 12 ns/send
Low Contention Cache 35 ns/send 9 ns/send

Where Unaligned is the current struct layout, Cache Aligned aligns the consumer fields and producer fields to their own cache line (code is here).
Default cache is the current node cache implementation in std, No Cache disables the Node cache entirely (producer code, consumer code), and Unbounded Cache should be self explanatory.

Low Contention Cache rewrites the cache bounding logic to be done entirely on the consumer side.
Instead of keeping a count of how many nodes are in the cache, it keeps track of how many nodes
are marked as eligible to be cached, and only and always caches those nodes so marked (code for this can be found here, my experimental implementation stores the eligible flag in the node, though it could also be done by stealing a bit from the pointer).

Some of these performance improvements translate to the full stream structure:

Stream Performance Unaligned Cache Aligned
Default Cache 205 ns/send 212 ns/send
No Cache 103 ns/send 91 ns/send
Low Contention Cache 179 ns/send 102 ns/send

But to fully see the benefits, contention with the wakeup state needs to be removed

Stream2 Performance Unaligned Cache Aligned
Default Cache 131 ns/send 112 ns/send
No Cache 91 ns/send 58 ns/send
Low Contention Cache 103 ns/send 27 ns/send

(The numbers were collected from a version of stream that does not a counter at all I've gotten similar numbers from simply putting every field in stream on their own cache line. I think there should be a layout which uses exactly 2 lines, one for the producer and one for the consumer, with similar performance, but I have not done enough benchmarking to confirm it yet).

All the code code to reproduce these numbers can be found in this repo, along with the number for a raspberry pi.
Note that the raspberry pi seemed to be undergoing throttling as the benchmark ran, so numbers gathered later in the run are significantly worse than the one gathered at the beginning.

My apologies for the length and quality of writing.

cc @alexcrichton I believe.

@alexcrichton
Copy link
Member

Holy cow this is some awesome and thorough investigation, thanks so much @JLockerman! These would certainly be some more-than-welcome improvments and everything here makes sense to me. Would you be interested in sending a PR?

@JLockerman
Copy link
Contributor Author

Thanks 😊 .
I'm interested in submitting a PR, but I'm not going to have time to do so until the 27th.
If no one else does it before me I'll submit one around then.

(I'm also planning to perform a similar analysis on MPSC shared sometime next month.
Benchmarking it will be a bit more complicated, but I suspect contention on its shared counter might be a scalability bottleneck.)

@alexcrichton
Copy link
Member

Ok sure thing and no worries! I suspect Rust will for sure still be there on the 27th!

@Mark-Simulacrum Mark-Simulacrum added C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Sep 17, 2017
bors added a commit that referenced this issue Oct 11, 2017
Improve performance of spsc_queue and stream.

This PR makes two main changes:

1. It switches the `spsc_queue` node caching strategy from keeping a shared
counter of the number of nodes in the cache to keeping a consumer only counter
of the number of node eligible to be cached.
2. It separates the consumer and producers fields of `spsc_queue` and `stream` into
a producer cache line and consumer cache line.

Overall, it speeds up `mpsc` in `spsc` mode by 2-10x.
Variance is higher than I'd like (that 2-10x speedup is on one benchmark), I believe this is due to the drop check in `send` (`fn stream::Queue::send:107`). I think this check can be combined with the sleep detection code into a version which only uses 1 shared variable, and only one atomic access per `send`, but I haven't looked through the select implementation enough to be sure.

The code currently assumes a cache line size of 64 bytes. I added a CacheAligned newtype in `mpsc` which I expect to reuse for `shared`. It doesn't really belong there, it would probably be best put in `core::sync::atomic`, but putting it in `core` would involve making it public, which I thought would require an RFC.

Benchmark runner is [here](https://github.com/JLockerman/queues/tree/3eca46279c53eb75833c5ecd416de2ac220bd022/shootout), benchmarks [here](https://github.com/JLockerman/queues/blob/3eca46279c53eb75833c5ecd416de2ac220bd022/queue_bench/src/lib.rs#L170-L293).

Fixes #44512.
bors added a commit that referenced this issue Oct 11, 2017
Improve performance of spsc_queue and stream.

This PR makes two main changes:

1. It switches the `spsc_queue` node caching strategy from keeping a shared
counter of the number of nodes in the cache to keeping a consumer only counter
of the number of node eligible to be cached.
2. It separates the consumer and producers fields of `spsc_queue` and `stream` into
a producer cache line and consumer cache line.

Overall, it speeds up `mpsc` in `spsc` mode by 2-10x.
Variance is higher than I'd like (that 2-10x speedup is on one benchmark), I believe this is due to the drop check in `send` (`fn stream::Queue::send:107`). I think this check can be combined with the sleep detection code into a version which only uses 1 shared variable, and only one atomic access per `send`, but I haven't looked through the select implementation enough to be sure.

The code currently assumes a cache line size of 64 bytes. I added a CacheAligned newtype in `mpsc` which I expect to reuse for `shared`. It doesn't really belong there, it would probably be best put in `core::sync::atomic`, but putting it in `core` would involve making it public, which I thought would require an RFC.

Benchmark runner is [here](https://github.com/JLockerman/queues/tree/3eca46279c53eb75833c5ecd416de2ac220bd022/shootout), benchmarks [here](https://github.com/JLockerman/queues/blob/3eca46279c53eb75833c5ecd416de2ac220bd022/queue_bench/src/lib.rs#L170-L293).

Fixes #44512.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

3 participants