Rayon uses a lot of CPU when there's not a lot of work to do #642

jrmuizel · 2019-03-05T02:13:20Z

The following program uses about 30% of the CPU on 4 core (8 HT) Linux and Mac machines

fn main() {
    while true {
        std::thread::sleep(std::time::Duration::from_millis(10));
        rayon::spawn(move || {  } );
    }
}

Reducing the sleep duration to 1ms pushes the CPU usage up to 200%.

cuviper · 2019-03-05T02:16:08Z

See also #576 and #614.

jrmuizel · 2019-03-05T02:18:01Z

A profile on macOS shows most of the time being spent in sched_yield

jrmuizel · 2019-03-05T02:26:19Z

Also, a similar program written using the dispatch crate uses about 3-4% cpu at the 1ms duration vs the 200% that rayon uses.

jrmuizel · 2019-06-21T18:22:11Z

@lukewagner here's an example of some of the bad.

jrmuizel · 2019-06-21T19:01:47Z

Rewriting this using crossbeam-channel produces an executable that only uses 4% cpu.

    let (s, r) = unbounded();

    for i in 0..8 {
        let r = r.clone();
        thread::spawn( move || loop {
            let m: i32 = r.recv().unwrap();
        });
    }

    while true {
        std::thread::sleep(std::time::Duration::from_millis(1));
        s.send(0);
    }

Perhaps rayon should use the same wakeup infrastructure that crossbeam-channel uses?

ishitatsuyuki · 2019-08-15T00:47:52Z

From https://github.com/rayon-rs/rayon/blob/master/rayon-core/src/sleep/README.md, it seems rayon will repetitively yield to the operating system before a thread goes to sleep. How does this work wrt to syscall overhead? I think yielding to the scheduler is almost as expensive as blocking on a condition variable?

cuviper · 2019-08-15T21:57:14Z

I think yielding to the scheduler is almost as expensive as blocking on a condition variable?

The net performance of blocking has to include the cost of the syscall to wake it up too. Even if the raw cost were the same, there's also a latency difference in fully blocking.

The life demo added in #603 shows this trade-off pretty well. Currently I get:

$ cargo run --release -- life bench
    Finished release [optimized] target(s) in 0.05s
     Running `/home/jistone/rust/rayon/target/release/rayon-demo life bench`
  serial:  114948948 ns
parallel:   32975762 ns -> 3.49x speedup
par_bridge:  631383430 ns -> 0.18x speedup

$ cargo run --release -- life play
    Finished release [optimized] target(s) in 0.05s
     Running `/home/jistone/rust/rayon/target/release/rayon-demo life play`
  serial: 60.00 fps
    cpu usage: 16.2%
parallel: 60.00 fps
  cpu usage: 45.1%
par_bridge: 60.00 fps
  cpu usage: 366.8%

If I change ROUNDS_UNTIL_SLEEPY and ROUNDS_UNTIL_ASLEEP to 1:

$ cargo run --release -- life bench
    Finished release [optimized] target(s) in 0.05s
     Running `/home/jistone/rust/rayon/target/release/rayon-demo life bench`
  serial:  111999829 ns
parallel:   40216218 ns -> 2.78x speedup
par_bridge:  625470343 ns -> 0.18x speedup

$ cargo run --release -- life play
    Finished release [optimized] target(s) in 0.05s
     Running `/home/jistone/rust/rayon/target/release/rayon-demo life play`
  serial: 60.00 fps
    cpu usage: 15.9%
parallel: 60.00 fps
  cpu usage: 33.8%
par_bridge: 60.00 fps
  cpu usage: 350.2%

So not as much speedup, but lower cpu usage.

ishitatsuyuki · 2019-09-06T06:48:12Z

A few links to what might making the problem worse:

Usage of notify_all() (Communication costs are quadratic in the number of worker threads #394) means that we pay the condition variable penalty more than what we expect. A hypothesis might be that keeping the threads awake is fast because in that way notify_all() costs less.
Rayon creates threads eagerly (Lazily create threads in ThreadPool #576). Threads are best when they're saturating CPU (in other words, "just enough"), otherwise it's more context switch = worse.

nikomatsakis · 2019-09-07T20:18:12Z

I've been working on a rewrite of the sleep system that I think will address this problem. I opened #691 with an immediate refactoring, but I'm working on a more extensive rewrite. I hope to open an RFC in a few days with a sketch of the algorithm plus a few alternatives I'd like to evaluate.

The gist of it is this:

In the new system, we almost never wake up more one thread at a time (the exception is when terminating, which wakes all threads so they can quit).
When producing new work, we can try to avoid waking threads at all if we judge there to be idle threads that are currently awake.
When signalling latches, we only wake up the one thread that is actually blocked on the latch (if any).

We'll have to benchmark it, of course, and we'll probably want to tune things.

nikomatsakis · 2019-09-08T14:10:13Z

Opened: rayon-rs/rfcs#5 with a fairly complete description of what I have in mind. I've not implemented all of that yet, but I'm getting there.

nikomatsakis · 2019-09-10T10:37:18Z

Using the algorithm from the RFC seems to reduce this example to taking negligible CPU time:

> time -p /home/nmatsakis/versioned/rayon/target/release/rayon-demo noop --iters 1000
real 10.15
user 0.06
sys 0.04
> time -p /home/nmatsakis/versioned/rayon/target/release/rayon-demo noop --iters 1000 --sleep 1
real 1.09
user 0.05
sys 0.03

djg · 2019-09-10T23:58:26Z

I've been experimenting with this case where work is spawned every millisecond:

fn main() {
    while true {
        std::thread::sleep(std::time::Duration::from_millis(1));
        rayon::spawn(move || {  } );
    }
}

With the current release of rayon, all my HTs are pegged:

With the latch-target-thread branch, CPU usage is massively reduced:

Excited to see where this goes.

695: a few tweaks to the demos r=cuviper a=nikomatsakis A few changes that I am making on my branch that I would like applied to master to. This renames some benchmarks to make them uniquely identifiable and also adds a test to measure #642. Co-authored-by: Niko Matsakis <niko@alum.mit.edu>

jrmuizel · 2020-04-13T19:23:08Z

Here's a screenshot from a GPUview profile of WebRender using Rayon.

It shows 2ms of spinning 2 cpus just starting and stopping the threads.

cuviper · 2020-04-13T19:28:22Z

@jrmuizel would it be possible for you to gather a similar profile with #746?

jrmuizel · 2020-04-13T21:59:46Z

Yeah. I'm trying that now.

jrmuizel · 2020-04-13T22:45:22Z

Here's a similar situation with #746

It looks like we're spinning for less time which is good.

I haven't read through #746 but wouldn't it be better for us to wait on something instead of having 2 cores jumping between 4 thread spinning on yield()?

cuviper · 2020-04-14T00:52:35Z

We do ultimately wait on condvars, but there's some spinning on yield() first, in hopes of reducing latency. You can see some of that effect in my earlier comment, where I cut down the number of spinning rounds. Still, there may be room to fine-tune this further in #746.

hniksic · 2020-09-02T14:47:54Z

I think I might have encountered this issue. It involves work that is easy to parallelize (essentially order-insensitive stream processing using par_bridge() with almost no mutable shared state) running on a 64-CPU server. On my laptop the work parallelizes nicely to 4 cores and runs approximately 4x faster than sequentially. But on the 64-CPU server the work definitely doesn't proceed with speed anywhere near 64x of the sequential version on the same machine, while all the CPUs are at 100% the entire time. Experimenting with RAYON_NUM_THREADS shows that performance peaks around 8-16 CPUs and gets worse with 32 and 64.

I was unable to reproduce the issue with simpler minimal examples, where Rayon seemed to distribute the load nicely. So I rewrote the code to use a crossbeam_channel to distribute the work to 64 manually created threads instead. I noticed two things: first, the timing was a little bit better (the program finished faster than with Rayon), but not spectacularly better. Second, and more important, the individual CPUs, as reported by htop were at under 50% all the time.

I suspect that the issue is that the work-giving thread (the one that reads the stream) is limited by IO and is not providing work fast enough to saturate 64 CPUs. The workload will never utilize full 64 CPUs - and that's fine. However, with Rayon it might happen that it is providing just enough work to hit this bug and make all cores 100% busy, but spending much of the time in busy-wait, so much that it actually prevents them from doing useful work. Reducing the number of threads fixes the issue because when the bug doesn't appear when there is enough work. The bug doesn't appear on my laptop for the same reason - it doesn't have enough CPU horsepower to outmatch the IO bandwidth of the reader thread.

Sadly, upgrading Rayon to 1.4.0, which as I understand should include #746, didn't seem to make a difference. Is there some way to test whether my program triggers the issue reported here?

(All of the above experiments are performed with release builds.)

cuviper · 2020-09-02T16:03:46Z

I'm going to close this issue now that 1.4.0 is published.

@hniksic could you open a new issue for your case? There are definitely some scalability issues with par_bridge() that we should discuss separately.

hniksic · 2020-09-02T16:55:32Z

@cuviper I was loath to open an issue since I couldn't provide a minimal example that reproduces the bug, and I tried. If you think the above description would be useful enough as an issue of its own, I'll gladly make one.

cuviper · 2020-09-02T17:04:44Z

@hniksic Issues are not a scarce resource. 🙂 We can at least discuss the problem in the abstract, and there might be other people that want to chime in, who may have something more reproducible.

hniksic · 2020-09-02T22:41:47Z

@cuviper Thanks! I've now submitted #795.

cuviper mentioned this issue Aug 26, 2019

Add env vars for ROUNDS_UNTIL_SLEEPY and ROUNDS_UNTIL_ASLEEP #684

Closed

nikomatsakis mentioned this issue Sep 8, 2019

improve the sleeping thread algorithm rayon-rs/rfcs#5

Merged

sagar-solana mentioned this issue Sep 10, 2019

Limit Rayon threadpool threads solana-labs/solana#5871

Merged

nikomatsakis mentioned this issue Sep 14, 2019

a few tweaks to the demos #695

Merged

jaynus mentioned this issue Feb 1, 2020

[BUG] Poor performance using Metal on MacOS amethyst/amethyst#1699

Closed

cuviper mentioned this issue Mar 9, 2020

High CPU usage and bad performance when using par_bridge on a bounded channel #730

Closed

ryoqun mentioned this issue Aug 12, 2020

Rayoff, fast simple threadpool solana-labs/solana#7110

Closed

aclysma mentioned this issue Aug 16, 2020

cpu usage bevyengine/bevy#111

Closed

cuviper closed this as completed Sep 2, 2020

hniksic mentioned this issue Sep 2, 2020

Extra CPU usage when not enough work in par_bridge #795

Closed

ElliotB256 mentioned this issue Feb 9, 2021

Profiling for red-mot TeamAtomECS/AtomECS#9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rayon uses a lot of CPU when there's not a lot of work to do #642

Rayon uses a lot of CPU when there's not a lot of work to do #642

jrmuizel commented Mar 5, 2019

cuviper commented Mar 5, 2019

jrmuizel commented Mar 5, 2019

jrmuizel commented Mar 5, 2019

jrmuizel commented Jun 21, 2019

jrmuizel commented Jun 21, 2019

ishitatsuyuki commented Aug 15, 2019

cuviper commented Aug 15, 2019

ishitatsuyuki commented Sep 6, 2019

nikomatsakis commented Sep 7, 2019

nikomatsakis commented Sep 8, 2019

nikomatsakis commented Sep 10, 2019

djg commented Sep 10, 2019

jrmuizel commented Apr 13, 2020

cuviper commented Apr 13, 2020

jrmuizel commented Apr 13, 2020

jrmuizel commented Apr 13, 2020

cuviper commented Apr 14, 2020

hniksic commented Sep 2, 2020

cuviper commented Sep 2, 2020

hniksic commented Sep 2, 2020

cuviper commented Sep 2, 2020

hniksic commented Sep 2, 2020

Rayon uses a lot of CPU when there's not a lot of work to do #642

Rayon uses a lot of CPU when there's not a lot of work to do #642

Comments

jrmuizel commented Mar 5, 2019

cuviper commented Mar 5, 2019

jrmuizel commented Mar 5, 2019

jrmuizel commented Mar 5, 2019

jrmuizel commented Jun 21, 2019

jrmuizel commented Jun 21, 2019

ishitatsuyuki commented Aug 15, 2019

cuviper commented Aug 15, 2019

ishitatsuyuki commented Sep 6, 2019

nikomatsakis commented Sep 7, 2019

nikomatsakis commented Sep 8, 2019

nikomatsakis commented Sep 10, 2019

djg commented Sep 10, 2019

jrmuizel commented Apr 13, 2020

cuviper commented Apr 13, 2020

jrmuizel commented Apr 13, 2020

jrmuizel commented Apr 13, 2020

cuviper commented Apr 14, 2020

hniksic commented Sep 2, 2020

cuviper commented Sep 2, 2020

hniksic commented Sep 2, 2020

cuviper commented Sep 2, 2020

hniksic commented Sep 2, 2020