Grab bag of runtime optimizations #8734

thestinger · 2013-08-24T06:52:12Z

Here are a bunch of of small optimizations that add up to a 36% improvement on one particular message passing benchmark.

After this and @toddaaro's optimizations from #8566 the next biggest wins are probably going to be avoiding the event loop, which is another 25%, and using a less-allocating channel implementation (not sure how much this wins but it should be a lot). Beyond that there is still the important optimization of using the stack pointer for TLS instead of the TLS API, reducing lock contention, identifying and reducing other syscalls, page faults, context switches and allocations, recycling tasks. Codegen improvements may help as well as there appears to be some nonsense in the assembly that one wouldn't write by hand.

There are a couple of notable changes here:

9692175 Turns rtassert! off for optimized builds, adding a new constant that can be queried before running expensive sanity checks: pub static ENFORCE_SANITY: bool = !cfg!(rtopt) || cfg!(rtdebug) || cfg!(rtassert). cfg(rtopt) is turned on by makefiles.
The two commits that reduce lock contention do so by adding unsynchronized checks of the length of queues in strategic places. Under load this makes the SleeperList in particular virtually uncontended (and it was extremely contended before).

Before (just with #8566):

brian@brian-X1:~/dev/rust-sched-bench$ RUST_THREADS=4 perf stat -- ./pingpong-rust
 Performance counter stats for './pingpong-rust':

       5873.315864 task-clock                #    3.882 CPUs utilized
             9,861 context-switches          #    0.002 M/sec
                23 cpu-migrations            #    0.004 K/sec
             1,438 page-faults               #    0.245 K/sec
    17,321,368,036 cycles                    #    2.949 GHz
    11,409,886,127 stalled-cycles-frontend   #   65.87% frontend cycles idle
   <not supported> stalled-cycles-backend
    13,352,527,802 instructions              #    0.77  insns per cycle
                                             #    0.85  stalled cycles per insn
     3,013,346,012 branches                  #  513.057 M/sec
        47,494,193 branch-misses             #    1.58% of all branches

       1.512948956 seconds time elapsed

After (these opts + #8566):

brian@brian-X1:~/dev/rust-sched-bench$ RUST_THREADS=4 perf stat -- ./pingpong-rust
Performance counter stats for './pingpong-rust':

       3682.534190 task-clock                #    3.863 CPUs utilized
             5,622 context-switches          #    0.002 M/sec
                24 cpu-migrations            #    0.007 K/sec
             1,432 page-faults               #    0.389 K/sec
    10,979,235,304 cycles                    #    2.981 GHz
     7,260,103,584 stalled-cycles-frontend   #   66.13% frontend cycles idle
   <not supported> stalled-cycles-backend
     8,557,987,829 instructions              #    0.78  insns per cycle
                                             #    0.85  stalled cycles per insn
     1,766,533,692 branches                  #  479.706 M/sec
        38,529,078 branch-misses             #    2.18% of all branches

       0.953177796 seconds time elapsed

And here is how Go does on the same benchmark:

brian@brian-X1:~/dev/rust-sched-bench$ GOMAXPROCS=4 perf stat ./pingpong-go
 Performance counter stats for './pingpong':

        990.779438 task-clock                #    3.874 CPUs utilized          
               517 context-switches          #    0.522 K/sec                  
                12 cpu-migrations            #    0.012 K/sec                  
               219 page-faults               #    0.221 K/sec                  
     2,949,207,854 cycles                    #    2.977 GHz                    
     1,913,919,758 stalled-cycles-frontend   #   64.90% frontend cycles idle   
   <not supported> stalled-cycles-backend  
     2,672,127,904 instructions              #    0.91  insns per cycle        
                                             #    0.72  stalled cycles per insn
       609,300,846 branches                  #  614.971 M/sec                  
         9,815,707 branch-misses             #    1.61% of all branches        

       0.255722866 seconds time elapsed

Here's what the profile looks like after these optimizations, those in #8566, and a hacked up optimization to not hit epoll (not included in this PR):

+   8.85%  pingpong-rust  pingpong-rust                       [.] rt::comm::__extensions__::try_send_inner_6667::_a19fb8784b62bf94::_0$x2e0
+   7.09%  pingpong-rust  pingpong-rust                       [.] rt::comm::__extensions__::try_recv_6722::_9022caef57e17a4::_0$x2e0
+   5.15%  pingpong-rust  libc-2.17.so                        [.] malloc
+   4.45%  pingpong-rust  libc-2.17.so                        [.] _int_free
+   3.39%  pingpong-rust  pingpong-rust                       [.] std..rt..comm..ChanOne$LT$std..rt..comm..StreamPayload$LT$$LP$$RP$$GT$$GT$::_a4c4e2b19dcf3b21::glue_drop_6336
+   3.30%  pingpong-rust  libpthread-2.17.so                  [.] pthread_getspecific
+   3.08%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::meth_29695::change_task_context::_e41693fc69da04e::_0$x2e8$x2dpre
+   2.62%  pingpong-rust  libc-2.17.so                        [.] _int_malloc
+   2.53%  pingpong-rust  libpthread-2.17.so                  [.] pthread_mutex_lock
+   2.31%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::switch_running_tasks_and_then::anon::expr_fn_29766
+   2.30%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] unstable::atomics::__extensions__::meth_20980::new::_cc85e4d2d8dd7469::_0$x2e8$x2dpre
+   2.12%  pingpong-rust  libpthread-2.17.so                  [.] __pthread_mutex_unlock_usercnt
+   2.04%  pingpong-rust  librustrt.so                        [.] swap_registers
+   1.97%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::meth_29405::run_sched_once::_f3525925b944a51::_0$x2e8$x2dpre
+   1.74%  pingpong-rust  pingpong-rust                       [.] rt::comm::__extensions__::drop_3677::_ae547a17a7ff6ea9::_0$x2e0
+   1.67%  pingpong-rust  libc-2.17.so                        [.] free
+   1.56%  pingpong-rust  pingpong-rust                       [.] rt::comm::oneshot_3615::_5a94a9b33e484f55::_0$x2e0
+   1.54%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::meth_21144::enqueue_task::_ce3b62147171303e::_0$x2e8$x2dpre
+   1.48%  pingpong-rust  libpthread-2.17.so                  [.] pthread_setspecific
+   1.47%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::thread_local_storage::pthread_getspecific::_741ee83c8a1aea64::_0$x2e8$x2dpre
+   1.43%  pingpong-rust  libc-2.17.so                        [.] pthread_mutex_lock
+   1.39%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::meth_29730::schedule_task::_373b465acd34aec7::_0$x2e8$x2dpre
+   1.36%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::local_ptr::unsafe_borrow_32251::_32439a71381e9371::_0$x2e8$x2dpre
+   1.34%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::task::__extensions__::meth_28172::take_unwrap_home::_8ff7f630b5c4cb79::_0$x2e8$x2dpre
+   1.34%  pingpong-rust  pingpong-rust                       [.] _$UP$std..rt..task..Task::_6757b7c895de82::glue_drop_3896
+   1.33%  pingpong-rust  pingpong-rust                       [.] rt::comm::__extensions__::try_recv_6760::anon::expr_fn_6765
+   1.29%  pingpong-rust  pingpong-rust                       [.] ping_pong_bench::run_pair::anon::anon::expr_fn_6583
+   1.25%  pingpong-rust  pingpong-rust                       [.] std..option..Option$LT$$UP$std..rt..sched..Scheduler$GT$::_ef4d43fe1137a657::glue_drop_5755
+   1.22%  pingpong-rust  pingpong-rust                       [.] rt::local::__extensions__::meth_5148::take::_87eba8df7b0468::_0$x2e0
+   1.20%  pingpong-rust  libc-2.17.so                        [.] __memmove_ssse3_back
+   1.19%  pingpong-rust  pingpong-rust                       [.] std..rt..kill..BlockedTask::_89842139b9381497::glue_drop_4308
+   1.16%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::task::__extensions__::meth_28209::is_home_no_tls::_ae80b16046b0aa7d::_0$x2e8$x2dpre
+   1.09%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] libc::funcs::c95::stdlib::free::_f269ec056c867a::_0$x2e8$x2dpre
+   1.09%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::meth_15900::run_task::_ce3b62147171303e::_0$x2e8$x2dpre
+   0.93%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::task::__extensions__::meth_28168::give_home::_13e078f88f4164a::_0$x2e8$x2dpre
+   0.86%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::meth_15988::deschedule_running_task_and_then::_6dd6f0e0a6b272fe::_0$x2e8$x2dpre
+   0.86%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] unstable::sync::__extensions__::get_29463::_35e2217cc3c2555b::_0$x2e8$x2dpre
+   0.83%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::thread_local_storage::set::_1d5c2ea6f9affc47::_0$x2e8$x2dpre
+   0.80%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::resume_task_immediately::anon::expr_fn_29762
+   0.80%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::meth_29407::interpret_message_queue::_e6191113fdd2ee81::_0$x2e8$x2dpre
+   0.74%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::run_task::anon::expr_fn_29741
+   0.74%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::work_queue::__extensions__::pop_29570::anon::expr_fn_29596
+   0.72%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::kill::__extensions__::meth_15714::wake::_3ccd357d3a209434::_0$x2e8$x2dpre

alexcrichton · 2013-08-24T08:09:05Z

Removing approval because @bors is going crazy right now

Force line ending of '.in' files in jemalloc to LF

Naturally, and sadly, turning off sanity checks in the runtime is a noticable performance win. The particular test I'm running goes from ~1.5 s to ~1.3s. Sanity checks are turned *on* when not optimizing, or when cfg includes `rtdebug` or `rtassert`.

This makes the lock much less contended. In the test I'm running the number of times it's contended goes from ~100000 down to ~1000.

It's not a huge win but it does reduce the amount of time spent contesting the message queue when the schedulers are under load

These aren't used for anything at the moment and cause some TLS hits on some perf-critical code paths. Will need to put better thought into it in the future.

vec::unshift uses this to add elements, scheduler queues use unshift, and this was causing a lot of reallocation

I'm not comfortable turning off rtassert! yet

Add some testcases for recent rustfix update changelog: none This adds a testcase for a bugfix that has been fixed by https://github.com/rust-lang/rustfix/tree/v0.6.1 `rustfix` is pulled in by `compiletest_rs`. So to test that the correct rustfix version is used, I added one (and a half) testcase. I tried to add a testcase for rust-lang#8734 as well, but interesting enough the rustfix is wrong: ```diff fn issue8734() { let _ = [0u8, 1, 2, 3] .into_iter() - .and_then(|n| match n { + .flat_map(|n| match n { + 1 => [n + .saturating_add(1) 1 => [n .saturating_add(1) .saturating_add(1) .saturating_add(1) .saturating_add(1) .saturating_add(1) .saturating_add(1) .saturating_add(1) .saturating_add(1)], n => [n], }); } ``` this needs some investigation and then this testcase needs to be enabled by commenting it out closes rust-lang#8878 related to rust-lang#8734

Uncomment test for rust-lang#8734 I believe the issue was an interaction between rustfix and `span_lint_and_sugg_for_edges`, so this would've been fixed by rust-lang#98261 (Thanks, `@WaffleLapkin!)` Closes rust-lang#8734 changelog: none

thestinger mentioned this pull request Aug 24, 2013

Grab bag of runtime optimizations #8599

Closed

Merge pull request #8738 from mukilan/master

59ca7a8

Force line ending of '.in' files in jemalloc to LF

thestinger closed this Aug 24, 2013

thestinger deleted the rt-opt branch August 24, 2013 18:18

brson added 13 commits August 24, 2013 14:43

std: Reduce TLS access

fd1aa0e

std: Convert some assert!s to rtassert!

aad5e4f

std: More TLS micro-optimization

1c0a7da

std::rt: Optimize TLS use in change_task_context

310e757

std::rt: Remove extra boxes from MessageQueue and SleeperList

fdbcda0

std::rt: Reduce SleeperList contention

b2c1832

This makes the lock much less contended. In the test I'm running the number of times it's contended goes from ~100000 down to ~1000.

std::rt: Reduce MessageQueue contention

3730e79

It's not a huge win but it does reduce the amount of time spent contesting the message queue when the schedulers are under load

std::rt: Remove metrics for perf

9b50db0

These aren't used for anything at the moment and cause some TLS hits on some perf-critical code paths. Will need to put better thought into it in the future.

std: Convert the runtime TLS key to a Rust global to avoid FFI

a4bf1be

std::rt: Remove an unnecessary allocation from the main sched loop

b68075e

std: Make vec::push_all_move call reserve_at_least

7abdde8

vec::unshift uses this to add elements, scheduler queues use unshift, and this was causing a lot of reallocation

std::rt: Enforce sanity a while longer

1afb803

I'm not comfortable turning off rtassert! yet

thestinger reopened this Aug 24, 2013

thestinger closed this Aug 24, 2013

thestinger deleted the rt-opt branch August 24, 2013 18:53

flip1995 pushed a commit to flip1995/rust that referenced this pull request Jul 18, 2022

Uncomment test for rust-lang#8734

6c61f71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Grab bag of runtime optimizations #8734

Grab bag of runtime optimizations #8734

Uh oh!

thestinger commented Aug 24, 2013

Uh oh!

alexcrichton commented Aug 24, 2013

Uh oh!

Uh oh!

Grab bag of runtime optimizations #8734

Grab bag of runtime optimizations #8734

Uh oh!

Conversation

thestinger commented Aug 24, 2013

Uh oh!

alexcrichton commented Aug 24, 2013

Uh oh!

Uh oh!