Skip to content

Grab bag of runtime optimizations #8734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from
Closed

Grab bag of runtime optimizations #8734

wants to merge 14 commits into from

Conversation

thestinger
Copy link
Contributor

Here are a bunch of of small optimizations that add up to a 36% improvement on one particular message passing benchmark.

After this and @toddaaro's optimizations from #8566 the next biggest wins are probably going to be avoiding the event loop, which is another 25%, and using a less-allocating channel implementation (not sure how much this wins but it should be a lot). Beyond that there is still the important optimization of using the stack pointer for TLS instead of the TLS API, reducing lock contention, identifying and reducing other syscalls, page faults, context switches and allocations, recycling tasks. Codegen improvements may help as well as there appears to be some nonsense in the assembly that one wouldn't write by hand.

There are a couple of notable changes here:

  • 9692175 Turns rtassert! off for optimized builds, adding a new constant that can be queried before running expensive sanity checks: pub static ENFORCE_SANITY: bool = !cfg!(rtopt) || cfg!(rtdebug) || cfg!(rtassert). cfg(rtopt) is turned on by makefiles.
  • The two commits that reduce lock contention do so by adding unsynchronized checks of the length of queues in strategic places. Under load this makes the SleeperList in particular virtually uncontended (and it was extremely contended before).

Before (just with #8566):

brian@brian-X1:~/dev/rust-sched-bench$ RUST_THREADS=4 perf stat -- ./pingpong-rust
 Performance counter stats for './pingpong-rust':

       5873.315864 task-clock                #    3.882 CPUs utilized
             9,861 context-switches          #    0.002 M/sec
                23 cpu-migrations            #    0.004 K/sec
             1,438 page-faults               #    0.245 K/sec
    17,321,368,036 cycles                    #    2.949 GHz
    11,409,886,127 stalled-cycles-frontend   #   65.87% frontend cycles idle
   <not supported> stalled-cycles-backend
    13,352,527,802 instructions              #    0.77  insns per cycle
                                             #    0.85  stalled cycles per insn
     3,013,346,012 branches                  #  513.057 M/sec
        47,494,193 branch-misses             #    1.58% of all branches

       1.512948956 seconds time elapsed

After (these opts + #8566):

brian@brian-X1:~/dev/rust-sched-bench$ RUST_THREADS=4 perf stat -- ./pingpong-rust
Performance counter stats for './pingpong-rust':

       3682.534190 task-clock                #    3.863 CPUs utilized
             5,622 context-switches          #    0.002 M/sec
                24 cpu-migrations            #    0.007 K/sec
             1,432 page-faults               #    0.389 K/sec
    10,979,235,304 cycles                    #    2.981 GHz
     7,260,103,584 stalled-cycles-frontend   #   66.13% frontend cycles idle
   <not supported> stalled-cycles-backend
     8,557,987,829 instructions              #    0.78  insns per cycle
                                             #    0.85  stalled cycles per insn
     1,766,533,692 branches                  #  479.706 M/sec
        38,529,078 branch-misses             #    2.18% of all branches

       0.953177796 seconds time elapsed

And here is how Go does on the same benchmark:

brian@brian-X1:~/dev/rust-sched-bench$ GOMAXPROCS=4 perf stat ./pingpong-go
 Performance counter stats for './pingpong':

        990.779438 task-clock                #    3.874 CPUs utilized          
               517 context-switches          #    0.522 K/sec                  
                12 cpu-migrations            #    0.012 K/sec                  
               219 page-faults               #    0.221 K/sec                  
     2,949,207,854 cycles                    #    2.977 GHz                    
     1,913,919,758 stalled-cycles-frontend   #   64.90% frontend cycles idle   
   <not supported> stalled-cycles-backend  
     2,672,127,904 instructions              #    0.91  insns per cycle        
                                             #    0.72  stalled cycles per insn
       609,300,846 branches                  #  614.971 M/sec                  
         9,815,707 branch-misses             #    1.61% of all branches        

       0.255722866 seconds time elapsed

Here's what the profile looks like after these optimizations, those in #8566, and a hacked up optimization to not hit epoll (not included in this PR):

+   8.85%  pingpong-rust  pingpong-rust                       [.] rt::comm::__extensions__::try_send_inner_6667::_a19fb8784b62bf94::_0$x2e0
+   7.09%  pingpong-rust  pingpong-rust                       [.] rt::comm::__extensions__::try_recv_6722::_9022caef57e17a4::_0$x2e0
+   5.15%  pingpong-rust  libc-2.17.so                        [.] malloc
+   4.45%  pingpong-rust  libc-2.17.so                        [.] _int_free
+   3.39%  pingpong-rust  pingpong-rust                       [.] std..rt..comm..ChanOne$LT$std..rt..comm..StreamPayload$LT$$LP$$RP$$GT$$GT$::_a4c4e2b19dcf3b21::glue_drop_6336
+   3.30%  pingpong-rust  libpthread-2.17.so                  [.] pthread_getspecific
+   3.08%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::meth_29695::change_task_context::_e41693fc69da04e::_0$x2e8$x2dpre
+   2.62%  pingpong-rust  libc-2.17.so                        [.] _int_malloc
+   2.53%  pingpong-rust  libpthread-2.17.so                  [.] pthread_mutex_lock
+   2.31%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::switch_running_tasks_and_then::anon::expr_fn_29766
+   2.30%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] unstable::atomics::__extensions__::meth_20980::new::_cc85e4d2d8dd7469::_0$x2e8$x2dpre
+   2.12%  pingpong-rust  libpthread-2.17.so                  [.] __pthread_mutex_unlock_usercnt
+   2.04%  pingpong-rust  librustrt.so                        [.] swap_registers
+   1.97%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::meth_29405::run_sched_once::_f3525925b944a51::_0$x2e8$x2dpre
+   1.74%  pingpong-rust  pingpong-rust                       [.] rt::comm::__extensions__::drop_3677::_ae547a17a7ff6ea9::_0$x2e0
+   1.67%  pingpong-rust  libc-2.17.so                        [.] free
+   1.56%  pingpong-rust  pingpong-rust                       [.] rt::comm::oneshot_3615::_5a94a9b33e484f55::_0$x2e0
+   1.54%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::meth_21144::enqueue_task::_ce3b62147171303e::_0$x2e8$x2dpre
+   1.48%  pingpong-rust  libpthread-2.17.so                  [.] pthread_setspecific
+   1.47%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::thread_local_storage::pthread_getspecific::_741ee83c8a1aea64::_0$x2e8$x2dpre
+   1.43%  pingpong-rust  libc-2.17.so                        [.] pthread_mutex_lock
+   1.39%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::meth_29730::schedule_task::_373b465acd34aec7::_0$x2e8$x2dpre
+   1.36%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::local_ptr::unsafe_borrow_32251::_32439a71381e9371::_0$x2e8$x2dpre
+   1.34%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::task::__extensions__::meth_28172::take_unwrap_home::_8ff7f630b5c4cb79::_0$x2e8$x2dpre
+   1.34%  pingpong-rust  pingpong-rust                       [.] _$UP$std..rt..task..Task::_6757b7c895de82::glue_drop_3896
+   1.33%  pingpong-rust  pingpong-rust                       [.] rt::comm::__extensions__::try_recv_6760::anon::expr_fn_6765
+   1.29%  pingpong-rust  pingpong-rust                       [.] ping_pong_bench::run_pair::anon::anon::expr_fn_6583
+   1.25%  pingpong-rust  pingpong-rust                       [.] std..option..Option$LT$$UP$std..rt..sched..Scheduler$GT$::_ef4d43fe1137a657::glue_drop_5755
+   1.22%  pingpong-rust  pingpong-rust                       [.] rt::local::__extensions__::meth_5148::take::_87eba8df7b0468::_0$x2e0
+   1.20%  pingpong-rust  libc-2.17.so                        [.] __memmove_ssse3_back
+   1.19%  pingpong-rust  pingpong-rust                       [.] std..rt..kill..BlockedTask::_89842139b9381497::glue_drop_4308
+   1.16%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::task::__extensions__::meth_28209::is_home_no_tls::_ae80b16046b0aa7d::_0$x2e8$x2dpre
+   1.09%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] libc::funcs::c95::stdlib::free::_f269ec056c867a::_0$x2e8$x2dpre
+   1.09%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::meth_15900::run_task::_ce3b62147171303e::_0$x2e8$x2dpre
+   0.93%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::task::__extensions__::meth_28168::give_home::_13e078f88f4164a::_0$x2e8$x2dpre
+   0.86%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::meth_15988::deschedule_running_task_and_then::_6dd6f0e0a6b272fe::_0$x2e8$x2dpre
+   0.86%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] unstable::sync::__extensions__::get_29463::_35e2217cc3c2555b::_0$x2e8$x2dpre
+   0.83%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::thread_local_storage::set::_1d5c2ea6f9affc47::_0$x2e8$x2dpre
+   0.80%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::resume_task_immediately::anon::expr_fn_29762
+   0.80%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::meth_29407::interpret_message_queue::_e6191113fdd2ee81::_0$x2e8$x2dpre
+   0.74%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::sched::__extensions__::run_task::anon::expr_fn_29741
+   0.74%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::work_queue::__extensions__::pop_29570::anon::expr_fn_29596
+   0.72%  pingpong-rust  libstd-6c65cf4b443341b1-0.8-pre.so  [.] rt::kill::__extensions__::meth_15714::wake::_3ccd357d3a209434::_0$x2e8$x2dpre

@alexcrichton
Copy link
Member

Removing approval because @bors is going crazy right now

Force line ending of '.in' files in jemalloc to LF
@thestinger thestinger closed this Aug 24, 2013
@thestinger thestinger deleted the rt-opt branch August 24, 2013 18:18
brson added 13 commits August 24, 2013 14:43
Naturally, and sadly, turning off sanity checks in the runtime is
a noticable performance win. The particular test I'm running goes from
~1.5 s to ~1.3s.

Sanity checks are turned *on* when not optimizing, or when cfg
includes `rtdebug` or `rtassert`.
This makes the lock much less contended. In the test I'm running the
number of times it's contended goes from ~100000 down to ~1000.
It's not a huge win but it does reduce the amount of time spent
contesting the message queue when the schedulers are under load
These aren't used for anything at the moment and cause some TLS hits
on some perf-critical code paths. Will need to put better thought into
it in the future.
vec::unshift uses this to add elements, scheduler queues use unshift,
and this was causing a lot of reallocation
I'm not comfortable turning off rtassert! yet
@thestinger thestinger reopened this Aug 24, 2013
@thestinger thestinger closed this Aug 24, 2013
@thestinger thestinger deleted the rt-opt branch August 24, 2013 18:53
flip1995 pushed a commit to flip1995/rust that referenced this pull request Jun 4, 2022
Add some testcases for recent rustfix update

changelog: none

This adds a testcase for a bugfix that has been fixed by https://github.com/rust-lang/rustfix/tree/v0.6.1

`rustfix` is pulled in by `compiletest_rs`. So to test that the correct rustfix version is used, I added one (and a half) testcase.

I tried to add a testcase for rust-lang#8734 as well, but interesting enough the rustfix is wrong:

```diff
 fn issue8734() {
     let _ = [0u8, 1, 2, 3]
         .into_iter()
-        .and_then(|n| match n {
+        .flat_map(|n| match n {
+            1 => [n
+                .saturating_add(1)
             1 => [n
                 .saturating_add(1)
                 .saturating_add(1)
                 .saturating_add(1)
                 .saturating_add(1)
                 .saturating_add(1)
                 .saturating_add(1)
                 .saturating_add(1)
                 .saturating_add(1)],
             n => [n],
         });
 }
```

this needs some investigation and then this testcase needs to be enabled by commenting it out

closes rust-lang#8878
related to rust-lang#8734
flip1995 pushed a commit to flip1995/rust that referenced this pull request Jul 18, 2022
flip1995 pushed a commit to flip1995/rust that referenced this pull request Jul 18, 2022
Uncomment test for rust-lang#8734

I believe the issue was an interaction between rustfix and `span_lint_and_sugg_for_edges`, so this would've been fixed by rust-lang#98261 (Thanks, `@WaffleLapkin!)`

Closes rust-lang#8734

changelog: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants