Issue: Parallel executor performance is mostly linear #326

byroot · 2016-02-02T23:58:34Z

We use capistrano to deploy on several hundred servers at once, and we noticed capistrano's performance was heavily tied to the number of servers.

To prove it I ran the following benchmark:

require 'benchmark'
require 'csv'
task :bench do
  servers = roles(:borg).to_a
  timings = {}

  if ENV['PREWARM']
    on servers do
      execute :echo, 'pong > /dev/null'
    end
  end

  [1, 10, 50, 100].each do |count|
    timings[count] = Benchmark.measure do
      on servers.pop(count) do
        execute :echo, 'pong > /dev/null'
      end
    end
  end

  res = [%w(count user system real)]
  timings.each do |count, timing|
    res << [count, timing.utime, timing.stime, timing.real]
  end
  puts res.map(&:to_csv).join
end

Here are the results: Spreadsheet

The first graph is with "cold" connections, meaning it's connection establishment plus the command execution. In the second graph, all the connections were pre established before the benchmark.

I'm still investigating to figure out where exactly the bottleneck (or bottlenecks) exactly is. I know the GIL is not for nothing, but capistrano / SSHKit being IO heavy, I think there is other reasons.

cc @kirs as well as @csfrancis

byroot · 2016-02-03T00:06:15Z

Oh and I forgot, here's a profile of a similar benchmark with about 180 severs:

Cold:

==================================
  Mode: wall(1000)
  Samples: 3592 (6.04% miss rate)
  GC: 390 (10.86%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
       626  (17.4%)         620  (17.3%)     SSHKit::Runner::Parallel#execute
       511  (14.2%)         511  (14.2%)     block (3 levels) in Net::SSH::KnownHosts#keys_for
       884  (24.6%)         306   (8.5%)     block (2 levels) in Net::SSH::KnownHosts#keys_for
       158   (4.4%)         111   (3.1%)     Net::SSH::Transport::Kex::DiffieHellmanGroup1SHA1#send_kexinit
       203   (5.7%)         109   (3.0%)     OpenSSL::PKey::DH#valid?
        94   (2.6%)          94   (2.6%)     block in OpenSSL::PKey::DH#valid?
        89   (2.5%)          89   (2.5%)     Net::SSH::Compat.io_select
        60   (1.7%)          60   (1.7%)     block (2 levels) in Net::SSH::Transport::ServerVersion#negotiate!
        55   (1.5%)          55   (1.5%)     block (2 levels) in Net::SSH::Connection::Channel#forward_local_env
        52   (1.4%)          52   (1.4%)     Net::SSH::KnownHosts#known_host_hash?
       956  (26.6%)          51   (1.4%)     Net::SSH::KnownHosts#keys_for
        93   (2.6%)          49   (1.4%)     block in Net::SSH::Config.load
        59   (1.6%)          46   (1.3%)     Net::SSH::Buffer#read
       127   (3.5%)          34   (0.9%)     Net::SSH::Config.load
        33   (0.9%)          33   (0.9%)     Net::SSH::Buffer#write_long
       181   (5.0%)          33   (0.9%)     Net::SSH::Transport::PacketStream#poll_next_packet
        64   (1.8%)          32   (0.9%)     Net::SSH::BufferedIo#fill
        52   (1.4%)          31   (0.9%)     Net::SSH::BufferedIo#send_pending
        30   (0.8%)          30   (0.8%)     SSHKit::Backend::ConnectionPool::Entry#expired?
        29   (0.8%)          29   (0.8%)     Logger#debug?
        33   (0.9%)          28   (0.8%)     Net::SSH::Transport::State#update_next_iv
        27   (0.8%)          27   (0.8%)     Net::SSH::Buffer#length
        27   (0.8%)          27   (0.8%)     Net::SSH::Config.pattern2regex
        26   (0.7%)          26   (0.7%)     Net::SSH::Transport::HMAC::Abstract.mac_length
        85   (2.4%)          23   (0.6%)     Net::SSH::Buffer.from
        22   (0.6%)          22   (0.6%)     OpenSSL::PKey::RSA#ssh_do_verify
      1374  (38.3%)          21   (0.6%)     block in Net::SSH::Transport::Session#poll_message
       905  (25.2%)          21   (0.6%)     block in Net::SSH::KnownHosts#keys_for
        20   (0.6%)          20   (0.6%)     Net::SSH::Buffer#initialize
        17   (0.5%)          17   (0.5%)     Net::SSH::KnownHosts#initialize

Warmed:

==================================
  Mode: wall(1000)
  Samples: 450 (6.83% miss rate)
  GC: 98 (21.78%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
        54  (12.0%)          54  (12.0%)     SSHKit::Backend::ConnectionPool::Entry#expired?
        50  (11.1%)          50  (11.1%)     Net::SSH::Compat.io_select
        28   (6.2%)          28   (6.2%)     block (2 levels) in Net::SSH::Connection::Channel#forward_local_env
        11   (2.4%)          11   (2.4%)     Net::SSH::Transport::HMAC::Abstract.mac_length
        11   (2.4%)          10   (2.2%)     Net::SSH::Buffer#read
        22   (4.9%)          10   (2.2%)     Net::SSH::BufferedIo#fill
         9   (2.0%)           9   (2.0%)     Net::SSH::Transport::State#update_next_iv
        12   (2.7%)           9   (2.0%)     Net::SSH::BufferedIo#send_pending
        63  (14.0%)           8   (1.8%)     block (2 levels) in SSHKit::Backend::ConnectionPool#prune_expired
         8   (1.8%)           8   (1.8%)     Net::SSH::Buffer#initialize
         7   (1.6%)           7   (1.6%)     Net::SSH::Buffer#append
         7   (1.6%)           7   (1.6%)     Thread.create_with_logging_context
         6   (1.3%)           6   (1.3%)     Net::SSH::Buffer#write_long
         5   (1.1%)           5   (1.1%)     Logger#debug?
         5   (1.1%)           5   (1.1%)     block in Net::SSH::Transport::PacketStream#enqueue_packet
         5   (1.1%)           5   (1.1%)     Net::SSH::Transport::HMAC::Abstract.digest_class
         5   (1.1%)           5   (1.1%)     #<Module:0x007fafe25598f0>.reporter
         9   (2.0%)           4   (0.9%)     Net::SSH::Transport::State#update_cipher
         9   (2.0%)           4   (0.9%)     SSHKit::Color#colorize
         4   (0.9%)           4   (0.9%)     Net::SSH::BufferedIo#input
         4   (0.9%)           4   (0.9%)     block (3 levels) in <class:Digest>
       271  (60.2%)           4   (0.9%)     Net::SSH::Connection::Session#dispatch_incoming_packets
         4   (0.9%)           4   (0.9%)     SSHKit::Color#colorize?
        10   (2.2%)           4   (0.9%)     Net::SSH::Buffer#read_long
         8   (1.8%)           4   (0.9%)     block in Net::SSH::Packet#instantiate!
         3   (0.7%)           3   (0.7%)     Net::SSH::Buffer#length
        14   (3.1%)           3   (0.7%)     Capistrano::DSL#on
        13   (2.9%)           3   (0.7%)     Net::SSH::Buffer.from
         3   (0.7%)           3   (0.7%)     SSHKit::Backend::Abstract#initialize
         3   (0.7%)           3   (0.7%)     Net::SSH::Buffer#to_s

mattbrictson · 2016-02-03T01:12:49Z

Interesting! I am also surprised that the results are linear. Please keep us updated on your findings. I wonder if JRuby results are any different.

byroot · 2016-02-03T01:18:53Z

Well, while posting the profile I figured that:

54  (12.0%)          54  (12.0%)     SSHKit::Backend::ConnectionPool::Entry#expired?

Was very sketchy. Turns out according to our monitoring, running migrations very often takes more than 30 seconds (we use docker so, docker pull, plus rails boot it's easy to cross that mark).
Which mean all the connections established before the deploy:migrate, have to be re-established after.

So I'll see tomorrow but I think SSHKit.config.backend.pool.idle_timeout = 120 should help quite a bit in our real world situation.

But it still doesn't explain the benchmark results so I'll keep digging.

byroot · 2016-02-03T04:10:24Z

Based on @csfrancis input and also on net-ssh/net-ssh#303, it seems likely that the linear performance comes from a good part of IO.select which get slower the more file descriptors it has to watch.

If net-ssh/net-ssh#303 goes through, it would be very interesting to experiment with an evented backend.

However, the profile clearly shows that a huge amount of time is spent in SSHKit::Backend::ConnectionPool::Entry#expired?. After investigation it's clear that the ConnectionPool put way to much zeal into pruning expired connection.

The prune_expired method that iterates over all the connections is called both on checkin and on checkout, so it's executed twice per host and per command.

e.g. if you execute 5 commands, on 10 servers, it will be executed 2 * 5 * 10, 100 times! Also note that it synchronize a mutex, so it's very probable that it kills limits the concurrency.

I'll start working on a PR to see if those prune could be batched somehow, e.g triggered after X checkins/checkouts.

I'd also like to be able to disable the idle_timeout check without losing the connection pooling entirely.

byroot · 2016-02-03T04:14:22Z

~~Quick update, it seems like idle_timeout disabling connection pooling entirely is a regression from #298. ~~ Sorry I was wrong.

camilo · 2016-02-03T04:40:38Z

Based on @csfrancis input and also on net-ssh/net-ssh#303, it seems likely that the linear performance comes from a good part of IO.select which get slower the more file descriptors it has to watch.

Maybe strace -C would tell, as in it would measure the time on just that syscall.

byroot · 2016-02-03T05:22:30Z

#327 allow to disable pruning entirely for those interested. I'm still ensure how to reduce the prune_expired calls. Maybe with a counter?

byroot · 2016-02-03T15:08:18Z

By inspecting the profiles, I also discovered that net/ssh waste a lot of CPU parsing the same known_hosts files on every connection establishment.

I'll try to see if they could be cached, ref: net-ssh/net-ssh#307

HoneyryderChuck · 2016-02-03T16:02:20Z

Just for adding up, and since you've mentioned my pull request still being evaluated: the known hosts parsing is a huge bottleneck in net-ssh. As you have correctly mentioned, the parsing will occur for every host you connect to. Performance of this operation is dependent of the file system you're running it from (if you're in a networked-FS, as I usually work from, this kills your performance).

Opening a file system descriptor will also kill any chance you want to have of toying around an evented framework, as the ruby implementations I know (nio4r, eventmachine) famously only listen to network sockets, which means that every File I/O will block the event loop.

IO.select calls do get expensive, but just because they're spread across the library for read/write checks. If the pull request gets accepted, the number of IO.select direct calls (io/wait still falls be to select() when no poll() is implemented) will be reduced to 1 per session traffic processing. For this there is currently no optimal solution in sight.

byroot · 2016-02-03T16:05:37Z

@TiagoCardoso1983 I tried net-ssh/net-ssh#303 but only got very marginal performance improvements. Is it because the main select in #process is still not patched?

HoneyryderChuck · 2016-02-03T16:35:44Z

@byroot which OS are you testing it on? Unless you're in a poll()-enabled one, it's the same thing. Even if you're in a poll()-enabled environment, the bulk of the work is done by the #process call, which is unpatchable in the current state and uses IO.select. Just for you to understand what I mean, net-ssh has a loop but is far from being truly non-blocking. What it does: it checks whether there is something to read/write, proceeding to read/write with the Socket blocking APIs (limiting the number of bytes transmitted, yes). Truly nonblocking would be: (1) try to read/write "non-block, (2a) succeed, or (2b) wait for read/write, and go back to (1). Improve this bit would mean rewriting the library, I don't know if that's feasible.

byroot · 2016-02-03T19:42:47Z

OS are you testing it on?

OSX, so it's poll-enabled.

Improve this bit would mean rewriting the library,

Yeah it's what I figured 😞

HoneyryderChuck · 2016-02-03T20:21:08Z

OSX poll implementation is broken. I would have to look at the code, but I think that ruby falls back to select when in OSX. you can check it with strace, though. As for the rest, it's mitigated with a proper fallback to a proper event loop. I've been working on something aside. It will not help your keys_for issue though. 😞

leehambley · 2016-02-03T21:51:01Z

Hi guys, excellent work digging into this. I'm going on the road for a week on Friday, and I don't have time to get too deep into this with you in the next 24h but you are in good hands with @mattbrictson, and I'll be reading my emails during CW6 if either of us can do anything to support you. We can always escalate things up to net/ssh if we need to, just let us know, and we'll help preparing PRs or issues there,

mattbrictson · 2016-02-05T20:11:09Z

@byroot It looks like with #327 and net-ssh/net-ssh#308 both merged, this issue is now largely resolved.

I did some benchmarking of my own, and while the current SSHKit ConnectionPool does have some unnecessary overhead, in the big picture it is pretty negligible. I think we have reached the point of diminishing returns. Agreed?

byroot · 2016-02-05T20:13:43Z

net-ssh/net-ssh#308 in itself doesn't fix anything. I'd like to propose one last feature in SSHKit to replace Net::SSH's KnownHost with an alternative implementation that caches the lookup in memory.

Would you agree with such feature?

I'm ok with closing that issue. I got sidetracked, but I'll try to see if there is other performance improvements I could bring, but I don't need this issue open to do so.

mattbrictson · 2016-02-05T21:42:12Z

@byroot Yes, I'm open to having SSHKit ship with a better/faster KnownHost implementation. Let's keep this issue open then until that is complete.

Also: during the past couple days I worked on a rewrite of the SSHKit ConnectionPool that uses much less mutex locking and moves stale connection eviction to a background thread. If I open a PR would you be willing to test my branch against your 180 servers? 😁

My hunch is that the mutex overhead is not really a big deal, but I'm curious to see the results.

byroot · 2016-02-05T21:43:08Z

If I open a PR would you be willing to test my branch against your 180 servers?

Of course, I'd be happy to.

…ptions As discussed in capistrano#326 (comment) Net::SSH re-parse the known_hosts files every time it needs to lookup for a known key. This alternative implementation parse it once and for all, and cache the result.

byroot · 2016-03-16T04:32:53Z

Just wanted to update on this. Today I updated our capistrano recipe to use all the improvements we did so far (including the yet unreleased #330) and as far as I'm concerned this issue is solved.

The reason I opened this issue initially is because when I attempted to deploy both our datacenters in the same recipe it ended up being ~50% slower than doing it in 2 concurrent process like we used to.

But on the second attempts there is no noticeable difference anymore.

There is likely still room for improvement, especially around epoll and friends, but at this point capistrano's performance is satisfactory at our scale (300-400 servers).

@mattbrictson let me know if you need me to do anything else to get #330 merged, other than that I'm good for closing this issue.

leehambley · 2016-03-16T06:26:22Z

Thanks for the feedback @byroot - and for your time and contributions. I'll merge #330 in day or two unless @mattbrictson chimes in in the meantime. I didn't see anything blocking though, but Matt is taking a short break from FOSS, but I do still want to give him the chance, without pressure to have a say.

…ptions As discussed in capistrano#326 (comment) Net::SSH re-parse the known_hosts files every time it needs to lookup for a known key. This alternative implementation parse it once and for all, and cache the result.

mattbrictson added the bug? label Feb 3, 2016

byroot mentioned this issue Feb 3, 2016

Allow to disable both connection pooling and pruning idependently #327

Merged

byroot mentioned this issue Feb 3, 2016

KnownHosts caching? net-ssh/net-ssh#307

Closed

byroot mentioned this issue Feb 3, 2016

Allow to substitute Net::SSH::KnownHosts with a custom implementation net-ssh/net-ssh#308

Merged

mattbrictson mentioned this issue Feb 5, 2016

Rewrite ConnectionPool for efficiency #328

Merged

2 tasks

byroot mentioned this issue Feb 11, 2016

Memoize session host keys net-ssh/net-ssh#311

Merged

byroot mentioned this issue Feb 11, 2016

Provide an alternative implementation of Net::SSH::KnownHost in ssh_options #330

Merged

leehambley changed the title ~~Parallel executor performance is mostly linear~~ Issue: Parallel executor performance is mostly linear Feb 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue: Parallel executor performance is mostly linear #326

Issue: Parallel executor performance is mostly linear #326

byroot commented Feb 2, 2016

byroot commented Feb 3, 2016

mattbrictson commented Feb 3, 2016

byroot commented Feb 3, 2016

byroot commented Feb 3, 2016

byroot commented Feb 3, 2016

camilo commented Feb 3, 2016

byroot commented Feb 3, 2016

byroot commented Feb 3, 2016

HoneyryderChuck commented Feb 3, 2016

byroot commented Feb 3, 2016

HoneyryderChuck commented Feb 3, 2016

byroot commented Feb 3, 2016

HoneyryderChuck commented Feb 3, 2016

leehambley commented Feb 3, 2016

mattbrictson commented Feb 5, 2016

byroot commented Feb 5, 2016

mattbrictson commented Feb 5, 2016

byroot commented Feb 5, 2016

byroot commented Mar 16, 2016

leehambley commented Mar 16, 2016

Issue: Parallel executor performance is mostly linear #326

Issue: Parallel executor performance is mostly linear #326

Comments

byroot commented Feb 2, 2016

byroot commented Feb 3, 2016

mattbrictson commented Feb 3, 2016

byroot commented Feb 3, 2016

byroot commented Feb 3, 2016

byroot commented Feb 3, 2016

camilo commented Feb 3, 2016

byroot commented Feb 3, 2016

byroot commented Feb 3, 2016

HoneyryderChuck commented Feb 3, 2016

byroot commented Feb 3, 2016

HoneyryderChuck commented Feb 3, 2016

byroot commented Feb 3, 2016

HoneyryderChuck commented Feb 3, 2016

leehambley commented Feb 3, 2016

mattbrictson commented Feb 5, 2016

byroot commented Feb 5, 2016

mattbrictson commented Feb 5, 2016

byroot commented Feb 5, 2016

byroot commented Mar 16, 2016

leehambley commented Mar 16, 2016