-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue: Parallel executor performance is mostly linear #326
Comments
Oh and I forgot, here's a profile of a similar benchmark with about 180 severs: Cold:
Warmed:
|
Interesting! I am also surprised that the results are linear. Please keep us updated on your findings. I wonder if JRuby results are any different. |
Well, while posting the profile I figured that:
Was very sketchy. Turns out according to our monitoring, running migrations very often takes more than 30 seconds (we use docker so, docker pull, plus rails boot it's easy to cross that mark). So I'll see tomorrow but I think But it still doesn't explain the benchmark results so I'll keep digging. |
Based on @csfrancis input and also on net-ssh/net-ssh#303, it seems likely that the linear performance comes from a good part of If net-ssh/net-ssh#303 goes through, it would be very interesting to experiment with an evented backend. However, the profile clearly shows that a huge amount of time is spent in The e.g. if you execute 5 commands, on I'll start working on a PR to see if those prune could be batched somehow, e.g triggered after X checkins/checkouts. I'd also like to be able to disable the |
~~Quick update, it seems like |
Maybe strace -C would tell, as in it would measure the time on just that syscall. |
#327 allow to disable pruning entirely for those interested. I'm still ensure how to reduce the |
By inspecting the profiles, I also discovered that net/ssh waste a lot of CPU parsing the same I'll try to see if they could be cached, ref: net-ssh/net-ssh#307 |
Just for adding up, and since you've mentioned my pull request still being evaluated: the known hosts parsing is a huge bottleneck in net-ssh. As you have correctly mentioned, the parsing will occur for every host you connect to. Performance of this operation is dependent of the file system you're running it from (if you're in a networked-FS, as I usually work from, this kills your performance). Opening a file system descriptor will also kill any chance you want to have of toying around an evented framework, as the ruby implementations I know (nio4r, eventmachine) famously only listen to network sockets, which means that every File I/O will block the event loop. IO.select calls do get expensive, but just because they're spread across the library for read/write checks. If the pull request gets accepted, the number of IO.select direct calls (io/wait still falls be to select() when no poll() is implemented) will be reduced to 1 per session traffic processing. For this there is currently no optimal solution in sight. |
@TiagoCardoso1983 I tried net-ssh/net-ssh#303 but only got very marginal performance improvements. Is it because the main |
@byroot which OS are you testing it on? Unless you're in a poll()-enabled one, it's the same thing. Even if you're in a poll()-enabled environment, the bulk of the work is done by the #process call, which is unpatchable in the current state and uses IO.select. Just for you to understand what I mean, net-ssh has a loop but is far from being truly non-blocking. What it does: it checks whether there is something to read/write, proceeding to read/write with the Socket blocking APIs (limiting the number of bytes transmitted, yes). Truly nonblocking would be: (1) try to read/write "non-block, (2a) succeed, or (2b) wait for read/write, and go back to (1). Improve this bit would mean rewriting the library, I don't know if that's feasible. |
OSX, so it's poll-enabled.
Yeah it's what I figured 😞 |
OSX poll implementation is broken. I would have to look at the code, but I think that ruby falls back to select when in OSX. you can check it with strace, though. As for the rest, it's mitigated with a proper fallback to a proper event loop. I've been working on something aside. It will not help your keys_for issue though. 😞 |
Hi guys, excellent work digging into this. I'm going on the road for a week on Friday, and I don't have time to get too deep into this with you in the next 24h but you are in good hands with @mattbrictson, and I'll be reading my emails during CW6 if either of us can do anything to support you. We can always escalate things up to |
@byroot It looks like with #327 and net-ssh/net-ssh#308 both merged, this issue is now largely resolved. I did some benchmarking of my own, and while the current SSHKit ConnectionPool does have some unnecessary overhead, in the big picture it is pretty negligible. I think we have reached the point of diminishing returns. Agreed? |
net-ssh/net-ssh#308 in itself doesn't fix anything. I'd like to propose one last feature in SSHKit to replace Net::SSH's KnownHost with an alternative implementation that caches the lookup in memory. Would you agree with such feature? I'm ok with closing that issue. I got sidetracked, but I'll try to see if there is other performance improvements I could bring, but I don't need this issue open to do so. |
@byroot Yes, I'm open to having SSHKit ship with a better/faster KnownHost implementation. Let's keep this issue open then until that is complete. Also: during the past couple days I worked on a rewrite of the SSHKit ConnectionPool that uses much less mutex locking and moves stale connection eviction to a background thread. If I open a PR would you be willing to test my branch against your 180 servers? 😁 My hunch is that the mutex overhead is not really a big deal, but I'm curious to see the results. |
Of course, I'd be happy to. |
…ptions As discussed in capistrano#326 (comment) Net::SSH re-parse the known_hosts files every time it needs to lookup for a known key. This alternative implementation parse it once and for all, and cache the result.
…ptions As discussed in capistrano#326 (comment) Net::SSH re-parse the known_hosts files every time it needs to lookup for a known key. This alternative implementation parse it once and for all, and cache the result.
…ptions As discussed in capistrano#326 (comment) Net::SSH re-parse the known_hosts files every time it needs to lookup for a known key. This alternative implementation parse it once and for all, and cache the result.
Just wanted to update on this. Today I updated our capistrano recipe to use all the improvements we did so far (including the yet unreleased #330) and as far as I'm concerned this issue is solved. The reason I opened this issue initially is because when I attempted to deploy both our datacenters in the same recipe it ended up being ~50% slower than doing it in 2 concurrent process like we used to. But on the second attempts there is no noticeable difference anymore. There is likely still room for improvement, especially around @mattbrictson let me know if you need me to do anything else to get #330 merged, other than that I'm good for closing this issue. |
Thanks for the feedback @byroot - and for your time and contributions. I'll merge #330 in day or two unless @mattbrictson chimes in in the meantime. I didn't see anything blocking though, but Matt is taking a short break from FOSS, but I do still want to give him the chance, without pressure to have a say. |
…ptions As discussed in capistrano#326 (comment) Net::SSH re-parse the known_hosts files every time it needs to lookup for a known key. This alternative implementation parse it once and for all, and cache the result.
We use capistrano to deploy on several hundred servers at once, and we noticed capistrano's performance was heavily tied to the number of servers.
To prove it I ran the following benchmark:
Here are the results: Spreadsheet
The first graph is with "cold" connections, meaning it's connection establishment plus the command execution. In the second graph, all the connections were pre established before the benchmark.
I'm still investigating to figure out where exactly the bottleneck (or bottlenecks) exactly is. I know the GIL is not for nothing, but capistrano / SSHKit being IO heavy, I think there is other reasons.
cc @kirs as well as @csfrancis
The text was updated successfully, but these errors were encountered: