-
Notifications
You must be signed in to change notification settings - Fork 504
Unix domain sockets & findings from offcputime analysis #1109
Conversation
Implements a Postgres-compatible Unix domain socket protocol. This time, without using libev to try to port everything over from epoll to io_uring.
You can now change the Unix domain socket settings.
2d5af11
to
91324ad
Compare
Further updates: epoll overhead is real and fairly significant--I think it's responsible for an avoidable ~5-10% of loss on localhost throughput--but designing around it is extremely painful. And of course, alternatives have overhead of their own. From investigations of macOS vs. Linux (thanks Matt), Postgres in a VM vs. Postgres on bare metal, and Noisepage in a VM vs. Noisepage on bare metal, it seems like our TPC-C throughput is very heavily constrained by networking and other kernel/OS related overhead. While efforts like this PR can reduce that overhead substantially, a huge amount of overhead remains. I believe that the extended query protocol, and to a lesser extent, its implementation (through epoll) present a major system performance bottleneck on benchmarks like TPC-C. I have also been plagued with issues while profiling from variance--both in the statistical and literal sense--in network stack performance across platforms, kernel versions, and even CPU microcode versions(!) This makes me question whether such network-IO intensive workloads, such as TPC-C, are a good fit for modeling and sanity-checking our performance data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small notes to address, then LGTM pending CI
Adds Doxygen comments to ConnectionDispatcherTask::ConnectionDispatcherTask (connection_dispatcher_task.h) and NetworkLayer::NetworkLayer (db_main.h)
Adds documentation to the test case in order to explain how the test connects to the domain socket.
Addresses code review feedback. Removes an unnecessary memset that was left in from an early development stage. I never would've caught this, so thanks Deepayan!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
So, I can confirm that Postgres is also affected by the same TCP issue. It gains a lot from using Unix domain sockets. Benched with postgres running on Optane (NVMe, not pmem):
Results are reproducible: Postgres regularly gains around 25-30% from using domain sockets. |
Renames SimpleQueryTest to UnixDomainSocketTest.
Improves portability of pathname validation, particularly with regards to macOS. Thanks, @gengkev!
c619135
to
29733b6
Compare
Requested a review from @lmwnshn so he can take a quick pass to make sure it doesn't conflict with anything he's poking at in the network layer. |
…errier into unix_domain_sockets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think error handling would be detectable if it were made an assert. Additionally, some changes to commenting would be nice. I am in the process of refactoring the network layer and will probably change stuff in terrier_server in a couple of days. Overall, LGTM though! Would be fine with merging as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a bug while merging it into my own branch.
Keeps sun/sin initialization the same for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Codecov Report
@@ Coverage Diff @@
## master #1109 +/- ##
==========================================
- Coverage 82.63% 81.72% -0.91%
==========================================
Files 653 650 -3
Lines 46429 43784 -2645
==========================================
- Hits 38366 35784 -2582
+ Misses 8063 8000 -63
Continue to review full report at Codecov.
|
Co-authored-by: Andy Pavlo <pavlo@cs.brown.edu> Co-authored-by: Wan Shen Lim <wanshen.lim@gmail.com>
Unix Domain Sockets
just ctrl-f "results"
Background
While investigating our offcputime, I was struck by how much time was spent strictly on network tasks. EC2 numbers pointed to about 8% being spent on just flushing write buffers, while my own profiles from VMware Workstation suggested that about 80% of offcputime was spent on flushing write buffers. (Small update: Restarting the VM brought that number down drastically. I went from ~120 t/s -> ~220+t/s on 1 terminal from faster writes!)
I wanted to see if Postgres was similarly affected on a VM. As it turns out, it absolutely is, and it is affected to the exact same extent. However, Postgres can cheat and use Unix domain sockets (think named pipes, but accessed via the filesystem) on supported tests. I sought to evaluate the extent to which Unix domain sockets could impact our performance by reducing offcputime.
Procedure
Disclaimer: I can't run oltpbench with more than about 10 terminals without the DB dying. This is the case for master and my code. I get these errors on the server:
All benchmarks were performed with 200 connection threads on a machine with twenty-four free cores. The benchmark used was tpcc from oltpbench. tpcc is a nice workload for this PR, since it really exaggerates protocol overhead. The terrier server was restarted prior to each run of the benchmark. See the section titled "The PR" for my configuration file. The virtual machine software used is VMware Workstation 15.
The virtual machine is running Ubuntu 18.04 LTS with kernel version 5.4.0-42-generic.
The bare metal machine is running Arch Linux with kernel version 5.4.57-1-lts.
Results
Discussion
To be honest, I'm pretty confused. The results go entirely against my expectations. I expected the gains from Unix domain sockets on bare metal to be a small fraction of the gains from a virtual machine. My interpretation of bare metal profiling data suggested that I had at most ~10% to gain from this PR. I was wrong on both counts. I can consistently reproduce the speedups that I have measured here. I'll revisit this with VTune again, profiling the code with and without Unix domain sockets, and see what's up.
The PR and its future
This PR creates a new network test, two new settings, and a "new" (it extracts out existing code) method in terrier_server. It supports simultaneous usage of both Unix and networked/INET sockets, with logs displaying relevant socket information, and with graceful handling and recovery from errors.
Since this adds a potentially useful feature from two big DBs (Postgres and MySQL) with next to no overhead, it's nice from a features perspective. That said, I don't think it makes a ton of sense to merge this PR. It (1) doesn't bring us much closer to project goals, and (2) isn't actually fixing any IO-protocol-induced-gap between us and Postgres/MySQL in benchmarks, since as it turns out, oltpbench can't actually use their Unix domain sockets.
In light of the last comment, I should note that to benchmark this PR, you would want to use my fork of Matt's fork of oltpbench. The real change is just two lines, though for others' convenience and for the sake of reproducibility, I've also changed the default settings of the benchmark to mirror what I've used. Instructions for running the benchmark are available here.
Further work
Investigating the source of the speedup
At this point, this is my main goal. I'd like to try to reconcile my interpretation of previous profiling data with the results that I've collected. This will involve more profiling, but I expect it to be straightforward.
Investigating libevent/epoll
This seems like a dead end.
After running oltpbench with Wireshark and being stunned by the amount of packets being sent by tpcc, I wondered if libevent and epoll could be potential sources of overhead. I've never been a fan of epoll--the only decent API that libevent supports on Linux--so I tried porting things over to libev. The idea was that I'd be able to use io_uring, a much more performant (and, if implemented properly, probably much cleaner to use) kernel API. After our meeting, this didn't seem worth it. So for now, work on libev and io_uring support is on pause.
In either case, I would hope that libevent's epoll backend would be more than fast enough for a mere ten terminals. epoll is plenty fast for our cases, but that's assuming that it's implemented well, and it's very hard to tell if that's the case from libevent's codebase. Attempts to update the libevent version and to use edge-triggered epoll had no impact on performance, so I think we're still far away from any theoretical epoll limits.
Investigating loopback and TCP overhead
This is probably also a dead end, but less of a dead end than the libevent/epoll stuff.
I'd like to work on improving our INET socket overhead, but at this point, that might not be doable without being a kernel engineer. Nothing I tried measurably reduced loopback and/or INET overhead, and I tried a lot. Still, I think it's worth digging some more.
One question I have is how much of the speedup comes from avoiding TCP, and how much comes from using a glorified pipe instead of loopback. This would be interesting to know, but I have no idea how I'd measure, and I'm not convinced that I'd be able to make anything useful from the answer.
If I don't think this should be merged, then why did I make this giant PR?
Good question. Four reasons:
I think that there is potential for good discussion from the findings here.
I've spent way too much time investigating off CPU time and don't want it to go to waste. I want the information I've collected to be accessible to others. I needed to write this all down anyway. It only made sense to clean up my notes, put them in markdown, and share them.
While Unix domain sockets aren't particularly practical, they do demonstrate a serious source of overhead in a way that anyone can verify--we no longer need to speculate, for example, as to whether or not (and/or, to what extent) loopback overhead might be costing us. Hence, as an academic exercise, I found writing this PR to be extremely useful as a way to gain a deeper insight into the performance characteristics of the system.
By sharing the findings from my investigation in this PR, I'm hoping that someone smart might realize a way to claw back some of the overhead losses.