Skip to content
This repository has been archived by the owner on Feb 20, 2023. It is now read-only.

Unix domain sockets & findings from offcputime analysis #1109

Merged
merged 27 commits into from
Sep 16, 2020

Conversation

gonzalezjo
Copy link
Collaborator

Unix Domain Sockets

just ctrl-f "results"

Background

While investigating our offcputime, I was struck by how much time was spent strictly on network tasks. EC2 numbers pointed to about 8% being spent on just flushing write buffers, while my own profiles from VMware Workstation suggested that about 80% of offcputime was spent on flushing write buffers. (Small update: Restarting the VM brought that number down drastically. I went from ~120 t/s -> ~220+t/s on 1 terminal from faster writes!)

I wanted to see if Postgres was similarly affected on a VM. As it turns out, it absolutely is, and it is affected to the exact same extent. However, Postgres can cheat and use Unix domain sockets (think named pipes, but accessed via the filesystem) on supported tests. I sought to evaluate the extent to which Unix domain sockets could impact our performance by reducing offcputime.

Procedure

Disclaimer: I can't run oltpbench with more than about 10 terminals without the DB dying. This is the case for master and my code. I get these errors on the server:

[2020-08-22 17:29:34.801] [network_logger] [error] Error writing: Connection reset by peer
[2020-08-22 17:29:34.801] [network_logger] [error] Error writing: Connection reset by peer
[2020-08-22 17:29:34.801] [network_logger] [error] Error writing: Connection reset by peer
[2020-08-22 17:29:34.801] [network_logger] [error] Error when filling read buffer
[2020-08-22 17:29:34.801] [network_logger] [error] Error when filling read buffer
[2020-08-22 17:29:34.801] [network_logger] [error] Error when filling read buffer
[2020-08-22 17:29:34.802] [network_logger] [error] Error when filling read buffer

All benchmarks were performed with 200 connection threads on a machine with twenty-four free cores. The benchmark used was tpcc from oltpbench. tpcc is a nice workload for this PR, since it really exaggerates protocol overhead. The terrier server was restarted prior to each run of the benchmark. See the section titled "The PR" for my configuration file. The virtual machine software used is VMware Workstation 15.

The virtual machine is running Ubuntu 18.04 LTS with kernel version 5.4.0-42-generic.

The bare metal machine is running Arch Linux with kernel version 5.4.57-1-lts.

Results

Machine Type Terminals Scale Factor Socket Type Transactions / Second
Bare metal 10 10 UNIX 8346 (+34%)
Bare metal 10 10 INET 6214
Bare metal 1 1 UNIX 914 (+37%)
Bare metal 1 1 INET 668
VMware 10 10 UNIX 3728 (+24%)
VMware 10 10 INET 3005
VMware 1 1 UNIX 289 (+28%)
VMware 1 1 INET 226

Discussion

To be honest, I'm pretty confused. The results go entirely against my expectations. I expected the gains from Unix domain sockets on bare metal to be a small fraction of the gains from a virtual machine. My interpretation of bare metal profiling data suggested that I had at most ~10% to gain from this PR. I was wrong on both counts. I can consistently reproduce the speedups that I have measured here. I'll revisit this with VTune again, profiling the code with and without Unix domain sockets, and see what's up.

The PR and its future

This PR creates a new network test, two new settings, and a "new" (it extracts out existing code) method in terrier_server. It supports simultaneous usage of both Unix and networked/INET sockets, with logs displaying relevant socket information, and with graceful handling and recovery from errors.

Since this adds a potentially useful feature from two big DBs (Postgres and MySQL) with next to no overhead, it's nice from a features perspective. That said, I don't think it makes a ton of sense to merge this PR. It (1) doesn't bring us much closer to project goals, and (2) isn't actually fixing any IO-protocol-induced-gap between us and Postgres/MySQL in benchmarks, since as it turns out, oltpbench can't actually use their Unix domain sockets.

In light of the last comment, I should note that to benchmark this PR, you would want to use my fork of Matt's fork of oltpbench. The real change is just two lines, though for others' convenience and for the sake of reproducibility, I've also changed the default settings of the benchmark to mirror what I've used. Instructions for running the benchmark are available here.

Further work

Investigating the source of the speedup

At this point, this is my main goal. I'd like to try to reconcile my interpretation of previous profiling data with the results that I've collected. This will involve more profiling, but I expect it to be straightforward.

Investigating libevent/epoll

This seems like a dead end.

After running oltpbench with Wireshark and being stunned by the amount of packets being sent by tpcc, I wondered if libevent and epoll could be potential sources of overhead. I've never been a fan of epoll--the only decent API that libevent supports on Linux--so I tried porting things over to libev. The idea was that I'd be able to use io_uring, a much more performant (and, if implemented properly, probably much cleaner to use) kernel API. After our meeting, this didn't seem worth it. So for now, work on libev and io_uring support is on pause.

In either case, I would hope that libevent's epoll backend would be more than fast enough for a mere ten terminals. epoll is plenty fast for our cases, but that's assuming that it's implemented well, and it's very hard to tell if that's the case from libevent's codebase. Attempts to update the libevent version and to use edge-triggered epoll had no impact on performance, so I think we're still far away from any theoretical epoll limits.

Investigating loopback and TCP overhead

This is probably also a dead end, but less of a dead end than the libevent/epoll stuff.

I'd like to work on improving our INET socket overhead, but at this point, that might not be doable without being a kernel engineer. Nothing I tried measurably reduced loopback and/or INET overhead, and I tried a lot. Still, I think it's worth digging some more.

One question I have is how much of the speedup comes from avoiding TCP, and how much comes from using a glorified pipe instead of loopback. This would be interesting to know, but I have no idea how I'd measure, and I'm not convinced that I'd be able to make anything useful from the answer.

If I don't think this should be merged, then why did I make this giant PR?

Good question. Four reasons:

  1. I think that there is potential for good discussion from the findings here.

  2. I've spent way too much time investigating off CPU time and don't want it to go to waste. I want the information I've collected to be accessible to others. I needed to write this all down anyway. It only made sense to clean up my notes, put them in markdown, and share them.

  3. While Unix domain sockets aren't particularly practical, they do demonstrate a serious source of overhead in a way that anyone can verify--we no longer need to speculate, for example, as to whether or not (and/or, to what extent) loopback overhead might be costing us. Hence, as an academic exercise, I found writing this PR to be extremely useful as a way to gain a deeper insight into the performance characteristics of the system.

  4. By sharing the findings from my investigation in this PR, I'm hoping that someone smart might realize a way to claw back some of the overhead losses.

Implements a Postgres-compatible Unix domain socket protocol. This time, without using libev to try to port everything over from epoll to io_uring.
You can now change the Unix domain socket settings.
@gonzalezjo gonzalezjo added question/discussion Further information is requested. performance Performance related issues or changes. do-not-merge This PR is either not intended to merge, or did not pass review. labels Aug 22, 2020
@gonzalezjo gonzalezjo force-pushed the unix_domain_sockets branch from 2d5af11 to 91324ad Compare August 24, 2020 17:17
@gonzalezjo
Copy link
Collaborator Author

Further updates: epoll overhead is real and fairly significant--I think it's responsible for an avoidable ~5-10% of loss on localhost throughput--but designing around it is extremely painful. And of course, alternatives have overhead of their own.

From investigations of macOS vs. Linux (thanks Matt), Postgres in a VM vs. Postgres on bare metal, and Noisepage in a VM vs. Noisepage on bare metal, it seems like our TPC-C throughput is very heavily constrained by networking and other kernel/OS related overhead. While efforts like this PR can reduce that overhead substantially, a huge amount of overhead remains.

I believe that the extended query protocol, and to a lesser extent, its implementation (through epoll) present a major system performance bottleneck on benchmarks like TPC-C. I have also been plagued with issues while profiling from variance--both in the statistical and literal sense--in network stack performance across platforms, kernel versions, and even CPU microcode versions(!)

This makes me question whether such network-IO intensive workloads, such as TPC-C, are a good fit for modeling and sanity-checking our performance data.

@apavlo apavlo requested a review from thepinetree August 28, 2020 20:57
@lmwnshn lmwnshn added the ready-for-review This PR passes all checks and is ready to be reviewed. Mark PRs with this. label Aug 29, 2020
Copy link
Contributor

@thepinetree thepinetree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small notes to address, then LGTM pending CI

src/network/terrier_server.cpp Outdated Show resolved Hide resolved
test/network/network_test.cpp Outdated Show resolved Hide resolved
@lmwnshn lmwnshn added in-progress This PR is being actively worked on and not ready to be reviewed or merged. Mark PRs with this. and removed ready-for-review This PR passes all checks and is ready to be reviewed. Mark PRs with this. labels Sep 1, 2020
Adds Doxygen comments to ConnectionDispatcherTask::ConnectionDispatcherTask (connection_dispatcher_task.h) and NetworkLayer::NetworkLayer (db_main.h)
Adds documentation to the test case in order to explain how the test connects to the domain socket.
Addresses code review feedback. Removes an unnecessary memset that was left in from an early development stage. I never would've caught this, so thanks Deepayan!
Copy link
Contributor

@thepinetree thepinetree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gonzalezjo gonzalezjo added ready-for-review This PR passes all checks and is ready to be reviewed. Mark PRs with this. and removed do-not-merge This PR is either not intended to merge, or did not pass review. in-progress This PR is being actively worked on and not ready to be reviewed or merged. Mark PRs with this. labels Sep 4, 2020
@gonzalezjo
Copy link
Collaborator Author

gonzalezjo commented Sep 4, 2020

So, I can confirm that Postgres is also affected by the same TCP issue. It gains a lot from using Unix domain sockets. Benched with postgres running on Optane (NVMe, not pmem):

jordig@dev5:~/oltpbench/oltpbench$ pgbench -c 10 -T 60 -j 10 -M extended -P 1 -h 127.0.0.1 -p 5433
Password: 
starting vacuum...end.
progress: 1.0 s, 12806.3 tps, lat 0.775 ms stddev 0.139
progress: 2.0 s, 13542.9 tps, lat 0.738 ms stddev 0.115
progress: 3.0 s, 13529.2 tps, lat 0.739 ms stddev 0.110
progress: 4.0 s, 13707.6 tps, lat 0.730 ms stddev 0.127
progress: 5.0 s, 13835.9 tps, lat 0.723 ms stddev 0.119
progress: 6.0 s, 13776.8 tps, lat 0.726 ms stddev 0.112
progress: 7.0 s, 13784.6 tps, lat 0.725 ms stddev 0.116
progress: 8.0 s, 13637.0 tps, lat 0.733 ms stddev 0.117
progress: 9.0 s, 13154.1 tps, lat 0.760 ms stddev 0.189

^C

jordig@dev5:~/oltpbench/oltpbench$ pgbench -c 10 -T 60 -j 10 -M extended -P 1
starting vacuum...end.
progress: 1.0 s, 17512.7 tps, lat 0.567 ms stddev 0.099
progress: 2.0 s, 17673.7 tps, lat 0.566 ms stddev 0.090
progress: 3.0 s, 17570.8 tps, lat 0.569 ms stddev 0.095
progress: 4.0 s, 17414.5 tps, lat 0.574 ms stddev 0.097
progress: 5.0 s, 17401.7 tps, lat 0.575 ms stddev 0.098
progress: 6.0 s, 17128.6 tps, lat 0.584 ms stddev 0.098
progress: 7.0 s, 17420.6 tps, lat 0.574 ms stddev 0.095
progress: 8.0 s, 17277.9 tps, lat 0.579 ms stddev 0.097
progress: 9.0 s, 17086.5 tps, lat 0.585 ms stddev 0.096

Results are reproducible: Postgres regularly gains around 25-30% from using domain sockets.

@apavlo apavlo added ready-to-merge This PR is ready to be merged. Mark PRs with this. and removed ready-for-review This PR passes all checks and is ready to be reviewed. Mark PRs with this. labels Sep 4, 2020
Renames SimpleQueryTest to UnixDomainSocketTest.
Improves portability of pathname validation, particularly with regards to macOS. Thanks, @gengkev!
@mbutrovich
Copy link
Contributor

Requested a review from @lmwnshn so he can take a quick pass to make sure it doesn't conflict with anything he's poking at in the network layer.

Copy link
Contributor

@lmwnshn lmwnshn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think error handling would be detectable if it were made an assert. Additionally, some changes to commenting would be nice. I am in the process of refactoring the network layer and will probably change stuff in terrier_server in a couple of days. Overall, LGTM though! Would be fine with merging as-is.

src/network/terrier_server.cpp Show resolved Hide resolved
src/network/connection_dispatcher_task.cpp Show resolved Hide resolved
src/include/network/terrier_server.h Show resolved Hide resolved
src/network/terrier_server.cpp Show resolved Hide resolved
src/network/terrier_server.cpp Outdated Show resolved Hide resolved
@lmwnshn lmwnshn added ready-to-merge This PR is ready to be merged. Mark PRs with this. and removed ready-for-review This PR passes all checks and is ready to be reviewed. Mark PRs with this. labels Sep 13, 2020
Copy link
Contributor

@lmwnshn lmwnshn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a bug while merging it into my own branch.

src/network/terrier_server.cpp Outdated Show resolved Hide resolved
@lmwnshn lmwnshn added in-progress This PR is being actively worked on and not ready to be reviewed or merged. Mark PRs with this. and removed ready-to-merge This PR is ready to be merged. Mark PRs with this. labels Sep 15, 2020
Keeps sun/sin initialization the same for now.
Copy link
Contributor

@lmwnshn lmwnshn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@lmwnshn lmwnshn added ready-to-merge This PR is ready to be merged. Mark PRs with this. and removed in-progress This PR is being actively worked on and not ready to be reviewed or merged. Mark PRs with this. labels Sep 15, 2020
@codecov
Copy link

codecov bot commented Sep 16, 2020

Codecov Report

Merging #1109 into master will decrease coverage by 0.90%.
The diff coverage is 76.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1109      +/-   ##
==========================================
- Coverage   82.63%   81.72%   -0.91%     
==========================================
  Files         653      650       -3     
  Lines       46429    43784    -2645     
==========================================
- Hits        38366    35784    -2582     
+ Misses       8063     8000      -63     
Impacted Files Coverage Δ
src/include/network/connection_dispatcher_task.h 100.00% <ø> (ø)
src/include/network/terrier_server.h 100.00% <ø> (ø)
src/network/network_io_wrapper.cpp 75.92% <ø> (+7.40%) ⬆️
src/network/terrier_server.cpp 76.47% <72.22%> (-6.46%) ⬇️
src/include/main/db_main.h 94.04% <100.00%> (+1.01%) ⬆️
src/include/settings/settings_defs.h 100.00% <100.00%> (ø)
src/network/connection_dispatcher_task.cpp 93.33% <100.00%> (+0.47%) ⬆️
src/optimizer/rules/unnesting_rules.cpp 11.36% <0.00%> (-63.02%) ⬇️
src/include/optimizer/rules/unnesting_rules.h 50.00% <0.00%> (-50.00%) ⬇️
src/optimizer/input_column_deriver.cpp 56.84% <0.00%> (-18.00%) ⬇️
... and 71 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 70f7e9e...bef37c0. Read the comment docs.

@lmwnshn lmwnshn merged commit b68a4a8 into cmu-db:master Sep 16, 2020
xinzhu-cai pushed a commit to xinzhu-cai/terrier that referenced this pull request Sep 22, 2020
Co-authored-by: Andy Pavlo <pavlo@cs.brown.edu>
Co-authored-by: Wan Shen Lim <wanshen.lim@gmail.com>
@jkosh44 jkosh44 mentioned this pull request Oct 25, 2020
3 tasks
@preetansh preetansh mentioned this pull request Oct 29, 2020
9 tasks
@rabbit721 rabbit721 mentioned this pull request Nov 8, 2020
1 task
@tpan496 tpan496 mentioned this pull request Nov 14, 2020
2 tasks
@thepinetree thepinetree mentioned this pull request Jan 13, 2021
9 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
performance Performance related issues or changes. question/discussion Further information is requested. ready-to-merge This PR is ready to be merged. Mark PRs with this.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants