Unix domain sockets & findings from offcputime analysis #1109

gonzalezjo · 2020-08-22T23:34:59Z

Unix Domain Sockets

just ctrl-f "results"

Background

While investigating our offcputime, I was struck by how much time was spent strictly on network tasks. EC2 numbers pointed to about 8% being spent on just flushing write buffers, while my own profiles from VMware Workstation suggested that about 80% of offcputime was spent on flushing write buffers. (Small update: Restarting the VM brought that number down drastically. I went from ~120 t/s -> ~220+t/s on 1 terminal from faster writes!)

I wanted to see if Postgres was similarly affected on a VM. As it turns out, it absolutely is, and it is affected to the exact same extent. However, Postgres can cheat and use Unix domain sockets (think named pipes, but accessed via the filesystem) on supported tests. I sought to evaluate the extent to which Unix domain sockets could impact our performance by reducing offcputime.

Procedure

Disclaimer: I can't run oltpbench with more than about 10 terminals without the DB dying. This is the case for master and my code. I get these errors on the server:

[2020-08-22 17:29:34.801] [network_logger] [error] Error writing: Connection reset by peer
[2020-08-22 17:29:34.801] [network_logger] [error] Error writing: Connection reset by peer
[2020-08-22 17:29:34.801] [network_logger] [error] Error writing: Connection reset by peer
[2020-08-22 17:29:34.801] [network_logger] [error] Error when filling read buffer
[2020-08-22 17:29:34.801] [network_logger] [error] Error when filling read buffer
[2020-08-22 17:29:34.801] [network_logger] [error] Error when filling read buffer
[2020-08-22 17:29:34.802] [network_logger] [error] Error when filling read buffer

All benchmarks were performed with 200 connection threads on a machine with twenty-four free cores. The benchmark used was tpcc from oltpbench. tpcc is a nice workload for this PR, since it really exaggerates protocol overhead. The terrier server was restarted prior to each run of the benchmark. See the section titled "The PR" for my configuration file. The virtual machine software used is VMware Workstation 15.

The virtual machine is running Ubuntu 18.04 LTS with kernel version 5.4.0-42-generic.

The bare metal machine is running Arch Linux with kernel version 5.4.57-1-lts.

Results

Machine Type	Terminals	Scale Factor	Socket Type	Transactions / Second
Bare metal	10	10	UNIX	8346 (+34%)
Bare metal	10	10	INET	6214
Bare metal	1	1	UNIX	914 (+37%)
Bare metal	1	1	INET	668
VMware	10	10	UNIX	3728 (+24%)
VMware	10	10	INET	3005
VMware	1	1	UNIX	289 (+28%)
VMware	1	1	INET	226

Discussion

To be honest, I'm pretty confused. The results go entirely against my expectations. I expected the gains from Unix domain sockets on bare metal to be a small fraction of the gains from a virtual machine. My interpretation of bare metal profiling data suggested that I had at most ~10% to gain from this PR. I was wrong on both counts. I can consistently reproduce the speedups that I have measured here. I'll revisit this with VTune again, profiling the code with and without Unix domain sockets, and see what's up.

The PR and its future

This PR creates a new network test, two new settings, and a "new" (it extracts out existing code) method in terrier_server. It supports simultaneous usage of both Unix and networked/INET sockets, with logs displaying relevant socket information, and with graceful handling and recovery from errors.

Since this adds a potentially useful feature from two big DBs (Postgres and MySQL) with next to no overhead, it's nice from a features perspective. That said, I don't think it makes a ton of sense to merge this PR. It (1) doesn't bring us much closer to project goals, and (2) isn't actually fixing any IO-protocol-induced-gap between us and Postgres/MySQL in benchmarks, since as it turns out, oltpbench can't actually use their Unix domain sockets.

In light of the last comment, I should note that to benchmark this PR, you would want to use my fork of Matt's fork of oltpbench. The real change is just two lines, though for others' convenience and for the sake of reproducibility, I've also changed the default settings of the benchmark to mirror what I've used. Instructions for running the benchmark are available here.

Further work

Investigating the source of the speedup

At this point, this is my main goal. I'd like to try to reconcile my interpretation of previous profiling data with the results that I've collected. This will involve more profiling, but I expect it to be straightforward.

Investigating libevent/epoll

This seems like a dead end.

After running oltpbench with Wireshark and being stunned by the amount of packets being sent by tpcc, I wondered if libevent and epoll could be potential sources of overhead. I've never been a fan of epoll--the only decent API that libevent supports on Linux--so I tried porting things over to libev. The idea was that I'd be able to use io_uring, a much more performant (and, if implemented properly, probably much cleaner to use) kernel API. After our meeting, this didn't seem worth it. So for now, work on libev and io_uring support is on pause.

In either case, I would hope that libevent's epoll backend would be more than fast enough for a mere ten terminals. epoll is plenty fast for our cases, but that's assuming that it's implemented well, and it's very hard to tell if that's the case from libevent's codebase. Attempts to update the libevent version and to use edge-triggered epoll had no impact on performance, so I think we're still far away from any theoretical epoll limits.

Investigating loopback and TCP overhead

This is probably also a dead end, but less of a dead end than the libevent/epoll stuff.

I'd like to work on improving our INET socket overhead, but at this point, that might not be doable without being a kernel engineer. Nothing I tried measurably reduced loopback and/or INET overhead, and I tried a lot. Still, I think it's worth digging some more.

One question I have is how much of the speedup comes from avoiding TCP, and how much comes from using a glorified pipe instead of loopback. This would be interesting to know, but I have no idea how I'd measure, and I'm not convinced that I'd be able to make anything useful from the answer.

If I don't think this should be merged, then why did I make this giant PR?

Good question. Four reasons:

I think that there is potential for good discussion from the findings here.
I've spent way too much time investigating off CPU time and don't want it to go to waste. I want the information I've collected to be accessible to others. I needed to write this all down anyway. It only made sense to clean up my notes, put them in markdown, and share them.
While Unix domain sockets aren't particularly practical, they do demonstrate a serious source of overhead in a way that anyone can verify--we no longer need to speculate, for example, as to whether or not (and/or, to what extent) loopback overhead might be costing us. Hence, as an academic exercise, I found writing this PR to be extremely useful as a way to gain a deeper insight into the performance characteristics of the system.
By sharing the findings from my investigation in this PR, I'm hoping that someone smart might realize a way to claw back some of the overhead losses.

Implements a Postgres-compatible Unix domain socket protocol. This time, without using libev to try to port everything over from epoll to io_uring.

You can now change the Unix domain socket settings.

Oops.

…domain_sockets

gonzalezjo · 2020-08-24T22:11:51Z

Further updates: epoll overhead is real and fairly significant--I think it's responsible for an avoidable ~5-10% of loss on localhost throughput--but designing around it is extremely painful. And of course, alternatives have overhead of their own.

From investigations of macOS vs. Linux (thanks Matt), Postgres in a VM vs. Postgres on bare metal, and Noisepage in a VM vs. Noisepage on bare metal, it seems like our TPC-C throughput is very heavily constrained by networking and other kernel/OS related overhead. While efforts like this PR can reduce that overhead substantially, a huge amount of overhead remains.

I believe that the extended query protocol, and to a lesser extent, its implementation (through epoll) present a major system performance bottleneck on benchmarks like TPC-C. I have also been plagued with issues while profiling from variance--both in the statistical and literal sense--in network stack performance across platforms, kernel versions, and even CPU microcode versions(!)

This makes me question whether such network-IO intensive workloads, such as TPC-C, are a good fit for modeling and sanity-checking our performance data.

thepinetree

A few small notes to address, then LGTM pending CI

src/network/terrier_server.cpp

test/network/network_test.cpp

Adds Doxygen comments to ConnectionDispatcherTask::ConnectionDispatcherTask (connection_dispatcher_task.h) and NetworkLayer::NetworkLayer (db_main.h)

Adds documentation to the test case in order to explain how the test connects to the domain socket.

Addresses code review feedback. Removes an unnecessary memset that was left in from an early development stage. I never would've caught this, so thanks Deepayan!

thepinetree

LGTM

gonzalezjo · 2020-09-04T20:47:22Z

So, I can confirm that Postgres is also affected by the same TCP issue. It gains a lot from using Unix domain sockets. Benched with postgres running on Optane (NVMe, not pmem):

jordig@dev5:~/oltpbench/oltpbench$ pgbench -c 10 -T 60 -j 10 -M extended -P 1 -h 127.0.0.1 -p 5433
Password: 
starting vacuum...end.
progress: 1.0 s, 12806.3 tps, lat 0.775 ms stddev 0.139
progress: 2.0 s, 13542.9 tps, lat 0.738 ms stddev 0.115
progress: 3.0 s, 13529.2 tps, lat 0.739 ms stddev 0.110
progress: 4.0 s, 13707.6 tps, lat 0.730 ms stddev 0.127
progress: 5.0 s, 13835.9 tps, lat 0.723 ms stddev 0.119
progress: 6.0 s, 13776.8 tps, lat 0.726 ms stddev 0.112
progress: 7.0 s, 13784.6 tps, lat 0.725 ms stddev 0.116
progress: 8.0 s, 13637.0 tps, lat 0.733 ms stddev 0.117
progress: 9.0 s, 13154.1 tps, lat 0.760 ms stddev 0.189

^C

jordig@dev5:~/oltpbench/oltpbench$ pgbench -c 10 -T 60 -j 10 -M extended -P 1
starting vacuum...end.
progress: 1.0 s, 17512.7 tps, lat 0.567 ms stddev 0.099
progress: 2.0 s, 17673.7 tps, lat 0.566 ms stddev 0.090
progress: 3.0 s, 17570.8 tps, lat 0.569 ms stddev 0.095
progress: 4.0 s, 17414.5 tps, lat 0.574 ms stddev 0.097
progress: 5.0 s, 17401.7 tps, lat 0.575 ms stddev 0.098
progress: 6.0 s, 17128.6 tps, lat 0.584 ms stddev 0.098
progress: 7.0 s, 17420.6 tps, lat 0.574 ms stddev 0.095
progress: 8.0 s, 17277.9 tps, lat 0.579 ms stddev 0.097
progress: 9.0 s, 17086.5 tps, lat 0.585 ms stddev 0.096

Results are reproducible: Postgres regularly gains around 25-30% from using domain sockets.

Renames SimpleQueryTest to UnixDomainSocketTest.

src/network/terrier_server.cpp

…domain_sockets

@gengkev

Improves portability of pathname validation, particularly with regards to macOS. Thanks, @gengkev!

mbutrovich · 2020-09-10T20:45:51Z

Requested a review from @lmwnshn so he can take a quick pass to make sure it doesn't conflict with anything he's poking at in the network layer.

…domain_sockets

gcc doesn't like that we're using strncpy exactly as it's intended to be used.

…errier into unix_domain_sockets

lmwnshn

I think error handling would be detectable if it were made an assert. Additionally, some changes to commenting would be nice. I am in the process of refactoring the network layer and will probably change stuff in terrier_server in a couple of days. Overall, LGTM though! Would be fine with merging as-is.

src/network/terrier_server.cpp

src/network/connection_dispatcher_task.cpp

src/include/network/terrier_server.h

src/network/terrier_server.cpp

lmwnshn

Found a bug while merging it into my own branch.

src/network/terrier_server.cpp

Keeps sun/sin initialization the same for now.

lmwnshn

LGTM.

codecov · 2020-09-16T04:43:53Z

Codecov Report

Merging #1109 into master will decrease coverage by 0.90%.
The diff coverage is 76.56%.

@@            Coverage Diff             @@
##           master    #1109      +/-   ##
==========================================
- Coverage   82.63%   81.72%   -0.91%     
==========================================
  Files         653      650       -3     
  Lines       46429    43784    -2645     
==========================================
- Hits        38366    35784    -2582     
+ Misses       8063     8000      -63

Impacted Files	Coverage Δ
src/include/network/connection_dispatcher_task.h	`100.00% <ø> (ø)`
src/include/network/terrier_server.h	`100.00% <ø> (ø)`
src/network/network_io_wrapper.cpp	`75.92% <ø> (+7.40%)`	⬆️
src/network/terrier_server.cpp	`76.47% <72.22%> (-6.46%)`	⬇️
src/include/main/db_main.h	`94.04% <100.00%> (+1.01%)`	⬆️
src/include/settings/settings_defs.h	`100.00% <100.00%> (ø)`
src/network/connection_dispatcher_task.cpp	`93.33% <100.00%> (+0.47%)`	⬆️
src/optimizer/rules/unnesting_rules.cpp	`11.36% <0.00%> (-63.02%)`	⬇️
src/include/optimizer/rules/unnesting_rules.h	`50.00% <0.00%> (-50.00%)`	⬇️
src/optimizer/input_column_deriver.cpp	`56.84% <0.00%> (-18.00%)`	⬇️
... and 71 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 70f7e9e...bef37c0. Read the comment docs.

Co-authored-by: Andy Pavlo <pavlo@cs.brown.edu> Co-authored-by: Wan Shen Lim <wanshen.lim@gmail.com>

gonzalezjo added 5 commits August 20, 2020 15:44

Preliminary support for Unix domain sockets.

13fd736

Full Unix domain socket support with tests

a9de5f1

Implements a Postgres-compatible Unix domain socket protocol. This time, without using libev to try to port everything over from epoll to io_uring.

Settings support

683b64e

You can now change the Unix domain socket settings.

Undo chandes made when profiling that never should've been committed

7e95302

Oops.

Merge branch 'master' of https://github.com/cmu-db/terrier into unix_…

91324ad

…domain_sockets

gonzalezjo added question/discussion Further information is requested. performance Performance related issues or changes. do-not-merge This PR is either not intended to merge, or did not pass review. labels Aug 22, 2020

gonzalezjo force-pushed the unix_domain_sockets branch from 2d5af11 to 91324ad Compare August 24, 2020 17:17

apavlo requested a review from thepinetree August 28, 2020 20:57

lmwnshn added the ready-for-review This PR passes all checks and is ready to be reviewed. Mark PRs with this. label Aug 29, 2020

thepinetree reviewed Aug 31, 2020

View reviewed changes

src/network/terrier_server.cpp Outdated Show resolved Hide resolved

test/network/network_test.cpp Outdated Show resolved Hide resolved

lmwnshn added in-progress This PR is being actively worked on and not ready to be reviewed or merged. Mark PRs with this. and removed ready-for-review This PR passes all checks and is ready to be reviewed. Mark PRs with this. labels Sep 1, 2020

gonzalezjo added 3 commits September 3, 2020 16:27

Add Doxygen comments

f59f846

Adds Doxygen comments to ConnectionDispatcherTask::ConnectionDispatcherTask (connection_dispatcher_task.h) and NetworkLayer::NetworkLayer (db_main.h)

Document the test case

d5c5102

Adds documentation to the test case in order to explain how the test connects to the domain socket.

Remove unnecessary memset

12d93d8

Addresses code review feedback. Removes an unnecessary memset that was left in from an early development stage. I never would've caught this, so thanks Deepayan!

thepinetree approved these changes Sep 3, 2020

View reviewed changes

apavlo added ready-to-merge This PR is ready to be merged. Mark PRs with this. and removed ready-for-review This PR passes all checks and is ready to be reviewed. Mark PRs with this. labels Sep 4, 2020

Update test name in the logging calls.

2f4c0b6

Renames SimpleQueryTest to UnixDomainSocketTest.

gengkev reviewed Sep 4, 2020

View reviewed changes

src/network/terrier_server.cpp Outdated Show resolved Hide resolved

gonzalezjo added 2 commits September 4, 2020 17:22

Merge branch 'master' of https://github.com/cmu-db/terrier into unix_…

4ad8638

…domain_sockets

Smarter pathname validation

29733b6

Improves portability of pathname validation, particularly with regards to macOS. Thanks, @gengkev!

gonzalezjo force-pushed the unix_domain_sockets branch from c619135 to 29733b6 Compare September 4, 2020 21:29

gonzalezjo added 4 commits September 10, 2020 19:02

Merge branch 'master' of https://github.com/cmu-db/terrier into unix_…

2aa2830

…domain_sockets

Merge branch 'master' into unix_domain_sockets

d923890

gcc-8 fix

fdf0547

gcc doesn't like that we're using strncpy exactly as it's intended to be used.

Merge branch 'unix_domain_sockets' of https://github.com/gonzalezjo/t…

33eca48

…errier into unix_domain_sockets

lmwnshn approved these changes Sep 13, 2020

View reviewed changes

lmwnshn added ready-to-merge This PR is ready to be merged. Mark PRs with this. and removed ready-for-review This PR passes all checks and is ready to be reviewed. Mark PRs with this. labels Sep 13, 2020

lmwnshn added 3 commits September 13, 2020 21:55

Merge branch 'master' into unix_domain_sockets

b8a3415

Merge branch 'master' into unix_domain_sockets

3fb7be5

Merge branch 'master' into unix_domain_sockets

014a04e

lmwnshn suggested changes Sep 15, 2020

View reviewed changes

src/network/terrier_server.cpp Outdated Show resolved Hide resolved

lmwnshn added in-progress This PR is being actively worked on and not ready to be reviewed or merged. Mark PRs with this. and removed ready-to-merge This PR is ready to be merged. Mark PRs with this. labels Sep 15, 2020

lmwnshn assigned gonzalezjo Sep 15, 2020

Address wan's comments

bef37c0

Keeps sun/sin initialization the same for now.

lmwnshn approved these changes Sep 15, 2020

View reviewed changes

lmwnshn added ready-to-merge This PR is ready to be merged. Mark PRs with this. and removed in-progress This PR is being actively worked on and not ready to be reviewed or merged. Mark PRs with this. labels Sep 15, 2020

lmwnshn merged commit b68a4a8 into cmu-db:master Sep 16, 2020

xinzhu-cai pushed a commit to xinzhu-cai/terrier that referenced this pull request Sep 22, 2020

Always use Unix domain sockets if available (cmu-db#1109)

9e1235c

Co-authored-by: Andy Pavlo <pavlo@cs.brown.edu> Co-authored-by: Wan Shen Lim <wanshen.lim@gmail.com>

jkosh44 mentioned this pull request Oct 25, 2020

Fix double parsing #1271

Merged

3 tasks

preetansh mentioned this pull request Oct 29, 2020

Add pg stats preetansh/terrier#2

Closed

9 tasks

rabbit721 mentioned this pull request Nov 8, 2020

Pilot Preparation #1309

Merged

1 task

AlanDZQ mentioned this pull request Nov 11, 2020

Update oltpbench2 cmu-mse-cmudb/terrier#28

Open

9 tasks

tpan496 mentioned this pull request Nov 14, 2020

Replication Framework Setup #1331

Closed

2 tasks

preetansh mentioned this pull request Dec 25, 2020

Add the framework for the pg_statistic table. #1416

Merged

thepinetree mentioned this pull request Jan 13, 2021

Deepayan metrics eppingere/terrier#1

Closed

9 tasks

XiaochenCui mentioned this pull request Aug 10, 2021

15 721 project1 XiaochenCui/noisepage#1

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unix domain sockets & findings from offcputime analysis #1109

Unix domain sockets & findings from offcputime analysis #1109

gonzalezjo commented Aug 22, 2020

gonzalezjo commented Aug 24, 2020

thepinetree left a comment

thepinetree left a comment

gonzalezjo commented Sep 4, 2020 •

edited

Loading

mbutrovich commented Sep 10, 2020

lmwnshn left a comment

lmwnshn left a comment

lmwnshn left a comment

codecov bot commented Sep 16, 2020

Unix domain sockets & findings from offcputime analysis #1109

Unix domain sockets & findings from offcputime analysis #1109

Conversation

gonzalezjo commented Aug 22, 2020

Unix Domain Sockets

just ctrl-f "results"

Background

Procedure

Results

Discussion

The PR and its future

Further work

Investigating the source of the speedup

Investigating libevent/epoll

Investigating loopback and TCP overhead

If I don't think this should be merged, then why did I make this giant PR?

gonzalezjo commented Aug 24, 2020

thepinetree left a comment

Choose a reason for hiding this comment

thepinetree left a comment

Choose a reason for hiding this comment

gonzalezjo commented Sep 4, 2020 • edited Loading

mbutrovich commented Sep 10, 2020

lmwnshn left a comment

Choose a reason for hiding this comment

lmwnshn left a comment

Choose a reason for hiding this comment

lmwnshn left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 16, 2020

Codecov Report

gonzalezjo commented Sep 4, 2020 •

edited

Loading