Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: investigate gRPC packeting optimizations #17370

Closed
petermattis opened this issue Aug 1, 2017 · 62 comments
Closed

perf: investigate gRPC packeting optimizations #17370

petermattis opened this issue Aug 1, 2017 · 62 comments
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. C-performance Perf of queries or internals. Solution not expected to change functional behavior. X-stale

Comments

@petermattis
Copy link
Collaborator

petermattis commented Aug 1, 2017

gRPC creates unnecessary packets during unary response processing. An upstream PR (grpc/grpc-go#1373) to fix this was buggy and rejected and we're waiting on a real fix from the gRPC folks. But above and beyond reducing the number of packets per response to 1, we can look at reducing it below 1 by combining multiple responses into a single packet. Additionally, flushing the write buffer to the connection currently blocks other goroutines. We should investigate a pipelined approach where the write buffer is filled and when it needs to be flushed a new buffer is swapped into place while the buffer is written to the connection.

@petermattis petermattis added this to the 1.1 milestone Aug 1, 2017
@irfansharif
Copy link
Contributor

could you add more explicit details here? Not as yet familiar with the code paths where this would have a measurable effect or even how I'd go about measuring this? Exec latencies, RTT latencies?

I also recall some prior experiments you ran investigating this effectively reducing {99,95}-th percentile latencies (?), if there's anything relevant pertaining to this it'd be nice to add.

@petermattis
Copy link
Collaborator Author

See https://github.com/grpc/grpc-go/blob/master/transport/http2_server.go#L816 and https://github.com/grpc/grpc-go/blob/master/transport/http2_client.go#L681. The measurable effect of reducing the number of packets (or perhaps the benefit is from reducing the number of system calls) is measurably lower latencies which directly impact performance. Also see grpc/grpc-go#1373 which demonstrated some of the benefits, but had a deadlock scenario. What I'd like to explore here is even more aggressively reducing flushes when there are multiple goroutines sending concurrently. The tricky part here is how to do this without introducing deadlock scenarios caused by the application level flow control. But perhaps experimental code doesn't need to be perfect in order to show whether further investment is worthwhile.

@petermattis
Copy link
Collaborator Author

In addition to the Cockroach measured network latencies, I also look at the packet/sec numbers that the prometheus node exporter provides. For a given queries/sec, lower packet/sec is usually better.

@irfansharif
Copy link
Contributor

Simply aggregating flushes for a given request into one.

___ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)
  +5.81%   +5.67%  -34.65%   -3.98%   -5.44%

Before:
image

After:
image

Note the reduced number of syscalls, now present on only WriteStatus{,Optimized} which takes place for every unary RPC.

@petermattis
Copy link
Collaborator Author

This matches what I saw previously. Will be interesting to see what happens if we perform less than 1 syscall per RPC.

@irfansharif
Copy link
Contributor

There's a bunch of other low hanging fruit I'm picking through. Here Proto() does a deep (defensive) copy when it doesn't really need to (Marshalling out to a io.Writer doesn't/shouldn't mutate the original proto struct). This deep copy took about as much time as the batched flush itself.

         .          .    876:
         .      910ms    877:   if p := st.Proto(); p != nil && len(p.Details) > 0 {
         .          .    878:           stBytes, err := proto.Marshal(p)
         .          .    879:           if err != nil {
         .          .    880:                   // TODO: return error instead, when callers are able to handle it.
         .          .    881:                   panic(err)
         .          .    882:           }
         .          .    883:
         .          .    884:           t.hEnc.WriteField(hpack.HeaderField{Name: "grpc-status-details-bin", Value: encodeBinHeader(stBytes)})
         .          .    885:   }
         .          .    886:

@irfansharif
Copy link
Contributor

As an aside, gRPC makes for another example where increasing Go's minimum stack size helps considerably:
image

@irfansharif
Copy link
Contributor

beyond reducing the number of packets per response to 1, we can look at reducing it below 1 by combining multiple responses into a single packet

It seems that this is already the case-ish. It's done so by the use of framer.{adjust}numWriters and flushing explicitly when you're the last writer-ish. See here. Unless I'm grossly misunderstanding something, although unlikely in practice, there's a potential starvation issue right there. You can have writers keep coming in with the adjustNumWriters({0,-1}) == {1,0} conditionals never becoming true and therefore nothing ever getting flushed.
Still confused.

@petermattis
Copy link
Collaborator Author

When I looked, the adjustNumWriters stuff was intended to avoid some flushes, but grpc still always flushed in WriteStatus and after sending the body of a unary response. I think adjustNumWriters is at the wrong level and it would need to be called before the headers for a unary response are sent and after the status is written. Note that the underlying bufio writer has a limited size. Ideally, we never flush manually and let the bufio flush when it reaches capacity.

PS I see a lot of similarities between flushing the network connection and syncing the WAL.

@petermattis petermattis modified the milestones: 1.2, 1.1 Aug 29, 2017
@irfansharif
Copy link
Contributor

irfansharif commented Aug 31, 2017

heh, grpc/grpc-go#1498 goes some way towards having a dedicated flushing goroutine.

@petermattis
Copy link
Collaborator Author

@irfansharif Yeah, I noticed that too.

@petermattis petermattis added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Sep 27, 2017
@petermattis
Copy link
Collaborator Author

TL;DR? gRPC has very bad scalability with concurrent RPCs. We need to figure out what is going on and either fix it or replace it. Cc @spencerkimball, @bdarnell, @tschottdorf, @a-robinson, @nvanbenschoten

I extended the pinger test program which can be used to send simple "ping" requests and responses over either grpc or another protocol. The following shows using grpc with 1000 concurrent client workers sending 200 byte requests and receiving 200 byte responses as fast as possible (note these tests are running on linux VMs on Azure):

~ ./pinger -c 200 -p 1000 -n 1 -d 10s -t grpc cockroach-denim-0001:50051
...
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s  40836.9     15.6     28.3     35.7     41.9     75.5

Here is the equivalent workload with the "x" (eXperimental) protocol:

~ ./pinger -c 200 -p 1000 -n 1 -d 10s -t x cockroach-denim-0001:50051
...
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s 335770.9    129.0      6.0      5.8      8.4     18.9

I don't have an adequate explanation for the performance difference. "x" is much simpler than grpc and I took some pains to avoid unnecessary flushes of network data. But even accounting for that and trying to mimic the grpc behavior, "x" is still much faster.

Note that 1000 concurrent client workers is extreme and the performance discrepancy doesn't scale down linearly. The following shows grpc scaling from 1 to 512 clients (in powers of 2):

_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s    869.7      0.3      0.0      1.5      2.2      8.1
     10s   1838.5      0.7      0.0      1.6      2.2      7.6
     10s   4006.3      1.5      1.0      1.5      2.0     10.0
     10s   7969.3      3.0      0.8      1.6      2.2      8.9
     10s  15422.5      5.9      0.7      1.8      2.6     10.5
     10s  25620.2      9.8      2.1      2.4      3.4     12.1
     10s  32319.0     12.3      2.5      3.7      5.5     15.2
     10s  38490.1     14.7      2.5      6.3      9.4     28.3
     10s  49670.8     18.9      5.0     10.0     13.6     27.3
     10s  45189.6     17.2     12.1     18.9     25.2     44.0

And here is the same scaling with "x":

_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s    953.4      0.4      0.0      1.4      2.0     10.5
     10s   2060.3      0.8      0.0      1.6      2.5     11.0
     10s   3959.5      1.5      0.9      1.6      2.4     10.5
     10s   8082.6      3.1      0.0      1.6      2.4      8.9
     10s  18925.1      7.3      0.7      1.3      2.2     10.5
     10s  42534.8     16.3      0.7      1.2      2.0      7.1
     10s  79104.9     30.4      0.8      1.3      2.1     11.0
     10s 138325.6     53.2      0.9      1.6      2.6     11.5
     10s 213606.0     82.1      1.0      2.1      3.5     50.3
     10s 279076.3    107.3      1.8      3.4      5.2    113.2

NB: I used 200 byte requests and responses because that is approximately the size of a ScanRequest and ScanResponse for kv read operations. Looking at top during these runs, I noticed that "x" used ~1/2 the cpu of grpc.

@tbg
Copy link
Member

tbg commented Oct 4, 2017

Are you testing that against tip? I haven't checked our vendored grpc version, but I thought they had done some performance work (also doubt it'd catch up to X, though).

Are you using single connections for both protocols?

Is your pinger repo up to date?

@petermattis
Copy link
Collaborator Author

Are you testing that against tip? I haven't checked our vendored grpc version, but I thought they had done some performance work (also doubt it'd catch up to X, though).

Yes, I'm testing against gRPC tip. The recent performance work provides a very small performance improvement.

Are you using single connections for both protocols?

Single connections for both protocols. There is a flag to allow using multiple connections, but gRPC doesn't show any benefit from doing that. Before doing this work on pinger, I whacked something into CockroachDB to allow using multiple gRPC connections and that also didn't show any benefit.

Is your pinger repo up to date?

It is now (just pushed).

@petermattis
Copy link
Collaborator Author

Hmm, a small tweak to how "x" performs synchronization to more closely mimic grpc causes its performance to drop to almost exactly the same level as grpc...but only on my laptop. Testing between 2 linux boxes over a real network still shows "x" to significantly outperform grpc.

The synchronization difference is in how grpc and "x" notify the "write loop". In grpc (on tip), this looks like:

func loopyWriter(ctx context.Context, cbuf *controlBuffer, handler func(item) error) {
	for {
		select {
		case i := <-cbuf.get():
			cbuf.load()
			if err := handler(i); err != nil {
				return
			}
		case <-ctx.Done():
			return
		}
	hasData:
		for {
			select {
			case i := <-cbuf.get():
				cbuf.load()
				if err := handler(i); err != nil {
					return
				}
			case <-ctx.Done():
				return
			default:
				if err := handler(&flushIO{}); err != nil {
					return
				}
				break hasData
			}
		}
	}
}

Essentially, grpc is using channels for synchronization. cbuf is an unbounded channel of items. In comparison, "x" uses a mutex, condition variable and a slice. Pseudo-code looks like:

func (x *xServerConn) writeLoop() {
	for {
		s.Lock()
		for len(s.pending) == 0 {
		 	s.cond.Wait()
		}
		pending := s.pending
		s.pending = nil
		s.Unlock()

		for _, p := range pending {
		 	if err := handler(p); err != nil {
		 		log.Fatal(err)
		 	}
		}

		if err := handler(&flushIO{}); err != nil {
			log.Fatal(err)
		}
	}
}

Switching "x" to use controlBuffer resulted in a dramatic slowdown on my laptop, but only on my laptop.

@petermattis
Copy link
Collaborator Author

Update, I've been able to slow down "x" between 2 linux machines through the combination of the switch to using a grpc-style controlBuffer and quotaPool:

_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s  71760.4     27.6     15.7     22.0     27.3     44.0

This is still faster than grpc, but I'm likely just missing some additional quota stuff (grpc has transport and stream level quota as well as quota on the number of streams).

For reference, this is the performance I was seeing earlier with the cond-var based synchronization and no quota pool:

_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s 335770.9    129.0      6.0      5.8      8.4     18.9

@tbg
Copy link
Member

tbg commented Oct 4, 2017

The quota pool corresponds to (stream and/or connection-level) window sizes, right? I saw that you have both at 65k initially. Does anything change if you up that to a lot more (especially the connection based one) or is the culprit just the internal overhead of having the quota pools in the first place?

@petermattis
Copy link
Collaborator Author

I'm not sure what the culprit is. We're never getting close to the stream quota as we're sending 200 byte requests. We do seem to be hitting the connection quota at high concurrency, but alleviating the limit doesn't provide a significant performance boost:

_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s  52903.9     20.2     16.3     28.3     35.7     58.7

I also tried disabling all of the quota code in grpc by commenting it out:

_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s  58959.2     22.5     16.8     25.2     32.5     56.6

So clearly something else is still going on inside grpc that is limiting its performance.

@tbg
Copy link
Member

tbg commented Oct 4, 2017

CPU profile isn't useful?

@petermattis
Copy link
Collaborator Author

Heh, I'm looping through all my tools. CPU profile didn't show anything interesting earlier, I'll look again.

@tbg
Copy link
Member

tbg commented Oct 4, 2017

If that doesn't help, you could consider (and you probably have) strategic commenting out of parts of the quota handling code 🥇

@petermattis
Copy link
Collaborator Author

If that doesn't help, you could consider (and you probably have) strategic commenting out of parts of the quota handling code 🥇

You might have missed above, but I already did exactly that.

@tbg
Copy link
Member

tbg commented Oct 4, 2017

Oh, (completely) misread that as "I also set the window size to basically infinity".

@MakMukhi
Copy link

Yeah that'll be a good start. If we get a good enough picture of the workload/scenario we can update pinger accordingly.
Thanks for the appreciation. I feel very optimistic about for gRPC's future.

a-robinson pushed a commit to a-robinson/cockroach that referenced this issue Apr 18, 2018
The cRPC protocol is built on top of http2 using only http2 data
frames. The core read and write loops are similar to gRPC, but trimmed
down to their essence. A ton of stuff is missing before this could
actually be used, such as context cancellation propagation, trace
propagation, chunking of large requests/responses into frames, and
proper error handling. This PR was done as a proof of concept to see
what performance impact replacing gRPC with something else could
achieve.

On a single-node cluster with the local server optimization disabled,
gRPC on a read-mostly workload shows:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   10.0s        0         111904        11189.4      1.4      1.2      3.0      5.8     18.9

cRPC:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   10.0s        0         183923        18389.3      0.9      0.7      1.8      3.7     62.9

Unfortunately, the effects on a multi-node cluster are less dramatic.

I think it would be possible to make cRPC wire compatible with
gRPC. This would require sending a headers frame with every RPC, but the
headers are small (7 bytes) and constant for this use case. We'd also
have to send a settings frame as the first frame on the connection, but
that is trivial.

See cockroachdb#17370.
@nvanbenschoten nvanbenschoten added the A-kv-client Relating to the KV client and the KV interface. label Apr 24, 2018
craig bot pushed a commit that referenced this issue May 17, 2018
24883: dep: Bump grpc-go version r=a-robinson a=a-robinson

And pull in new packages -- in particular, the encoding/proto package
isn't needed for compilation but is needed at runtime.

Release note: None

---------------------

We should wait to merge this until they've cut the 1.12 release tag so that we aren't just at an arbitrary commit in their git history but I'm sending out the PR now so that I (or whoever would have done this) don't have to deal with debugging the missing encoding/proto package when it comes time to merge this.

As tested in #17370 (comment), this gives a 5-10% boost in whole-cluster throughput and improved tail latencies when run with a highly concurrent workload. It appears to have little performance effect for lower-concurrency workloads.

25410:  sql: run schema changes after CREATE TABLE in txn r=vivekmenezes a=vivekmenezes

Includes a commit from #25362 and should be reviewed after that change.

25612: util: fix `retry.WithMaxAttempts` context cancelled before run. r=windchan7 a=windchan7

If context gets cancelled right after `retry.WithMaxAttempts` runs, the function
passed to it will never gets run. Now `retry.WithMaxAttempts` will at least run
the function once otherwise an error will be returned.

Making this change because there are places such as `show_cluster_setting.go`
require the passed in function to be run at least once. Otherwise there will
be seg fault.

Fixes: #25600. 
Fixes: #25603.
Fixes: #25570. 
Fixes: #25567.
Fixes: #25566.
Fixes: #25511.
Fixes: #25485.
Release note: None

25625: storage: Adding testing knob to disable automatic lease renewals r=a-robinson a=a-robinson

In order to fix the test flakes caused by automatic lease renewals

Fixes #25537
Fixes #25540
Fixes #25568
Fixes #25573
Fixes #25576
Fixes #25589
Fixes #25594
Fixes #25599
Fixes #25605
Fixes #25620

Release note: None

Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com>
Co-authored-by: Vivek Menezes <vivek@cockroachlabs.com>
Co-authored-by: Victor Chen <victor@cockroachlabs.com>
@petermattis petermattis modified the milestones: 2.1, 2.2 Jul 19, 2018
@tbg tbg added A-coreperf and removed A-disaster-recovery A-kv-transactions Relating to MVCC and the transactional model. A-kv-distribution Relating to rebalancing and leasing. A-kv-client Relating to the KV client and the KV interface. A-storage Relating to our storage engine (Pebble) on-disk storage. A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 31, 2018
@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
@petermattis petermattis removed their assignment Oct 6, 2018
@nvanbenschoten nvanbenschoten added A-kv-client Relating to the KV client and the KV interface. and removed A-coreperf labels Oct 16, 2018
@a-robinson a-robinson assigned petermattis and unassigned a-robinson Jan 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-performance Perf of queries or internals. Solution not expected to change functional behavior. X-stale
Projects
None yet
Development

No branches or pull requests

10 participants