perf: investigate gRPC packeting optimizations #17370

petermattis · 2017-08-01T20:56:10Z

gRPC creates unnecessary packets during unary response processing. An upstream PR (grpc/grpc-go#1373) to fix this was buggy and rejected and we're waiting on a real fix from the gRPC folks. But above and beyond reducing the number of packets per response to 1, we can look at reducing it below 1 by combining multiple responses into a single packet. Additionally, flushing the write buffer to the connection currently blocks other goroutines. We should investigate a pipelined approach where the write buffer is filled and when it needs to be flushed a new buffer is swapped into place while the buffer is written to the connection.

irfansharif · 2017-08-03T20:51:39Z

could you add more explicit details here? Not as yet familiar with the code paths where this would have a measurable effect or even how I'd go about measuring this? Exec latencies, RTT latencies?

I also recall some prior experiments you ran investigating this effectively reducing {99,95}-th percentile latencies (?), if there's anything relevant pertaining to this it'd be nice to add.

petermattis · 2017-08-04T00:21:58Z

See https://github.com/grpc/grpc-go/blob/master/transport/http2_server.go#L816 and https://github.com/grpc/grpc-go/blob/master/transport/http2_client.go#L681. The measurable effect of reducing the number of packets (or perhaps the benefit is from reducing the number of system calls) is measurably lower latencies which directly impact performance. Also see grpc/grpc-go#1373 which demonstrated some of the benefits, but had a deadlock scenario. What I'd like to explore here is even more aggressively reducing flushes when there are multiple goroutines sending concurrently. The tricky part here is how to do this without introducing deadlock scenarios caused by the application level flow control. But perhaps experimental code doesn't need to be perfect in order to show whether further investment is worthwhile.

petermattis · 2017-08-04T00:23:16Z

In addition to the Cockroach measured network latencies, I also look at the packet/sec numbers that the prometheus node exporter provides. For a given queries/sec, lower packet/sec is usually better.

irfansharif · 2017-08-17T18:35:06Z

Simply aggregating flushes for a given request into one.

___ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)
  +5.81%   +5.67%  -34.65%   -3.98%   -5.44%

Before:

After:

Note the reduced number of syscalls, now present on only WriteStatus{,Optimized} which takes place for every unary RPC.

petermattis · 2017-08-17T19:28:08Z

This matches what I saw previously. Will be interesting to see what happens if we perform less than 1 syscall per RPC.

irfansharif · 2017-08-17T19:32:38Z

There's a bunch of other low hanging fruit I'm picking through. Here Proto() does a deep (defensive) copy when it doesn't really need to (Marshalling out to a io.Writer doesn't/shouldn't mutate the original proto struct). This deep copy took about as much time as the batched flush itself.

         .          .    876:
         .      910ms    877:   if p := st.Proto(); p != nil && len(p.Details) > 0 {
         .          .    878:           stBytes, err := proto.Marshal(p)
         .          .    879:           if err != nil {
         .          .    880:                   // TODO: return error instead, when callers are able to handle it.
         .          .    881:                   panic(err)
         .          .    882:           }
         .          .    883:
         .          .    884:           t.hEnc.WriteField(hpack.HeaderField{Name: "grpc-status-details-bin", Value: encodeBinHeader(stBytes)})
         .          .    885:   }
         .          .    886:

irfansharif · 2017-08-17T19:55:25Z

As an aside, gRPC makes for another example where increasing Go's minimum stack size helps considerably:

irfansharif · 2017-08-17T21:24:06Z

beyond reducing the number of packets per response to 1, we can look at reducing it below 1 by combining multiple responses into a single packet

It seems that this is already the case-ish. It's done so by the use of framer.{adjust}numWriters and flushing explicitly when you're the last writer-ish. See here. Unless I'm grossly misunderstanding something, although unlikely in practice, there's a potential starvation issue right there. You can have writers keep coming in with the adjustNumWriters({0,-1}) == {1,0} conditionals never becoming true and therefore nothing ever getting flushed.
Still confused.

petermattis · 2017-08-18T01:04:26Z

When I looked, the adjustNumWriters stuff was intended to avoid some flushes, but grpc still always flushed in WriteStatus and after sending the body of a unary response. I think adjustNumWriters is at the wrong level and it would need to be called before the headers for a unary response are sent and after the status is written. Note that the underlying bufio writer has a limited size. Ideally, we never flush manually and let the bufio flush when it reaches capacity.

PS I see a lot of similarities between flushing the network connection and syncing the WAL.

irfansharif · 2017-08-27T18:08:15Z

irfansharif · 2017-08-31T01:12:10Z

heh, grpc/grpc-go#1498 goes some way towards having a dedicated flushing goroutine.

petermattis · 2017-08-31T11:22:37Z

@irfansharif Yeah, I noticed that too.

petermattis · 2017-10-04T01:42:00Z

TL;DR? gRPC has very bad scalability with concurrent RPCs. We need to figure out what is going on and either fix it or replace it. Cc @spencerkimball, @bdarnell, @tschottdorf, @a-robinson, @nvanbenschoten

I extended the pinger test program which can be used to send simple "ping" requests and responses over either grpc or another protocol. The following shows using grpc with 1000 concurrent client workers sending 200 byte requests and receiving 200 byte responses as fast as possible (note these tests are running on linux VMs on Azure):

~ ./pinger -c 200 -p 1000 -n 1 -d 10s -t grpc cockroach-denim-0001:50051
...
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s  40836.9     15.6     28.3     35.7     41.9     75.5

Here is the equivalent workload with the "x" (eXperimental) protocol:

~ ./pinger -c 200 -p 1000 -n 1 -d 10s -t x cockroach-denim-0001:50051
...
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s 335770.9    129.0      6.0      5.8      8.4     18.9

I don't have an adequate explanation for the performance difference. "x" is much simpler than grpc and I took some pains to avoid unnecessary flushes of network data. But even accounting for that and trying to mimic the grpc behavior, "x" is still much faster.

Note that 1000 concurrent client workers is extreme and the performance discrepancy doesn't scale down linearly. The following shows grpc scaling from 1 to 512 clients (in powers of 2):

_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s    869.7      0.3      0.0      1.5      2.2      8.1
     10s   1838.5      0.7      0.0      1.6      2.2      7.6
     10s   4006.3      1.5      1.0      1.5      2.0     10.0
     10s   7969.3      3.0      0.8      1.6      2.2      8.9
     10s  15422.5      5.9      0.7      1.8      2.6     10.5
     10s  25620.2      9.8      2.1      2.4      3.4     12.1
     10s  32319.0     12.3      2.5      3.7      5.5     15.2
     10s  38490.1     14.7      2.5      6.3      9.4     28.3
     10s  49670.8     18.9      5.0     10.0     13.6     27.3
     10s  45189.6     17.2     12.1     18.9     25.2     44.0

And here is the same scaling with "x":

_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s    953.4      0.4      0.0      1.4      2.0     10.5
     10s   2060.3      0.8      0.0      1.6      2.5     11.0
     10s   3959.5      1.5      0.9      1.6      2.4     10.5
     10s   8082.6      3.1      0.0      1.6      2.4      8.9
     10s  18925.1      7.3      0.7      1.3      2.2     10.5
     10s  42534.8     16.3      0.7      1.2      2.0      7.1
     10s  79104.9     30.4      0.8      1.3      2.1     11.0
     10s 138325.6     53.2      0.9      1.6      2.6     11.5
     10s 213606.0     82.1      1.0      2.1      3.5     50.3
     10s 279076.3    107.3      1.8      3.4      5.2    113.2

NB: I used 200 byte requests and responses because that is approximately the size of a ScanRequest and ScanResponse for kv read operations. Looking at top during these runs, I noticed that "x" used ~1/2 the cpu of grpc.

tbg · 2017-10-04T03:06:04Z

Are you testing that against tip? I haven't checked our vendored grpc version, but I thought they had done some performance work (also doubt it'd catch up to X, though).

Are you using single connections for both protocols?

Is your pinger repo up to date?

petermattis · 2017-10-04T10:48:05Z

Are you testing that against tip? I haven't checked our vendored grpc version, but I thought they had done some performance work (also doubt it'd catch up to X, though).

Yes, I'm testing against gRPC tip. The recent performance work provides a very small performance improvement.

Are you using single connections for both protocols?

Single connections for both protocols. There is a flag to allow using multiple connections, but gRPC doesn't show any benefit from doing that. Before doing this work on pinger, I whacked something into CockroachDB to allow using multiple gRPC connections and that also didn't show any benefit.

Is your pinger repo up to date?

It is now (just pushed).

petermattis · 2017-10-04T16:32:24Z

Hmm, a small tweak to how "x" performs synchronization to more closely mimic grpc causes its performance to drop to almost exactly the same level as grpc...but only on my laptop. Testing between 2 linux boxes over a real network still shows "x" to significantly outperform grpc.

The synchronization difference is in how grpc and "x" notify the "write loop". In grpc (on tip), this looks like:

func loopyWriter(ctx context.Context, cbuf *controlBuffer, handler func(item) error) {
	for {
		select {
		case i := <-cbuf.get():
			cbuf.load()
			if err := handler(i); err != nil {
				return
			}
		case <-ctx.Done():
			return
		}
	hasData:
		for {
			select {
			case i := <-cbuf.get():
				cbuf.load()
				if err := handler(i); err != nil {
					return
				}
			case <-ctx.Done():
				return
			default:
				if err := handler(&flushIO{}); err != nil {
					return
				}
				break hasData
			}
		}
	}
}

Essentially, grpc is using channels for synchronization. cbuf is an unbounded channel of items. In comparison, "x" uses a mutex, condition variable and a slice. Pseudo-code looks like:

func (x *xServerConn) writeLoop() {
	for {
		s.Lock()
		for len(s.pending) == 0 {
		 	s.cond.Wait()
		}
		pending := s.pending
		s.pending = nil
		s.Unlock()

		for _, p := range pending {
		 	if err := handler(p); err != nil {
		 		log.Fatal(err)
		 	}
		}

		if err := handler(&flushIO{}); err != nil {
			log.Fatal(err)
		}
	}
}

Switching "x" to use controlBuffer resulted in a dramatic slowdown on my laptop, but only on my laptop.

petermattis · 2017-10-04T19:42:58Z

Update, I've been able to slow down "x" between 2 linux machines through the combination of the switch to using a grpc-style controlBuffer and quotaPool:

_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s  71760.4     27.6     15.7     22.0     27.3     44.0

This is still faster than grpc, but I'm likely just missing some additional quota stuff (grpc has transport and stream level quota as well as quota on the number of streams).

For reference, this is the performance I was seeing earlier with the cond-var based synchronization and no quota pool:

_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s 335770.9    129.0      6.0      5.8      8.4     18.9

tbg · 2017-10-04T20:36:49Z

The quota pool corresponds to (stream and/or connection-level) window sizes, right? I saw that you have both at 65k initially. Does anything change if you up that to a lot more (especially the connection based one) or is the culprit just the internal overhead of having the quota pools in the first place?

petermattis · 2017-10-04T20:45:06Z

I'm not sure what the culprit is. We're never getting close to the stream quota as we're sending 200 byte requests. We do seem to be hitting the connection quota at high concurrency, but alleviating the limit doesn't provide a significant performance boost:

_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s  52903.9     20.2     16.3     28.3     35.7     58.7

I also tried disabling all of the quota code in grpc by commenting it out:

_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     10s  58959.2     22.5     16.8     25.2     32.5     56.6

So clearly something else is still going on inside grpc that is limiting its performance.

tbg · 2017-10-04T20:53:32Z

CPU profile isn't useful?

petermattis · 2017-10-04T20:54:50Z

Heh, I'm looping through all my tools. CPU profile didn't show anything interesting earlier, I'll look again.

tbg · 2017-10-04T20:55:39Z

If that doesn't help, you could consider (and you probably have) strategic commenting out of parts of the quota handling code 🥇

petermattis · 2017-10-04T20:57:13Z

If that doesn't help, you could consider (and you probably have) strategic commenting out of parts of the quota handling code 🥇

You might have missed above, but I already did exactly that.

tbg · 2017-10-04T21:07:05Z

Oh, (completely) misread that as "I also set the window size to basically infinity".

MakMukhi · 2018-04-18T01:43:40Z

Yeah that'll be a good start. If we get a good enough picture of the workload/scenario we can update pinger accordingly.
Thanks for the appreciation. I feel very optimistic about for gRPC's future.

The cRPC protocol is built on top of http2 using only http2 data frames. The core read and write loops are similar to gRPC, but trimmed down to their essence. A ton of stuff is missing before this could actually be used, such as context cancellation propagation, trace propagation, chunking of large requests/responses into frames, and proper error handling. This PR was done as a proof of concept to see what performance impact replacing gRPC with something else could achieve. On a single-node cluster with the local server optimization disabled, gRPC on a read-mostly workload shows: _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms) 10.0s 0 111904 11189.4 1.4 1.2 3.0 5.8 18.9 cRPC: _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms) 10.0s 0 183923 18389.3 0.9 0.7 1.8 3.7 62.9 Unfortunately, the effects on a multi-node cluster are less dramatic. I think it would be possible to make cRPC wire compatible with gRPC. This would require sending a headers frame with every RPC, but the headers are small (7 bytes) and constant for this use case. We'd also have to send a settings frame as the first frame on the connection, but that is trivial. See cockroachdb#17370.

24883: dep: Bump grpc-go version r=a-robinson a=a-robinson And pull in new packages -- in particular, the encoding/proto package isn't needed for compilation but is needed at runtime. Release note: None --------------------- We should wait to merge this until they've cut the 1.12 release tag so that we aren't just at an arbitrary commit in their git history but I'm sending out the PR now so that I (or whoever would have done this) don't have to deal with debugging the missing encoding/proto package when it comes time to merge this. As tested in #17370 (comment), this gives a 5-10% boost in whole-cluster throughput and improved tail latencies when run with a highly concurrent workload. It appears to have little performance effect for lower-concurrency workloads. 25410: sql: run schema changes after CREATE TABLE in txn r=vivekmenezes a=vivekmenezes Includes a commit from #25362 and should be reviewed after that change. 25612: util: fix `retry.WithMaxAttempts` context cancelled before run. r=windchan7 a=windchan7 If context gets cancelled right after `retry.WithMaxAttempts` runs, the function passed to it will never gets run. Now `retry.WithMaxAttempts` will at least run the function once otherwise an error will be returned. Making this change because there are places such as `show_cluster_setting.go` require the passed in function to be run at least once. Otherwise there will be seg fault. Fixes: #25600. Fixes: #25603. Fixes: #25570. Fixes: #25567. Fixes: #25566. Fixes: #25511. Fixes: #25485. Release note: None 25625: storage: Adding testing knob to disable automatic lease renewals r=a-robinson a=a-robinson In order to fix the test flakes caused by automatic lease renewals Fixes #25537 Fixes #25540 Fixes #25568 Fixes #25573 Fixes #25576 Fixes #25589 Fixes #25594 Fixes #25599 Fixes #25605 Fixes #25620 Release note: None Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com> Co-authored-by: Vivek Menezes <vivek@cockroachlabs.com> Co-authored-by: Victor Chen <victor@cockroachlabs.com>

petermattis added this to the 1.1 milestone Aug 1, 2017

petermattis assigned petermattis and irfansharif Aug 1, 2017

petermattis modified the milestones: 1.2, 1.1 Aug 29, 2017

petermattis added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Sep 27, 2017

nvanbenschoten added the A-kv-client Relating to the KV client and the KV interface. label Apr 24, 2018

petermattis modified the milestones: 2.1, 2.2 Jul 19, 2018

petermattis removed this from the 2.2 milestone Oct 5, 2018

petermattis removed their assignment Oct 6, 2018

nvanbenschoten added A-kv-client Relating to the KV client and the KV interface. and removed A-coreperf labels Oct 16, 2018

a-robinson assigned petermattis and unassigned a-robinson Jan 11, 2019

lunevalex added the X-stale label Apr 23, 2021

lunevalex closed this as completed Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: investigate gRPC packeting optimizations #17370

perf: investigate gRPC packeting optimizations #17370

petermattis commented Aug 1, 2017 •

edited by irfansharif

Loading

irfansharif commented Aug 3, 2017

petermattis commented Aug 4, 2017

petermattis commented Aug 4, 2017

irfansharif commented Aug 17, 2017

petermattis commented Aug 17, 2017

irfansharif commented Aug 17, 2017

irfansharif commented Aug 17, 2017

irfansharif commented Aug 17, 2017

petermattis commented Aug 18, 2017

irfansharif commented Aug 27, 2017

irfansharif commented Aug 31, 2017 •

edited

Loading

petermattis commented Aug 31, 2017

petermattis commented Oct 4, 2017

tbg commented Oct 4, 2017

petermattis commented Oct 4, 2017

petermattis commented Oct 4, 2017

petermattis commented Oct 4, 2017

tbg commented Oct 4, 2017

petermattis commented Oct 4, 2017

tbg commented Oct 4, 2017

petermattis commented Oct 4, 2017

tbg commented Oct 4, 2017

petermattis commented Oct 4, 2017

tbg commented Oct 4, 2017

MakMukhi commented Apr 18, 2018

perf: investigate gRPC packeting optimizations #17370

perf: investigate gRPC packeting optimizations #17370

Comments

petermattis commented Aug 1, 2017 • edited by irfansharif Loading

irfansharif commented Aug 3, 2017

petermattis commented Aug 4, 2017

petermattis commented Aug 4, 2017

irfansharif commented Aug 17, 2017

petermattis commented Aug 17, 2017

irfansharif commented Aug 17, 2017

irfansharif commented Aug 17, 2017

irfansharif commented Aug 17, 2017

petermattis commented Aug 18, 2017

irfansharif commented Aug 27, 2017

irfansharif commented Aug 31, 2017 • edited Loading

petermattis commented Aug 31, 2017

petermattis commented Oct 4, 2017

tbg commented Oct 4, 2017

petermattis commented Oct 4, 2017

petermattis commented Oct 4, 2017

petermattis commented Oct 4, 2017

tbg commented Oct 4, 2017

petermattis commented Oct 4, 2017

tbg commented Oct 4, 2017

petermattis commented Oct 4, 2017

tbg commented Oct 4, 2017

petermattis commented Oct 4, 2017

tbg commented Oct 4, 2017

MakMukhi commented Apr 18, 2018

petermattis commented Aug 1, 2017 •

edited by irfansharif

Loading

irfansharif commented Aug 31, 2017 •

edited

Loading