transport: refactor to reduce lock contention and improve performance #1962

MakMukhi · 2018-04-02T22:03:06Z

Free stream (RPC) goroutines from the onus of checking transport level quotas like flow control and to remove all the global locks. The streams instead schedule their messages and headers by adding them to controlBuf which is read by a dedicated goroutine which takes care of checking these quotas and writes bytes on the wire. This eliminates friction in highly-concurrent environments.
Furthermore, this dedicated writing goroutine is tuned to maximise batch size for each syscall by yielding its processor (runtime.Gosched()) when there isn’t much data to write. This leads to stream goroutines getting scheduled and writing more data on controlBuf and eventually increasing the batch size per syscall. This also helps with improving QPS (benchmark not covered above) by about 20%.

Following are the benchmark results:

Pinger benchmark:
mmukhi@mmukhi:~/sandbox/pinger$ ./pinger -p 100 -n 1 -d 10s -t grpc

Before
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)				
10s  82081.1      1.6      1.6      2.1      3.4     11.5				

After				
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)				
10s  99347.8      1.9      0.7      1.8      2.5      8.4

Internal benchmarks:

Before
Multi QPS	        467971
Multi throughput	15068
Single throughput	7891

After
Multi QPS	        553490
Multi Throughput	14771
Single Throughput	7896

gRPC-Go OSS benchmarks:

Unary-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_100-reqSize_1B-respSize_1B-Compressor_false

Before
50_Latency: 1554.5760 µs	90_Latency: 1860.5830 µs	99_Latency: 2400.5140 µs	Avg latency: 1602.7460 µs	Count: 3742535	8059 Bytes/op	146 Allocs/op
Histogram (unit: µs)						
Count: 3742535  Min: 220.7  Max: 12036.0  Avg: 1602.75						

After
50_Latency: 1056.6540 µs	90_Latency: 1392.0010 µs	99_Latency: 1933.4470 µs	Avg latency: 1096.5640 µs	Count: 5469064	8424 Bytes/op	158 Allocs/op	
Histogram (unit: µs)							
Count: 5469064  Min: 161.8  Max: 9567.4  Avg: 1096.56

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_100-reqSize_1B-respSize_1B-Compressor_false:

Before
50_Latency: 285.8070 µs	90_Latency: 399.3940 µs	99_Latency: 574.6630 µs	Avg latency: 298.8770 µs	Count: 20034781	714 Bytes/op	31 Allocs/op	
Histogram (unit: µs)							
Count: 20034781  Min:  53.7  Max: 30833.8  Avg: 298.88							

After
50_Latency: 241.8850 µs	90_Latency: 329.1310 µs	99_Latency: 444.5360 µs	Avg latency: 250.7900 µs	Count: 23848730	768 Bytes/op	34 Allocs/op	
Histogram (unit: µs)							
Count: 23848730  Min:  50.6  Max: 28181.1  Avg: 250.79

Unary-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1000-reqSize_1B-respSize_1B-Compressor_false:

Before
50_Latency: 16.7664 ms	90_Latency: 18.1225 ms	99_Latency: 19.6502 ms	Avg latency: 16.7603 ms	Count: 3579757	8043 Bytes/op	146 Allocs/op	
Histogram (unit: ms)							
Count: 3579757  Min:   5.1  Max:  30.3  Avg: 16.76							

After
50_Latency: 10285.5940 µs	90_Latency: 12159.6960 µs	99_Latency: 13847.5580 µs	Avg latency: 10339.7370 µs	Count: 5801603	8406 Bytes/op	156 Allocs/op	
Histogram (unit: µs)							
Count: 5801603  Min: 348.0  Max: 25294.0  Avg: 10339.74

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1000-reqSize_1B-respSize_1B-Compressor_false:

Before
50_Latency: 2036.5920 µs	90_Latency: 2858.7270 µs	99_Latency: 5129.1910 µs	Avg latency: 2209.2680 µs	Count: 27136942	712 Bytes/op	30 Allocs/op
Histogram (unit: µs)						
Count: 27136942  Min:  81.2  Max: 58660.0  Avg: 2209.27

After
50_Latency: 1652.7570 µs	90_Latency: 2263.1300 µs	99_Latency: 3555.5170 µs	Avg latency: 1750.0020 µs	Count: 34260286	758 Bytes/op	33 Allocs/op
Histogram (unit: µs)						
Count: 34260286  Min: 115.0  Max: 52355.5  Avg: 1750.00

dfawley

Please add benchmark results before/after to the PR description.

dfawley · 2018-04-04T17:38:45Z

clientconn.go

+			// In case this is triggered because clientConn.Close()
+			// was called, we want to immeditately close the transport
+			// since no other goroutine might notice it for a while.
+			t.Close()


@jadekler note this new requirement for the grpc layer to manually close the transport. This will need to be done by the onError callback in your PR.

Thanks! I'll add it.

…ect()

MakMukhi requested a review from dfawley April 2, 2018 22:03

MakMukhi force-pushed the sdlr_oss branch 3 times, most recently from 98874de to 8dd7549 Compare April 3, 2018 02:25

Export changes to OSS.

ac64308

MakMukhi force-pushed the sdlr_oss branch from 8dd7549 to ac64308 Compare April 3, 2018 18:04

MakMukhi assigned dfawley Apr 3, 2018

MakMukhi added the Type: Performance Performance improvements (CPU, network, memory, etc) label Apr 3, 2018

dfawley changed the title ~~Scheduler~~ transport: refactor to reduce lock contention and improve performance Apr 4, 2018

dfawley reviewed Apr 4, 2018

View reviewed changes

dfawley approved these changes Apr 4, 2018

View reviewed changes

MakMukhi merged commit d0a21a3 into grpc:master Apr 5, 2018

dfawley added this to the 1.12 Release milestone Apr 5, 2018

jeanbza added a commit to jeanbza/grpc-go that referenced this pull request Apr 6, 2018

add transport closure per grpc#1962, and rename common() to ac.reconn…

a1f4bdd

…ect()

MakMukhi mentioned this pull request Apr 12, 2018

perf: investigate gRPC packeting optimizations cockroachdb/cockroach#17370

Closed

lyuxuan mentioned this pull request Apr 12, 2018

Fix Test: race between t.Write() and t.closeStream() #1989

Merged

MakMukhi mentioned this pull request Apr 12, 2018

Implement gRPC-specific tracing for RPC life cycle #1986

Closed

MakMukhi deleted the sdlr_oss branch May 4, 2018 02:10

lyuxuan mentioned this pull request Jun 8, 2018

Memory leak of gRPC server #2110

Closed

lock bot locked as resolved and limited conversation to collaborators Oct 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transport: refactor to reduce lock contention and improve performance #1962

transport: refactor to reduce lock contention and improve performance #1962

MakMukhi commented Apr 2, 2018 •

edited

Loading

dfawley left a comment

dfawley Apr 4, 2018

jeanbza Apr 4, 2018

transport: refactor to reduce lock contention and improve performance #1962

transport: refactor to reduce lock contention and improve performance #1962

Conversation

MakMukhi commented Apr 2, 2018 • edited Loading

dfawley left a comment

Choose a reason for hiding this comment

dfawley Apr 4, 2018

Choose a reason for hiding this comment

jeanbza Apr 4, 2018

Choose a reason for hiding this comment

MakMukhi commented Apr 2, 2018 •

edited

Loading