Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport: refactor to reduce lock contention and improve performance #1962

Merged
merged 1 commit into from
Apr 5, 2018

Conversation

MakMukhi
Copy link
Contributor

@MakMukhi MakMukhi commented Apr 2, 2018

Free stream (RPC) goroutines from the onus of checking transport level quotas like flow control and to remove all the global locks. The streams instead schedule their messages and headers by adding them to controlBuf which is read by a dedicated goroutine which takes care of checking these quotas and writes bytes on the wire. This eliminates friction in highly-concurrent environments.
Furthermore, this dedicated writing goroutine is tuned to maximise batch size for each syscall by yielding its processor (runtime.Gosched()) when there isn’t much data to write. This leads to stream goroutines getting scheduled and writing more data on controlBuf and eventually increasing the batch size per syscall. This also helps with improving QPS (benchmark not covered above) by about 20%.

Following are the benchmark results:

Pinger benchmark:
mmukhi@mmukhi:~/sandbox/pinger$ ./pinger -p 100 -n 1 -d 10s -t grpc

Before
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)				
10s  82081.1      1.6      1.6      2.1      3.4     11.5				

After				
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)				
10s  99347.8      1.9      0.7      1.8      2.5      8.4				

Internal benchmarks:

Before
Multi QPS	        467971
Multi throughput	15068
Single throughput	7891

After
Multi QPS	        553490
Multi Throughput	14771
Single Throughput	7896

gRPC-Go OSS benchmarks:

  • Unary-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_100-reqSize_1B-respSize_1B-Compressor_false
Before
50_Latency: 1554.5760 µs	90_Latency: 1860.5830 µs	99_Latency: 2400.5140 µs	Avg latency: 1602.7460 µs	Count: 3742535	8059 Bytes/op	146 Allocs/op
Histogram (unit: µs)						
Count: 3742535  Min: 220.7  Max: 12036.0  Avg: 1602.75						

After
50_Latency: 1056.6540 µs	90_Latency: 1392.0010 µs	99_Latency: 1933.4470 µs	Avg latency: 1096.5640 µs	Count: 5469064	8424 Bytes/op	158 Allocs/op	
Histogram (unit: µs)							
Count: 5469064  Min: 161.8  Max: 9567.4  Avg: 1096.56							
  • Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_100-reqSize_1B-respSize_1B-Compressor_false:
Before
50_Latency: 285.8070 µs	90_Latency: 399.3940 µs	99_Latency: 574.6630 µs	Avg latency: 298.8770 µs	Count: 20034781	714 Bytes/op	31 Allocs/op	
Histogram (unit: µs)							
Count: 20034781  Min:  53.7  Max: 30833.8  Avg: 298.88							

After
50_Latency: 241.8850 µs	90_Latency: 329.1310 µs	99_Latency: 444.5360 µs	Avg latency: 250.7900 µs	Count: 23848730	768 Bytes/op	34 Allocs/op	
Histogram (unit: µs)							
Count: 23848730  Min:  50.6  Max: 28181.1  Avg: 250.79							
  • Unary-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1000-reqSize_1B-respSize_1B-Compressor_false:
Before
50_Latency: 16.7664 ms	90_Latency: 18.1225 ms	99_Latency: 19.6502 ms	Avg latency: 16.7603 ms	Count: 3579757	8043 Bytes/op	146 Allocs/op	
Histogram (unit: ms)							
Count: 3579757  Min:   5.1  Max:  30.3  Avg: 16.76							

After
50_Latency: 10285.5940 µs	90_Latency: 12159.6960 µs	99_Latency: 13847.5580 µs	Avg latency: 10339.7370 µs	Count: 5801603	8406 Bytes/op	156 Allocs/op	
Histogram (unit: µs)							
Count: 5801603  Min: 348.0  Max: 25294.0  Avg: 10339.74							
  • Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1000-reqSize_1B-respSize_1B-Compressor_false:
Before
50_Latency: 2036.5920 µs	90_Latency: 2858.7270 µs	99_Latency: 5129.1910 µs	Avg latency: 2209.2680 µs	Count: 27136942	712 Bytes/op	30 Allocs/op
Histogram (unit: µs)						
Count: 27136942  Min:  81.2  Max: 58660.0  Avg: 2209.27

After
50_Latency: 1652.7570 µs	90_Latency: 2263.1300 µs	99_Latency: 3555.5170 µs	Avg latency: 1750.0020 µs	Count: 34260286	758 Bytes/op	33 Allocs/op
Histogram (unit: µs)						
Count: 34260286  Min: 115.0  Max: 52355.5  Avg: 1750.00											

@MakMukhi MakMukhi requested a review from dfawley April 2, 2018 22:03
@MakMukhi MakMukhi force-pushed the sdlr_oss branch 3 times, most recently from 98874de to 8dd7549 Compare April 3, 2018 02:25
@MakMukhi MakMukhi added the Type: Performance Performance improvements (CPU, network, memory, etc) label Apr 3, 2018
@dfawley dfawley changed the title Scheduler transport: refactor to reduce lock contention and improve performance Apr 4, 2018
Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add benchmark results before/after to the PR description.

// In case this is triggered because clientConn.Close()
// was called, we want to immeditately close the transport
// since no other goroutine might notice it for a while.
t.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jadekler note this new requirement for the grpc layer to manually close the transport. This will need to be done by the onError callback in your PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll add it.

@MakMukhi MakMukhi merged commit d0a21a3 into grpc:master Apr 5, 2018
@dfawley dfawley added this to the 1.12 Release milestone Apr 5, 2018
jeanbza added a commit to jeanbza/grpc-go that referenced this pull request Apr 6, 2018
@MakMukhi MakMukhi deleted the sdlr_oss branch May 4, 2018 02:10
@lock lock bot locked as resolved and limited conversation to collaborators Oct 31, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Performance Performance improvements (CPU, network, memory, etc)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants