server.go: use worker goroutines for fewer stack allocations #3204

adtac · 2019-11-21T20:59:48Z

Currently (go1.13.4), the default stack size for newly spawned
goroutines is 2048 bytes. This is insufficient when processing gRPC
requests as the we often require more than 4 KiB stacks. This causes the
Go runtime to call runtime.morestack at least twice per RPC, which
causes performance to suffer needlessly as stack reallocations require
all sorts of internal work such as changing pointers to point to new
addresses.

Since this stack growth is guaranteed to happen at least twice per RPC,
reusing goroutines gives us two wins:

The stack is already grown to 8 KiB after the first RPC, so
subsequent RPCs do not call runtime.morestack.
We eliminate the need to spawn a new goroutine for each request
(even though they're relatively inexpensive).

Performance improves across the board. The improvement is especially
visible in small, unary requests as the overhead of stack reallocation
is higher, percentage-wise. QPS is up anywhere between 3% and 5%
depending on the number of concurrent RPC requests in flight. Latency is
down ~3%. There is even a 1% decrease in memory footprint in some cases,
though that is an unintended, but happy coincidence.

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_8-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps      2613512      2701705     3.37%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      8657.00      8654.17    -0.03%
           Allocs/op       173.37       173.28     0.00%
             ReqT/op    348468.27    360227.33     3.37%
            RespT/op    348468.27    360227.33     3.37%
            50th-Lat    174.601µs    167.378µs    -4.14%
            90th-Lat    233.132µs    229.087µs    -1.74%
            99th-Lat     438.98µs    441.857µs     0.66%
             Avg-Lat    183.263µs     177.26µs    -3.28%

server.go

adtac · 2019-11-22T00:49:09Z

Unary RPCs (1-byte req/resp). Similar, but a less positive impact with larger payloads / streaming RPCs.

dfawley · 2019-12-18T17:44:47Z

Per offline discussions:

Please add a ServerOption to configure this value, disabled by default
Benchmark the difference between using % instead of &.
Use streamID >> 1 instead of streamID since the client always uses odd stream IDs.
Run benchmarks to compare this approach with the more obvious one that uses a single channel with multiple listeners and report the results.

adtac · 2019-12-18T23:36:30Z

Please add a ServerOption to configure this value, disabled by default

Done.

Benchmark the difference between using % instead of &.

Turns out to be insignificant. Also, since we're now allowing the user to set the number of stream workers, enforcing it to be an exponent of two is probably not a good idea. Switched to modulo.

Use streamID >> 1 instead of streamID since the client always uses odd stream IDs.

I just realised that there's a slight issue with using stream ID -- it's biased towards lower numbers. If a server receives one unary request from several clients (a common workload), they'll all go to the same channel because the first stream ID is always 0.

Switched to a round-robin method that's actually fairer and 1% faster (~compared to master, round robin is a 5.33% improvement while stream ID is a ~4.54% improvement).

Run benchmarks to compare this approach with the more obvious one that uses a single channel with multiple listeners and report the results.

See previous comment.

Pushed V2.

Currently (go1.13.4), the default stack size for newly spawned goroutines is 2048 bytes. This is insufficient when processing gRPC requests as the we often require more than 4 KiB stacks. This causes the Go runtime to call runtime.morestack at least twice per RPC, which causes performance to suffer needlessly as stack reallocations require all sorts of internal work such as changing pointers to point to new addresses. Since this stack growth is guaranteed to happen at least twice per RPC, reusing goroutines gives us two wins: 1. The stack is already grown to 8 KiB after the first RPC, so subsequent RPCs do not call runtime.morestack. 2. We eliminate the need to spawn a new goroutine for each request (even though they're relatively inexpensive). Performance improves across the board. The improvement is especially visible in small, unary requests as the overhead of stack reallocation is higher, percentage-wise. QPS is up anywhere between 3% and 5% depending on the number of concurrent RPC requests in flight. Latency is down ~3%. There is even a 1% decrease in memory footprint in some cases, though that is an unintended, but happy coincidence. unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_8-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false Title Before After Percentage TotalOps 2613512 2701705 3.37% SendOps 0 0 NaN% RecvOps 0 0 NaN% Bytes/op 8657.00 8654.17 -0.03% Allocs/op 173.37 173.28 0.00% ReqT/op 348468.27 360227.33 3.37% RespT/op 348468.27 360227.33 3.37% 50th-Lat 174.601µs 167.378µs -4.14% 90th-Lat 233.132µs 229.087µs -1.74% 99th-Lat 438.98µs 441.857µs 0.66% Avg-Lat 183.263µs 177.26µs -3.28%

internal/transport/transport.go

server.go

dfawley

After these changes, can you re-run some benchmarks and provide the results here? Thanks!

dfawley · 2019-12-20T17:23:47Z

server.go

+// serverWorkerResetThreshold defines how often the stack must be reset. Every
+// N requests, by spawning a new goroutine in its place, a worker can reset its
+// stack so that large stacks don't live in memory forever. 2^16 should allow
+// each goroutine stack to live for at least a few seconds in a typical
+// workload (assuming a QPS of a few thousand requests/sec).
+const serverWorkerResetThreshold = 1 << 16


Brainstorming: should this be time-based instead of request-based? Or some combination of both?

If a server goes idle, we may want to restart the threads.

Possibly yes, but I expect the overhead introduced by that to be non-negligible. Also heavily depends on how Go's runtime manages shrinking stacks during GC pauses, I think.

I think we could do it with a Timer checked in the same select as the workload without being too expensive. But this is fine for now - we can work on this more in the future.

a late comment just to point out that the runtime shrinks the stacks of goroutines during GC (or at the next safepoint), so this may be not needed at all (and potentially is counter-productive): https://github.com/golang/go/blob/9d812cfa5cbb1f573d61c452c864072270526753/src/runtime/mgcmark.go#L781-L783

Dropping all this part would have the benefit of making it much easier to implement adaptive workers (discussed below)

dfawley · 2019-12-20T21:07:15Z

(Still interested in seeing some new results after the latest changes, otherwise LGTM.)

adtac · 2019-12-20T22:42:35Z

ok this is weird, I'm now getting a 15% perf improvement -- why's there a sudden jump at 16? I think it's because I was using 16 worker goroutines.

CAFxX · 2020-05-19T01:11:14Z

server.go

+ s.handleStream(st, stream, s.traceInfo(st, stream))
+ wg.Done()


this should have been

defer wg.Done() s.handleStream(st, stream, s.traceInfo(st, stream))

as it is on line 809 and as it was before the change

CAFxX · 2020-05-19T02:17:38Z

server.go

+ if s.opts.numServerWorkers > 0 {
+ data := &serverWorkerData{st: st, wg: &wg, stream: stream}
+ select {
+ case s.serverWorkerChannels[atomic.AddUint32(&roundRobinCounter, 1)%s.opts.numServerWorkers] <- data:


this is potentially pretty suboptimal in case some workers are busy on requests that take significant time

CAFxX · 2020-05-19T02:19:23Z

server.go

+ select {
+ case s.serverWorkerChannels[atomic.AddUint32(&roundRobinCounter, 1)%s.opts.numServerWorkers] <- data:
+ default:
+ // If all stream workers are busy, fallback to the default code path.


I think we should add a worker in this case, not a one-off goroutine.

Extra workers created over numServerWorkers should linger for a short period of time waiting for new work once they're done, and then shut down if no new work arrives. This would in turn enable to remove the numServerWorkers knob, as new workers will be created as needed.

adtac force-pushed the adtac/workers branch 2 times, most recently from bf9688f to 532ec54 Compare November 22, 2019 00:27

adtac changed the title ~~[WIP] server.go: use worker goroutines for fewer stack allocations~~ server.go: use worker goroutines for fewer stack allocations Nov 22, 2019

adtac commented Nov 22, 2019

View reviewed changes

server.go Outdated Show resolved Hide resolved

adtac force-pushed the adtac/workers branch 2 times, most recently from f1bbea6 to 7fc9be1 Compare November 22, 2019 00:42

adtac commented Nov 22, 2019

View reviewed changes

server.go Outdated Show resolved Hide resolved

adtac force-pushed the adtac/workers branch 3 times, most recently from 9cda117 to 7d808bc Compare November 22, 2019 00:45

adtac force-pushed the adtac/workers branch 2 times, most recently from edced23 to cb330a0 Compare November 22, 2019 20:42

menghanl self-requested a review December 5, 2019 22:50

menghanl self-assigned this Dec 5, 2019

dfawley assigned adtac Dec 18, 2019

dfawley self-requested a review December 18, 2019 17:47

This comment has been minimized.

Sign in to view

adtac force-pushed the adtac/workers branch from cb330a0 to 3dd5e78 Compare December 18, 2019 23:32

adtac force-pushed the adtac/workers branch from 3dd5e78 to 7572f55 Compare December 18, 2019 23:40

Adhityaa Chandrasekar added 2 commits December 18, 2019 15:40

V2: server option, stream ID bitshift

11cd160

adtac force-pushed the adtac/workers branch from 7572f55 to 11cd160 Compare December 18, 2019 23:40

dfawley added the Type: Performance Performance improvements (CPU, network, memory, etc) label Dec 19, 2019

dfawley added this to the 1.27 Release milestone Dec 19, 2019

dfawley reviewed Dec 19, 2019

View reviewed changes

internal/transport/transport.go Outdated Show resolved Hide resolved

server.go Show resolved Hide resolved

V3: server workers instead of stream workers

f13952c

aaronbee reviewed Dec 20, 2019

View reviewed changes

server.go Outdated Show resolved Hide resolved

Adhityaa Chandrasekar added 2 commits December 19, 2019 17:04

V4: go fmt

af27325

V5: use a drifting, random number of iterations

d104fc2

dfawley reviewed Dec 20, 2019

View reviewed changes

multiple channels

9939840

menghanl assigned dfawley and unassigned adtac Jan 9, 2020

menghanl modified the milestones: 1.27 Release, 1.28 Release Jan 28, 2020

dfawley modified the milestones: 1.28 Release, 1.29 Release Mar 5, 2020

dfawley approved these changes Mar 20, 2020

View reviewed changes

dfawley removed their assignment Mar 20, 2020

easwars modified the milestones: 1.29 Release, 1.30 Release Apr 8, 2020

menghanl approved these changes Apr 16, 2020

View reviewed changes

menghanl merged commit a0cdc21 into grpc:master Apr 23, 2020

CAFxX reviewed May 19, 2020

View reviewed changes

github-actions bot locked as resolved and limited conversation to collaborators May 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server.go: use worker goroutines for fewer stack allocations #3204

server.go: use worker goroutines for fewer stack allocations #3204

adtac commented Nov 21, 2019 •

edited

Loading

adtac commented Nov 22, 2019 •

edited

Loading

dfawley commented Dec 18, 2019

This comment has been minimized.

adtac commented Dec 18, 2019

dfawley left a comment

dfawley Dec 20, 2019

adtac Dec 20, 2019

dfawley Dec 20, 2019

CAFxX May 19, 2020 •

edited

Loading

dfawley commented Dec 20, 2019

adtac commented Dec 20, 2019

CAFxX May 19, 2020

CAFxX May 19, 2020

CAFxX May 19, 2020 •

edited

Loading

		s.handleStream(st, stream, s.traceInfo(st, stream))
		wg.Done()

server.go: use worker goroutines for fewer stack allocations #3204

server.go: use worker goroutines for fewer stack allocations #3204

Conversation

adtac commented Nov 21, 2019 • edited Loading

adtac commented Nov 22, 2019 • edited Loading

dfawley commented Dec 18, 2019

This comment has been minimized.

adtac commented Dec 18, 2019

dfawley left a comment

Choose a reason for hiding this comment

dfawley Dec 20, 2019

Choose a reason for hiding this comment

adtac Dec 20, 2019

Choose a reason for hiding this comment

dfawley Dec 20, 2019

Choose a reason for hiding this comment

CAFxX May 19, 2020 • edited Loading

Choose a reason for hiding this comment

dfawley commented Dec 20, 2019

adtac commented Dec 20, 2019

CAFxX May 19, 2020

Choose a reason for hiding this comment

CAFxX May 19, 2020

Choose a reason for hiding this comment

CAFxX May 19, 2020 • edited Loading

Choose a reason for hiding this comment

adtac commented Nov 21, 2019 •

edited

Loading

adtac commented Nov 22, 2019 •

edited

Loading

CAFxX May 19, 2020 •

edited

Loading

CAFxX May 19, 2020 •

edited

Loading