Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server.go: use worker goroutines for fewer stack allocations #3204

Merged
merged 6 commits into from
Apr 23, 2020

Conversation

adtac
Copy link
Contributor

@adtac adtac commented Nov 21, 2019

Currently (go1.13.4), the default stack size for newly spawned
goroutines is 2048 bytes. This is insufficient when processing gRPC
requests as the we often require more than 4 KiB stacks. This causes the
Go runtime to call runtime.morestack at least twice per RPC, which
causes performance to suffer needlessly as stack reallocations require
all sorts of internal work such as changing pointers to point to new
addresses.

Since this stack growth is guaranteed to happen at least twice per RPC,
reusing goroutines gives us two wins:

  1. The stack is already grown to 8 KiB after the first RPC, so
    subsequent RPCs do not call runtime.morestack.
  2. We eliminate the need to spawn a new goroutine for each request
    (even though they're relatively inexpensive).

Performance improves across the board. The improvement is especially
visible in small, unary requests as the overhead of stack reallocation
is higher, percentage-wise. QPS is up anywhere between 3% and 5%
depending on the number of concurrent RPC requests in flight. Latency is
down ~3%. There is even a 1% decrease in memory footprint in some cases,
though that is an unintended, but happy coincidence.

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_8-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps      2613512      2701705     3.37%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      8657.00      8654.17    -0.03%
           Allocs/op       173.37       173.28     0.00%
             ReqT/op    348468.27    360227.33     3.37%
            RespT/op    348468.27    360227.33     3.37%
            50th-Lat    174.601µs    167.378µs    -4.14%
            90th-Lat    233.132µs    229.087µs    -1.74%
            99th-Lat     438.98µs    441.857µs     0.66%
             Avg-Lat    183.263µs     177.26µs    -3.28%

@adtac adtac force-pushed the adtac/workers branch 2 times, most recently from bf9688f to 532ec54 Compare November 22, 2019 00:27
@adtac adtac changed the title [WIP] server.go: use worker goroutines for fewer stack allocations server.go: use worker goroutines for fewer stack allocations Nov 22, 2019
server.go Outdated Show resolved Hide resolved
@adtac adtac force-pushed the adtac/workers branch 2 times, most recently from f1bbea6 to 7fc9be1 Compare November 22, 2019 00:42
server.go Outdated Show resolved Hide resolved
@adtac adtac force-pushed the adtac/workers branch 3 times, most recently from 9cda117 to 7d808bc Compare November 22, 2019 00:45
@adtac
Copy link
Contributor Author

adtac commented Nov 22, 2019

Unary RPCs (1-byte req/resp). Similar, but a less positive impact with larger payloads / streaming RPCs.

qps
50th
bytes

@adtac adtac force-pushed the adtac/workers branch 2 times, most recently from edced23 to cb330a0 Compare November 22, 2019 20:42
@menghanl menghanl self-requested a review December 5, 2019 22:50
@menghanl menghanl self-assigned this Dec 5, 2019
@dfawley
Copy link
Member

dfawley commented Dec 18, 2019

Per offline discussions:

  • Please add a ServerOption to configure this value, disabled by default
  • Benchmark the difference between using % instead of &.
  • Use streamID >> 1 instead of streamID since the client always uses odd stream IDs.
  • Run benchmarks to compare this approach with the more obvious one that uses a single channel with multiple listeners and report the results.

@adtac

This comment has been minimized.

@adtac
Copy link
Contributor Author

adtac commented Dec 18, 2019

Please add a ServerOption to configure this value, disabled by default

Done.

Benchmark the difference between using % instead of &.

Turns out to be insignificant. Also, since we're now allowing the user to set the number of stream workers, enforcing it to be an exponent of two is probably not a good idea. Switched to modulo.

Use streamID >> 1 instead of streamID since the client always uses odd stream IDs.

I just realised that there's a slight issue with using stream ID -- it's biased towards lower numbers. If a server receives one unary request from several clients (a common workload), they'll all go to the same channel because the first stream ID is always 0.

Switched to a round-robin method that's actually fairer and 1% faster (~compared to master, round robin is a 5.33% improvement while stream ID is a ~4.54% improvement).

Run benchmarks to compare this approach with the more obvious one that uses a single channel with multiple listeners and report the results.

See previous comment.

Pushed V2.

Adhityaa Chandrasekar added 2 commits December 18, 2019 15:40
Currently (go1.13.4), the default stack size for newly spawned
goroutines is 2048 bytes. This is insufficient when processing gRPC
requests as the we often require more than 4 KiB stacks. This causes the
Go runtime to call runtime.morestack at least twice per RPC, which
causes performance to suffer needlessly as stack reallocations require
all sorts of internal work such as changing pointers to point to new
addresses.

Since this stack growth is guaranteed to happen at least twice per RPC,
reusing goroutines gives us two wins:

  1. The stack is already grown to 8 KiB after the first RPC, so
     subsequent RPCs do not call runtime.morestack.
  2. We eliminate the need to spawn a new goroutine for each request
     (even though they're relatively inexpensive).

Performance improves across the board. The improvement is especially
visible in small, unary requests as the overhead of stack reallocation
is higher, percentage-wise. QPS is up anywhere between 3% and 5%
depending on the number of concurrent RPC requests in flight. Latency is
down ~3%. There is even a 1% decrease in memory footprint in some cases,
though that is an unintended, but happy coincidence.

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_8-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps      2613512      2701705     3.37%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      8657.00      8654.17    -0.03%
           Allocs/op       173.37       173.28     0.00%
             ReqT/op    348468.27    360227.33     3.37%
            RespT/op    348468.27    360227.33     3.37%
            50th-Lat    174.601µs    167.378µs    -4.14%
            90th-Lat    233.132µs    229.087µs    -1.74%
            99th-Lat     438.98µs    441.857µs     0.66%
             Avg-Lat    183.263µs     177.26µs    -3.28%
@dfawley dfawley added the Type: Performance Performance improvements (CPU, network, memory, etc) label Dec 19, 2019
@dfawley dfawley added this to the 1.27 Release milestone Dec 19, 2019
internal/transport/transport.go Outdated Show resolved Hide resolved
server.go Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After these changes, can you re-run some benchmarks and provide the results here? Thanks!

Comment on lines +417 to +422
// serverWorkerResetThreshold defines how often the stack must be reset. Every
// N requests, by spawning a new goroutine in its place, a worker can reset its
// stack so that large stacks don't live in memory forever. 2^16 should allow
// each goroutine stack to live for at least a few seconds in a typical
// workload (assuming a QPS of a few thousand requests/sec).
const serverWorkerResetThreshold = 1 << 16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorming: should this be time-based instead of request-based? Or some combination of both?

If a server goes idle, we may want to restart the threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly yes, but I expect the overhead introduced by that to be non-negligible. Also heavily depends on how Go's runtime manages shrinking stacks during GC pauses, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do it with a Timer checked in the same select as the workload without being too expensive. But this is fine for now - we can work on this more in the future.

Copy link

@CAFxX CAFxX May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a late comment just to point out that the runtime shrinks the stacks of goroutines during GC (or at the next safepoint), so this may be not needed at all (and potentially is counter-productive): https://github.com/golang/go/blob/9d812cfa5cbb1f573d61c452c864072270526753/src/runtime/mgcmark.go#L781-L783

Dropping all this part would have the benefit of making it much easier to implement adaptive workers (discussed below)

@dfawley
Copy link
Member

dfawley commented Dec 20, 2019

(Still interested in seeing some new results after the latest changes, otherwise LGTM.)

@adtac
Copy link
Contributor Author

adtac commented Dec 20, 2019

ok this is weird, I'm now getting a 15% perf improvement -- why's there a sudden jump at 16? I think it's because I was using 16 worker goroutines.

50th
qps

@menghanl menghanl assigned dfawley and unassigned adtac Jan 9, 2020
@menghanl menghanl modified the milestones: 1.27 Release, 1.28 Release Jan 28, 2020
@dfawley dfawley modified the milestones: 1.28 Release, 1.29 Release Mar 5, 2020
@dfawley dfawley removed their assignment Mar 20, 2020
@easwars easwars modified the milestones: 1.29 Release, 1.30 Release Apr 8, 2020
@menghanl menghanl merged commit a0cdc21 into grpc:master Apr 23, 2020
Comment on lines +803 to +804
s.handleStream(st, stream, s.traceInfo(st, stream))
wg.Done()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should have been

defer wg.Done()
s.handleStream(st, stream, s.traceInfo(st, stream))

as it is on line 809 and as it was before the change

if s.opts.numServerWorkers > 0 {
data := &serverWorkerData{st: st, wg: &wg, stream: stream}
select {
case s.serverWorkerChannels[atomic.AddUint32(&roundRobinCounter, 1)%s.opts.numServerWorkers] <- data:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is potentially pretty suboptimal in case some workers are busy on requests that take significant time

select {
case s.serverWorkerChannels[atomic.AddUint32(&roundRobinCounter, 1)%s.opts.numServerWorkers] <- data:
default:
// If all stream workers are busy, fallback to the default code path.
Copy link

@CAFxX CAFxX May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a worker in this case, not a one-off goroutine.

Extra workers created over numServerWorkers should linger for a short period of time waiting for new work once they're done, and then shut down if no new work arrives. This would in turn enable to remove the numServerWorkers knob, as new workers will be created as needed.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Performance Performance improvements (CPU, network, memory, etc)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants