Description
We recently started using profile guided optimizations (pgo) for our Go gRPC services, and in some cases saw a significant increase in memory usage from optimized binaries.
The details of the investigation can be found in golang/go#65532. To summarize, pgo may inlines internal/transport.(*loopyWriter).processData, which is called 3 times in internal/transport.(*loopyWriter).run. This is the goroutine that schedules writes of HTTP2 frames on TCP connections. processData
allocates a 16KiB array on the stack to construct a frame, so inlining it in loopyWriter.run
results in a total of fix 48KiB memory allocated per connection, instead of 16KiB (that may even be released if loopy is blocked). When there are many connnections, this can be a lot of memory. One of our production services saw a 20% memory increase after building with PGO due to this issue.
There are options to still use PGO (which provides otherwise interesting gains) while avoiding this undesirable side effect, but they require changes to grpc-go:
- As suggested in the issue, simply add a
go:noinline
pragma toloopyWriter.processData
to avoid any memory increase. - Move allocation of the local frame byte array directly inside
loopyWriter.run
. The downside is that when the connection is idle, the 16KiB cannot be reclaimed. - A variation of option 2 where array allocation happens in a subroutine of
loopyWriter.run
, so that when loopy blocks because the connection is idle, the array is not allocated.
From those option 1 and 3 seem the most compelling to me. Would you be willing to accept a patch for this?