-
Notifications
You must be signed in to change notification settings - Fork 18.3k
Description
Go version
go1.21.5
Output of go env
in your module/workspace:
GOARCH=amd64
GOOS=linux
GOAMD64=v1
What did you do?
Build my application using a default.pgo
CPU profile from production.
What did you see happen?
Go memory usage (/memory/classes/total:bytes − /memory/classes/heap/released:bytes
) increased from 720 MB to 850 MB (18%) until rollback, see below.
This increase in memory usage seems to have been caused by an increase in goroutine stack size (/memory/classes/heap/stacks:bytes
) from 207 MB to 280MB (35%).
This increase was not due to an increase in the number of active goroutines, but due to an increase of the average stack size (/memory/classes/heap/stacks:bytes / /sched/goroutines:goroutines
).
To debug this further, I built a hacky goroutine stack frame profiler. This pointed me to to google.golang.org/grpc/internal/transport.(*loopyWriter).run
For the binary compiled without pgo, my tool estimated 2MB of stack usage for ~1000 goroutines:
And for the binary compiled with pgo, my tool estimated 71MB of stack usage for ~1000 goroutines:
Looking at the assembly, it becomes clear that is due to the frame size increasing from 0x50
(80
) bytes to 0xc1f8
(49656
) bytes.
assembly
before pgo:
TEXT google.golang.org/grpc/internal/transport.(*loopyWriter).run(SB) /go/pkg/mod/google.golang.org/grpc@v1.58.2/internal/transport/controlbuf.go
0x8726e0 493b6610 CMPQ SP, 0x10(R14) // cmp 0x10(%r14),%rsp
0x8726e4 0f86ab020000 JBE 0x872995 // jbe 0x872995
0x8726ea 55 PUSHQ BP // push %rbp
0x8726eb 4889e5 MOVQ SP, BP // mov %rsp,%rbp
0x8726ee 4883ec50 SUBQ $0x50, SP // sub $0x50,%rsp
after pgo:
TEXT google.golang.org/grpc/internal/transport.(*loopyWriter).run(SB) /go/pkg/mod/google.golang.org/grpc@v1.58.2/internal/transport/controlbuf.go
0x8889a0 4989e4 MOVQ SP, R12 // mov %rsp,%r12
0x8889a3 4981ec80c10000 SUBQ $0xc180, R12 // sub $0xc180,%r12
0x8889aa 0f82c0300000 JB 0x88ba70 // jb 0x88ba70
0x8889b0 4d3b6610 CMPQ R12, 0x10(R14) // cmp 0x10(%r14),%r12
0x8889b4 0f86b6300000 JBE 0x88ba70 // jbe 0x88ba70
0x8889ba 55 PUSHQ BP // push %rbp
0x8889bb 4889e5 MOVQ SP, BP // mov %rsp,%rbp
0x8889be 4881ecf8c10000 SUBQ $0xc1f8, SP // sub $0xc1f8,%rsp
And the root cause for this appears to be the inlining of 3 calls to processData
, each of which allocates a 16KiB byte array on its stack
What did you expect to see?
No significant increase in memory usage.
Maybe PGO could take frame sizes into account for inlining, especially if multiple calls are being made to a function that has a large frame size.
Meanwhile, maybe we should send a PR that adds a //go:noinline
pragma to the processData
func in gRPC. Given the current code structure, it seems highly undesirable to inline this function up to 3 times in the run
method.
cc @prattmic
Metadata
Metadata
Assignees
Labels
Type
Projects
Status