proposal: runtime/pprof: add goroutine stack memory usage profile

### Proposal Details

## Summary

I'm proposing to implement a new profile type that allows to break down goroutine stack space usage by frame.

* No new API. Added as a new sample type to the `goroutine` profile
* The value for each stack trace is the sum of space for the leaf frame
* Free stack space is indicated via a virtual `runtime._FreeStack` leaf node
* The grand total should be equal (or close) to `/memory/classes/heap/stacks:bytes`
* A rough prototype CL is available here: https://go-review.googlesource.com/c/go/+/574795 (200 LoC excluding tests)

Given the above, perhaps this is small and simple enough to skip the official proposal process. But since the design includes a few opinionated choices, it's probably best to have some discussions and consensus upfront.

## Motivation

My main motivation for this came from debugging a case of dramatic stack space growth while deploying PGO to production (#65532) which I was able to root cause using a hacky stack space profiler that I implemented in userland ([repo](https://github.com/felixge/go-stack-profiler)).

Additionally I imagine this profile type will be useful for other scenarios, e.g. high cpu usage in `morestack` (#18138).

## Implementation

I'm proposing to implement this profile type by taking each stack trace in the goroutine profile and looking up it's frame size (❶ shows this for a single stack trace). Then each stack trace is broken into one stack trace per prefix (from the root), and these stack traces are assigned the frame size of their leaf frames as values (❷). This will produce a flame graph where the "self value" of each frame corresponds to its frame size, and its total value corresponds to its frame size plus the frame sizes of its children (❸).

![image](https://github.com/golang/go/assets/15000/d03fd262-809a-4d03-878f-8b0ae9ffb93a)

These values are then multiplied by the number of goroutines that were captured for the given stack trace, resulting in the sum of stack space usage.

Last but not least, a `runtime._FreeStack` leaf node is added to capture the delta between the stack space used by frames, and the total size of the stack allocated for the goroutine. Additionally a root-level `runtime._FreeStack` is used to show the amount of memory reserved for goroutine stacks that is currently not in use. These virtual frames are motivated by producing a profile that adds up to `/memory/classes/heap/stacks:bytes` as well as giving the user the ability to reason about potential `morestack` issues.

## Prototype

I have uploaded a rough CL for a prototype here: https://go-review.googlesource.com/c/go/+/574795 (200 LoC excluding tests).

Using this prototype we can look at a real stack profile for a program with the following goroutines:

* 1 goroutine with a oneThousand byte frame
* 1 goroutine with a twoThousand byte frame 
* 2 goroutines with a threeThousand byte frame

<details>
<summary>Code Snippet</summary>

```go
func launchGoroutinesWithKnownStacks() func() {
	c1 := make(chan struct{})
	c2 := make(chan struct{})
	c3 := make(chan struct{})
	c4 := make(chan struct{})

	go oneThousand(c1)
	go twoThousand(c2)
	go threeThousand(c3)
	go threeThousand(c4)
	<-c1
	<-c2
	<-c3
	<-c4
	// hacky way to ensure all goroutines reach the same <-ch statement
	// TODO(fg) make caller retry in the rare case this could go wrong
	time.Sleep(10 * time.Millisecond)
	return func() {
		c1 <- struct{}{}
		c2 <- struct{}{}
		c3 <- struct{}{}
		c4 <- struct{}{}
	}
}

//go:noinline
func oneThousand(ch chan struct{}) [1000]byte {
	var a [1000]byte
	ch <- struct{}{}
	<-ch
	return a
}

//go:noinline
func twoThousand(ch chan struct{}) [2000]byte {
	var a [2000]byte
	ch <- struct{}{}
	<-ch
	return a
}

//go:noinline
func threeThousand(ch chan struct{}) [3000]byte {
	var a [3000]byte
	ch <- struct{}{}
	<-ch
	return a
}
```

</details>

![2024-03-27 pprof test stack_space at 20 27 34@2x](https://github.com/golang/go/assets/15000/98eae095-0048-45db-a9f8-4ab8628cf2df)

Note: The prototype doesn't implement the proposed root-level `runtime._FreeStack` frame yet.

## Performance

I not measure this yet, but I suspect all of this can be done with negligible impact on the overhead of the goroutine profile.

## Next Steps

Please let me know what you think. cc @prattmic @mknyszek @nsrip-dd @rhysh (this was previously discussed in a recent runtime diagnostics sync, see [notes](https://github.com/golang/go/issues/57175#issuecomment-1986600809)).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

proposal: runtime/pprof: add goroutine stack memory usage profile #66566

Proposal Details

Summary

Motivation

Implementation

Prototype

Performance

Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

proposal: runtime/pprof: add goroutine stack memory usage profile #66566

Description

Proposal Details

Summary

Motivation

Implementation

Prototype

Performance

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions