-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: runtime/pprof: add goroutine stack memory usage profile #66566
Comments
I think the overall idea, thanks for filing the proposal. One minor "knee-jerk" reaction I have from looking at the example image is that the display of the That said, I can't immediately think of how to display it better. The fundamental problem here seems to be the aggregation by common stack frames complicating things. Ways I can think of to "fix" that would be:
|
Thanks for the feedback @prattmic. I've also added this as a potential agenda item for tomorrow's runtime diagnostics sync as it might be easier to hash out some of these ideas in real time.
Yeah, I agree. It is confusing. (But I disagree about this being misleading about reality. The free stack memory truly sits below
Good ideas. What about a variant of version 1 where the What do you think? |
Hm, I do like that better, but the ordering still feels slightly inverted to me (it seems odd that the free stack is the second frame, when the first frame is using real stack space). But this gives me another idea: What if we invert the idea of "free stack space" and instead display total allocated stack size? Concretely, each stack ends at a "[total]" frame, whose value is the total allocated stack size. In a flame graph, you'd end up with something that looks like this (pardon the ASCII art):
The empty space in the bottom left is the total free stack space in the program. This sounds a bit like my suggestion (2) above because we can't see the free space per goroutine type, but I'm not certain we have to lose this information. In theory, running with I'm not 100% sure this would work out of the box though given the slightly strange way we are exploding samples. We may need to use unique fake PCs per goroutine ( |
I think if I saw I profile like that, I'd soon want to travel back in time to see which call stack the goroutine had when the runtime determined the stack needed to be grown. But it looks like the current proposal is to tie information to the goroutine's current stack. I'd expect a CPU profile looking at To be concrete, consider an HTTP server where: each request is handled quickly, some requests result in deep call stacks, and all arrive on often-idle http/1.1 keep-alive connections. It sounds like the current proposal would show a lot of goroutines with |
@prattmic I like this idea, but I don't think The problem is that the "empty space" below a node is created by a stack trace that only goes up to that node. In my example from above, the "empty space" below So given the definition of focus below, I think
Looking at the other pprof options neither Anyway, I also realized a few other areas where stack memory can hide:
Tracking this as Of course we could also consider to just not deal with any of this, and only show stack memory usage by the frames that are currently on the stack. But this would leave an annoyingly large gap with |
@rhysh I see your point. I'm implicitly biased towards continuous profiling where I'm assuming the user has access to many stack profiles and the ability to aggregate them. In practice this should allow the user to see stack profiles that include the frames below Do you think that's good enough? I mean I could also see the value in a dedicated |
The CPU profile can show what triggers stack growth in the same way that the I've got three kinds of memory profiles in mind:
For heap memory, we get the first from the For stack memory, we're able to skip right to the third style (it's not a graph, so calculating the dominators is easy). And a CPU profile focused on |
Proposal Details
Summary
I'm proposing to implement a new profile type that allows to break down goroutine stack space usage by frame.
goroutine
profileruntime._FreeStack
leaf node/memory/classes/heap/stacks:bytes
Given the above, perhaps this is small and simple enough to skip the official proposal process. But since the design includes a few opinionated choices, it's probably best to have some discussions and consensus upfront.
Motivation
My main motivation for this came from debugging a case of dramatic stack space growth while deploying PGO to production (#65532) which I was able to root cause using a hacky stack space profiler that I implemented in userland (repo).
Additionally I imagine this profile type will be useful for other scenarios, e.g. high cpu usage in
morestack
(#18138).Implementation
I'm proposing to implement this profile type by taking each stack trace in the goroutine profile and looking up it's frame size (❶ shows this for a single stack trace). Then each stack trace is broken into one stack trace per prefix (from the root), and these stack traces are assigned the frame size of their leaf frames as values (❷). This will produce a flame graph where the "self value" of each frame corresponds to its frame size, and its total value corresponds to its frame size plus the frame sizes of its children (❸).
These values are then multiplied by the number of goroutines that were captured for the given stack trace, resulting in the sum of stack space usage.
Last but not least, a
runtime._FreeStack
leaf node is added to capture the delta between the stack space used by frames, and the total size of the stack allocated for the goroutine. Additionally a root-levelruntime._FreeStack
is used to show the amount of memory reserved for goroutine stacks that is currently not in use. These virtual frames are motivated by producing a profile that adds up to/memory/classes/heap/stacks:bytes
as well as giving the user the ability to reason about potentialmorestack
issues.Prototype
I have uploaded a rough CL for a prototype here: https://go-review.googlesource.com/c/go/+/574795 (200 LoC excluding tests).
Using this prototype we can look at a real stack profile for a program with the following goroutines:
Code Snippet
Note: The prototype doesn't implement the proposed root-level
runtime._FreeStack
frame yet.Performance
I not measure this yet, but I suspect all of this can be done with negligible impact on the overhead of the goroutine profile.
Next Steps
Please let me know what you think. cc @prattmic @mknyszek @nsrip-dd @rhysh (this was previously discussed in a recent runtime diagnostics sync, see notes).
The text was updated successfully, but these errors were encountered: