-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/metrics: provide histogram of all STW events #63340
Comments
CC @golang/runtime |
It could be nice to have a full breakdown of each STW type, but I'm not sure the value is worth littering with so many new metrics. I also wonder if we should have a metric for the time it takes to stop the world (i.e., the runtime of cc @mknyszek |
I'd be happy with a pair of histogram metrics (how long to stop the world, and how long it remains stopped). That's usually the first way I'd need to split an investigation; the raw number of STW events (and how it grows over time) should give a good hint at which kinds of STW are prevalent in the app. To the extent that #45894 is explained by the spinning M mechanism being slow to restart the world, it could be interesting to also have a third "cost of STW" histogram that would shed light on that. Defining it would be tricky, something like "from the start of the call to As of go1.21.1 the runtime/metrics package has 18 entries under Execution traces give great detail on STW events, so long as the STW events aren't themselves correlated / anti-correlated with the collection of an execution trace. It's very easy to see (most of) how long the world remains stopped at the beginning of an execution trace. But it's impossible to see how long that STW takes to bring the world to a stop. The mechanism I use for most of my profiling collects goroutine profiles immediately before taking any execution traces, so the work of gathering the goroutine profile (including its STW events) never has a chance to show up in the execution trace. Maybe the flight recording API #63185 can change that, through allowing us to look back in time to see interesting STWs that we otherwise would have missed. |
Often enough I find myself wondering if struggles to stop-the-world are causing latency issues that I think they are worth measuring. Also, rather than measuring "GC" and "all" pauses, IMO it would be cleaner to measure "GC" and "everything else" separately to make the non-GC effects more obvious. Concretely, I propose that we add:
With this scheme, I have written a prototype CL implementing this for reference, which I will push shortly. |
Change https://go.dev/cl/534161 mentions this issue: |
FWIW I support this. Non-GC STWs has been a gap in our metrics for a long time. +1 to making the metrics orthogonal (GC and then everything else) and +1 to breaking out "time to stop" (listed as "start") as a metric. I also think it's fine to not have the same kind of orthogonality for "total" vs. "start" that we have for "gc" vs. "other". Histograms can't be combined in quite the same way. I agree "total" and "start" also makes more sense, as they're the most individually useful. Bikeshedding just a little, maybe "start" could be a bit less opaque? Maybe "bring-to-stop"? And on that note, perhaps "total" could be "end-to-end" or something? I don't know. "total" is probably fine. I don't feel strongly about this. |
Based on the discussion above, this proposal seems like a likely accept. Add four new runtime/metrics histograms for tracking GC pauses: /sched/pauses/start/gc:seconds Mark /gc/pauses:seconds as deprecated in favor of /sched/pauses/total/gc:seconds (which is exactly the same). Perhaps a better name than “start” can be found to indicate the time it takes the world to actually stop (“settle”? “quiesce”? “freeze”?). |
s/start/stopping/ |
Maybe |
No change in consensus, so accepted. 🎉 Add four new runtime/metrics histograms for tracking GC pauses: /sched/pauses/stopping/gc:seconds Mark /gc/pauses:seconds as deprecated in favor of /sched/pauses/total/gc:seconds (which is exactly the same). |
Change https://go.dev/cl/547095 mentions this issue: |
For #63340. For #61422. Change-Id: Ib74bb54b0450e96b7f4b7eb7ba2ae7ac2d40171a Reviewed-on: https://go-review.googlesource.com/c/go/+/547095 Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
For golang#63340. For golang#61422. Change-Id: Ib74bb54b0450e96b7f4b7eb7ba2ae7ac2d40171a Reviewed-on: https://go-review.googlesource.com/c/go/+/547095 Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
The
/gc/pauses:seconds
metric gives a count of GC-related stop-the-world events (and the distribution of their durations), but there are several other ways for an application to trigger a STW. The prevalence and duration of those is also worth tracking: sometimes the work done while the world is stopped is expensive (#33250), and sometimes the stopping itself is expensive (#31222).In the runtime today, I see
runtime.GOMAXPROCS
,runtime.ReadMemStats
,runtime/pprof.Lookup("goroutine").WriteTo
,runtime/trace.{Start,Stop}
(runtime.{Start,Stop}Trace
),runtime.Stack(..., true)
,runtime/debug.WriteHeapDump
,syscall.AllThreadsSyscall{,6}
, plus GC phase transitions, as being able to cause a STW. It's not unusual for latency-sensitive apps to call one or more of those on a regular basis: in my own experience that's usually goroutine profiles, ReadMemStats (via theexpvar
package), and execution traces.I propose adding a new histogram metric that reports on the full duration of all STW events, including the time to bring the world to a stop and the time with the world stopped, tentatively named
/sched/pauses:seconds
. This will make it easier to identify the presence and impact of stop-the-world pauses, even when they're not related to the GC.Existing:
Proposed addition:
The text was updated successfully, but these errors were encountered: