-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/trace: execution tracer overhaul #60773
Comments
Change https://go.dev/cl/503038 mentions this issue: |
Change https://go.dev/cl/494187 mentions this issue: |
Change https://go.dev/cl/515635 mentions this issue: |
Change https://go.dev/cl/538515 mentions this issue: |
Change https://go.dev/cl/540256 mentions this issue: |
Change https://go.dev/cl/540778 mentions this issue: |
Currently the execution tracer synchronizes with itself using very heavyweight operations. As a result, it's totally fine for most of the tracer code to look like: if traceEnabled() { traceXXX(...) } However, if we want to make that synchronization more lightweight (as issue #60773 proposes), then this is insufficient. In particular, we need to make sure the tracer can't observe an inconsistency between g atomicstatus and the event that would be emitted for a particular g transition. This means making the g status change appear to happen atomically with the corresponding trace event being written out from the perspective of the tracer. This requires a change in API to something more like a lock. While we're here, we might as well make sure that trace events can *only* be emitted while this lock is held. This change introduces such an API: traceAcquire, which returns a value that can emit events, and traceRelease, which requires the value that was returned by traceAcquire. In practice, this won't be a real lock, it'll be more like a seqlock. For the current tracer, this API is completely overkill and the value returned by traceAcquire basically just checks trace.enabled. But it's necessary for the tracer described in #60773 and we can implement that more cleanly if we do this refactoring now instead of later. For #60773. Change-Id: Ibb9ff5958376339fafc2b5180aef65cf2ba18646 Reviewed-on: https://go-review.googlesource.com/c/go/+/515635 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
This change mostly implements the design described in #60773 and includes a new scalable parser for the new trace format, available in internal/trace/v2. I'll leave this commit message short because this is clearly an enormous CL with a lot of detail. This change does not hook up the new tracer into cmd/trace yet. A follow-up CL will handle that. For #60773. Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest,gotip-linux-amd64-longtest-race Change-Id: I5d2aca2cc07580ed3c76a9813ac48ec96b157de0 Reviewed-on: https://go-review.googlesource.com/c/go/+/494187 Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
This change adds a new MutatorUtilization for traces for Go 1.22+. To facilitate testing, it also generates a short trace with the gc-stress.go test program (shortening its duration to 10ms) and adds it to the tests for the internal/trace/v2 package. Notably, we make sure this trace has a GCMarkAssistActive event to test that codepath. For #63960. For #60773. Change-Id: I2e61f545988677be716818e2a08641c54c4c201f Reviewed-on: https://go-review.googlesource.com/c/go/+/540256 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com>
Change https://go.dev/cl/541695 mentions this issue: |
Change https://go.dev/cl/541696 mentions this issue: |
Change https://go.dev/cl/541257 mentions this issue: |
Change https://go.dev/cl/541259 mentions this issue: |
Change https://go.dev/cl/541258 mentions this issue: |
Change https://go.dev/cl/541999 mentions this issue: |
The last change made the MMU rendering code common and introduced a new API, but it was kind of messy. Part of the problem was that some of the Javascript in the template for the main page referred to specific endpoints on the server. Fix this by having the Javascript access the same endpoint but with a different query variable. Now the Javascript code doesn't depend on specific endpoints, just on query variables for the current endpoint. For #60773. For #63960. Change-Id: I1c559d9859c3a0d62e2094c9d4ab117890b63b31 Reviewed-on: https://go-review.googlesource.com/c/go/+/541259 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
For #60773. For #63960. Change-Id: Id97380f19267ec765b25a703ea3e2f284396ad75 Reviewed-on: https://go-review.googlesource.com/c/go/+/541998 Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com>
This change adds support for the pprof endpoints to cmd/trace/v2. In the process, I realized we need to pass the goroutine summaries to more places, and previous CLs had already done the goroutine analysis during cmd/trace startup. This change thus refactors the goroutine analysis API once again to operate in a streaming manner, and to run at the same time as the initial trace parsing. Now we can include it in the parsedTrace type and pass that around as the de-facto global trace context. Note: for simplicity, this change redefines "syscall" profiles to capture *all* syscalls, not just syscalls that block. IIUC, this choice was partly the result of a limitation in the previous trace format that syscalls don't all have complete durations and many short syscalls are treated as instant. To this end, this change modifies the text on the main trace webpage to reflect this change. For #60773. For #63960. Change-Id: I601d9250ab0849a0bfaef233fd9b1e81aca9a22a Reviewed-on: https://go-review.googlesource.com/c/go/+/541999 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
The v2 trace parser currently handles task inheritance and region task association incorrectly. It assumes that a TaskID of 0 means that there is no task. However, this is only true for task events. A TaskID of 0 means that a region gets assigned to the "background task." The parser currently has no concept of a "background task." Fix this by defining the background task as task ID 0 and redefining NoTask to ^uint64(0). This aligns the TaskID values more closely with other IDs in the parser and also enables disambiguating these two cases. For #60773. For #63960. Change-Id: I09c8217b33b87c8f8f8ea3b0203ed83fd3b61e11 Reviewed-on: https://go-review.googlesource.com/c/go/+/543019 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> TryBot-Bypass: Michael Knyszek <mknyszek@google.com>
For v1 traces, cmd/trace contains code for analyzing tasks separately from the goroutine analysis code present in internal/trace. As I started to look into porting that code to v2 traces, I noticed that it wouldn't be too hard to just generalize the existing v2 goroutine summary code to generate exactly the same information. This change does exactly that. For #60773. For #63960. Change-Id: I0cdd9bf9ba11fb292a9ffc37dbf18c2a6a2483b8 Reviewed-on: https://go-review.googlesource.com/c/go/+/542076 Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com>
This code will be useful for the new tracer, and there's no need to duplicate it. This change copies it to internal/trace/traceviewer, adds some comments, and renames it to TimeHistogram. While we're here, let's get rid of the unused String method which has a comment talking about how awful the rendering is. Also, let's get rid of uses of niceDuration. We'd have to bring it with us in the move and I don't think it's worth it. The difference between the default time.Duration rendering and the niceDuration rendering is usually a few extra digits of precision. Yes, it's noisier, but AFAICT it's not substantially worse. It doesn't seem worth the new API, even if it's just internal. We can also always bring it back later. For #60773. For #63960. Change-Id: I795f58f579f1d503c540c3a40bed12e52bce38e2 Reviewed-on: https://go-review.googlesource.com/c/go/+/542001 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
This change fills out the last of cmd/trace's subpages for v2 traces by adding support for task and region endpoints. For #60773. For #63960. Change-Id: Ifc4c660514b3904788785a1b20e3abc3bb9e55f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/542077 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
This change adds support for the trace?goid=<goid> endpoint to the trace tool for v2 traces. In effect, this change actually implements a per-goroutine view. I tried to add a link to the main page to enable a "view by goroutines" view without filtering, but the web trace viewer broke the browser tab when there were a few hundred goroutines. The risk of a browser hang probably isn't worth the cases where this is nice, especially since filtering by goroutine already works. Unfortunate, but c'est l'vie. Might be worth revisiting if we change out the web viewer in the future. For #60773. For #63960. Change-Id: I8e29f4ab8346af6708fd8824505c30f2c43db796 Reviewed-on: https://go-review.googlesource.com/c/go/+/543595 TryBot-Bypass: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com>
This change implements support for the trace?focustask=<taskid> endpoint in the trace tool for v2 traces. Note: the one missing feature in v2 vs. v1 is that the "irrelevant" (but still rendered) events are not grayed out. This basically includes events that overlapped with events that overlapped with other events that were in the task time period, but aren't themselves directly associated. This is probably fine -- the UI already puts a very obvious focus on the period of time the selected task was running. For #60773. For #63960. Change-Id: I5c78a220ae816e331b74cb67c01c5cd98be40dd4 Reviewed-on: https://go-review.googlesource.com/c/go/+/543596 Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com>
This is a nice-to-have that's now straightforward to do with the new trace format. This change adds a new query variable passed to the /trace endpoint called "view," which indicates the type of view to use. It is orthogonal with task-related views. Unfortunately a goroutine-based view isn't included because it's too likely to cause the browser tab to crash. For #60773. For #63960. Change-Id: Ifbcb8f2d58ffd425819bdb09c586819cb786478d Reviewed-on: https://go-review.googlesource.com/c/go/+/543695 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com>
This change adds support for a goroutine-oriented task view via the /trace?taskid=<taskid> endpoint. This works but it's missing regions. That will be implemented in a follow-up CL. For #60773. For #63960. Change-Id: I086694143e5c71596ac22fc551416868f0b80923 Reviewed-on: https://go-review.googlesource.com/c/go/+/543937 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
This change emits regions in the goroutine-oriented task view (the /trace endpoint with the taskid query variable set) in the same way the old cmd/trace does. For #60773. Fixes #63960. Change-Id: If6c3e7072c694c84a7d2d6c34df668f48d3acc2a Reviewed-on: https://go-review.googlesource.com/c/go/+/543995 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Change https://go.dev/cl/546026 mentions this issue: |
For #60773. For #62627. For #63960. For #61422. Change-Id: I3c933f7522f65cd36d11d38a268556d92c8053f9 Reviewed-on: https://go-review.googlesource.com/c/go/+/546026 Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Knyszek <mknyszek@google.com>
@mknyszek the Future work section of the proposal mentions the following:
Does the timestamp delta have 20 bits or 24? At 20 bits and 64 ns resolution, it'd only be 67 ms, not 1s. Furthermore, I'm not convinced that 64 ns granularity is sufficient. In a trace produced by |
@dominikh Unfortunately that document is out of date; I plan to rewrite a good chunk of it in the new year. There are things I didn't get to, and things that had to be implemented differently. (The core goals and ideas still apply, but the implementation was different, generally to avoid unforeseen limitations and/or bugs in the proposed changes.) Timestamps are encoded the same way they were before, but the new timestamp resolution is indeed 64 ns. The parser will avoid equal timestamps by incrementing by 1 ns; it will do the same thing if timestamps are out-of-order, which is why you see these odd-looking deltas. Increasing the resolution is easy, it's just one constant and changing it is backwards compatible. If you're finding that 64 ns is not good enough, we can change it. |
For golang#60773. For golang#62627. For golang#63960. For golang#61422. Change-Id: I3c933f7522f65cd36d11d38a268556d92c8053f9 Reviewed-on: https://go-review.googlesource.com/c/go/+/546026 Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Knyszek <mknyszek@google.com>
For golang/go#60773. Change-Id: I0f35c07941b5ebafc4e57db037a4a92092d6c7b0 Reviewed-on: https://go-review.googlesource.com/c/proposal/+/503038 Reviewed-by: Michael Pratt <mpratt@google.com>
Change https://go.dev/cl/576256 mentions this issue: |
Change https://go.dev/cl/576257 mentions this issue: |
Change https://go.dev/cl/576258 mentions this issue: |
This change makes the new execution tracer described in #60773, the default tracer. This change attempts to make the smallest amount of changes for a single CL. Updates #66703 For #60773 Change-Id: I3742f3419c54f07d7c020ae5e1c18d29d8bcae6d Reviewed-on: https://go-review.googlesource.com/c/go/+/576256 Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
This change renames the v2 execution tracer files created as part of Updates #66703 For #60773 Change-Id: I91bfdc08fec4ec68ff3a6e8b5c86f6f8bcae6e6d Reviewed-on: https://go-review.googlesource.com/c/go/+/576257 Auto-Submit: Carlos Amedee <carlos@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com>
This change removes the ExecTracer2 experiment flag. This flag was created as part of the tracer overhaul described in #60773. Updates #66703 For #60773 Change-Id: Ib95b9dd8bb68b48e7543c4d28d853c6a518438f6 Reviewed-on: https://go-review.googlesource.com/c/go/+/576258 Auto-Submit: Carlos Amedee <carlos@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com>
Execution tracer overhaul
Authored by mknyszek@google.com with a mountain of input from others.
In no particular order, thank you to Felix Geisendorfer, Nick Ripley, Michael Pratt, Austin Clements, Rhys Hiltner, thepudds, Dominik Honnef, and Bryan Boreham for your invaluable feedback.
Background
Original design document.
Go execution traces provide a moment-to-moment view of what happens in a Go program over some duration. This information is invaluable in understanding program behavior over time and can be leveraged to achieve significant performance improvements. Because Go has a runtime, it can provide deep information about program execution without any external dependencies, making traces particularly attractive for large deployments.
Unfortunately limitations in the trace implementation prevent widespread use.
For example, the process of analyzing execution traces scales poorly with the size of the trace. Traces need to be parsed in their entirety to do anything useful with them, making them impossible to stream. As a result, trace parsing and validation has very high memory requirements for large traces.
Also, Go execution traces are designed to be internally consistent, but don't provide any way to align with other kinds of traces, for example OpenTelemetry traces and Linux sched traces. Alignment with higher level tracing mechanisms is critical to connecting business-level tasks with resource costs. Meanwhile alignment with lower level traces enables a fully vertical view of application performance to root out the most difficult and subtle issues.
Lastly, the implementation of the execution tracer has evolved organically over time and it shows. The codebase also has many old warts and some age-old bugs that make collecting traces difficult, and seem broken. Furthermore, many significant decision decisions were made over the years but weren't thoroughly documented; those decisions largely exist solely in old commit messages and breadcrumbs left in comments within the codebase itself.
Thanks to work in Go 1.21 cycle, the execution tracer's run-time overhead was reduced from about -10% throughput and +10% request latency in web services to about 1% in both for most applications. This reduced overhead in conjunction with making traces more scalable enables some exciting and powerful new opportunities for traces.
Goals
The goal of this document is to define an alternative implementation for Go execution traces that scales up to large Go deployments.
Specifically, the design presented aims to achieve:
This document also describes the existing state of the tracer in detail and explains how we got there.
Design
Link to design document.
CC @felixge @nsrip-dd @prattmic @aclements @rhysh @dominikh @bboreham @thepudds
The text was updated successfully, but these errors were encountered: