*: conclusively attribute CPU usage to SQL queries and sessions #60508

tbg · 2021-02-11T21:42:58Z

This prototype demonstrates how, with a modest technical investment, we can
improve make the CPU utilization observable on a granular level (SQL sessions,
statements, etc) using Goroutine labels and the Go CPU profiler

Before we dive into what the prototype does, here's a status quo. We are already supporting profiler labels to some degree, though in an unpolished form that is hard to capitalize on. Concretely, we set profiler labels here:

cockroach/pkg/sql/conn_executor_exec.go

Lines 105 to 121 in 401e9d4

    
           if ex.server.cfg.Settings.CPUProfileType() == cluster.CPUProfileWithLabels { 
        
           	remoteAddr := "internal" 
        
           	if rAddr := ex.sessionData.RemoteAddr; rAddr != nil { 
        
           		remoteAddr = rAddr.String() 
        
           	} 
        
           	labels := pprof.Labels( 
        
           		"appname", ex.sessionData.ApplicationName, 
        
           		"addr", remoteAddr, 
        
           		"stmt.tag", ast.StatementTag(), 
        
           		"stmt.anonymized", anonymizeStmt(ast), 
        
           	) 
        
           	pprof.Do(ctx, labels, func(ctx context.Context) { 
        
           		ev, payload, err = ex.execStmtInOpenState(ctx, parserStmt, prepared, pinfo, res) 
        
           	}) 
        
           } else { 
        
           	ev, payload, err = ex.execStmtInOpenState(ctx, parserStmt, prepared, pinfo, res) 
        
           }

This snippet also shows that we have infrastructure that tells us when to
enable Goroutine labels, so that we only do so if the process is actively being
traced (with labels requested). This is relevant because Goroutine labels are
far from free. The implementation in the runtime is not very allocation
efficient: you need to wrap a map in a Context, and the runtime makes an
additional copy of this map internally. The internal copy cannot be avoided,
but many of the surrounding allocations could be with appropriate pooling and
perhaps a custom impl of context.Context that we use for the sole purpose of
passing the map into the runtime. Either way, the take-away is that getting the
overhead down will be important if we are to rely on this mechanism more, as it
would be a shame to make the system a lot slower when trying to find out why it
is so slow in the first place.
To summarize the status quo, we are also already taking label-enabled profiles
of all nodes in the cluster in debug.zip, and also add a script to show a breakdown
by tags. As far as I know, these have never really been used.

The goal:

Given a cluster in which (one or many) nodes experience high CPU usage, a TSE/SRE
can easily identify

whether the high CPU usage is primarily driven by query load, and if so,
which user/session/query is responsible for that load, so that they can
alert the user and/or throttle or blocklist it (not included in the prototype)

Stretch goal (not included but could be worth thinking about)

reduce the overhead of the proposed mechanism so that we can periodically enable
it (at low profiling frequency), and
also offer CPU breakdown per query/txn. Now imagine that we also persist historical
statement/txn stats (maybe just store it before the hourly reset), it seems pretty
powerful.

What this prototype does concretely is focus only on 1) and 2) with no attempt to
optimize performance.

Try to ensure we always allocate all proper labels to everything. In the process of that:
Strip out the "only if requested" logic, to simplify things
Add a dummy marker label in cli/start which allows us to look at a profile and figure out
if there's any work done on behalf of the dummy marker, which would indicate that we were
not properly labelling everywhere.
hoist the tracing in connExecutor.execStmt into the connExecutor main loop. I did this
because I realized that significant chunks of CPU time are spent outside of execStmt,
and most of the work in the prototype was to arrive at a system that captured enough of
the work so that we could comfortably "deduce" that a query is responsible for overload.
write gRPC interceptors that propagate labels across RPC boundaries. This is mostly for
distributed SQL flows: if the profile stopped at the gateway, but the CPU was spent on
another node, it wouldn't be associated to the query. Note that all nodes need to actively
profile (or at least attach goroutine labels) at around the same time, or the links will
be lost.
disable the "local node" fastpath for KV RPCs because it's ad-hoc and
bypasses the interceptors; it would need a more holistic solution before shipping.

Learnings from the prototype:

Go intentionally made the labels API such that you couldn't derive Goroutine-local storage from it. On top of the allocation inefficiency, we also need to always keep the context.Context and the actual labels in sync manually (as the gRPC interceptors can only get the labels from there; they can't read them off the goroutine). Labels automatically (via the runtime) get propagated to child goroutines, but we often don't do the same thing for Context - because doing so also propagates the cancellation. The prime example of this is stopper.RunAsyncTask, where we often derive from context.Background for the task. We need to change the Stopper to automatically put the labels of the parent goroutine into the children it starts, which is not too difficult. Next, this project would lend further weight to ban the Go keyword; the Stopper should be our authority here.
In a fully loaded system, the contributions from "our" goroutines aren't usually 100%. There is one clear outlier which we can fix: pebble. It has its own goroutine label, which is fine, but we should massage it so that it fits in with whatever labelling categories we want to productionize here (if any). Since pebble is third party, likely we'll just want it to give us a hook to determine the labels rather than hard-coding its own. But even pebble aside, runtime-internal goroutines tend to eat a good chunk of CPU when busy, for example for GC scans. This raises the interesting question about 1) at the top of this message - given the CPU time associated to our labels, how can we determine if it's enough to claim that it's overloading the machine? Clearly if a 30s sample displays only 1s of total CPU, the system is not busy. A 30s sample on a 4vcpu machine could top out at around 120s total. If we account for 60s of that, the system seems very busy, but we only account for 50%. What does that mean? So there will need to be some guidance around this to help SRE/TSE make judgment calls. I think what it will boil down to is: if the total samples account for close to num_vcpus * profile_duration, then the system is pretty much overloaded (they should also see that from the runtime metrics). Since we promise that we are putting labels on all moving parts, we can assume that the seconds accrued by those labels are "100% of the productive work" and that this is the whole category to look at. But this needs to be verified experimentally. It could be possible for some SQL query to be low-CPU but extremely alloc heavy, so that it would drive the garbage collector nuts.
The goroutine profile also uses labels, so it can be used to check whether we're labelling all of the goroutines, simply by restricting to goroutines which don't have a certain label that we expect to be present on all of our processes.
It's possible to grab the profile in the process and to then evaluate it, so it shouldn't be too hard to have a button in the UI that populates the stmt table with a CPU percentage breakdown.
We need a little bit of convention around these labels, i.e. we need to wrap the labels API into our custom thing that is more strongly typed.

cockroach-teamcity · 2021-02-11T21:43:05Z

This change is

awoods187 · 2021-02-11T21:59:32Z

Oh i can't wait to see this

…

On Thu, Feb 11, 2021 at 4:43 PM cockroach-teamcity ***@***.***> wrote: This change is [image: Reviewable] <https://reviewable.io/reviews/cockroachdb/cockroach/60508> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#60508 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFJ7F77SOA2D2WD7LWKX5K3S6RFRBANCNFSM4XPTBUWQ> .

-- Andy Woods Group Product Manager, Cockroach Labs andy@cockroachlabs.com

Background profiling is an experimental patch that was floated as a way to get granular CPU usage information in golang/go#23458. I decided to dust off the Golang patch that implements this (https://go-review.googlesource.com/c/go/+/102755) and see for myself what you actually get. The output is something like this: ``` bgprof_test.go:51: ts: 12h37m15.540022251s labels: map[foo:bar] fn: hash/crc32.ieeeCLMUL bgprof_test.go:51: ts: 12h37m15.54801909s labels: map[foo:bar] fn: hash/crc32.ieeeCLMUL bgprof_test.go:51: ts: 12h37m15.560015269s labels: map[foo:bar] fn: hash/crc32.ieeeCLMUL bgprof_test.go:51: ts: 12h37m15.568026639s labels: map[foo:bar] fn: hash/crc32.ieeeCLMUL bgprof_test.go:51: ts: 12h37m15.580029608s labels: map[foo:bar] fn: hash/crc32.ieeeCLMUL ... ``` If we used profiler labels to identify queries (similar to what's shown in cockroachdb#60508) I guess I can see how you can somehow build a profile from this stream and then use the tag breakdown of the profile to reason about CPU allocated to each query. It seems wildly experimental and also unergonomical, though. It seems like we'd get maybe roughly the same, without as much bending over backwards and wildly experimental bleeding edge, by relying on periodic foreground sampling at a lower frequency (`runtime.SetCPUProfileRate`) than the default 100hz. To run this PR (which you shouldn't need to do for anything, but still) you'd clone `cockroachdb/go`, check out the [bgprof] branch, run `./make.bash` in `src` and change your PATH so that `$(which go)` is `bin/go`. Release note: None

When running experiment.sh with three nodes, there's always some missing-client and missing-server; I think this must be some DistSQL behavior breaking the context (since I do not see this issue with a single node and have disabled the local internal client optimization). TODO: return an error from node.batch in that case to see which sql ops have it bubble up. Release note: None

tbg · 2021-02-22T16:15:15Z

What needs to be done to productionize the prototype:

An owner needs to be found to take this on for 21.2. An RFC seems in order, or at least coordination with rfcs: Task groups in Go: per-session resource usage tracking #60589.
Determine the basic perf overhead of profiling without labels. (Recall that the overhead while not profiling can be zero if we so desire, even with labels).
- If we decide that it is too expensive to even turn on temporarily ever, this project is dead. The only alternative I know of is rfcs: Task groups in Go: per-session resource usage tracking #60589.
- Determine whether to always assign labels (if we only do while profiling, we need to profile all nodes at once to get all of the labels).
Standardize on label fields and wrap that in a library and set up the stopper to propagate them explicitly via contexts.
Productionize the gRPC interceptors (mostly, use the library and write tests)
Make the node local internal client optimization compliant with interceptors (independently valuable)
Introduce tooling/paradigms that ensure that all sources of work are labelled. Not sure exactly how! But should stay away from ad-hoc labels; this will rot as folks write new code.
- assertions in test builds (all server endpoints refuse incoming requests w/o labels? Or maybe that's too restrictive?)
Figure out the UX. Various levels:
- level 0: don't do anything, so it's in the profiles but that's it. Write documentation for our TSE/SRE on how to use them.
- level 1: manually triggered cpu profile gets taken internally and CPU breakdown per query is populated in stmt_statistics (or something similar)
- level 2: this gets hooked up in the UI and there's a button in the stmt/txn stats page that you can click that will run a short profile to update the table
- level 3: profils are kicked off automatically at a cadence

michae2 · 2021-10-27T21:03:41Z

@kevin-v-ngo this seems like something really useful. I wonder what you think of it?

tbg mentioned this pull request Feb 16, 2021

rfcs: Task groups in Go: per-session resource usage tracking #60589

Open

awoods187 mentioned this pull request Feb 16, 2021

sql/ui: add a cpu used per statement/transaction/session #60628

Closed

tbg mentioned this pull request Feb 19, 2021

[dnm]: use background profiling #60795

Closed

tbg and others added 4 commits February 22, 2021 12:49

pprof experiment wip

fa91649

wip

845cac6

wip

07d5a51

tbg force-pushed the pprof-experiment branch from c477641 to 07d5a51 Compare February 22, 2021 11:49

note

651b855

tbg force-pushed the pprof-experiment branch from 86bf7aa to 651b855 Compare February 22, 2021 12:34

tbg closed this Apr 30, 2021

This was referenced Feb 25, 2022

tenantrate: use measured on-cpu time for rate limiting #77041

Open

proposal: runtime: add per-goroutine CPU stats golang/go#41554

Open

tbg mentioned this pull request Aug 12, 2022

rpc,sql: propagate pprof labels for SQL queries to KV #86012

Closed

tbg mentioned this pull request Oct 13, 2022

server: add cpuprofile_dumper #75799

Closed

adityamaru mentioned this pull request Mar 30, 2023

kvserver: propagate job pprof labels to KV requests sent on behalf of the job #100166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

*: conclusively attribute CPU usage to SQL queries and sessions #60508

*: conclusively attribute CPU usage to SQL queries and sessions #60508

Uh oh!

tbg commented Feb 11, 2021 •

edited

Loading

Uh oh!

cockroach-teamcity commented Feb 11, 2021

Uh oh!

awoods187 commented Feb 11, 2021 via email

Uh oh!

tbg commented Feb 22, 2021

Uh oh!

michae2 commented Oct 27, 2021

Uh oh!

Uh oh!

	if ex.server.cfg.Settings.CPUProfileType() == cluster.CPUProfileWithLabels {
	remoteAddr := "internal"
	if rAddr := ex.sessionData.RemoteAddr; rAddr != nil {
	remoteAddr = rAddr.String()
	}
	labels := pprof.Labels(
	"appname", ex.sessionData.ApplicationName,
	"addr", remoteAddr,
	"stmt.tag", ast.StatementTag(),
	"stmt.anonymized", anonymizeStmt(ast),
	)
	pprof.Do(ctx, labels, func(ctx context.Context) {
	ev, payload, err = ex.execStmtInOpenState(ctx, parserStmt, prepared, pinfo, res)
	})
	} else {
	ev, payload, err = ex.execStmtInOpenState(ctx, parserStmt, prepared, pinfo, res)
	}

*: conclusively attribute CPU usage to SQL queries and sessions #60508

*: conclusively attribute CPU usage to SQL queries and sessions #60508

Uh oh!

Conversation

tbg commented Feb 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Feb 11, 2021

Uh oh!

awoods187 commented Feb 11, 2021 via email

Uh oh!

tbg commented Feb 22, 2021

Uh oh!

michae2 commented Oct 27, 2021

Uh oh!

Uh oh!

tbg commented Feb 11, 2021 •

edited

Loading