Skip to content

*: conclusively attribute CPU usage to SQL queries and sessions #60508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

tbg
Copy link
Member

@tbg tbg commented Feb 11, 2021

This prototype demonstrates how, with a modest technical investment, we can
improve make the CPU utilization observable on a granular level (SQL sessions,
statements, etc) using Goroutine labels and the Go CPU profiler

Before we dive into what the prototype does, here's a status quo. We are already supporting profiler labels to some degree, though in an unpolished form that is hard to capitalize on. Concretely, we set profiler labels here:

if ex.server.cfg.Settings.CPUProfileType() == cluster.CPUProfileWithLabels {
remoteAddr := "internal"
if rAddr := ex.sessionData.RemoteAddr; rAddr != nil {
remoteAddr = rAddr.String()
}
labels := pprof.Labels(
"appname", ex.sessionData.ApplicationName,
"addr", remoteAddr,
"stmt.tag", ast.StatementTag(),
"stmt.anonymized", anonymizeStmt(ast),
)
pprof.Do(ctx, labels, func(ctx context.Context) {
ev, payload, err = ex.execStmtInOpenState(ctx, parserStmt, prepared, pinfo, res)
})
} else {
ev, payload, err = ex.execStmtInOpenState(ctx, parserStmt, prepared, pinfo, res)
}

This snippet also shows that we have infrastructure that tells us when to
enable Goroutine labels, so that we only do so if the process is actively being
traced (with labels requested). This is relevant because Goroutine labels are
far from free. The implementation in the runtime is not very allocation
efficient: you need to wrap a map in a Context, and the runtime makes an
additional copy of this map internally. The internal copy cannot be avoided,
but many of the surrounding allocations could be with appropriate pooling and
perhaps a custom impl of context.Context that we use for the sole purpose of
passing the map into the runtime. Either way, the take-away is that getting the
overhead down will be important if we are to rely on this mechanism more, as it
would be a shame to make the system a lot slower when trying to find out why it
is so slow in the first place.
To summarize the status quo, we are also already taking label-enabled profiles
of all nodes in the cluster in debug.zip, and also add a script to show a breakdown
by tags. As far as I know, these have never really been used.

The goal:

Given a cluster in which (one or many) nodes experience high CPU usage, a TSE/SRE
can easily identify

  1. whether the high CPU usage is primarily driven by query load, and if so,
  2. which user/session/query is responsible for that load, so that they can
  3. alert the user and/or throttle or blocklist it (not included in the prototype)

Stretch goal (not included but could be worth thinking about)

  • reduce the overhead of the proposed mechanism so that we can periodically enable
    it (at low profiling frequency), and
  • also offer CPU breakdown per query/txn. Now imagine that we also persist historical
    statement/txn stats (maybe just store it before the hourly reset), it seems pretty
    powerful.

What this prototype does concretely is focus only on 1) and 2) with no attempt to
optimize performance.

  • Try to ensure we always allocate all proper labels to everything. In the process of that:
  • Strip out the "only if requested" logic, to simplify things
  • Add a dummy marker label in cli/start which allows us to look at a profile and figure out
    if there's any work done on behalf of the dummy marker, which would indicate that we were
    not properly labelling everywhere.
  • hoist the tracing in connExecutor.execStmt into the connExecutor main loop. I did this
    because I realized that significant chunks of CPU time are spent outside of execStmt,
    and most of the work in the prototype was to arrive at a system that captured enough of
    the work so that we could comfortably "deduce" that a query is responsible for overload.
  • write gRPC interceptors that propagate labels across RPC boundaries. This is mostly for
    distributed SQL flows: if the profile stopped at the gateway, but the CPU was spent on
    another node, it wouldn't be associated to the query. Note that all nodes need to actively
    profile (or at least attach goroutine labels) at around the same time, or the links will
    be lost.
  • disable the "local node" fastpath for KV RPCs because it's ad-hoc and
    bypasses the interceptors; it would need a more holistic solution before shipping.

Learnings from the prototype:

  • Go intentionally made the labels API such that you couldn't derive Goroutine-local storage from it. On top of the allocation inefficiency, we also need to always keep the context.Context and the actual labels in sync manually (as the gRPC interceptors can only get the labels from there; they can't read them off the goroutine). Labels automatically (via the runtime) get propagated to child goroutines, but we often don't do the same thing for Context - because doing so also propagates the cancellation. The prime example of this is stopper.RunAsyncTask, where we often derive from context.Background for the task. We need to change the Stopper to automatically put the labels of the parent goroutine into the children it starts, which is not too difficult. Next, this project would lend further weight to ban the Go keyword; the Stopper should be our authority here.
  • In a fully loaded system, the contributions from "our" goroutines aren't usually 100%. There is one clear outlier which we can fix: pebble. It has its own goroutine label, which is fine, but we should massage it so that it fits in with whatever labelling categories we want to productionize here (if any). Since pebble is third party, likely we'll just want it to give us a hook to determine the labels rather than hard-coding its own. But even pebble aside, runtime-internal goroutines tend to eat a good chunk of CPU when busy, for example for GC scans. This raises the interesting question about 1) at the top of this message - given the CPU time associated to our labels, how can we determine if it's enough to claim that it's overloading the machine? Clearly if a 30s sample displays only 1s of total CPU, the system is not busy. A 30s sample on a 4vcpu machine could top out at around 120s total. If we account for 60s of that, the system seems very busy, but we only account for 50%. What does that mean? So there will need to be some guidance around this to help SRE/TSE make judgment calls. I think what it will boil down to is: if the total samples account for close to num_vcpus * profile_duration, then the system is pretty much overloaded (they should also see that from the runtime metrics). Since we promise that we are putting labels on all moving parts, we can assume that the seconds accrued by those labels are "100% of the productive work" and that this is the whole category to look at. But this needs to be verified experimentally. It could be possible for some SQL query to be low-CPU but extremely alloc heavy, so that it would drive the garbage collector nuts.
  • The goroutine profile also uses labels, so it can be used to check whether we're labelling all of the goroutines, simply by restricting to goroutines which don't have a certain label that we expect to be present on all of our processes.
  • It's possible to grab the profile in the process and to then evaluate it, so it shouldn't be too hard to have a button in the UI that populates the stmt table with a CPU percentage breakdown.
  • We need a little bit of convention around these labels, i.e. we need to wrap the labels API into our custom thing that is more strongly typed.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@awoods187
Copy link
Contributor

awoods187 commented Feb 11, 2021 via email

tbg added a commit to tbg/cockroach that referenced this pull request Feb 19, 2021
Background profiling is an experimental patch that was floated as a way
to get granular CPU usage information in
golang/go#23458.

I decided to dust off the Golang patch that implements this
(https://go-review.googlesource.com/c/go/+/102755) and see for myself
what you actually get. The output is something like this:

```
bgprof_test.go:51: ts: 12h37m15.540022251s
    labels: map[foo:bar]
    fn: hash/crc32.ieeeCLMUL
bgprof_test.go:51: ts: 12h37m15.54801909s
    labels: map[foo:bar]
    fn: hash/crc32.ieeeCLMUL
bgprof_test.go:51: ts: 12h37m15.560015269s
    labels: map[foo:bar]
    fn: hash/crc32.ieeeCLMUL
bgprof_test.go:51: ts: 12h37m15.568026639s
    labels: map[foo:bar]
    fn: hash/crc32.ieeeCLMUL
bgprof_test.go:51: ts: 12h37m15.580029608s
    labels: map[foo:bar]
    fn: hash/crc32.ieeeCLMUL
...
```

If we used profiler labels to identify queries (similar to what's shown
in cockroachdb#60508) I guess I can
see how you can somehow build a profile from this stream and then use
the tag breakdown of the profile to reason about CPU allocated to each
query. It seems wildly experimental and also unergonomical, though.

It seems like we'd get maybe roughly the same, without as much bending
over backwards and wildly experimental bleeding edge, by relying on
periodic foreground sampling at a lower frequency
(`runtime.SetCPUProfileRate`) than the default 100hz.

To run this PR (which you shouldn't need to do for anything, but still)
you'd clone `cockroachdb/go`, check out the [bgprof] branch, run
`./make.bash` in `src` and change your PATH so that `$(which go)` is
`bin/go`.

Release note: None
tbg and others added 4 commits February 22, 2021 12:49
When running experiment.sh with three nodes, there's always some
missing-client and missing-server; I think this must be some DistSQL
behavior breaking the context (since I do not see this issue with a
single node and have disabled the local internal client optimization).

TODO: return an error from node.batch in that case to see which sql
ops have it bubble up.

Release note: None
@tbg tbg force-pushed the pprof-experiment branch from c477641 to 07d5a51 Compare February 22, 2021 11:49
@tbg tbg force-pushed the pprof-experiment branch from 86bf7aa to 651b855 Compare February 22, 2021 12:34
@tbg
Copy link
Member Author

tbg commented Feb 22, 2021

What needs to be done to productionize the prototype:

  • An owner needs to be found to take this on for 21.2. An RFC seems in order, or at least coordination with rfcs: Task groups in Go: per-session resource usage tracking #60589.
  • Determine the basic perf overhead of profiling without labels. (Recall that the overhead while not profiling can be zero if we so desire, even with labels).
  • Standardize on label fields and wrap that in a library and set up the stopper to propagate them explicitly via contexts.
  • Productionize the gRPC interceptors (mostly, use the library and write tests)
  • Make the node local internal client optimization compliant with interceptors (independently valuable)
  • Introduce tooling/paradigms that ensure that all sources of work are labelled. Not sure exactly how! But should stay away from ad-hoc labels; this will rot as folks write new code.
    • assertions in test builds (all server endpoints refuse incoming requests w/o labels? Or maybe that's too restrictive?)
  • Figure out the UX. Various levels:
    • level 0: don't do anything, so it's in the profiles but that's it. Write documentation for our TSE/SRE on how to use them.
    • level 1: manually triggered cpu profile gets taken internally and CPU breakdown per query is populated in stmt_statistics (or something similar)
    • level 2: this gets hooked up in the UI and there's a button in the stmt/txn stats page that you can click that will run a short profile to update the table
    • level 3: profils are kicked off automatically at a cadence

@tbg tbg closed this Apr 30, 2021
@michae2
Copy link
Collaborator

michae2 commented Oct 27, 2021

@kevin-v-ngo this seems like something really useful. I wonder what you think of it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants