tenant: add endpoint with instant metrics #70750

darinpp · 2021-09-27T01:43:44Z

Previously the tenant process was serving various metrics on
/_status/vars. This endpoint has all the available metrics and these are
updated every 10 sec. Many of the metrics show a rate that is calculated
over the 10 sec interval. Some of the metrics are used by the cockroach
operator to monitor the CPU workload of the tenant process and use that
workload for automatic scaling. The 10 sec interval however is too long
and causes a slow scaling up. The reporting of high CPU utilization can
take up to 20 sec (to compute a delta). To resolve this, the PR adds a
new endpoint /_status/load that provides an instant reading of a
very small subset of the normal metrics - user and system CPU time for
now. By having these be instant, the client can retrieve in quick
succession, consecutive snapshots and compute a precise CPU utulization.
It also allows the client to control the interval between the two pulls
(as opposed to having it hard coded to 10 sec).

Release note: None

cockroach-teamcity · 2021-09-27T01:43:50Z

This change is

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @darinpp)

-- commits, line 12 at r1:
NIT: Update the commit/PR message to be /_status/load.

pkg/server/tenant.go, line 321 at r1 (raw file):

	return func(w http.ResponseWriter, r *http.Request) {
		if err := cpuTime.Get(os.Getpid()); err != nil {

Should this CPU calculation code be an exported function on status.RuntimeStatSampler? Something like:

func (rsr *RuntimeStatSampler) SampleCPU() {
  ...
}

I believe this code should be thread-safe. You could call that code both from here and from the SampleEnvironment method. It would get the updated CPU time and then update the CPUUserNS and CPUSysNS gauges.

pkg/server/tenant.go, line 323 at r1 (raw file):

		if err := cpuTime.Get(os.Getpid()); err != nil {
			log.Ops.Errorf(ctx, "unable to get cpu usage: %v", err)
			http.Error(w, err.Error(), http.StatusInternalServerError)

Don't we need return` here in the error code path?

pkg/server/tenant.go, line 334 at r1 (raw file):

		if err := exporter.PrintAsText(w); err != nil {
			log.Errorf(r.Context(), "%v", err)
			http.Error(w, err.Error(), http.StatusInternalServerError)

And a return here as well.

darinpp

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball)

-- commits, line 12 at r1:

Previously, andy-kimball (Andy Kimball) wrote…

NIT: Update the commit/PR message to be /_status/load.

done

pkg/server/tenant.go, line 321 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Should this CPU calculation code be an exported function on status.RuntimeStatSampler? Something like:
func (rsr *RuntimeStatSampler) SampleCPU() {
  ...
}
I believe this code should be thread-safe. You could call that code both from here and from the SampleEnvironment method. It would get the updated CPU time and then update the CPUUserNS and CPUSysNS gauges.

changed

pkg/server/tenant.go, line 323 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Don't we need return` here in the error code path?

after the SampleCPU change - not needed.

pkg/server/tenant.go, line 334 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

And a return here as well.

after the SampleCPU change - not needed.

andy-kimball · 2021-09-28T00:39:23Z

@knz, feel free to add any other person(s) who might want to review this. We're not sure who the right people/team would be.

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)

pkg/server/tenant.go, line 334 at r1 (raw file):

Previously, darinpp wrote…

after the SampleCPU change - not needed.

How come it's not still needed? If we added code after this block, we don't want to execute it in this error case.

darinpp

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)

pkg/server/tenant.go, line 334 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

How come it's not still needed? If we added code after this block, we don't want to execute it in this error case.

OK. Added a return.

andy-kimball

, but make sure to get a review from someone on the Server team before merging.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz)

knz

you need to update the PR description with the new contents of the commit message.

Reviewed 3 of 5 files at r1, 1 of 2 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @darinpp)

darinpp

done

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @darinpp)

knz · 2021-09-28T15:27:21Z

@dhartunian you may want to have a look too

andy-kimball · 2021-09-28T16:36:09Z

There's one potential problem I wanted to call out: because we're updating the server's instance of RuntimeStatSampler, which is shared with the regular vars endpoint, this will cause "de-sync'ing" of related metrics. For example, the instantaneous CPU won't "match" the CPU percentage. Generally, we want a reasonably consistent snapshot every 10 seconds, with each of the metrics roughly corresponding with one another by timestamp. By overwriting just some of these values every, say, 3 seconds, overlaying different metrics on top of one another by time won't work very well.

Perhaps we should use a different instance of RuntimeStatSampler for instant values so we don't mix with the background values we collect?

darinpp

Good point. There is actually nothing that requires the two sets to share metrics. I changed the new endpoint to use separate gauges.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz)

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @darinpp and @knz)

pkg/server/tenant.go, line 321 at r4 (raw file):

	return func(w http.ResponseWriter, r *http.Request) {
		if err := cpuTime.Get(os.Getpid()); err != nil {

We're back to not sharing this code, which I don't think is great. Can't you encapsulate this code in a little function, like we do with GetUserCPUSeconds, and then use it both here and from the sampler? Something like:

func GetCPUSeconds(ctx context.Context) (userTimeNs, sysTimeNs int64, err error)

I actually think we should move GetUserCPUSeconds to tenant.go and make it use the new GetCPUSeconds function. GetUserCPUSeconds is a very specialized function that shouldn't be in a shared file, IMO.

dhartunian

I like Andy's suggestions re: code organization.

Reviewed 2 of 5 files at r1, 1 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @darinpp, @dhartunian, and @knz)

pkg/ccl/serverccl/tenant_vars_test.go, line 62 at r4 (raw file):

		"invalid non-200 status code %v from tenant", resp.StatusCode)

	prometheusMetricStringPattern := `^(?P<metric>\w+)(?:\{` +

Minor: We import the prometheus client library as a dependency so you might be able to use it to parse the metrics directly instead of manually via regex.

darinpp

done

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @dhartunian, and @knz)

pkg/ccl/serverccl/tenant_vars_test.go, line 62 at r4 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Minor: We import the prometheus client library as a dependency so you might be able to use it to parse the metrics directly instead of manually via regex.

switched to use Prometheus client

pkg/server/tenant.go, line 321 at r4 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

We're back to not sharing this code, which I don't think is great. Can't you encapsulate this code in a little function, like we do with GetUserCPUSeconds, and then use it both here and from the sampler? Something like:
func GetCPUSeconds(ctx context.Context) (userTimeNs, sysTimeNs int64, err error)
I actually think we should move GetUserCPUSeconds to tenant.go and make it use the new GetCPUSeconds function. GetUserCPUSeconds is a very specialized function that shouldn't be in a shared file, IMO.

Done

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @darinpp, @dhartunian, and @knz)

pkg/server/tenant.go, line 292 at r5 (raw file):

	getUserCPUSec := func(ctx context.Context) float64 {
		userTimeMS, _, err := status.GetCPUTime(ctx)
		if err != nil {

Don't need to log this error, as it's already logged in GetCPUTime.

pkg/server/status/runtime.go, line 450 at r5 (raw file):

	userTimeMS, sysTimeMS, err := GetCPUTime(ctx)
	if err != nil {
		log.Ops.Errorf(ctx, "unable to get cpu usage: %v", err)

This error has already been logged, you don't need to log again.

pkg/server/status/runtime.go, line 694 at r5 (raw file):

// GetCPUTime returns the cumulative user/system time (in ms) since the process start.
func GetCPUTime(ctx context.Context) (userTimeMS, sysTimeMS int64, err error) {

NIT: I think this should be userTimeMs rather than capitalizing as MS.

darinpp

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @dhartunian, and @knz)

pkg/server/tenant.go, line 292 at r5 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Don't need to log this error, as it's already logged in GetCPUTime.

I removed the logging in GetCPUTime. Leaving the log here as it is higher up in the call stack and will log the caller.

pkg/server/status/runtime.go, line 450 at r5 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

This error has already been logged, you don't need to log again.

I removed the logging in GetCPUTime. Leaving the log here as it is higher up in the call stack and will log the caller.

pkg/server/status/runtime.go, line 694 at r5 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

NIT: I think this should be userTimeMs rather than capitalizing as MS.

Changed the abbreviations to mixed case.

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @dhartunian and @knz)

Previously the tenant process was serving various metrics on `/_status/vars`. This endpoint has all the available metrics and these are updated every 10 sec. Many of the metrics show a rate that is calculated over the 10 sec interval. Some of the metrics are used by the cockroach operator to monitor the CPU workload of the tenant process and use that workload for automatic scaling. The 10 sec interval however is too long and causes a slow scaling up. The reporting of high CPU utilization can take up to 20 sec (to compute a delta). To resolve this, the PR adds a new endpoint `/_status/load` that provides an instant reading of a very small subset of the normal metrics - user and system CPU time for now. By having these be instant, the client can retrieve in quick succession, consecutive snapshots and compute a precise CPU utulization. It also allows the client to control the interval between the two pulls (as opposed to having it hard coded to 10 sec). Release note: None

darinpp · 2021-10-01T00:33:12Z

bors r+

craig · 2021-10-01T02:20:55Z

Build succeeded:

GitHub CI (Cockroach)

blathers-crl · 2021-10-01T02:21:03Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 5b0bdb0 to blathers/backport-release-21.2-70750: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

darinpp requested a review from andy-kimball September 27, 2021 01:49

darinpp added the backport-21.2.x label Sep 27, 2021

darinpp force-pushed the add-instant-vars-to-tenant branch 3 times, most recently from fd7403d to e1dbe12 Compare September 27, 2021 19:44

andy-kimball reviewed Sep 27, 2021

View reviewed changes

darinpp force-pushed the add-instant-vars-to-tenant branch from e1dbe12 to 0f8f621 Compare September 27, 2021 22:20

darinpp commented Sep 27, 2021

View reviewed changes

andy-kimball requested a review from knz September 28, 2021 00:38

andy-kimball requested a review from a team September 28, 2021 00:40

andy-kimball reviewed Sep 28, 2021

View reviewed changes

darinpp force-pushed the add-instant-vars-to-tenant branch from 0f8f621 to 57cf942 Compare September 28, 2021 02:08

darinpp commented Sep 28, 2021

View reviewed changes

andy-kimball reviewed Sep 28, 2021

View reviewed changes

knz reviewed Sep 28, 2021

View reviewed changes

darinpp commented Sep 28, 2021

View reviewed changes

darinpp requested a review from knz September 28, 2021 15:24

knz approved these changes Sep 28, 2021

View reviewed changes

darinpp force-pushed the add-instant-vars-to-tenant branch from 57cf942 to 07205b8 Compare September 28, 2021 18:23

darinpp commented Sep 28, 2021

View reviewed changes

andy-kimball reviewed Sep 29, 2021

View reviewed changes

dhartunian reviewed Sep 29, 2021

View reviewed changes

darinpp force-pushed the add-instant-vars-to-tenant branch from 07205b8 to 90a2f80 Compare September 29, 2021 23:56

darinpp commented Sep 29, 2021

View reviewed changes

andy-kimball reviewed Sep 30, 2021

View reviewed changes

darinpp force-pushed the add-instant-vars-to-tenant branch from 90a2f80 to 08e6e0f Compare September 30, 2021 18:22

darinpp force-pushed the add-instant-vars-to-tenant branch from 08e6e0f to 538a2f2 Compare September 30, 2021 19:28

darinpp commented Sep 30, 2021

View reviewed changes

andy-kimball approved these changes Sep 30, 2021

View reviewed changes

darinpp force-pushed the add-instant-vars-to-tenant branch from 538a2f2 to 5b0bdb0 Compare September 30, 2021 22:54

craig bot merged commit 5f53feb into cockroachdb:master Oct 1, 2021

darinpp mentioned this pull request Oct 3, 2021

release-21.2: tenant: add endpoint with instant metrics #71052

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tenant: add endpoint with instant metrics #70750

tenant: add endpoint with instant metrics #70750

darinpp commented Sep 27, 2021 •

edited

Loading

cockroach-teamcity commented Sep 27, 2021

andy-kimball left a comment

darinpp left a comment

andy-kimball commented Sep 28, 2021

andy-kimball left a comment

darinpp left a comment

andy-kimball left a comment

knz left a comment

darinpp left a comment

knz commented Sep 28, 2021

andy-kimball commented Sep 28, 2021

darinpp left a comment

andy-kimball left a comment

dhartunian left a comment

darinpp left a comment

andy-kimball left a comment

darinpp left a comment

andy-kimball left a comment

darinpp commented Oct 1, 2021

craig bot commented Oct 1, 2021

blathers-crl bot commented Oct 1, 2021

tenant: add endpoint with instant metrics #70750

tenant: add endpoint with instant metrics #70750

Conversation

darinpp commented Sep 27, 2021 • edited Loading

cockroach-teamcity commented Sep 27, 2021

andy-kimball left a comment

Choose a reason for hiding this comment

darinpp left a comment

Choose a reason for hiding this comment

andy-kimball commented Sep 28, 2021

andy-kimball left a comment

Choose a reason for hiding this comment

darinpp left a comment

Choose a reason for hiding this comment

andy-kimball left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

darinpp left a comment

Choose a reason for hiding this comment

knz commented Sep 28, 2021

andy-kimball commented Sep 28, 2021

darinpp left a comment

Choose a reason for hiding this comment

andy-kimball left a comment

Choose a reason for hiding this comment

dhartunian left a comment

Choose a reason for hiding this comment

darinpp left a comment

Choose a reason for hiding this comment

andy-kimball left a comment

Choose a reason for hiding this comment

darinpp left a comment

Choose a reason for hiding this comment

andy-kimball left a comment

Choose a reason for hiding this comment

darinpp commented Oct 1, 2021

craig bot commented Oct 1, 2021

blathers-crl bot commented Oct 1, 2021

darinpp commented Sep 27, 2021 •

edited

Loading