Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tenant: add endpoint with instant metrics #70750

Merged
merged 1 commit into from
Oct 1, 2021

Conversation

darinpp
Copy link
Contributor

@darinpp darinpp commented Sep 27, 2021

Previously the tenant process was serving various metrics on
/_status/vars. This endpoint has all the available metrics and these are
updated every 10 sec. Many of the metrics show a rate that is calculated
over the 10 sec interval. Some of the metrics are used by the cockroach
operator to monitor the CPU workload of the tenant process and use that
workload for automatic scaling. The 10 sec interval however is too long
and causes a slow scaling up. The reporting of high CPU utilization can
take up to 20 sec (to compute a delta). To resolve this, the PR adds a
new endpoint /_status/load that provides an instant reading of a
very small subset of the normal metrics - user and system CPU time for
now. By having these be instant, the client can retrieve in quick
succession, consecutive snapshots and compute a precise CPU utulization.
It also allows the client to control the interval between the two pulls
(as opposed to having it hard coded to 10 sec).

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@darinpp darinpp force-pushed the add-instant-vars-to-tenant branch 3 times, most recently from fd7403d to e1dbe12 Compare September 27, 2021 19:44
Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @darinpp)


-- commits, line 12 at r1:
NIT: Update the commit/PR message to be /_status/load.


pkg/server/tenant.go, line 321 at r1 (raw file):

	return func(w http.ResponseWriter, r *http.Request) {
		if err := cpuTime.Get(os.Getpid()); err != nil {

Should this CPU calculation code be an exported function on status.RuntimeStatSampler? Something like:

func (rsr *RuntimeStatSampler) SampleCPU() {
  ...
}

I believe this code should be thread-safe. You could call that code both from here and from the SampleEnvironment method. It would get the updated CPU time and then update the CPUUserNS and CPUSysNS gauges.


pkg/server/tenant.go, line 323 at r1 (raw file):

		if err := cpuTime.Get(os.Getpid()); err != nil {
			log.Ops.Errorf(ctx, "unable to get cpu usage: %v", err)
			http.Error(w, err.Error(), http.StatusInternalServerError)

Don't we need return` here in the error code path?


pkg/server/tenant.go, line 334 at r1 (raw file):

		if err := exporter.PrintAsText(w); err != nil {
			log.Errorf(r.Context(), "%v", err)
			http.Error(w, err.Error(), http.StatusInternalServerError)

And a return here as well.

Copy link
Contributor Author

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball)


-- commits, line 12 at r1:

Previously, andy-kimball (Andy Kimball) wrote…

NIT: Update the commit/PR message to be /_status/load.

done


pkg/server/tenant.go, line 321 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Should this CPU calculation code be an exported function on status.RuntimeStatSampler? Something like:

func (rsr *RuntimeStatSampler) SampleCPU() {
  ...
}

I believe this code should be thread-safe. You could call that code both from here and from the SampleEnvironment method. It would get the updated CPU time and then update the CPUUserNS and CPUSysNS gauges.

changed


pkg/server/tenant.go, line 323 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Don't we need return` here in the error code path?

after the SampleCPU change - not needed.


pkg/server/tenant.go, line 334 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

And a return here as well.

after the SampleCPU change - not needed.

@andy-kimball
Copy link
Contributor

@knz, feel free to add any other person(s) who might want to review this. We're not sure who the right people/team would be.

@andy-kimball andy-kimball requested a review from a team September 28, 2021 00:40
Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/server/tenant.go, line 334 at r1 (raw file):

Previously, darinpp wrote…

after the SampleCPU change - not needed.

How come it's not still needed? If we added code after this block, we don't want to execute it in this error case.

Copy link
Contributor Author

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/server/tenant.go, line 334 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

How come it's not still needed? If we added code after this block, we don't want to execute it in this error case.

OK. Added a return.

Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:, but make sure to get a review from someone on the Server team before merging.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to update the PR description with the new contents of the commit message.

Reviewed 3 of 5 files at r1, 1 of 2 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @darinpp)

Copy link
Contributor Author

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @darinpp)

@darinpp darinpp requested a review from knz September 28, 2021 15:24
@knz
Copy link
Contributor

knz commented Sep 28, 2021

@dhartunian you may want to have a look too

@andy-kimball
Copy link
Contributor

There's one potential problem I wanted to call out: because we're updating the server's instance of RuntimeStatSampler, which is shared with the regular vars endpoint, this will cause "de-sync'ing" of related metrics. For example, the instantaneous CPU won't "match" the CPU percentage. Generally, we want a reasonably consistent snapshot every 10 seconds, with each of the metrics roughly corresponding with one another by timestamp. By overwriting just some of these values every, say, 3 seconds, overlaying different metrics on top of one another by time won't work very well.

Perhaps we should use a different instance of RuntimeStatSampler for instant values so we don't mix with the background values we collect?

Copy link
Contributor Author

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. There is actually nothing that requires the two sets to share metrics. I changed the new endpoint to use separate gauges.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz)

Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @darinpp and @knz)


pkg/server/tenant.go, line 321 at r4 (raw file):

	return func(w http.ResponseWriter, r *http.Request) {
		if err := cpuTime.Get(os.Getpid()); err != nil {

We're back to not sharing this code, which I don't think is great. Can't you encapsulate this code in a little function, like we do with GetUserCPUSeconds, and then use it both here and from the sampler? Something like:

func GetCPUSeconds(ctx context.Context) (userTimeNs, sysTimeNs int64, err error)

I actually think we should move GetUserCPUSeconds to tenant.go and make it use the new GetCPUSeconds function. GetUserCPUSeconds is a very specialized function that shouldn't be in a shared file, IMO.

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: I like Andy's suggestions re: code organization.

Reviewed 2 of 5 files at r1, 1 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @darinpp, @dhartunian, and @knz)


pkg/ccl/serverccl/tenant_vars_test.go, line 62 at r4 (raw file):

		"invalid non-200 status code %v from tenant", resp.StatusCode)

	prometheusMetricStringPattern := `^(?P<metric>\w+)(?:\{` +

Minor: We import the prometheus client library as a dependency so you might be able to use it to parse the metrics directly instead of manually via regex.

Copy link
Contributor Author

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @dhartunian, and @knz)


pkg/ccl/serverccl/tenant_vars_test.go, line 62 at r4 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Minor: We import the prometheus client library as a dependency so you might be able to use it to parse the metrics directly instead of manually via regex.

switched to use Prometheus client


pkg/server/tenant.go, line 321 at r4 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

We're back to not sharing this code, which I don't think is great. Can't you encapsulate this code in a little function, like we do with GetUserCPUSeconds, and then use it both here and from the sampler? Something like:

func GetCPUSeconds(ctx context.Context) (userTimeNs, sysTimeNs int64, err error)

I actually think we should move GetUserCPUSeconds to tenant.go and make it use the new GetCPUSeconds function. GetUserCPUSeconds is a very specialized function that shouldn't be in a shared file, IMO.

Done

Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @darinpp, @dhartunian, and @knz)


pkg/server/tenant.go, line 292 at r5 (raw file):

	getUserCPUSec := func(ctx context.Context) float64 {
		userTimeMS, _, err := status.GetCPUTime(ctx)
		if err != nil {

Don't need to log this error, as it's already logged in GetCPUTime.


pkg/server/status/runtime.go, line 450 at r5 (raw file):

	userTimeMS, sysTimeMS, err := GetCPUTime(ctx)
	if err != nil {
		log.Ops.Errorf(ctx, "unable to get cpu usage: %v", err)

This error has already been logged, you don't need to log again.


pkg/server/status/runtime.go, line 694 at r5 (raw file):

// GetCPUTime returns the cumulative user/system time (in ms) since the process start.
func GetCPUTime(ctx context.Context) (userTimeMS, sysTimeMS int64, err error) {

NIT: I think this should be userTimeMs rather than capitalizing as MS.

Copy link
Contributor Author

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @dhartunian, and @knz)


pkg/server/tenant.go, line 292 at r5 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Don't need to log this error, as it's already logged in GetCPUTime.

I removed the logging in GetCPUTime. Leaving the log here as it is higher up in the call stack and will log the caller.


pkg/server/status/runtime.go, line 450 at r5 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

This error has already been logged, you don't need to log again.

I removed the logging in GetCPUTime. Leaving the log here as it is higher up in the call stack and will log the caller.


pkg/server/status/runtime.go, line 694 at r5 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

NIT: I think this should be userTimeMs rather than capitalizing as MS.

Changed the abbreviations to mixed case.

Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @dhartunian and @knz)

Previously the tenant process was serving various metrics on
`/_status/vars`. This endpoint has all the available metrics and these are
updated every 10 sec. Many of the metrics show a rate that is calculated
over the 10 sec interval. Some of the metrics are used by the cockroach
operator to monitor the CPU workload of the tenant process and use that
workload for automatic scaling. The 10 sec interval however is too long
and causes a slow scaling up. The reporting of high CPU utilization can
take up to 20 sec (to compute a delta). To resolve this, the PR adds a
new endpoint `/_status/load` that provides an instant reading of a
very small subset of the normal metrics - user and system CPU time for
now. By having these be instant, the client can retrieve in quick
succession, consecutive snapshots and compute a precise CPU utulization.
It also allows the client to control the interval between the two pulls
(as opposed to having it hard coded to 10 sec).

Release note: None
@darinpp
Copy link
Contributor Author

darinpp commented Oct 1, 2021

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 1, 2021

Build succeeded:

@craig craig bot merged commit 5f53feb into cockroachdb:master Oct 1, 2021
@blathers-crl
Copy link

blathers-crl bot commented Oct 1, 2021

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 5b0bdb0 to blathers/backport-release-21.2-70750: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants