Allow debugging kubelet image pull times #96054

alvaroaleman · 2020-10-30T15:33:55Z

This PR adds additional buckets of 60, 300, 600, 900 and 1200 seconds
to the kubelet_runtime_operations_duration_seconds metrics in order to
allow debugging image pull times. Right now the biggest bucket is 10
seconds, which is an ordinary time frame to pull an image, making the
metric useless for the aforementioned usecase.

I wrote this PR while waiting 10 minutes for a 90 MB image to get pulled and I'd like to be able to find out if pulls are always this slow or if there are other factors that impact pull time (time of day, different nodes, ...)

The kubelet_runtime_operations_duration_seconds metric buckets were set to 0.005 0.0125 0.03125 0.078125 0.1953125 0.48828125 1.220703125 3.0517578125 7.62939453125 19.073486328125 47.6837158203125 119.20928955078125 298.0232238769531 and 745.0580596923828 seconds

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

alvaroaleman · 2020-10-30T15:38:27Z

Test failure is unlikely to be related:
ERROR: An error occurred during the fetch of repository 'debian-iptables-amd64':

/retest

mrunalp · 2020-11-02T18:02:38Z

pkg/kubelet/metrics/metrics.go

@@ -202,7 +202,7 @@ var (
 			Subsystem:      KubeletSubsystem,
 			Name:           RuntimeOperationsDurationKey,
 			Help:           "Duration in seconds of runtime operations. Broken down by operation type.",
-			Buckets:        metrics.DefBuckets,
+			Buckets:        []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 60, 300, 600, 900, 1200},


We could define a variable for this custom bucket.

Sure. Where would that variable live, here or in the metrics package?

tangenti

The metric is also used for other runtime ops. Would this change add too much cardinality to kubelet_runtime_operations_duration_seconds?

ruiwen-zhao · 2020-11-04T00:22:57Z

pkg/kubelet/metrics/metrics.go

@@ -202,7 +202,7 @@ var (
 			Subsystem:      KubeletSubsystem,
 			Name:           RuntimeOperationsDurationKey,
 			Help:           "Duration in seconds of runtime operations. Broken down by operation type.",
-			Buckets:        metrics.DefBuckets,
+			Buckets:        []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 60, 300, 600, 900, 1200},


Would it be clearer (and look less arbitrary) if we just use exponential buckets here? i.e.

... Buckets : metrics.ExponentialBuckets(.005, 2, 18), ...

to make a list of .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 20, 40, 80, 160, 320, 640, 1280

Yeah, works for me, too, I don't have a very strong opinion on the exact bucket sizes, I only want a set of bigger buckets to exist

@alvaroaleman: +1 to changing this PR to metrics.ExponentialBuckets(.005, 2, 18)

updated, ptal

dashpole · 2020-11-04T17:06:56Z

/assign
/triage accepted

alvaroaleman · 2020-11-09T23:21:34Z

/retest

rphillips · 2020-11-11T19:41:27Z

/kind feature
/lgtm

dashpole · 2020-11-11T23:23:09Z

I think this should produce buckets of

.005, .01, .02, .04, .08, .16, .32, .64, 1.28, 2.56, 5.12, 10.24, 20.48, 40.96, 81.92, 163.84, 327.68, 655.36

not what is listed above. Is that still large enough for your use-case? If you are changing bucket boundaries anyways, should we make the growth factor larger so we don't need to add as many streams?

alvaroaleman · 2020-11-11T23:34:51Z

not what is listed above. Is that still large enough for your use-case?

Thanks for pointing that out, its not, the issue we are seeing is that this takes more than 15 minutes.

If you are changing bucket boundaries anyways, should we make the growth factor larger so we don't need to add as many streams?

TBH I would actually prefer having it smaller in the upper end (that's why the PR initially proposed 600, 900, 1200). The underlying problem is that this metric is used for both operations that are very quick and some like the pulling that are very slow. WDYT about what I initially proposed, still visible above this comment?

dashpole · 2020-11-12T00:25:21Z

I suspect you are interested in buckets clustered around ~15 minutes because that is about what your image takes to pull. I prefer exponential buckets because they are equally good for debugging X% deviations from expectation, regardless of what the expected value is.

It's unfortunate that image pulls are included with the other metrics, otherwise i'd say we should remove many of the smaller buckets. Looking at some data I have, it is unusual for image pull to take less than a second. time docker pull busybox takes ~1.3 seconds.

It does look like it is common for stopping a container or sandbox to take longer than 10 seconds. That might just be because some containers don't respond to sig-term... But still, it means additional buckets would likely be useful for other operations.

alvaroaleman · 2020-11-12T00:50:35Z

I suspect you are interested in buckets clustered around ~15 minutes because that is about what your image takes to pull. I prefer exponential buckets because they are equally good for debugging X% deviations from expectation, regardless of what the expected value is.

hm yeah, to be fair the expectation for pretty much all pulls is low single digit minutes at most so it probably doesn't matter much.
I've updated the PR to use metrics.ExponentialBuckets(.005, 2, 20), is that ok for you?

alvaroaleman · 2020-11-12T01:04:10Z

The test failures seem to be due to kubernetes/test-infra#19910

dashpole · 2020-11-12T01:06:01Z

ack
/retest

alvaroaleman · 2020-11-12T01:09:08Z

merge: 04e3aaea31fc9805aae21daa1b81fad85d3067de - not something we can merge

I have that commit locally and the UI also referenced it: 04e3aae

Rebased in the hope that it will make a difference

dashpole · 2020-11-12T01:12:01Z

Doubling the number of buckets feels like a lot of additional streams. Does something like (.005, 2.5, 14) seems like a reasonable compromise? That puts the top buckets at ~2, ~5, ~12, ~30 minutes.

alvaroaleman · 2020-11-12T01:15:57Z

Doubling the number of buckets feels like a lot of additional streams. Does something like (.005, 2.5, 14) seems like a reasonable compromise? That puts the top buckets at ~2, ~5, ~12, ~30 minutes.

sgtm, updated

dashpole · 2020-11-12T01:17:19Z

/lgtm
/approve

k8s-ci-robot · 2020-11-12T01:17:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, dashpole, ruiwen-zhao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/metrics/OWNERS~~ [dashpole]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This PR changes the buckets of the kubelet_runtime_operation_duration_seconds metric to be metrics.ExponentialBuckets(.005, 2.5, 14) in order to allow debugging image pull times. Right now the biggest bucket is 10 seconds, which is an ordinary time frame to pull an image, making the metric useless for the aforementioned usecase.

alvaroaleman · 2020-11-12T01:27:36Z

Same merge: 04e3aaea31fc9805aae21daa1b81fad85d3067de - not something we can merge issue, could you re-add the lgtm? I just did a no-op ammend to work around it

dashpole · 2020-11-12T16:52:20Z

/lgtm

alvaroaleman · 2020-11-12T17:42:36Z

/retest

k8s-ci-robot requested review from feiskyer and mtaufen October 30, 2020 15:35

mrunalp reviewed Nov 2, 2020

View reviewed changes

tangenti reviewed Nov 3, 2020

View reviewed changes

ruiwen-zhao reviewed Nov 4, 2020

View reviewed changes

k8s-ci-robot assigned dashpole Nov 4, 2020

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 4, 2020

spiffxp added do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 5, 2020

alvaroaleman force-pushed the increase branch from 83dacd5 to 1e4894a Compare November 9, 2020 22:15

ruiwen-zhao approved these changes Nov 9, 2020

View reviewed changes

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 11, 2020

k8s-ci-robot assigned rphillips Nov 11, 2020

alvaroaleman force-pushed the increase branch from 1e4894a to 04e3aae Compare November 12, 2020 00:50

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2020

alvaroaleman force-pushed the increase branch from 04e3aae to 6092f48 Compare November 12, 2020 01:08

alvaroaleman force-pushed the increase branch from 6092f48 to 6f3ee1a Compare November 12, 2020 01:15

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 12, 2020

alvaroaleman force-pushed the increase branch from 6f3ee1a to 801a52c Compare November 12, 2020 01:20

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2020

k8s-ci-robot merged commit df794c1 into kubernetes:master Nov 12, 2020

k8s-ci-robot added this to the v1.20 milestone Nov 12, 2020

rphillips mentioned this pull request Nov 17, 2020

Bug 1893776: UPSTREAM: 96054: Allow debugging kubelet image pull times openshift/kubernetes#460

Merged

github-actions bot mentioned this pull request Nov 18, 2020

Week Ending November 15, 2020 dev-obs/actus#273

Open

ruiwen-zhao mentioned this pull request Nov 23, 2020

REQUEST: New membership for ruiwen-zhao kubernetes/org#2340

Closed

6 tasks

ruiwen-zhao mentioned this pull request Sep 14, 2023

Use a wider-range of metric buckets for PodStartDuration #120680

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow debugging kubelet image pull times #96054

Allow debugging kubelet image pull times #96054

alvaroaleman commented Oct 30, 2020 •

edited

Loading

alvaroaleman commented Oct 30, 2020

mrunalp Nov 2, 2020

alvaroaleman Nov 2, 2020

tangenti left a comment

ruiwen-zhao Nov 4, 2020

alvaroaleman Nov 4, 2020

rphillips Nov 9, 2020

alvaroaleman Nov 9, 2020

dashpole commented Nov 4, 2020

alvaroaleman commented Nov 9, 2020

rphillips commented Nov 11, 2020

dashpole commented Nov 11, 2020

alvaroaleman commented Nov 11, 2020

dashpole commented Nov 12, 2020

alvaroaleman commented Nov 12, 2020

alvaroaleman commented Nov 12, 2020

dashpole commented Nov 12, 2020

alvaroaleman commented Nov 12, 2020

dashpole commented Nov 12, 2020

alvaroaleman commented Nov 12, 2020

dashpole commented Nov 12, 2020

k8s-ci-robot commented Nov 12, 2020

alvaroaleman commented Nov 12, 2020 •

edited

Loading

dashpole commented Nov 12, 2020

alvaroaleman commented Nov 12, 2020

Allow debugging kubelet image pull times #96054

Allow debugging kubelet image pull times #96054

Conversation

alvaroaleman commented Oct 30, 2020 • edited Loading

alvaroaleman commented Oct 30, 2020

mrunalp Nov 2, 2020

Choose a reason for hiding this comment

alvaroaleman Nov 2, 2020

Choose a reason for hiding this comment

tangenti left a comment

Choose a reason for hiding this comment

ruiwen-zhao Nov 4, 2020

Choose a reason for hiding this comment

alvaroaleman Nov 4, 2020

Choose a reason for hiding this comment

rphillips Nov 9, 2020

Choose a reason for hiding this comment

alvaroaleman Nov 9, 2020

Choose a reason for hiding this comment

dashpole commented Nov 4, 2020

alvaroaleman commented Nov 9, 2020

rphillips commented Nov 11, 2020

dashpole commented Nov 11, 2020

alvaroaleman commented Nov 11, 2020

dashpole commented Nov 12, 2020

alvaroaleman commented Nov 12, 2020

alvaroaleman commented Nov 12, 2020

dashpole commented Nov 12, 2020

alvaroaleman commented Nov 12, 2020

dashpole commented Nov 12, 2020

alvaroaleman commented Nov 12, 2020

dashpole commented Nov 12, 2020

k8s-ci-robot commented Nov 12, 2020

alvaroaleman commented Nov 12, 2020 • edited Loading

dashpole commented Nov 12, 2020

alvaroaleman commented Nov 12, 2020

alvaroaleman commented Oct 30, 2020 •

edited

Loading

alvaroaleman commented Nov 12, 2020 •

edited

Loading