-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow debugging kubelet image pull times #96054
Conversation
Test failure is unlikely to be related: /retest |
pkg/kubelet/metrics/metrics.go
Outdated
@@ -202,7 +202,7 @@ var ( | |||
Subsystem: KubeletSubsystem, | |||
Name: RuntimeOperationsDurationKey, | |||
Help: "Duration in seconds of runtime operations. Broken down by operation type.", | |||
Buckets: metrics.DefBuckets, | |||
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 60, 300, 600, 900, 1200}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could define a variable for this custom bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Where would that variable live, here or in the metrics
package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric is also used for other runtime ops. Would this change add too much cardinality to kubelet_runtime_operations_duration_seconds?
pkg/kubelet/metrics/metrics.go
Outdated
@@ -202,7 +202,7 @@ var ( | |||
Subsystem: KubeletSubsystem, | |||
Name: RuntimeOperationsDurationKey, | |||
Help: "Duration in seconds of runtime operations. Broken down by operation type.", | |||
Buckets: metrics.DefBuckets, | |||
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 60, 300, 600, 900, 1200}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be clearer (and look less arbitrary) if we just use exponential buckets here? i.e.
...
Buckets : metrics.ExponentialBuckets(.005, 2, 18),
...
to make a list of .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 20, 40, 80, 160, 320, 640, 1280
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, works for me, too, I don't have a very strong opinion on the exact bucket sizes, I only want a set of bigger buckets to exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alvaroaleman: +1 to changing this PR to metrics.ExponentialBuckets(.005, 2, 18)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, ptal
/assign |
83dacd5
to
1e4894a
Compare
/retest |
/kind feature |
I think this should produce buckets of
not what is listed above. Is that still large enough for your use-case? If you are changing bucket boundaries anyways, should we make the growth factor larger so we don't need to add as many streams? |
Thanks for pointing that out, its not, the issue we are seeing is that this takes more than 15 minutes.
TBH I would actually prefer having it smaller in the upper end (that's why the PR initially proposed |
I suspect you are interested in buckets clustered around ~15 minutes because that is about what your image takes to pull. I prefer exponential buckets because they are equally good for debugging X% deviations from expectation, regardless of what the expected value is. It's unfortunate that image pulls are included with the other metrics, otherwise i'd say we should remove many of the smaller buckets. Looking at some data I have, it is unusual for image pull to take less than a second. It does look like it is common for stopping a container or sandbox to take longer than 10 seconds. That might just be because some containers don't respond to sig-term... But still, it means additional buckets would likely be useful for other operations. |
1e4894a
to
04e3aae
Compare
hm yeah, to be fair the expectation for pretty much all pulls is low single digit minutes at most so it probably doesn't matter much. |
The test failures seem to be due to kubernetes/test-infra#19910 |
ack |
04e3aae
to
6092f48
Compare
I have that commit locally and the UI also referenced it: 04e3aae Rebased in the hope that it will make a difference |
Doubling the number of buckets feels like a lot of additional streams. Does something like (.005, 2.5, 14) seems like a reasonable compromise? That puts the top buckets at ~2, ~5, ~12, ~30 minutes. |
6092f48
to
6f3ee1a
Compare
sgtm, updated |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alvaroaleman, dashpole, ruiwen-zhao The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR changes the buckets of the kubelet_runtime_operation_duration_seconds metric to be metrics.ExponentialBuckets(.005, 2.5, 14) in order to allow debugging image pull times. Right now the biggest bucket is 10 seconds, which is an ordinary time frame to pull an image, making the metric useless for the aforementioned usecase.
6f3ee1a
to
801a52c
Compare
Same |
/lgtm |
/retest |
This PR adds additional buckets of 60, 300, 600, 900 and 1200 seconds
to the kubelet_runtime_operations_duration_seconds metrics in order to
allow debugging image pull times. Right now the biggest bucket is 10
seconds, which is an ordinary time frame to pull an image, making the
metric useless for the aforementioned usecase.
I wrote this PR while waiting 10 minutes for a 90 MB image to get pulled and I'd like to be able to find out if pulls are always this slow or if there are other factors that impact pull time (time of day, different nodes, ...)
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: