Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow debugging kubelet image pull times #96054

Merged
merged 1 commit into from
Nov 12, 2020

Conversation

alvaroaleman
Copy link
Member

@alvaroaleman alvaroaleman commented Oct 30, 2020

This PR adds additional buckets of 60, 300, 600, 900 and 1200 seconds
to the kubelet_runtime_operations_duration_seconds metrics in order to
allow debugging image pull times. Right now the biggest bucket is 10
seconds, which is an ordinary time frame to pull an image, making the
metric useless for the aforementioned usecase.

I wrote this PR while waiting 10 minutes for a 90 MB image to get pulled and I'd like to be able to find out if pulls are always this slow or if there are other factors that impact pull time (time of day, different nodes, ...)

The kubelet_runtime_operations_duration_seconds metric buckets were set to 0.005 0.0125 0.03125 0.078125 0.1953125 0.48828125 1.220703125 3.0517578125 7.62939453125 19.073486328125 47.6837158203125 119.20928955078125 298.0232238769531 and 745.0580596923828 seconds

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/kubelet sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 30, 2020
@alvaroaleman
Copy link
Member Author

Test failure is unlikely to be related:
ERROR: An error occurred during the fetch of repository 'debian-iptables-amd64':

/retest

@@ -202,7 +202,7 @@ var (
Subsystem: KubeletSubsystem,
Name: RuntimeOperationsDurationKey,
Help: "Duration in seconds of runtime operations. Broken down by operation type.",
Buckets: metrics.DefBuckets,
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 60, 300, 600, 900, 1200},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could define a variable for this custom bucket.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Where would that variable live, here or in the metrics package?

Copy link

@tangenti tangenti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric is also used for other runtime ops. Would this change add too much cardinality to kubelet_runtime_operations_duration_seconds?

@@ -202,7 +202,7 @@ var (
Subsystem: KubeletSubsystem,
Name: RuntimeOperationsDurationKey,
Help: "Duration in seconds of runtime operations. Broken down by operation type.",
Buckets: metrics.DefBuckets,
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 60, 300, 600, 900, 1200},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be clearer (and look less arbitrary) if we just use exponential buckets here? i.e.

...
Buckets : metrics.ExponentialBuckets(.005, 2, 18),
...

to make a list of .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 20, 40, 80, 160, 320, 640, 1280

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, works for me, too, I don't have a very strong opinion on the exact bucket sizes, I only want a set of bigger buckets to exist

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alvaroaleman: +1 to changing this PR to metrics.ExponentialBuckets(.005, 2, 18)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, ptal

@dashpole
Copy link
Contributor

dashpole commented Nov 4, 2020

/assign
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 4, 2020
@spiffxp spiffxp added do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 5, 2020
@alvaroaleman
Copy link
Member Author

/retest

@rphillips
Copy link
Member

/kind feature
/lgtm

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 11, 2020
@dashpole
Copy link
Contributor

I think this should produce buckets of

.005, .01, .02, .04, .08, .16, .32, .64, 1.28, 2.56, 5.12, 10.24, 20.48, 40.96, 81.92, 163.84, 327.68, 655.36

not what is listed above. Is that still large enough for your use-case? If you are changing bucket boundaries anyways, should we make the growth factor larger so we don't need to add as many streams?

@alvaroaleman
Copy link
Member Author

not what is listed above. Is that still large enough for your use-case?

Thanks for pointing that out, its not, the issue we are seeing is that this takes more than 15 minutes.

If you are changing bucket boundaries anyways, should we make the growth factor larger so we don't need to add as many streams?

TBH I would actually prefer having it smaller in the upper end (that's why the PR initially proposed 600, 900, 1200). The underlying problem is that this metric is used for both operations that are very quick and some like the pulling that are very slow. WDYT about what I initially proposed, still visible above this comment?

@dashpole
Copy link
Contributor

I suspect you are interested in buckets clustered around ~15 minutes because that is about what your image takes to pull. I prefer exponential buckets because they are equally good for debugging X% deviations from expectation, regardless of what the expected value is.

It's unfortunate that image pulls are included with the other metrics, otherwise i'd say we should remove many of the smaller buckets. Looking at some data I have, it is unusual for image pull to take less than a second. time docker pull busybox takes ~1.3 seconds.

It does look like it is common for stopping a container or sandbox to take longer than 10 seconds. That might just be because some containers don't respond to sig-term... But still, it means additional buckets would likely be useful for other operations.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2020
@alvaroaleman
Copy link
Member Author

I suspect you are interested in buckets clustered around ~15 minutes because that is about what your image takes to pull. I prefer exponential buckets because they are equally good for debugging X% deviations from expectation, regardless of what the expected value is.

hm yeah, to be fair the expectation for pretty much all pulls is low single digit minutes at most so it probably doesn't matter much.
I've updated the PR to use metrics.ExponentialBuckets(.005, 2, 20), is that ok for you?

@alvaroaleman
Copy link
Member Author

The test failures seem to be due to kubernetes/test-infra#19910

@dashpole
Copy link
Contributor

ack
/retest

@alvaroaleman
Copy link
Member Author

merge: 04e3aaea31fc9805aae21daa1b81fad85d3067de - not something we can merge

I have that commit locally and the UI also referenced it: 04e3aae

Rebased in the hope that it will make a difference

@dashpole
Copy link
Contributor

Doubling the number of buckets feels like a lot of additional streams. Does something like (.005, 2.5, 14) seems like a reasonable compromise? That puts the top buckets at ~2, ~5, ~12, ~30 minutes.

@alvaroaleman
Copy link
Member Author

Doubling the number of buckets feels like a lot of additional streams. Does something like (.005, 2.5, 14) seems like a reasonable compromise? That puts the top buckets at ~2, ~5, ~12, ~30 minutes.

sgtm, updated

@dashpole
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, dashpole, ruiwen-zhao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 12, 2020
This PR changes the buckets of the
kubelet_runtime_operation_duration_seconds metric to be
metrics.ExponentialBuckets(.005, 2.5, 14) in order to
allow debugging image pull times. Right now the biggest bucket is 10
seconds, which is an ordinary time frame to pull an image, making the
metric useless for the aforementioned usecase.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2020
@alvaroaleman
Copy link
Member Author

alvaroaleman commented Nov 12, 2020

Same merge: 04e3aaea31fc9805aae21daa1b81fad85d3067de - not something we can merge issue, could you re-add the lgtm? I just did a no-op ammend to work around it

@dashpole
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2020
@alvaroaleman
Copy link
Member Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants