Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows kubelet stats timeout updates #87730

Merged

Conversation

marosset
Copy link
Contributor

@marosset marosset commented Jan 31, 2020

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR addresses an issue where kubelet metrics call take a very long time on Windows nodes if more than a handful of containers are running.

Which issue(s) this PR fixes:

Fixes Stats performance is slow on Windows #74991

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Kubelet metrics gathered through metrics-server or prometheus should no longer timeout for Windows nodes running more than 3 pods.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 31, 2020
@marosset
Copy link
Contributor Author

/sig windows
cc @PatrickLang

/test pull-kubernetes-e2e-aks-engine-azure-windows

@k8s-ci-robot k8s-ci-robot added the sig/windows Categorizes an issue or PR as relevant to SIG Windows. label Jan 31, 2020
@k8s-ci-robot k8s-ci-robot added area/kubelet area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. kind/bug Categorizes issue or PR as related to a bug. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Jan 31, 2020
@PatrickLang
Copy link
Contributor

/retest pull-kubernetes-integration

@PatrickLang
Copy link
Contributor

/milestone v1.18

@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Feb 1, 2020
@PatrickLang
Copy link
Contributor

I'm testing a custom build with these changes plus #86101 since it also depends on metrics :)

@PatrickLang
Copy link
Contributor

/assign @yliaog

@PatrickLang
Copy link
Contributor

/assign @benmoss

@PatrickLang
Copy link
Contributor

/lgtm
I applied both this metrics fix and the limits fix, built, and metrics & limits tests all pass.

}

if foundNode == false {
framework.Skipf("Could not find and ready and schedulable Windows nodes")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking

Copy link
Contributor Author

@marosset marosset Feb 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like skipf got moved from framework to framework/skipper with this commit
641321c

I'll rebase and push an update

@liggitt
Copy link
Member

liggitt commented Feb 6, 2020

/test pull-kubernetes-bazel-build

@marosset marosset force-pushed the windows-kubelet-stats-timeout-updates branch from e54fd36 to 999fdfa Compare February 6, 2020 18:17
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2020
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Feb 6, 2020

@marosset: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-e2e-aks-engine-azure-windows e5f6c6b5e205f369895d47d5858ce7eb2c2d5165 link /test pull-kubernetes-e2e-aks-engine-azure-windows

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@marosset
Copy link
Contributor Author

marosset commented Feb 6, 2020

/retest

@PatrickLang
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2020
@k8s-ci-robot k8s-ci-robot merged commit 6a92f19 into kubernetes:master Feb 6, 2020
@marosset marosset deleted the windows-kubelet-stats-timeout-updates branch February 6, 2020 22:19
k8s-ci-robot added a commit that referenced this pull request Apr 7, 2020
…30-upstream-release-1.15

Automated cherry pick of #87730 upstream release 1.15
k8s-ci-robot added a commit that referenced this pull request Apr 7, 2020
…30-upstream-release-1.17

Automated cherry pick of #87730 upstream release 1.17
k8s-ci-robot added a commit that referenced this pull request Apr 7, 2020
…30-upstream-release-1.16

Automated cherry pick of #87730 upstream release 1.16
vboulineau added a commit to vboulineau/kubernetes that referenced this pull request May 4, 2020
…s are present on the node

Following changes in kubernetes#87730, Kubelet is directly hcsshim to gather stats.
However, unlike `docker stats` API that was used before, hcsshim does not
keep information about exited containers.

When the Kubelet lists containers (`docker_container.go:ListContainers()`),
it sets `All: true`, retrieving non-running containers.

When docker stats is called with such container id, it'll return a valid JSON
with all values set to 0. The non-running containers are filtered later on in the process.

When the hcsshim is called with such container id, it'll return an error, effectively
stopping the stats retrieval for all containers.
vboulineau added a commit to vboulineau/kubernetes that referenced this pull request May 14, 2020
…s are present on the node

Following changes in kubernetes#87730, Kubelet is directly hcsshim to gather stats.
However, unlike `docker stats` API that was used before, hcsshim does not
keep information about exited containers.

When the Kubelet lists containers (`docker_container.go:ListContainers()`),
it sets `All: true`, retrieving non-running containers.

When docker stats is called with such container id, it'll return a valid JSON
with all values set to 0. The non-running containers are filtered later on in the process.

When the hcsshim is called with such container id, it'll return an error, effectively
stopping the stats retrieval for all containers.
vboulineau added a commit to vboulineau/kubernetes that referenced this pull request May 14, 2020
…s are present on the node

Following changes in kubernetes#87730, Kubelet is directly hcsshim to gather stats.
However, unlike `docker stats` API that was used before, hcsshim does not
keep information about exited containers.

When the Kubelet lists containers (`docker_container.go:ListContainers()`),
it sets `All: true`, retrieving non-running containers.

When docker stats is called with such container id, it'll return a valid JSON
with all values set to 0. The non-running containers are filtered later on in the process.

When the hcsshim is called with such container id, it'll return an error, effectively
stopping the stats retrieval for all containers.
vboulineau added a commit to vboulineau/kubernetes that referenced this pull request May 14, 2020
…s are present on the node

Following changes in kubernetes#87730, Kubelet is directly hcsshim to gather stats.
However, unlike `docker stats` API that was used before, hcsshim does not
keep information about exited containers.

When the Kubelet lists containers (`docker_container.go:ListContainers()`),
it sets `All: true`, retrieving non-running containers.

When docker stats is called with such container id, it'll return a valid JSON
with all values set to 0. The non-running containers are filtered later on in the process.

When the hcsshim is called with such container id, it'll return an error, effectively
stopping the stats retrieval for all containers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants