Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

adding in priority value in podSpec for controller, scheduler and apiserver #1631

Merged

Conversation

erleene
Copy link
Contributor

@erleene erleene commented Jun 18, 2019

We have found a bug in 1.13.x where mirror pods get evicted when the node is under pressure. This PR sets the priority directly so that those pods won't get evicted. We don't believe that this is required with 1.14+.

Relates to this issue kubernetes/kubernetes#73572

@k8s-ci-robot
Copy link
Contributor

Welcome @erleene!

It looks like this is your first PR to kubernetes-incubator/kube-aws 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-incubator/kube-aws has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jun 18, 2019
@k8s-ci-robot k8s-ci-robot requested review from mumoshu and redbaron June 18, 2019 15:07
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 18, 2019
@codecov-io
Copy link

codecov-io commented Jun 18, 2019

Codecov Report

Merging #1631 into v0.13.x will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff            @@
##           v0.13.x    #1631   +/-   ##
========================================
  Coverage    25.49%   25.49%           
========================================
  Files           98       98           
  Lines         5049     5049           
========================================
  Hits          1287     1287           
  Misses        3619     3619           
  Partials       143      143

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 289cd68...9d45c31. Read the comment docs.

@erleene erleene force-pushed the bugfix/mirror-pod-priorities branch 2 times, most recently from 7880864 to fd364ab Compare June 18, 2019 15:50
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 18, 2019
@davidmccormick
Copy link
Contributor

Please test before we merge

@davidmccormick davidmccormick changed the title adding in priority value in podSpec for controller, scheduler and apiserver WIP: adding in priority value in podSpec for controller, scheduler and apiserver Jun 19, 2019
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 19, 2019
@erleene erleene force-pushed the bugfix/mirror-pod-priorities branch from fd364ab to 4581ead Compare June 19, 2019 10:30
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 19, 2019
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 19, 2019
@dominicgunn
Copy link
Contributor

This looks to have been fixed for Kubernetes 1.13.5+, the default for kube-aws is now 1.13.7 so this shouldn't impact newly created clusters.

Kubernetes upstream has fixed this for versions as far back as v1.11.x, do we want to do the same in kube-aws?

@davidmccormick
Copy link
Contributor

Yes, that has got us confused somewhat as we observed the behavior on a 1.13.7 cluster without this work-around! It shouldn’t happen.

@erleene
Copy link
Contributor Author

erleene commented Jun 20, 2019

@dominicgunn

The problem still occurs in 1.13.7, where no controllers are running and getting evicted. We have the api server and the scheduler up, and kubelet shows the following:


11:01:46.328766    1525 certificate_manager.go:378] Certificate request was not signed: timed out waiting for the condition
11:01:46.328804    1525 certificate_manager.go:269] Reached backoff limit, still unable to rotate certs: timed out waiting for the condition
11:10:49.829455    1525 log.go:172] http: multiple response.WriteHeader calls
11:14:54.907091    1525 certificate_manager.go:378] Certificate request was not signed: timed out waiting for the condition




some more interesting output:

10:49:05.306713    1525 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: Get https://127.0.0.1/api/v1/services?li>
10:49:05.307641    1525 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://127.0.0.1/api/v1/pods?fi>
10:49:05.308780    1525 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: Get https://127.0.0.1/api/v1/nodes?fieldSel>

@dominicgunn
Copy link
Contributor

Does this only happen in clusters created with kube-aws, or is it a problem in upstream k8s?

@davidmccormick
Copy link
Contributor

davidmccormick commented Jun 20, 2019

Thanks, it was also confirmed that the controller-manager pods had been evicted - this is an upstream kubernetes issue which may also affect other releases, e.g. 1.14.x, 1.15.x (although it is supposed to have been patched). It affects kube-aws customers deploying smaller controllers that are pushed near or over capacity and won't affect well sized nodes. What it has shown us though is that when resources are under pressure some of the most critical components were chosen for eviction.

I suggest we merge this as a work-around on 1.13.x and test the behaviour on 1.14.x and apply similar work-around if observed on that branch.

Good work Erleen!

@dominicgunn
Copy link
Contributor

Cool, i'm behind that. Code looks good.

@dominicgunn
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dominicgunn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 20, 2019
@davidmccormick davidmccormick changed the title WIP: adding in priority value in podSpec for controller, scheduler and apiserver adding in priority value in podSpec for controller, scheduler and apiserver Jun 20, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 20, 2019
@davidmccormick
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 20, 2019
@davidmccormick davidmccormick merged commit ca91b7f into kubernetes-retired:v0.13.x Jun 20, 2019
@davidmccormick
Copy link
Contributor

Thanks for your first contribution! 🙏

@davidmccormick davidmccormick added this to the v0.13.0-rc.3 milestone Jun 20, 2019
@erleene erleene deleted the bugfix/mirror-pod-priorities branch June 20, 2019 12:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants