Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement: fluentd-gcp to get same toleration as kube-proxy #44445

Closed
JorritSalverda opened this issue Apr 13, 2017 · 26 comments
Closed

Improvement: fluentd-gcp to get same toleration as kube-proxy #44445

JorritSalverda opened this issue Apr 13, 2017 · 26 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@JorritSalverda
Copy link

JorritSalverda commented Apr 13, 2017

When adding a noexecute taint to a node most kube-system pods get evicted, except for kube-proxy because it has toleration =:Exists:NoExecute.

It would be nice if fluentd-gcp and heapster get the same =:Exists:NoExecute toleration to make sure they keep running on that node.

Currently in GKE 1.6.0 they have the following tolerations:

fluentd-gcp

  Tolerations:  
    node.alpha.kubernetes.io/ismaster=:NoSchedule
    node.alpha.kubernetes.io/notReady=:Exists:NoExecute for 300s
    node.alpha.kubernetes.io/unreachable=:Exists:NoExecute for 300s

kube-proxy

  Tolerations:  
    =:Exists:NoExecute

heapster

  Tolerations:  
    CriticalAddonsOnly=:Exists
    node.alpha.kubernetes.io/notReady=:Exists:NoExecute for 300s
    node.alpha.kubernetes.io/unreachable=:Exists:NoExecute for 300s
@JorritSalverda
Copy link
Author

Perhaps the even more liberal toleration below would be best for all 3, although there's no way to avoid those applications to get scheduled on your node in that case.

      tolerations:
      - operator: Exists

@davidopp
Copy link
Member

davidopp commented May 5, 2017

@kubernetes/sig-scheduling-feature-requests
@piosz @vishh @gmarek

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label May 5, 2017
@davidopp
Copy link
Member

davidopp commented May 5, 2017

I assume you mean "When adding a noexecute taint to a node"

FYI the node.alpha.kubernetes.io/{notReady,unreachable} tolerations don't do anything in 1.6 unless you have the alpha feature enabled (I forgot the exact flag).

But can you explain what behavior you want? You want fluentd and heapster are never evicted due to taints under any circumstances? I think that makes sense for fluentd since it runs on every node but not heapster.

@gmarek
Copy link
Contributor

gmarek commented May 5, 2017

I'm not sure about heapster (@piosz), and the fact that fluentd doesn't have infinite toleration is a bug, that's getting fixed in #45349 - I hope I'll manage to put it into next patch release.

@JorritSalverda
Copy link
Author

@davidopp I updated the typo (taint instead of toleration).

When filing this ticket I was still under the impression heapster was also a DaemonSet, but realized afterwards that it isn't. So yes, it only applies to fluentd.

@JorritSalverda JorritSalverda changed the title Improvement: fluentd-gcp and heapster to get same toleration as kube-proxy Improvement: fluentd-gcp to get same toleration as kube-proxy May 5, 2017
@gmarek
Copy link
Contributor

gmarek commented May 5, 2017

All DaemonSets shoud have infinite toleration, and we actually tried to do that, but failed. The fix is already merged in the head and I'm working on cherry-pick.

@gmarek
Copy link
Contributor

gmarek commented May 5, 2017

Cherry-pick: #45401

@nimeshksingh
Copy link

I think I'm missing something, but how does #45349 add the =:Exists:NoExecute toleration to fluentd-gcp?

@gmarek
Copy link
Contributor

gmarek commented May 11, 2017

It's not for all NoExecutes, only for ones that are created by NodeController (i.e. when Node is unresponsive). I don't think it's OK to add general toleration for all NoExecute taints, even user-specified ones. We do want user to be able to set a Taint that will remove all Pods from the given Node, even ones created by Daemons.

If you want your particular Daemon to tolerate all NoExecute Taints you can add this toleration to it's spec. I don't think that the system should do it automatically though. @kubernetes/sig-scheduling-feature-requests

@nimeshksingh
Copy link

Oh, my understanding of this issue was that currently, in GKE, it's not possible to use any user-defined NoExecute taints because fluentd-gcp gets evicted. fluentd-gcp is from a DaemonSet that users don't own, so we can't add arbitrary tolerations to it, but it is intended to be on every GKE node, just like kube-proxy.

@davidopp
Copy link
Member

@nimeshksingh I guess your point is that user-added taints are not usable if users can't adjust the set of default tolerations on system-generated pods, which is the case on GKE. That seems like a valid point. We will eventually have user-configurable admission controller (kubernetes/community#132 will enable this) but that won't be soon.

I need to think about what is the best solution for this. For runs-on-every-node system pods we probably need to tolerate all taints by default. For runs-on-one-or-a-few-nodes system pods (like Heapster or DNS) I'm not sure what to do.

@davidopp
Copy link
Member

cc/ @mikedanese

@aveshagarwal
Copy link
Member

@davidopp I wonder why runs-on-one-or-a-few-nodes system pods would be different than runs-on-every-node system pods, as those few nodes could be any nodes, that means they would have to tolerate all taints by default too? Or in other case, if those few nodes nodes are some special nodes, then I dont think user defined taint would be allowed by default?

@gmarek
Copy link
Contributor

gmarek commented May 11, 2017

We certainly need to discuss this, but I think that only problem that's left is dealing with system-pods, that user can't modify (e.g. fluentd on GKE @crassirostris). For this particular case I think we just need to add a 'tolerate all' Toleration to it and call it a day (we probably should do the same thing with kube-proxy and NPD if we didn't already - @bowei @dchen1107 @Random-Liu).

For "run one" things (like Heapster or DNS), we can't add them. System depends on them running reasonably well, and Taints will be used as a method for Pod evictions in case of Node problems. I.e. when Node dies you really want your DNS to die and be scheduled somewhere else. Tolerations currently don't support negations IIRC, so we can't specify to tolerate all Taints except ones created by NC. @kubernetes/sig-scheduling-misc

k8s-github-robot pushed a commit that referenced this issue May 12, 2017
Automatic merge from submit-queue (batch tested with PRs 45691, 45667, 45698, 45715)

Add general NoExecute Toleration to fluentd in gcp configuration

Ref #44445

Once merged I'll create a cherry-pick that will be picked up in GKE together with the next patch release.

cc @JorritSalverda @davidopp @aveshagarwal @nimeshksingh @piosz 

```release-note
fluentd will tolerate all NoExecute Taints when run in gcp configuration.
```
@dchen1107
Copy link
Member

re: #44445 (comment)
pr #43116 was applying taint tolerations for NoExecute for all static pods, including kube-proxy.

k8s-github-robot pushed a commit that referenced this issue May 29, 2017
Automatic merge from submit-queue

Add generic NoExecute Toleration to NPD

Ref. #44445

cc @davidopp 

```release-note
Add generic Toleration for NoExecute Taints to NodeProblemDetector
```
@nimeshksingh
Copy link

I hate to keep this going, but somehow, even after k8s 1.6.6, which has pr #45715 and therefore the NoExecute toleration on fluentd-gcp, I ran into the following behavior:

  1. Add new node. kube-proxy, fluentd-gcp, and my own daemonset pod with a universal 'Exists' toleration start running.
  2. Taint the node with a NoSchedule effect:
    kubectl taint nodes node/my-node-name mykey=myvalue:NoSchedule
  3. The fluentd-gcp pod gets evicted, but kube-proxy and my own pod stays.
  4. Remove the taint with:
    kubectl taint nodes node/my-node-name mykey:NoSchedule-
  5. The fluentd-gcp pod comes back.

If I understand the effects correctly, the NoSchedule taint should not affect the fluentd-gcp pod, as it was already running on the node, but it did for some reason. But, given that fluentd should probably come back if it somehow dies anyways, should it have a 'NoSchedule' toleration in addition to the 'NoExecute' toleration?

@nimeshksingh
Copy link

Just to clarify, using a NoExecute effect is a reasonable workaround for me, but the current behavior for NoSchedule still strikes me as strange.

@davidopp
Copy link
Member

davidopp commented Jun 28, 2017

If I understand the effects correctly, the NoSchedule taint should not affect the fluentd-gcp pod, as it was already running on the node,

Your understanding is correct. Is the behavior you described repeatable? What PodStatus does the fluentd-gcp pod should when it is evicted? It would useful to see the kubelet and NodeController logs, I guess. Most likely the fluentd-gcp pod is getting evicted for some other reason. (Maybe you could try the same scenario as you described, but don't taint the node, and see if the fluentd-gcp pod gets evicted anyway.)

given that fluentd should probably come back if it somehow dies anyways, should it have a 'NoSchedule' toleration in addition to the 'NoExecute' toleration?

Yeah, it seems like fluentd-gcp should tolerate all taint effects, not just NoExecute.

cc/ @kubernetes/sig-cluster-lifecycle-bugs
cc/ @kubernetes/sig-scheduling-bugs

@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. kind/bug Categorizes issue or PR as related to a bug. labels Jun 28, 2017
@gmarek
Copy link
Contributor

gmarek commented Jun 28, 2017

This is actually strange - TaintController isn't even looking at 'NoSchedule' Taints. If adding 'NoSchedule' evicts Daemon, it sounds like DaemonSetController bug. @kubernetes/sig-apps-bugs

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. kind/bug Categorizes issue or PR as related to a bug. labels Jun 28, 2017
@gmarek
Copy link
Contributor

gmarek commented Jun 28, 2017

I looked into such problem, and indeed it looks like DSC is deleting daemon:

I0628 07:56:56.634089       5 wrap.go:75] DELETE /api/v1/namespaces/kube-system/pods/fluentd-gcp-v2.0-n6lm7: (3.09306ms) 200 [[kube-controller-manager/v1.6.6 (linux/amd64) kubernetes/7fa1c17/system:serviceaccount:kube-system:daemon-set-controller] [::1]:36844]

So it's certainly DaemonSet bug - it shouldn't remove Daemon from the Node even if it doesn't tolerate NoSchedule taint that was put on it.

I'll write a quick-fix that add general NoSchedule toleration for fluentd.

@mikedanese
Copy link
Member

This is a regression in the daemonset controller. See #46577 (comment)

@gmarek
Copy link
Contributor

gmarek commented Jun 28, 2017

Assinged @erictune as a @kubernetes/sig-apps-bugs lead, but it's very likely that @mikedanese will fix it before Eric will wake up:)

k8s-github-robot pushed a commit that referenced this issue Jun 28, 2017
Automatic merge from submit-queue (batch tested with PRs 48192, 48182)

Add generic NoSchedule toleration to fluentd in gcp config as a quick…

…-fix for #44445
@davidopp
Copy link
Member

Thanks for the debugging, @gmarek !

@nimeshksingh
Copy link

Looks like this is already all figured out, so just let me know if anyone needs me to provide more info.

@davidopp
Copy link
Member

Thanks a lot for reporting the problem, @nimeshksingh

@mikedanese
Copy link
Member

The original issue was resolved along with another that got rolled up into this. What is unresolved:

@gmarek

I don't think it's OK to add general toleration for all NoExecute taints, even user-specified ones. We do want user to be able to set a Taint that will remove all Pods from the given Node, even ones created by Daemons.

@davidopp

For runs-on-every-node system pods we probably need to tolerate all taints by default. For runs-on-one-or-a-few-nodes system pods (like Heapster or DNS) I'm not sure what to do.

But these were side conversations to the original issue. If you guys think these are important to continue discussing, can you break out a dedicated issue?

k8s-github-robot pushed a commit that referenced this issue Nov 30, 2017
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add wildcard tolerations to kube-proxy

- Add wildcard tolerations to kube-proxy.
- Add `nvidia.com/gpu` toleration to nvidia-gpu-device-plugin.

Related to #55080 and #44445.

/kind bug
/priority critical-urgent
/sig scheduling

**Release note**:
```release-note
kube-proxy addon tolerates all NoExecute and NoSchedule taints by default.
```

/assign @davidopp @bsalamat @vishh @jiayingz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

9 participants