Kube node drainer service fails to start. #40

camilb · 2016-11-07T14:03:37Z

I think the line Restart=on-failure from kube-node-drainer.service should be removed.

Getting these errors:

Failed to restart kubelet.service: Unit kube-node-drainer.service is not loaded properly: Invalid argument.
kube-node-drainer.service: Service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing.

The text was updated successfully, but these errors were encountered:

camilb · 2016-11-07T15:17:31Z

#41

pieterlange · 2016-11-07T15:18:28Z

Easy mistake to slip in..

I think i've spotted another one:

        ExecStop=/bin/sh -c '/usr/bin/docker run --rm -v /etc/kubernetes:/etc/kubernetes {{.HyperkubeImageRepo}}:{{.K8sVer}} \
          /hyperkube kubectl \
          --server=https://{{.ExternalDNSName}}:443 \
          --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml \
          drain $$(hostname) \
          --ignore-daemonsets \
          --force'

Unless i'm missing something here, the double $$ seems to be an error. Is this some escaping trick? I think we should just be able to run $(hostname), right?

camilb · 2016-11-07T15:35:14Z

@pieterlange I see that it's working with double $$as with one $.

camilb · 2016-11-07T18:44:59Z

Related to this, but looks like it's affecting other services too, using ExecStartPre=/usr/bin/systemctl is-active kubelet.service will check if the service is active then exit (code=exited, status=3). This being a oneshotservice, it will not be restarted. Also observed this on install-calico-system.service and install-kube-system.service. Using ExecStartPre=/usr/bin/systemctl is-active service.name it will generate a lot of errors before kubelet.service becomes active.

In this case I think is better to use something like:

After=multi-user.target

[Install]
WantedBy=node-drain.target

This way we make sure that all the services are running before we start this one.

Here is a proposal:

     [Unit]
     Description=drain this k8s node to make running pods time to gracefully shut down before stopping kubelet
     After=multi-user.target
     Wants=decrypt-tls-assets.service kubelet.service docker.service

     [Service]
     Type=oneshot
     RemainAfterExit=true
     ExecStart=/bin/sh -c '/usr/bin/docker run --rm -v /etc/kubernetes:/etc/kubernetes {{.HyperkubeImageRepo}}:{{.K8sVer}} \
       /hyperkube kubectl \
       --server=https://{{.ExternalDNSName}}:443 \
       --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml \
       uncordon $(hostname)
     ExecStop=/bin/sh -c '/usr/bin/docker run --rm -v /etc/kubernetes:/etc/kubernetes {{.HyperkubeImageRepo}}:{{.K8sVer}} \
       /hyperkube kubectl \
       --server=https://{{.ExternalDNSName}}:443 \
       --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml \
       drain $(hostname) \
       --ignore-daemonsets \
       --force'

     [Install]
     WantedBy=node-drain.target

mumoshu · 2016-11-08T00:20:57Z

@pieterlange replied to you in #41 (comment) about $$

mumoshu · 2016-11-08T00:34:18Z

@camilb Thanks for your feedback and proposal!

I'm not intended to just stick with the is-active method but anyways it seems that we may have 3 issues here:

a lot of errors before starting kubelet, flanneld, etc.
the node-drainer isn't coming up at all(!)
un-cordoning nodes

For 1, I believe we can use RestartSec as used in https://github.com/coreos/coreos-baremetal/blob/master/examples/ignition/bootkube-controller.yaml#L66 to alleviate the issue. Should we start from RestartSec=10 anyway?

For 2, a part of issue is resolved thanks to your pr #41. To tackle the other part of issue, I began to believe your proposal, that uses After=multi-user.target to control order of service startup, is the only way to go!

For 3, though I'm rather looking forward with it, I'm not familiar with its use-case!

mumoshu · 2016-11-08T00:46:41Z

Added to the known-issues list https://github.com/coreos/kube-aws/releases/tag/v0.9.1-rc.1

pieterlange · 2016-11-08T05:26:23Z

Uncordoning will be necessary when the node was restarted due to manual operator intervention instead of a rolling upgrade. Not sure how often we'd see that in practice (who reboots nodes in an ASG?) but it might cause some 'funny' side-effects.

camilb · 2016-11-08T06:24:27Z

@mumoshu

I will do some testing today with RestartSec and also try to set a order for services to see which it's more effective.
I'm already using After=multi-user.target and works fine.
We could stick with ExecStart=/bin/true in this case. @pieterlange Actually had to reboot the nodes several time on a staging cluster due the Docker 10.3 problems on overload. I know that you can lose the node in an ASG but it's faster.

mumoshu · 2016-11-08T09:05:25Z

@camilb Thanks for your cooperation here.

Just doing a quick reply to 1., fyi, we've already been hit by the 51200 bytes limit in the master branch.
If you are going to start testing on top of it, I'd like to encourage you to try #45 as the base branch for testing!

camilb · 2016-11-08T09:49:11Z

@mumoshu Thanks, I will try it. Hitted the limit several times and was using kube-aws up --export with S3.

camilb · 2016-11-08T23:44:10Z

@mumoshu Finished testing.

kube-aws up --s3-uri s3://my_bucket/kube-aws works fine

kube-aws update --s3-uri s3://my_bucket/kube-aws always shows this error:

Error: Error updating cluster: error updating cloudformation stack: ValidationError: Template    format error: unsupported structure.
status code: 400, request id: 0d04f84f-a607-11e6-b3c5-2548ad1d19cd

But works fine with aws cloudformation update-stack.

Regarding kube-node-drainer.service

docker.service it's stopped at the same time with kube-node-drainer.service Less than 10% of times it worked. I did try several configurations like Wanted-by or Required-by=poweroff.target reboot.target halt.target etc... Then swithced to rkt and after few intents I found a solution that works fine with rolling-updates, manual shutdown, reboot, etc. Tested several times and it din't fail.

- name: kube-node-drainer.service
  enable: true
  command: start
  runtime: true
  content: |
    [Unit]
    Description=drain this k8s node to make running pods time to gracefully shut down before stopping kubelet
    After=multi-user.target

    [Service]
    Type=oneshot
    RemainAfterExit=true
    ExecStart=/bin/true
    TimeoutStopSec=30s
    ExecStop=/bin/sh -c '/usr/bin/rkt run \
    --volume=kube,kind=host,source=/etc/kubernetes,readOnly=true \
    --mount=volume=kube,target=/etc/kubernetes \
    --net=host \
    quay.io/coreos/hyperkube:v1.4.5_coreos.0 \
      --exec=/kubectl -- \
      --server=https://{{.ExternalDNSName}}:443 \
      --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml \
      drain $(hostname) \
      --ignore-daemonsets \
      --force'

    [Install]
    WantedBy=multi-user.target

mumoshu · 2016-11-09T00:03:37Z

@camilb

Glad to hear that kube-aws up --s3-uri worked for you!
For 2., it turned out that I had made the worst mistake I've ever made in the last commit 😈 already fixed/git-push'ed and going to test it out soon. Thank you for the feedback 🙇
For 3., honestly I had never noticed that instability, which is what we'd like to fix asap! I'm going to look into your PR 👍

mumoshu · 2016-11-09T03:21:57Z

FYI, I have just merged #48 which, in combination with #41, fixes what is stated in the title of this issue.

mumoshu · 2016-11-09T03:42:20Z

Revisiting & thinking about @pieterlange's comment at #40 (comment) and @camilb's comment at #40 (comment).

In addition to what you've mentioned, would the uncordon feature + the node drainer allow us to automatically upgrade CoreOS version, which implies automatic rebooting, hopefully without downtime?
If that is the case, as there're several possible use-cases already, I thought it would be nice to keep discussing about it in an another issue.

camilb · 2016-11-11T10:22:17Z

@mumoshu I will close this for now. Still not perfect but works better now. The request to drain the node is properly sent and the pods are started on other nodes. The problem is the containers are quickly killed on the drained node and for some pods that are using bigger images or need a longer time to start/stop, there is not enough time to be started on other nodes. I'm looking for a good method to delay stopping some services on shutdown or reboot. Saw some examples on Redhat and want to test them. I will open another issue with a proposal for improvements soon.

…m-image to hcom-flavour * commit 'f97674a7765dd91c6fa868cc7f2ded1aac5c1836': KIAMImage should affect server as well as client.

mumoshu added this to the v0.9.1-rc.2 milestone Nov 8, 2016

camilb mentioned this issue Nov 8, 2016

Replace docker with rkt in "kube-node-drainer.service". #48

Merged

camilb closed this as completed Nov 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kube node drainer service fails to start. #40

Kube node drainer service fails to start. #40

camilb commented Nov 7, 2016

camilb commented Nov 7, 2016

pieterlange commented Nov 7, 2016

camilb commented Nov 7, 2016

camilb commented Nov 7, 2016 •

edited

Loading

mumoshu commented Nov 8, 2016

mumoshu commented Nov 8, 2016 •

edited

Loading

mumoshu commented Nov 8, 2016

pieterlange commented Nov 8, 2016

camilb commented Nov 8, 2016

mumoshu commented Nov 8, 2016

camilb commented Nov 8, 2016

camilb commented Nov 8, 2016

mumoshu commented Nov 9, 2016

mumoshu commented Nov 9, 2016

mumoshu commented Nov 9, 2016

camilb commented Nov 11, 2016

Kube node drainer service fails to start. #40

Kube node drainer service fails to start. #40

Comments

camilb commented Nov 7, 2016

camilb commented Nov 7, 2016

pieterlange commented Nov 7, 2016

camilb commented Nov 7, 2016

camilb commented Nov 7, 2016 • edited Loading

mumoshu commented Nov 8, 2016

mumoshu commented Nov 8, 2016 • edited Loading

mumoshu commented Nov 8, 2016

pieterlange commented Nov 8, 2016

camilb commented Nov 8, 2016

mumoshu commented Nov 8, 2016

camilb commented Nov 8, 2016

camilb commented Nov 8, 2016

mumoshu commented Nov 9, 2016

mumoshu commented Nov 9, 2016

mumoshu commented Nov 9, 2016

camilb commented Nov 11, 2016

camilb commented Nov 7, 2016 •

edited

Loading

mumoshu commented Nov 8, 2016 •

edited

Loading