Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

master: flanneld seems to be timing out while running decrypt-tls-asset in ExecStartPre #65

Closed
mumoshu opened this issue Nov 17, 2016 · 5 comments · Fixed by #73
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@mumoshu
Copy link
Contributor

mumoshu commented Nov 17, 2016

With 7ea5f6b, I've seen an error message like:

Nov 17 08:48:10 ip-10-0-0-216.ap-northeast-1.compute.internal systemd[1]: flanneld.service: Start-pre operation timed out. Terminating.
Nov 17 08:48:11 ip-10-0-0-216.ap-northeast-1.compute.internal etcdctl[8586]: open /etc/kubernetes/ssl/etcd-client.pem: no such file or directory
Nov 17 08:48:11 ip-10-0-0-216.ap-northeast-1.compute.internal systemd[1]: flanneld.service: Control process exited, code=exited status=1

Full log can be seen at https://gist.github.com/mumoshu/6f9fe119f882d3fcda40322d209123d8

It seems that after decrypt-tls-assets timing out, systemd continues to run next ExecStartPre, which also end up with an error like etcd-client.pem: no such file or directory(it might be so because you systemd terminated decrypt-tls-assets which is intended to generate that file!)

It seems to take about 3 min 30 sec until flanneld fully gets up and running.
Could we shorten it by removing unnecessary timeouts like this?

@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 17, 2016

I guess timing out/terminating decrypt-tls-assets like this would needlessly make flanneld startup time longer.

The timeout seems to be 10 second according to timestamps.
Should we make it sufficiently longer, maybe 60 sec?

@mumoshu mumoshu modified the milestones: v0.9.1-rc.3, v0.9.1-rc.4 Nov 17, 2016
@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 17, 2016

According to the systemd doc, there seems no specific configuration just for ExecStartPre timeouts.
Possibly relevant configurations are TimeoutSec and TimeoutStartSec.
I'm going to try the latter.

@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 18, 2016

Currently testing with TimeoutStartSec=60:

core@ip-10-0-0-40 ~ $ systemctl show flanneld.service | grep Timeout
TimeoutStartUSec=1min
TimeoutStopUSec=1min 30s
JobTimeoutUSec=infinity
JobTimeoutAction=none
core@ip-10-0-0-40 ~ $ systemctl show kubelet.service | grep Timeout
TimeoutStartUSec=1min 30s
TimeoutStopUSec=1min 30s
JobTimeoutUSec=infinity
JobTimeoutAction=none
core@ip-10-0-0-40 ~ $ systemctl show docker.service | grep Timeout
TimeoutStartUSec=infinity
TimeoutStopUSec=1min 30s
JobTimeoutUSec=infinity
JobTimeoutAction=none

Now it takes 2min until flanneld fully starts up:
https://gist.github.com/mumoshu/71e7c1858ef439197360121e4aaac1d9

However, 60 sec doesn't seem to be sufficient:

Nov 18 00:02:38 ip-10-0-0-40.ap-northeast-1.compute.internal systemd[1]: Starting Network fabric for containers...
...
Nov 18 00:03:38 ip-10-0-0-40.ap-northeast-1.compute.internal systemd[1]: flanneld.service: Start-pre operation timed out. Terminating.

@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 18, 2016

With TimeoutStartSec=120:

Nov 18 00:31:08 ip-10-0-0-133.ap-northeast-1.compute.internal systemd[1]: Starting Network fabric for containers...
Nov 18 00:32:40 ip-10-0-0-133.ap-northeast-1.compute.internal systemd[1]: Started Network fabric for containers.

https://gist.github.com/mumoshu/763efc6c923c966323c6d9757425f738

There're no timeouts and it takes only 1min 32 sec until up 🎉

@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 18, 2016

This slowness is almost certainly slipped into v0.9.1-rc.1 via #34, and have been active since then.

mumoshu added a commit to mumoshu/kube-aws that referenced this issue Nov 18, 2016
@mumoshu mumoshu added the kind/bug Categorizes issue or PR as related to a bug. label Nov 18, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant