Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.

Bootkube prematurely exits if scheduler/controller-manager lose leader-election #372

Closed
aaronlevy opened this issue Mar 10, 2017 · 10 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/P0

Comments

@aaronlevy
Copy link
Contributor

aaronlevy commented Mar 10, 2017

This is most easily surfaced when using self-hosted etcd - because there is a period of time when pivoting from boot-etcd to self-hosted etcd that the compiled-in control plane is no longer able to contact the etcd cluster (and therefore give up leader status).

When either the scheduler or controller-manager give up leadership it calls an os.Exit() which kills bootkube entirely - and the bootstrap process may not be complete (e.g. boot-etcd still running, or in another incomplete state).

One fix might to to just provide the bootkube api-server with the expected addresses of both etcd's (boot-etcd, and the self-hosted etcd service ip) when running bootkube start. For example: https://github.com/kubernetes-incubator/bootkube/blob/master/hack/multi-node/bootkube-up#L16

Another solution in the future might be to use static manifests for control plane components, such that if they exit they won't affect bootkube.

/cc @hongchaodeng @xiang90

@aaronlevy aaronlevy added kind/bug Categorizes issue or PR as related to a bug. priority/P0 labels Mar 10, 2017
@peebs
Copy link

peebs commented Mar 10, 2017

I see this occasionally in the CI. For example: https://jenkins-tectonic.prod.coreos.systems/job/bootkube-dev/454/console

@aaronlevy
Copy link
Contributor Author

We may need to turn off self-hosted etcd tests running automatically (but allow them to be manually triggered) - seems to be failing PRs that shouldn't

@aaronlevy
Copy link
Contributor Author

Or we deal with the known flakes

@peebs
Copy link

peebs commented Mar 10, 2017

The main flakes I observe these days seem to be the leaderelection one and the etcd scaling test not completing its scale down phase. I bumped the time we wait for the pods to scale to 200 seconds and I see flakes sometimes so I suspect its getting wedged.

Shall I move the etcd tests all back into the optional job? Upside is green checkmark is easier to get. Downside is we will probably run the optional tests less often and have to remember to run them. Either way it means more robot interactions for us.

@aaronlevy
Copy link
Contributor Author

This one might not be a very complex fix (assuming it just means an extra flag provided to bootkube start maybe just leave them for now - and if this isn't an easy resolution we can revisit

@Quentin-M
Copy link
Contributor

The leaderelection lock is 15 seconds by default for the CM and Sched. However, as soon as the init container of the new etcd node, started by the etcd operator, adds a member to the cluster, the cluster loses quorum. Then if by any change the CM or Sched tries to renew their lock, which can happen anytime between 0.01s and 15 seconds, before the etcd instance is up (which is different from the init container), then the leader election fails and the CM/Sched dies, killing bootkube with it.

@Quentin-M
Copy link
Contributor

It seems however that bootkube died before the migration even started. It appears that boot-etcd experienced request/sync timeouts, maybe related to slow disk on AWS (EBS), which might have failed the lease renewal and killed CM/Sched with glog.Fatalf() -> os.Exit(255). Additionally, the TPR was shown to be ready but getBootEtcdPodIP() never succeeded. This function calls a List() operation which calls etcd. This is another hint that etcd might have been too slow. This would mean that this issue could actually happen regardless of whether you use self-hosted etcd or not - if your etcd cluster is quite slow. This issue is exacerbated by the fact that we run the etcd instance on the same node, download containers (disk bandwidth, ..) etc. Seems like running the CM/Sched as a static pod so they can restart would be the easiest solution - if not to trap the syscall :D

@Quentin-M
Copy link
Contributor

  • Using SSD-backed huge machines
  • Disabling the GC on the CM
    allowed it to work at least once. It does seem to work around the lease renewal failure due to etcd being too slow.

However, we might still hit a renewal failure when the etcd quorum is broken (added member - not started/sync'd), especially given the very low lease renewal interval.

@aaronlevy
Copy link
Contributor Author

On the bootkube side there are two things we should do to mitigate this issue:

Allow bootkube start to reference multiple etcd-endpoints (opened issue to specifically track that fix here: #411)

And move to using static manifests for temp control-plane: #168

@aaronlevy
Copy link
Contributor Author

This should be resolved by #425 (static manifests in control-plane)

And #418 (temp control plane should reference both temp & self hosted etcd)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/P0
Projects
None yet
Development

No branches or pull requests

3 participants