-
Notifications
You must be signed in to change notification settings - Fork 220
Simple scheduler & controller-manager disaster recovery #112
Comments
Thinking longer term, rather than a separate |
👍 I think an interesting idea is to play with re-running In the static manifest case this problem isnt really a problem because the pods which run aren't scheduled, so the scheduler failing in any scenario always results in it being restarted and able to run (assuming it can grab it's lease). I wonder if there's a good way to do this, perhaps by check pointing the scheduler to nodes which have |
Also, definitely agree that a kubelet-pod-api that skips scheduling makes this story way better. |
Hi guys, I was trying bootkube and I had a similar case. In my case, I figured that the controller had no |
One low-hanging fruit is that we should be deploying multiple copies of the controller-manager/scheduler. In that case you would be doing a rolling-update of the component, verifying that the new functionality works before destroying all of the old copies. However, there are still situations where we have a loss of all schedulers and/or controller-manager (e.g. maybe a flag change is subtly broken, but the pod is still running so the deployment manager rolls out all broken pods). You could launch a new master as an option, but if you still have an api-server/etcd running you should be able to recover. Essentially you would need to inject a controller-manager pod into the cluster, then delete it as soon as your existing controller-manager deployment has been scheduled. For example:
Then take the podSpec section (second indented spec, with a Something like:
Then wrap that in a pod header, and specify the nodeName it should run on:
Then inject it in the cluster What will happen is that this pod will act as controller-manager, convert your existing deployment/controller-manager into pods - then they will be scheduled. After that you can just delete the recovery pod:
|
Just hit this as well as a result of a container linux reboot |
In addition, if your scheduler is down, recovering the
|
You can do the same steps I outlined above for the scheduler as well (and don't need to actually ssh into a machine to create the static manifest). |
But who schedules the recovery scheduler if the scheduler is dead? :-) |
Per the steps outlined above you would populate the pod's |
Oh right!! That's great. Thanks :-) |
Is this documented anywhere? Could help our users! |
Working on it @mfburnett as i have hit with the same issue today while upgrading. Should I be creating a doc defect for this, or would you be doing it for me? cc @aaronlevy |
Thanks @radhikapc! |
Just to track some internal discussions -- another option might be to propose a sub-command to kubectl upstream. Not sure of UX specifics, but maybe something like:
|
Another simple way to hook back a Pod to a Node, when Scheduler + Controller-manager are dead:
with this content:
A Binding is the objected injected in the K8s cluster when the Schedule takes a scheduling decision. You can do the same manually. |
Thanks! This prevented me from restarting the whole cluster (again) with Someone should properly add this to: https://github.com/kubernetes-incubator/bootkube/blob/master/Documentation/disaster-recovery.md |
Been thinking about this, can't checkpointer help here? mark controller-manager as pod to be checkpointed, then it would be recovering it if it can't find it on the node. There can be edge case when it recovers too many pods, but given that they do leader election that is fine to have few extra running. As for the scheduler, making it daemonset would allow it to be enough just for controller-manager & apiserver to be alive as daemonsets are scheduled by controller, at least in current 1.8.x release |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There are potential failure cases, where you have permanently lost all schedulers and/or all controller-managers, where recovery leaves you in a chicken-egg state:
For example, assume you have lost all schedulers - but still have a functioning api-server which contains the scheduler deployment object:
You need a controller-manager to convert the deployment into unscheduled pods, and you need a scheduler to then schedule those pods to nodes (scheduler to schedule the scheduler, if you will).
While these types of situations should be mitigated by deploying across failure domains, it is still something we need to cover.
In the short term this could mean documenting, for example, how to create a temporary scheduler pod pre-assigned to a node (once a single scheduler exists, it will then schedule the rest of the scheduler pods).
Another option might be to build a tool which knows how to do this for you based on parsing an existing object:
Where the
kube-recover
tool could read the object from the api-server (or from disk), parse out the podSpec and pre-assign a pod to the target node (bypassing both need for controller-manager and scheduler).The text was updated successfully, but these errors were encountered: