-
Notifications
You must be signed in to change notification settings - Fork 219
Support deploying self-hosted etcd #31
Comments
We need the data that ends up in etcd to persist with the cluster that is launched. If that data lives in the bootkube, then bootkube must continue to run for the lifecycle of the cluster. Alternatively, we need a way to pivot the etcd data injected during the bootstrap process to the "long-lived" etcd cluster. The long-lived cluster would essentially be a self-hosted etcd cluster launched as k8s components just like the rest of the control plane. What I'd probably like to see is something along the lines of:
Another option might be trying to copy the etcd keys from the in-process/local node to the self-hosted node, but this can get a little messy because we would be trying to manually copy (and mirror) data of a live cluster. Some concerns with this approach:
|
@philips what do you think about changing this issue to be "support self-host etcd", and dropping from 0.1.0 milestone ? |
Adding notes from a side-discussion: Another option that was mentioned is just copying keys from bootkube-etcd to cluster-etcd. This would require some coordination points in the bootkube process:
|
How do you want the self-hosted apiserver to discover the location of self-hosted etcd?
v1.3.5 talking to etcd v3.0.3 edit: |
I've managed to get this done. Using a separate ETCD cluster where each k8s node (master/minion) is running an ETCD in proxy mode. I'm using Terraform to configure both. The etcd module is available here and the k8s module is available here. P.S: The master is not volatile and cannot be scaled. If the master node reboots it will not start any of the components again, not sure why but P.P.S: I had few issues doing that but mostly related to me adding |
@xiang90 and @hongchaodeng can you put some thoughts together on this in relation to having an etcd controller. I think there are essentially two paths:
I think option 2 is better because it means we don't have to worry about cutting over and having split brain. But! How do we do 2 if the cluster only intend to have one etcd member (say in AWS because you will have a single machine cluster backed by EBS). I think we should try and prototype this out ASAP as this is the last remaining component that hasn't been proven to be self-hostable. |
@philips I have thought about this a little bit. And here is the workflow in my mind:
... k8s is ready...
Now etcd controller fully control the etcd cluster and can grow the size to desired size. |
Started some work on this - got a bootkube-hosted etcd cluster up, now working on migrating from the bootkube instance to the etcd-controller managed instance |
@philips what happens if the self-hosted etcd cluster (or the control plane behind it) dies? I believe this is why @aaronlevy mentioned it is:
This is exactly the concern I shared in the design proposal. Can this issue clarify if this concept is simply meant for non-production use-cases? |
@pires if the self-hosted etcd cluster dies you need to recover using bootkube from a backup. This is really no different than if it died normally and you would have to redeploy the cluster from a backup and restart the API servers again. |
@philips can you point me to the backup strategy you guys are designing or already implementing? |
I believe the backup @philips mentioned is actually the etcd backup. For the etcd-controller, we do a backup:
|
I understand the concept and it should work as you say, I'm just looking for more details on:
Don't take me wrong, I find this really cool and I'm trying to grasp it as much as possible as sig-cluster-lifecycle looks into HA. |
The data is stored on local storage. etcd has builtin recovery mechinism. When you have a 3 member etcd cluster, you already have 3 local copies
Backup is a for extra safety. It helps with rollback + disaster recovery.
If there is a disaster case or bad upgrade, we recover the cluster from the backup. |
As an update on the etcd and self-hosted plan we have merged support behind an experimental flag in bootkube: https://github.com/kubernetes-incubator/bootkube/blob/master/cmd/bootkube/start.go#L37 This is self-hosted and self-healing etcd on top of Kubernetes. |
@philips |
@gitoverflow That is the plan. |
Although bootkube now supports a self-hosted etcd pod for bootstrapping, I can't find any documentation which explains:
|
@jamiehannaford You're right - and we do need to catch up on Documentation. Some tracking issues: Regarding your questions:
|
@aaronlevy Thanks for the links. I'm still wrapping my head around the boot-up procedure. It seems the chronology for a self-hosted etcd cluster is:
My question is, why does the self-hosted etcd need to wait for certain pods to exist before the data migration happens? I thought the data migration would happen first, then all the final control plane elements would be created. I looked at the init args for kube-apiserver, and it has the eventual IPv4 of the real etcd ( |
It could likely work in this order as well - but there could be more coordination points (vs just "everything is running - so let the etcd-operator take over"). For example, we would need to make sure to deploy kube-proxy & etcd-operator, then do the etcd pivot, then create the rest of the cluster components. Where right now it's just "create all components that exist in the /manifest dir, wait for some of them, do etcd-pivot" - which initially is easier. Are there any issues particular to the current order that you've found?
Sort of. Really everything pivots around etcd / apiserver addressability. The "real" api-server doesn't immediately take over, because it is unable to bind on 8080/443 (bootkube apiserver is still listening on those ports). The rest of the components don't know if they're talking to bootkube-apiserver or "real" apiserver. It's just an address they expect to reach. So when we're ready to pivot to the self-hosted control-plane, it's just simply exiting the bootkube-apiserver so the ports free up. You're right that there will be a moment where no api-server is bound to the ports - but it's actually fine in most cases for components to fail/retry - much of Kubernetes is designed this way (including its core components). However, there currently is an issue where the bootkube-apiserver is still "active", but it expects to only be talking to the static/boot etcd node - however - that node may have already been removed as a member if the etcd cluster. This puts us in a state where the "active" bootkube apiserver can no longer reach the data-store and essentially becomes inactive. See #372 for more info. The above issue might be as simple as adding both boot-etcd address and the service IP for self-hosted etcd to the bootkube api-server, I just haven't had a chance to test that assumption. |
From the README:
In the original prototype we had a built in etcd. Why is that no longer part of this?
The text was updated successfully, but these errors were encountered: