-
Notifications
You must be signed in to change notification settings - Fork 224
self hosted etcd: checkpoint iptables on master nodes #284
Comments
FWIW, checkpointing iptables would also enable the api server to become a deployment instead of a daemonset. This would enable it to move around. @xiang90 it would seem to me that there is a chance that the iptables restored after a checkpoint might be stale. I think that would lead to cluster not coming back up in some cases. |
Yes, I am aware. We are not trying to solve this problem at 100%. The problem we are trying to solve here is a full API components down case. This is usually caused by a power failure of one DC or a rack. Or someone is playing around with selfhosted to see its recovery ability. So the chance to hit it is already pretty low. Moreover, the stale iptable can be an issue for etcd case only if ALL etcd members are moved. In another word, the service IP in the iptable points to no valid etcd members. Assume that we checkpoint iptable every 1 minutes. The chance of moving all etcd members around in 1 minute is VERY low. Combining the two very low event, I would not worry about it. If it does happen, we need a full backup/recovery. Maybe I am missing something you concern about? |
@xiang90 the case I'm considering (for our own scenarios) is a full power off and power on of a self-hosted cluster. I think you are making a good point -- if the interval of checkpointing is small and you're running HA then its a low probability event for the cluster to not come back up after a power failure. What I'm trying to get my head around is if the cluster comes back up, but the checkpoint information is stale, will it converge on the correct authoritative state in time. For example, if one (out of three) etcd member moved from 10.0.0.5 to 10.0.0.6 but we did not checkpoint the iptables (and pod manifest) in time and a power off happened. The cluster comes back up, the etcd member will still run on 10.0.0.5 (thats where the last checkpoint manifest and iptables had it), but etcd state has it at 10.0.0.6. So its effectively a lost member (yet it still running). How does this cluster converge and cleanup? |
Let's assume the other members are 10.0.0.1 and 10.0.0.2 besides 10.0.0.5 (which moved to 10.0.0.6 later on)
The API server should be able to serve request since it still can reach 10.0.0.1, 10.0.0.2. The kube proxy should be able to pull the latest service IP mapping from the running API server (10.0.0.1, 10.0.0.2 and 10.0.0.6). The kube proxy will set the service mapping and take over the control. kube proxy will do so since it has no idea about the ip table restore we just did. It assumes the ip table is empty and will do the initial recovery. This is my understanding of how kube proxy should work. I might be wrong. But based on my limited experiments, it works this way. /cc @aaronlevy might know more details than me. |
/cc @bassam |
What is the service IP used for? Can we make the service IP static as it is for DNS? Would that help? Also, can we remove the service IP requirement somehow to just side-step this whole issue? What is service IP fixing? |
load balancing + hide the actual etcd pod IP which is subject to change Or you will have to restart API server to track etcd member changes.
The service IP is static. The mapping is dynamic
Same as question 1. |
@xiang90 sounds like you're saying kube-proxy would overwrite the stale routes and things would converge. @philips I think a service for etcd enables the members to float. I was hoping we can do the same with the api server as well. I experimented with a bit (bassam@52c0499) but didn't get far. |
Yes. After it starts, it should. @aaronlevy right? |
I would assume thats it's behavior, but I have not personally verified this. E.g. does it behave differently if there happens to be existing state? I'd be surprised if it did change its behavior, but not something I've looked at directly. @philips does checkpointing the iptables rules concern you? Operationally, this isn't something I've had much experience with -- so if this seems like a bad idea -- we can revisit more closely. The options we've discussed so far were just using the systemd units that already ship with the OS (CoreOS specific, but easy enough to add elsewhere),. Then validating that this works as expected. Then maybe even checkpointing the routes ourselves in the pod-checkpointer code (and restoring on reboot). |
From ux perspective, I feel it is better to do it in pod-checkpointer. systemd unit is an extra step for the user. |
After a few rounds of internal (CoreOS) discussion, we decided to try out iptable checkpoint with a user program first. We wrote a small tool (kenc) that can checkpoint/recovery iptables to/from a given file. It can help self hosted etcd Kubernetes deployment to recover from a full API components down case. I propose to have the kenc running as recovery mode as an init container. Then add the kenc running in checkpointing mode into the API server pod. @aaronlevy opinion? |
link to kenc project: https://github.com/coreos/kenc/ |
From a previous offline discussion the general architecture discussed:
Can we expand on this / write up a design document so we can make sure we're all on same page in terms of implementation? For example, I would prefer that the "checkpointing" aspect of this tool is run as a separate daemonset (not as part of api-server). Similarly - I would initially think it makes more sense for the "restore" init-container to be run as part of the kube-proxy, not the api-server (enforces our restore occurs prior to kube-proxy starting - and they will have less chance of conflict when operating on same rules). |
I do not want this tool to run on non-master nodes. Can you explain the conflict part? I think if we run the init phase before the API server gets started, there should not be any conflict. kube-proxy should do nothing until it reaches the API server. |
The restore or checkpoint or both?
There could be multiple api-servers, there's no guarantee that the local kube-proxy will not speak to a different api-server before the local api-server is started. |
both.
OK. Fair. I want to avoid the initial unnecessary timeouts: components -> local-api server -> etcd vip -> times out. Even in this case, there should be no conflicts. Kube-proxy never tries to interpret the existing rules. It will simply flush the KUBE chain atomically. The worst case is the iptables got rewrote by a stale checkpoint table, but it should be OK since kube-proxy will eventually update it back. I can double check this.
What are the benefits to only checkpoint parts of the services. Why not checkpoint the entire NAT table initially? The later is significantly simpler to get started. |
If you want to restrict where this runs, this depends on how we deploy the tools. If the "checkpoint" process is deployed as its own daemonset, this is as simple as adding the nodeSelector to the manifest. For the restore process, if we just have it run along-side the kube-proxy and only continue down the restore path if that file exists, we're not really adding any overhead or additional risk of timeout - and we shouldn't assume there would be a locally available api-server. My concern with this approach is that we would need to be careful to protect that checkpoint file location and/or only allow validated https traffic, so a bad actor couldn't just put in a checkpoint for their own backend. I'm a bit open on what we do here - not immediately sure best option
Given the option, I would much prefer that we don't risk the conflict and just enforce the ordering if we can.
This greatly reduces the risk / footprint of what we are checkpointing. My initial thinking is that this would be worth the effort to only checkpoint the rules which we really care about. |
One of the reason I want this tool to co-located with API server is to ease the deployment. Conceptually, this tool is a complementary for the API server, not for the kube-proxy. Kube proxy should not rely on this tool to operate.
OK. I can add logic into the tool to not overwrite iptable rules in the restore path. So if kube-proxy already populated the table, it wont restore at all. (The initial detecting logic might still allow a super tight window when kube-proxy flushes out table right after the checking. But I guess it is OK for n)
As long as we checkpoint less frequently than kube-proxy, we will not impose a huge overhead from performance stand point. |
Can you write-up a design doc PR? - I feel like we need to have a pretty well defined process prior to committing to code. While this is complementary to the api-server - it does not function nor is it useful in the absence of the proxy. And it is much easier to enforce ordering with init-container in the proxy. If we add this to the api-server - we risk much easier conflict with the kube-proxy. I could be convinced otherwise - but I don't see really any benefit to tying this to the api-server, but I do see benefit to tying it to kube-proxy.
We discussed the risk of the api changing - by only checkpointing services we explicitly outline for this - we greatly reduce risk. I would highly recommend we implement this and not just checkpoint everything blindly. I think we need to have a solid design / process that is agreed upon before moving forward with code. |
OK. Sure. Will do. |
This should be closed by #380 |
self hosted etcd relies on service IP to work correctly. Kuberetes API server contact etcd pod by service IP (load balancing + hide the actual etcd pod IP which is subject to change).
Service IP relies on API server to be restored after a machine reboot. If we restart all API servers at the same time, service IP is not recoverable.
To solve this chicken and egg issue, we have to checkpoint the iptables (which do all the heavy-lifting for service IP).
I have tried the IP table checkpoint approach, it works well at least for the hack/multi-node example.
/cc @aaronlevy
The text was updated successfully, but these errors were encountered: