self hosted etcd: checkpoint iptables on master nodes #284

xiang90 · 2017-01-19T21:57:45Z

self hosted etcd relies on service IP to work correctly. Kuberetes API server contact etcd pod by service IP (load balancing + hide the actual etcd pod IP which is subject to change).

Service IP relies on API server to be restored after a machine reboot. If we restart all API servers at the same time, service IP is not recoverable.

To solve this chicken and egg issue, we have to checkpoint the iptables (which do all the heavy-lifting for service IP).

I have tried the IP table checkpoint approach, it works well at least for the hack/multi-node example.

/cc @aaronlevy

bassam · 2017-01-19T22:34:08Z

FWIW, checkpointing iptables would also enable the api server to become a deployment instead of a daemonset. This would enable it to move around.

@xiang90 it would seem to me that there is a chance that the iptables restored after a checkpoint might be stale. I think that would lead to cluster not coming back up in some cases.

xiang90 · 2017-01-19T22:45:31Z

@bassam

it would seem to me that there is a chance that the iptables restored after a checkpoint might be stale

Yes, I am aware. We are not trying to solve this problem at 100%.

The problem we are trying to solve here is a full API components down case. This is usually caused by a power failure of one DC or a rack. Or someone is playing around with selfhosted to see its recovery ability. So the chance to hit it is already pretty low.

Moreover, the stale iptable can be an issue for etcd case only if ALL etcd members are moved. In another word, the service IP in the iptable points to no valid etcd members.

Assume that we checkpoint iptable every 1 minutes. The chance of moving all etcd members around in 1 minute is VERY low.

Combining the two very low event, I would not worry about it. If it does happen, we need a full backup/recovery. Maybe I am missing something you concern about?

bassam · 2017-01-19T22:58:55Z

@xiang90 the case I'm considering (for our own scenarios) is a full power off and power on of a self-hosted cluster.

I think you are making a good point -- if the interval of checkpointing is small and you're running HA then its a low probability event for the cluster to not come back up after a power failure.

What I'm trying to get my head around is if the cluster comes back up, but the checkpoint information is stale, will it converge on the correct authoritative state in time.

For example, if one (out of three) etcd member moved from 10.0.0.5 to 10.0.0.6 but we did not checkpoint the iptables (and pod manifest) in time and a power off happened. The cluster comes back up, the etcd member will still run on 10.0.0.5 (thats where the last checkpoint manifest and iptables had it), but etcd state has it at 10.0.0.6. So its effectively a lost member (yet it still running). How does this cluster converge and cleanup?

xiang90 · 2017-01-19T23:04:34Z

Let's assume the other members are 10.0.0.1 and 10.0.0.2 besides 10.0.0.5 (which moved to 10.0.0.6 later on)

but etcd state has it at 10.0.0.6. So its effectively a lost member (yet it still running). How does this cluster converge and cleanup?

The API server should be able to serve request since it still can reach 10.0.0.1, 10.0.0.2. The kube proxy should be able to pull the latest service IP mapping from the running API server (10.0.0.1, 10.0.0.2 and 10.0.0.6). The kube proxy will set the service mapping and take over the control.

kube proxy will do so since it has no idea about the ip table restore we just did. It assumes the ip table is empty and will do the initial recovery.

This is my understanding of how kube proxy should work. I might be wrong. But based on my limited experiments, it works this way.

/cc @aaronlevy might know more details than me.

xiang90 · 2017-01-19T23:04:43Z

/cc @bassam

philips · 2017-01-20T18:21:16Z

What is the service IP used for? Can we make the service IP static as it is for DNS? Would that help?

Also, can we remove the service IP requirement somehow to just side-step this whole issue? What is service IP fixing?

xiang90 · 2017-01-20T18:26:32Z

What is the service IP used for

load balancing + hide the actual etcd pod IP which is subject to change

Or you will have to restart API server to track etcd member changes.

Can we make the service IP static as it is for DNS?

The service IP is static. The mapping is dynamic

What is service IP fixing?

Same as question 1.

bassam · 2017-01-20T21:14:29Z

@xiang90 sounds like you're saying kube-proxy would overwrite the stale routes and things would converge.

@philips I think a service for etcd enables the members to float. I was hoping we can do the same with the api server as well. I experimented with a bit (bassam@52c0499) but didn't get far.

xiang90 · 2017-01-20T21:15:26Z

@bassam

@xiang90 sounds like you're saying kube-proxy would overwrite the stale routes and things would converge.

Yes. After it starts, it should. @aaronlevy right?

aaronlevy · 2017-01-20T21:22:20Z

I would assume thats it's behavior, but I have not personally verified this. E.g. does it behave differently if there happens to be existing state? I'd be surprised if it did change its behavior, but not something I've looked at directly.

@philips does checkpointing the iptables rules concern you? Operationally, this isn't something I've had much experience with -- so if this seems like a bad idea -- we can revisit more closely.

The options we've discussed so far were just using the systemd units that already ship with the OS (CoreOS specific, but easy enough to add elsewhere),. Then validating that this works as expected. Then maybe even checkpointing the routes ourselves in the pod-checkpointer code (and restoring on reboot).

xiang90 · 2017-01-20T21:26:28Z

The options we've discussed so far were just using the systemd units that already ship with the OS

From ux perspective, I feel it is better to do it in pod-checkpointer. systemd unit is an extra step for the user.

xiang90 · 2017-02-27T21:20:51Z

After a few rounds of internal (CoreOS) discussion, we decided to try out iptable checkpoint with a user program first.

We wrote a small tool (kenc) that can checkpoint/recovery iptables to/from a given file.

It can help self hosted etcd Kubernetes deployment to recover from a full API components down case.

I propose to have the kenc running as recovery mode as an init container. Then add the kenc running in checkpointing mode into the API server pod.

@aaronlevy opinion?

hongchaodeng · 2017-02-27T21:29:38Z

link to kenc project: https://github.com/coreos/kenc/

aaronlevy · 2017-02-27T21:32:31Z

From a previous offline discussion the general architecture discussed:

Build a iptables checkpointer which has two modes of execution: checkpoint (long running), and restore (run-once).

Init-container in kube-proxy(?) pod which runs the restore process:
- Does a checkpoint file exist via host-mount
  - If none exists, exit
- Check if existing iptables rules exist (TBD what this means, could be as simple as KUBE-SERVICES chain exists).
  - If existing services rules exist, exit (assume we have good state)
- (maybe/maybe not) attempt to contact api-server given a fixed timeout.
  - If api-server is contacted, exit (assume we can retrieve good state)
- restore the checkpoint
- TBD (when do we garbage collect existing checkpoints that may get stale?)
Side-car container (or potentially separate daemonset) runs the checkpoint process:
-Watch all services which contain the checkpointer.alpha.coreos.com/checkpoint: "true" annotation.
- Periodically read / walk the NAT table KUBE-SERVICES chain, and track all KUBE-SVC and KUBE-SEP chains / rules for the watched services. Write these to disk.
- TBD:
  - in what other cases do we need to watch filter/other tables (in my simple tests they were unused),
  - in what other cases do we care about other kubernetes chains (firewalls, etc).
- Regarding the above unknowns, all we really need is enough to allow addressability of etcd. We are assuming that the kube-proxy will resolve full state once it can run.

Can we expand on this / write up a design document so we can make sure we're all on same page in terms of implementation?

For example, I would prefer that the "checkpointing" aspect of this tool is run as a separate daemonset (not as part of api-server). Similarly - I would initially think it makes more sense for the "restore" init-container to be run as part of the kube-proxy, not the api-server (enforces our restore occurs prior to kube-proxy starting - and they will have less chance of conflict when operating on same rules).

xiang90 · 2017-02-27T21:36:37Z

I would initially think it makes more sense for the "restore" init-container to be run as part of the kube-proxy, not the api-server (enforces our restore occurs prior to kube-proxy starting - and they will have less chance of conflict when operating on same rules).

I do not want this tool to run on non-master nodes. Can you explain the conflict part? I think if we run the init phase before the API server gets started, there should not be any conflict. kube-proxy should do nothing until it reaches the API server.

aaronlevy · 2017-02-27T21:42:42Z

I do not want this tool to run on non-master nodes

The restore or checkpoint or both?

Can you explain the conflict part? I think if we run the init phase before the API server gets started, there should not be any conflict. kube-proxy should do nothing until it reaches the API server.

There could be multiple api-servers, there's no guarantee that the local kube-proxy will not speak to a different api-server before the local api-server is started.

xiang90 · 2017-02-27T21:52:10Z

The restore or checkpoint or both?

both.

There could be multiple api-servers

OK. Fair. I want to avoid the initial unnecessary timeouts: components -> local-api server -> etcd vip -> times out.

Even in this case, there should be no conflicts. Kube-proxy never tries to interpret the existing rules. It will simply flush the KUBE chain atomically. The worst case is the iptables got rewrote by a stale checkpoint table, but it should be OK since kube-proxy will eventually update it back. I can double check this.

Watch all services which contain

What are the benefits to only checkpoint parts of the services. Why not checkpoint the entire NAT table initially? The later is significantly simpler to get started.

aaronlevy · 2017-02-27T22:11:58Z

both.

If you want to restrict where this runs, this depends on how we deploy the tools.

If the "checkpoint" process is deployed as its own daemonset, this is as simple as adding the nodeSelector to the manifest.

For the restore process, if we just have it run along-side the kube-proxy and only continue down the restore path if that file exists, we're not really adding any overhead or additional risk of timeout - and we shouldn't assume there would be a locally available api-server.

My concern with this approach is that we would need to be careful to protect that checkpoint file location and/or only allow validated https traffic, so a bad actor couldn't just put in a checkpoint for their own backend.

I'm a bit open on what we do here - not immediately sure best option

The worst case is the iptables got rewrote by a stale checkpoint table, but it should be OK since kube-proxy will eventually update it back.

Given the option, I would much prefer that we don't risk the conflict and just enforce the ordering if we can.

What are the benefits to only checkpoint parts of the services. Why not checkpoint the entire NAT table initially?

This greatly reduces the risk / footprint of what we are checkpointing. My initial thinking is that this would be worth the effort to only checkpoint the rules which we really care about.

xiang90 · 2017-02-27T22:20:42Z

If you want to restrict where this runs, this depends on how we deploy the tools.

One of the reason I want this tool to co-located with API server is to ease the deployment.

Conceptually, this tool is a complementary for the API server, not for the kube-proxy. Kube proxy should not rely on this tool to operate.

Given the option, I would much prefer that we don't risk the conflict and just enforce the ordering if we can.

OK. I can add logic into the tool to not overwrite iptable rules in the restore path. So if kube-proxy already populated the table, it wont restore at all. (The initial detecting logic might still allow a super tight window when kube-proxy flushes out table right after the checking. But I guess it is OK for n)

This greatly reduces the risk / footprint of what we are checkpointing. My initial thinking is that this would be worth the effort to only checkpoint the rules which we really care about.

As long as we checkpoint less frequently than kube-proxy, we will not impose a huge overhead from performance stand point.

aaronlevy · 2017-02-27T22:26:52Z

Can you write-up a design doc PR? - I feel like we need to have a pretty well defined process prior to committing to code.

While this is complementary to the api-server - it does not function nor is it useful in the absence of the proxy. And it is much easier to enforce ordering with init-container in the proxy. If we add this to the api-server - we risk much easier conflict with the kube-proxy. I could be convinced otherwise - but I don't see really any benefit to tying this to the api-server, but I do see benefit to tying it to kube-proxy.

As long as we checkpoint less frequently than kube-proxy, we will not impose a huge overhead from performance stand point.

We discussed the risk of the api changing - by only checkpointing services we explicitly outline for this - we greatly reduce risk. I would highly recommend we implement this and not just checkpoint everything blindly.

I think we need to have a solid design / process that is agreed upon before moving forward with code.

xiang90 · 2017-02-27T22:29:22Z

Can you write-up a design doc PR?

OK. Sure. Will do.

aaronlevy · 2017-03-16T01:01:59Z

This should be closed by #380

kokhang mentioned this issue Jan 23, 2017

stable and reliable kubernetes environment rook/rook#299

Closed

aaronlevy added kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. priority/P1 labels Feb 14, 2017

aaronlevy mentioned this issue Feb 27, 2017

checkpoint/recovery iptables for self hosted etcd deployement #335

Closed

aaronlevy mentioned this issue Feb 28, 2017

Support deploying self-hosted etcd #31

Closed

aaronlevy closed this as completed Mar 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

self hosted etcd: checkpoint iptables on master nodes #284

self hosted etcd: checkpoint iptables on master nodes #284

xiang90 commented Jan 19, 2017

bassam commented Jan 19, 2017

xiang90 commented Jan 19, 2017

bassam commented Jan 19, 2017

xiang90 commented Jan 19, 2017

xiang90 commented Jan 19, 2017

philips commented Jan 20, 2017

xiang90 commented Jan 20, 2017

bassam commented Jan 20, 2017

xiang90 commented Jan 20, 2017

aaronlevy commented Jan 20, 2017 •

edited

Loading

xiang90 commented Jan 20, 2017

xiang90 commented Feb 27, 2017

hongchaodeng commented Feb 27, 2017

aaronlevy commented Feb 27, 2017 •

edited

Loading

xiang90 commented Feb 27, 2017

aaronlevy commented Feb 27, 2017

xiang90 commented Feb 27, 2017

aaronlevy commented Feb 27, 2017 •

edited

Loading

xiang90 commented Feb 27, 2017 •

edited

Loading

aaronlevy commented Feb 27, 2017

xiang90 commented Feb 27, 2017

aaronlevy commented Mar 16, 2017

self hosted etcd: checkpoint iptables on master nodes #284

self hosted etcd: checkpoint iptables on master nodes #284

Comments

xiang90 commented Jan 19, 2017

bassam commented Jan 19, 2017

xiang90 commented Jan 19, 2017

bassam commented Jan 19, 2017

xiang90 commented Jan 19, 2017

xiang90 commented Jan 19, 2017

philips commented Jan 20, 2017

xiang90 commented Jan 20, 2017

bassam commented Jan 20, 2017

xiang90 commented Jan 20, 2017

aaronlevy commented Jan 20, 2017 • edited Loading

xiang90 commented Jan 20, 2017

xiang90 commented Feb 27, 2017

hongchaodeng commented Feb 27, 2017

aaronlevy commented Feb 27, 2017 • edited Loading

xiang90 commented Feb 27, 2017

aaronlevy commented Feb 27, 2017

xiang90 commented Feb 27, 2017

aaronlevy commented Feb 27, 2017 • edited Loading

xiang90 commented Feb 27, 2017 • edited Loading

aaronlevy commented Feb 27, 2017

xiang90 commented Feb 27, 2017

aaronlevy commented Mar 16, 2017

aaronlevy commented Jan 20, 2017 •

edited

Loading

aaronlevy commented Feb 27, 2017 •

edited

Loading

aaronlevy commented Feb 27, 2017 •

edited

Loading

xiang90 commented Feb 27, 2017 •

edited

Loading