Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.

Add disaster recovery documentation. #584

Merged
merged 5 commits into from
Jun 19, 2017

Conversation

diegs
Copy link
Contributor

@diegs diegs commented Jun 15, 2017

Fixes #432, also one way to address issues raised in #112.

@diegs diegs requested review from aaronlevy and xiang90 June 15, 2017 01:07
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 15, 2017
@diegs diegs self-assigned this Jun 15, 2017
@diegs
Copy link
Contributor Author

diegs commented Jun 15, 2017

cc @coresolve

```
bootkube recover --asset-dir=recovered --etcd-backup-file=backup --kubeconfig=/etc/kubernetes/kubeconfig
```

Copy link
Contributor

@xiang90 xiang90 Jun 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we tried this with our field engs yesterday.

there are a few things we need to make sure before running this script:

  1. kubelet is running on the machine
  2. no related containers are running (old etcd, old api server, etc.. this is also applied to other recovery cases i believe)
  3. docker state is clean (docker ps -a does not contain old states of relevant containers). kubelet has bugs that it might incorrectly believes the static pod has dead when old state exists.
  4. /var/etcd dir is clean on ALL master nodes

Copy link
Contributor Author

@diegs diegs Jun 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, do you want me to add this directly to the documentation?

Also this is not really true of the other recovery situations. This makes it sound like you should basically destroy and recreate all your master nodes before using this recovery approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diegs just fyi. we can address them later.

control plane can be extracted directly from the api-server:

```
bootkube recover --asset-dir=recovered --kubeconfig=/etc/kubernetes/kubeconfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one relevant issue: field engs suggest to rename --asset-dir to output-asset-dir. when they first tried without our help, they tried to pass in the old asset-dir in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! Created #589

@xiang90
Copy link
Contributor

xiang90 commented Jun 16, 2017

Awesome start!

@xiang90
Copy link
Contributor

xiang90 commented Jun 16, 2017

LGTM

Copy link
Contributor

@aaronlevy aaronlevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor notes that I wouldn't block on - more open-ended. LGTM


To minimize the likelihood of any of the these scenarios, production
self-hosted clusters should always run in a [high-availability
configuration](https://kubernetes.io/docs/admin/high-availability/).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence about linking to those docs -- as they're pretty different to how self-hosted HA works (which we need docs for: #311). It does touch on some important topics like leader-election, but even then we already have that and all we care about is scaling replica counts (for example).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added a TODO linking to the issue instead for now.

For more information, see the [Pod Checkpointer
README](https://github.com/kubernetes-incubator/bootkube/blob/master/cmd/checkpoint/README.md).

## Bootkube Recover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to have some kind of versioning convention. I'm assuming right now it's: you should always use the latest bootkube release when running recover. This may not be a confusion point, but I wonder if users will try and use the same bootkube release that they installed with (which is probably fine in most cases, unless there are new bug fixes they should have).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added note recommending to always use the latest version.

@coresolve
Copy link
Contributor

We need to test this against a tectonic cluster. Not just the bootkube rendered cluster. Looks really good though.

@diegs diegs merged commit 8b303d7 into kubernetes-retired:master Jun 19, 2017
@diegs diegs deleted the recovery-docs branch June 19, 2017 20:22
@diegs
Copy link
Contributor Author

diegs commented Jun 19, 2017

@coresolve sgtm, we should go through that (and especially with self-hosted etcd) next.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants