-
Notifications
You must be signed in to change notification settings - Fork 220
Conversation
cc @coresolve |
Documentation/disaster-recovery.md
Outdated
``` | ||
bootkube recover --asset-dir=recovered --etcd-backup-file=backup --kubeconfig=/etc/kubernetes/kubeconfig | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we tried this with our field engs yesterday.
there are a few things we need to make sure before running this script:
- kubelet is running on the machine
- no related containers are running (old etcd, old api server, etc.. this is also applied to other recovery cases i believe)
- docker state is clean (docker ps -a does not contain old states of relevant containers). kubelet has bugs that it might incorrectly believes the static pod has dead when old state exists.
- /var/etcd dir is clean on ALL master nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, do you want me to add this directly to the documentation?
Also this is not really true of the other recovery situations. This makes it sound like you should basically destroy and recreate all your master nodes before using this recovery approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diegs just fyi. we can address them later.
Documentation/disaster-recovery.md
Outdated
control plane can be extracted directly from the api-server: | ||
|
||
``` | ||
bootkube recover --asset-dir=recovered --kubeconfig=/etc/kubernetes/kubeconfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one relevant issue: field engs suggest to rename --asset-dir to output-asset-dir. when they first tried without our help, they tried to pass in the old asset-dir in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! Created #589
Awesome start! |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor notes that I wouldn't block on - more open-ended. LGTM
Documentation/disaster-recovery.md
Outdated
|
||
To minimize the likelihood of any of the these scenarios, production | ||
self-hosted clusters should always run in a [high-availability | ||
configuration](https://kubernetes.io/docs/admin/high-availability/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the fence about linking to those docs -- as they're pretty different to how self-hosted HA works (which we need docs for: #311). It does touch on some important topics like leader-election, but even then we already have that and all we care about is scaling replica counts (for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, added a TODO linking to the issue instead for now.
For more information, see the [Pod Checkpointer | ||
README](https://github.com/kubernetes-incubator/bootkube/blob/master/cmd/checkpoint/README.md). | ||
|
||
## Bootkube Recover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to have some kind of versioning convention. I'm assuming right now it's: you should always use the latest bootkube release when running recover. This may not be a confusion point, but I wonder if users will try and use the same bootkube release that they installed with (which is probably fine in most cases, unless there are new bug fixes they should have).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added note recommending to always use the latest version.
We need to test this against a tectonic cluster. Not just the bootkube rendered cluster. Looks really good though. |
@coresolve sgtm, we should go through that (and especially with self-hosted etcd) next. |
Fixes #432, also one way to address issues raised in #112.