From 4a184b67d6ae25b21b35373e7dd6eab41b042c96 Mon Sep 17 00:00:00 2001
From: Andrey Smirnov <smirnov.andrey@gmail.com>
Date: Fri, 16 Apr 2021 22:40:51 +0300
Subject: [PATCH] docs: add etcd backup and restore guide

Describe full procedure from backups to disaster recovery.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Co-authored-by: Spencer Smith <rsmitty@users.noreply.github.com>
---
 .../docs/v0.10/Guides/disaster-recovery.md    | 147 ++++++++++++++++++
 1 file changed, 147 insertions(+)
 create mode 100644 website/content/docs/v0.10/Guides/disaster-recovery.md
diff --git a/website/content/docs/v0.10/Guides/disaster-recovery.md b/website/content/docs/v0.10/Guides/disaster-recovery.md
new file mode 100644
index 0000000000..286b70080f
--- /dev/null
+++ b/website/content/docs/v0.10/Guides/disaster-recovery.md
@@ -0,0 +1,147 @@
+---
+title: "Disaster Recovery"
+description: "Procedure for snapshotting etcd database and recovering from catastrophic control plane failure."
+---
+
+`etcd` database backs Kubernetes control plane state, so if the `etcd` service is unavailable
+Kubernetes control plane goes down, and the cluster is not recoverable until `etcd` is recovered with contents.
+The `etcd` consistency model builds around the consensus protocol Raft, so for highly-available control plane clusters,
+loss of one control plane node doesn't impact cluster health.
+In general, `etcd` stays up as long as a sufficient number of nodes to maintain quorum are up.
+For a three control plane node Talos cluster, this means that the cluster tolerates a failure of any single node,
+but losing more than one node at the same time leads to complete loss of service.
+Because of that, it is important to take routine backups of `etcd` state to have a snapshot to recover cluster from
+in case of catastrophic failure.
+
+## Backup
+
+### Snapshotting `etcd` Database
+
+Create a consistent snapshot of `etcd` database with `talosctl etcd snapshot` command:
+
+```bash
+$ talosctl -n <IP> etcd snapshot db.snapshot
+etcd snapshot saved to "db.snapshot" (2015264 bytes)
+snapshot info: hash c25fd181, revision 4193, total keys 1287, total size 3035136
+```
+
+> Note: filename `db.snapshot` is arbitrary.
+
+This database snapshot can be taken on any healthy control plane node (with IP address `<IP>` in the example above),
+as all `etcd` instances contain exactly same data.
+It is recommended to configure `etcd` snapshots to be created on some schedule to allow point-in-time recovery using the latest snapshot.
+
+### Disaster Database Snapshot
+
+If `etcd` cluster is not healthy, the `talosctl etcd snapshot` command might fail.
+In that case, copy the database snapshot directly from the control plane node:
+
+```bash
+talosctl -n <IP> cp /var/lib/etcd/member/snap/db .
+```
+
+This snapshot might not be fully consistent (if the `etcd` process is running), but it allows
+for disaster recovery when latest regular snapshot is not available.
+
+### Machine Configuration
+
+Machine configuration might be required to recover the node after hardware failure.
+Backup Talos node machine configuration with the command:
+
+```bash
+talosctl -n IP get mc v1alpha1 -o yaml | yq eval '.spec' -
+```
+
+## Recovery
+
+Before starting a disaster recovery procedure, make sure that `etcd` cluster can't be recovered:
+
+* get `etcd` cluster member list on all healthy control plane nodes with `talosctl -n IP etcd members` command and compare across all members.
+* query `etcd` health across control plane nodes with `talosctl -n IP service etcd`.
+
+If the quorum can be restored, restoring quorum might be a better strategy than performing full disaster recovery
+procedure.
+
+### Latest Etcd Snapshot
+
+Get hold of the latest `etcd` database snapshot.
+If a snapshot is not fresh enough, create a database snapshot (see above),  even if the `etcd` cluster is unhealthy.
+
+### Init Node
+
+Make sure that there are no control plane nodes with machine type `init`:
+
+```bash
+$ talosctl -n <IP1>,<IP2>,... get machinetype
+NODE         NAMESPACE   TYPE          ID             VERSION   TYPE
+172.20.0.2   config      MachineType   machine-type   2         controlplane
+172.20.0.4   config      MachineType   machine-type   2         controlplane
+172.20.0.3   config      MachineType   machine-type   2         controlplane
+```
+
+Nodes with `init` type are incompatible with `etcd` recovery procedure.
+`init` node can be converted to `controlplane` type with `talosctl edit mc --on-reboot` command followed
+by node reboot with `talosctl reboot` command.
+
+### Preparing Control Plane Nodes
+
+If some control plane nodes experienced hardware failure, replace them with new nodes.
+Use machine configuration backup to re-create the nodes with the same secret material and control plane settings
+to allow workers to join the recovered control plane.
+
+If a control plane node is healthy but `etcd` isn't, wipe the node's `EPHEMERAL` partition to remove the `etcd`
+data directory (make sure a database snapshot is taken before doing this):
+
+```bash
+talosctl -n <IP> reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL
+```
+
+At this point, all control plane nodes should boot up, and `etcd` service should be in the `Preparing` state.
+
+Kubernetes control plane endpoint should be pointed to the new control plane nodes if there were
+any changes to the node addresses.
+
+### Recovering from the Backup
+
+Make sure all `etcd` service instances are in `Preparing` state:
+
+```bash
+$ talosctl -n <IP> service etcd
+NODE     172.20.0.2
+ID       etcd
+STATE    Preparing
+HEALTH   ?
+EVENTS   [Preparing]: Running pre state (17s ago)
+         [Waiting]: Waiting for service "cri" to be "up", time sync (18s ago)
+         [Waiting]: Waiting for service "cri" to be "up", service "networkd" to be "up", time sync (20s ago)
+```
+
+Execute the bootstrap command against any control plane node passing the path to the `etcd` database snapshot:
+
+```bash
+$ talosctl -n <IP> bootstrap --recover-from=./db.snapshot
+recovering from snapshot "./db.snapshot": hash c25fd181, revision 4193, total keys 1287, total size 3035136
+```
+
+> Note: if database snapshot was copied out directly from the `etcd` data directory using `talosctl cp`,
+> add flag `--recover-skip-hash-check` to skip integrity check on restore.
+
+Talos node should print matching information in the kernel log:
+
+```log
+recovering etcd from snapshot: hash c25fd181, revision 4193, total keys 1287, total size 3035136
+{"level":"info","msg":"restoring snapshot","path":"/var/lib/etcd.snapshot","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/li}
+{"level":"info","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":3360}
+{"level":"info","msg":"added member","cluster-id":"a3390e43eb5274e2","local-member-id":"0","added-peer-id":"eb4f6f534361855e","added-peer-peer-urls":["https:/}
+{"level":"info","msg":"restored snapshot","path":"/var/lib/etcd.snapshot","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/lib/etcd/member/snap"}
+```
+
+Now `etcd` service should become healthy on the bootstrap node, Kubernetes control plane components
+should start and control plane endpoint should become available.
+Remaining control plane nodes join `etcd` cluster once control plane endpoint is up.
+
+## Single Control Plane Node Cluster
+
+This guide applies to the single control plane clusters as well.
+In fact, it is much more important to take regular snapshots of the `etcd` database in single control plane node
+case, as loss of the control plane node might render the whole cluster irrecoverable without a backup.