Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dev-docs: etcd disaster recovery #1544

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions dev-docs/howto/etcd-disaster-recovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# etcd disaster recovery
When etcd loses quorum and (N-1)/2 control planes are lost,
etcd will not be able to perform any transactions anymore and will basically stall and cause timeouts.
This makes the Constellation control planes unusable, resulting in a frozen cluster. The worker nodes will continue to function for a bit,
but given that they cannot communicate to the control plane anymore, they will eventually also cease to function correctly.

If the missing control plane nodes are still existent and their state disk still exists, you likely do not need this guide.
If the missing are irrecoverably lost (e.g. scaled up and down the control plane instance set), follow through this guide to bring the cluster back up.

## General concept
1. Create snapshot of a state disk from a remaining control plane node.
2. Download the snapshot and decrypt it locally
3. Follow the [Restoring a cluster](https://etcd.io/docs/v3.5/op-guide/recovery/#restoring-a-cluster) guide from etcd.
4. Save the modified virtual disk and upload it back to the CSP.
5. Modify the scale set (or remaining VM singularly, if you can) to use the new uploaded data disk.
6. Reboot, wait a few minutes.
7. Pray it worked ;)

## How I did it once (Azure)

1. If the VM has never been rebooted once after initialization, reboot it once to sync any LUKS passphrase changes to disk (not 100% sure if necessary to sync the change to the passphrase - would have to double-check that later with an experimental cluster)

2. Create a snapshot from the disk using the CLI:
```bash
az snapshot create --resource-group dogfooding --name dogfooding-3 --source /subscriptions/0d202bbb-4fa7-4af8-8125-58c269a05435/resourceGroups/dogfooding/providers/Microsoft.Compute/disks/constell-f2332c74-coconstell-f2332c74-condisk2_dd460a6ae3124aa3a4c23be0ab39634e --location northeurope
```

3. Look up the snapshot online, export it as VHD
Mount the disk:
```bash
modprobe nbd && sudo qemu-nbd -c /dev/nbd0 /home/nils/Downloads/constellation-disk.vhd
```

4. Get the UUID of the disk:
```bash
sudo cryptsetup luksDump /dev/nbd0
```

5. Regenerate the passphrase to unlock the disk (the [code snippet below](#get-disk-decryption-key) might be useful for this)

6. Decrypt the disk:
```bash
sudo cryptsetup luksOpen /dev/nbd0 constellation-state --key-file passphrase
```

7. Mount the decrypted disk (I just did this via the Nautilus)

8. Find the db file from etcd in `/var/lib/etcd/member/snap/db`

9. Perform the etcd [Restoring a cluster](https://etcd.io/docs/v3.5/op-guide/recovery/#restoring-a-cluster) step:

```bash
./etcdutl snapshot restore db --initial-cluster constell-f2332c74-control-plane000001=https://10.9.126.0:2380 --initial-advertise-peer-urls https://10.9.126.0:2380 --data-dir recovery --name constell-f2332c74-control-plane000001 --skip-hash-check=true
```
*(replace name & IP with the name and the private IP of the remaining control plane VM you are to perform the restore process on - this information can be found in the Azure portal)*

10. Copy the contents of the newly created recovery directory to the mounted state disk and remove any remaining old files.
**Make sure the permissions are the same as before!**

11. Unmount the partition:
```bash
sudo umount /your/mount/path
sudo luksClose constellation-state
sudo qemu-nbd -d /dev/nbd0
```

12. Upload the modified VHD back to Azure (I just used Azure Storage Explorer for this).

13. Patch the whole control-plane VMSS to remove LUN 0 from the VMs:
```bash
az vmss disk detach --lun 0 --resource-group dogfooding --vmss-name constell-f2332c74-control-plane
```

14. Update the VM:
```bash
az vmss update-instances -g dogfooding --name constell-f2332c74-control-plane --instance-ids 1
```

15. Attach the uploaded disk as LUN 0 (either via CLI or Azure Portal, I just used the Azure Portal).

16. Start the VM and pray it works ;) It can take a few minutes before etcd becomes fully alive again.

17. Patch the state disk definition back to the VMSS (no idea how, haven't done his yet) so newly created VMs in the VMSS have a clean state disk again.

## Get disk decryption key
```golang
package main

import (
"crypto/sha256"
"encoding/hex"
"encoding/json"
"fmt"
"io"
"os"

"golang.org/x/crypto/hkdf"
)

type MasterSecret struct {
Key []byte `json:"key"`
Salt []byte `json:"salt"`
}

func main() {
uuid := "4ae66293-57aa-4821-b99c-ebc598a6e5a8" // replace me

masterSecretRaw, err := os.ReadFile("constellation-mastersecret.json")
if err != nil {
panic(err)
}

var masterSecret MasterSecret
if err := json.Unmarshal(masterSecretRaw, &masterSecret); err != nil {
panic(err)
}

dek, err := DeriveKey(masterSecret.Key, masterSecret.Salt, []byte("key-"+uuid), 32)
if err != nil {
panic(err)
}

fmt.Println(hex.EncodeToString(dek))

if err := os.WriteFile("passphrase", dek, 0o644); err != nil {
panic(err)
}
}

// DeriveKey derives a key from a secret.
func DeriveKey(secret, salt, info []byte, length uint) ([]byte, error) {
hkdfReader := hkdf.New(sha256.New, secret, salt, info)
key := make([]byte, length)
if _, err := io.ReadFull(hkdfReader, key); err != nil {
return nil, err
}
return key, nil
}
```