doc/clustering: Better document healing

Closes #1032 Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
lxc · Aug 5, 2024 · 12e10a1 · 12e10a1
1 parent 51413bd
commit 12e10a1
Show file tree

Hide file tree

Showing 2 changed files with 19 additions and 3 deletions.
diff --git a/doc/.wordlist.txt b/doc/.wordlist.txt
@@ -26,6 +26,7 @@ BitLocker
 bool
 bootable
 BPF
+BMC
 Btrfs
 bugfix
 bugfixes
@@ -189,6 +190,7 @@ OVS
 Pbit
 PCI
 PCIe
+PDU
 peerings
 Permalink
 PFs

diff --git a/doc/howto/cluster_manage.md b/doc/howto/cluster_manage.md
@@ -77,11 +77,25 @@ When the evacuated server is available again, use the [`incus cluster restore`](
 This command also moves the evacuated instances back from the servers that were temporarily holding them.
 
 (cluster-automatic-evacuation)=
-### Automatic evacuation
+### Cluster healing
 
-If you set the {config:option}`server-cluster:cluster.healing_threshold` configuration to a non-zero value, instances are automatically evacuated if a cluster member goes offline.
+Incus can automatically detect and recover from a broken server. This is done by setting the {config:option}`server-cluster:cluster.healing_threshold` configuration to a non-zero value.
+Instances are automatically evacuated to other servers after the leader has marked a cluster member has offline.
 
-When the evacuated server is available again, you must manually restore it.
+When the broken server is available again, you must manually restore it as if it had been manually evacuated.
+
+```{note}
+This automatic cluster healing only applies to instances on shared storage and which don't use any local devices.
+```
+
+```{warning}
+Enabling this feature can come at the risk of data corruption should a server be deemed offling as a result of partial connectivity issues.
+Incus considers a server to be offline when it fails to respond to hearbeat packets and when it also fails to respond to ICMP packets.
+
+It's critical to ensure that a server which is considered offline is in fact offline and isn't still running its instances.
+One way to automatically achieve this is to have a piece of software monitor Incus for a `cluster-member-healed` event and promptly cut the
+power to the server in question by interacting with its BMC or PDU.
+```
 
 (cluster-manage-delete-members)=
 ## Delete cluster members
-Original file line number
+Diff line change
@@ Expand Up / @@ -26,6 +26,7 @@ BitLocker @@
     bool
     bootable
     BPF
+    BMC
     Btrfs
     bugfix
     bugfixes
@@ Expand Down Expand Up / @@ -189,6 +190,7 @@ OVS @@
     Pbit
     PCI
     PCIe
+    PDU
     peerings
     Permalink
     PFs
@@ Expand Down @@