From 6c612d12fb4e387bef3bb0a23cb5431a35cb53f5 Mon Sep 17 00:00:00 2001
From: Ewout Prangsma <ewout@prangsma.net>
Date: Thu, 10 May 2018 10:32:04 +0200
Subject: [PATCH 1/4] Added a spec regarding the rules for eviction &
 replacement of pods

---
 docs/design/pod_evication_and_replacement.md | 123 +++++++++++++++++++
 1 file changed, 123 insertions(+)
 create mode 100644 docs/design/pod_evication_and_replacement.md

diff --git a/docs/design/pod_evication_and_replacement.md b/docs/design/pod_evication_and_replacement.md
new file mode 100644
index 000000000..567298463
--- /dev/null
+++ b/docs/design/pod_evication_and_replacement.md
@@ -0,0 +1,123 @@
+# Pod Eviction & Replacement
+
+This chapter specifies the rules around evicting pods from nodes and
+restarting or replacing them.
+
+## Eviction
+
+Eviction is the process of removing a pod that is running on a node from that node.
+
+This is typically the result of a drain action (`kubectl drain`) or
+from a taint being added to a node (either automatically by Kubernetes or manually by an operator).
+
+## Replacement
+
+Replacement is the process of replacing a pod an another pod that takes over the responsibilities
+of the original pod.
+
+The replacement pod has a new ID and new (read empty) persistent data.
+
+Note that replacing a pod is different from restarting a pod. A pod is restarted when it has been reported
+to have termined.
+
+## NoExecute Tolerations
+
+NoExecute tolerations are used to control the behavior of Kubernetes (wrt. to a Pod) when the node
+that the pod is running on is no longer reachable or becomes not-ready.
+
+See the applicable [Kubernetes documentation](https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/) for more info.
+
+## Rules
+
+The rules for eviction & replacement are specified per type of pod.
+
+### Image ID Pods
+
+The Image ID pods are starter to fetch the ArangoDB version of a specific
+ArangoDB image and fetch the docker sha256 of that image.
+They have no persistent state.
+
+- Image ID pods can always be evicted from any node
+- Image ID pods can always be restarted on a different node.
+  There is no need to replace an image ID pod.
+- `node.kubernetes.io/unreachable:NoExecute` toleration time is set very low (5sec)
+- `node.kubernetes.io/not-ready:NoExecute` toleration time is set very low (5sec)
+
+### Coordinator Pods
+
+Coordinator pods run an ArangoDB coordinator as part of an ArangoDB cluster.
+They have no persistent state, but do have a unique ID.
+
+- Coordinator pods can always be evicted from any node
+- Coordinator pods can always be replaced with another coordinator pod with a different ID on a different node
+- `node.kubernetes.io/unreachable:NoExecute` toleration time is set low (15sec)
+- `node.kubernetes.io/not-ready:NoExecute` toleration time is set low (15sec)
+
+### DBServer Pods
+
+DBServer pods run an ArangoDB dbserver as part of an ArangoDB cluster.
+It has persistent state potentially tight to the node it runs on and it has a unique ID.
+
+- DBServer pods can be evicted from any node as soon as:
+  - It has been completely drained AND
+  - It is no longer the shard master for any shard
+- DBServer pods can be replaced with another dbserver pod with a different ID on a different node when:
+  - It is not the shard master for any shard OR
+  - For every shard it is the master for, there is an in-sync follower
+- `node.kubernetes.io/unreachable:NoExecute` toleration time is set high to "wait it out a while" (5min)
+- `node.kubernetes.io/not-ready:NoExecute` toleration time is set high to "wait it out a while" (5min)
+
+### Agent Pods
+
+Agent pods run an ArangoDB dbserver as part of an ArangoDB agency.
+It has persistent state potentially tight to the node it runs on and it has a unique ID.
+
+- Agent pods can be evicted from any node as soon as:
+  - It is no longer the agency leader AND
+  - There is at least an agency leader that is responding AND
+  - There is at least an agency follower that is responding
+- Agent pods can be replaced with another agent pod with the same ID but whiped persistent state on a different node when:
+  - The old pod is known to be deleted (e.g. explicit eviction)
+- `node.kubernetes.io/unreachable:NoExecute` toleration time is not set to "wait it out forever"
+- `node.kubernetes.io/not-ready:NoExecute` toleration time is not set "wait it out forever"
+
+### Single Server Pods
+
+Single server pods run an ArangoDB server as part of an ArangoDB single server deployment.
+It has persistent state potentially tight to the node.
+
+- Single server pods cannot be evicted from any node.
+- Single server pods cannot be replaced with another pod.
+- `node.kubernetes.io/unreachable:NoExecute` toleration time is not set to "wait it out forever"
+- `node.kubernetes.io/not-ready:NoExecute` toleration time is not set "wait it out forever"
+
+### Single Pods in Active Failover Deployment
+
+Single pods run an ArangoDB single server as part of an ArangoDB active failover deployment.
+It has persistent state potentially tight to the node it runs on and it has a unique ID.
+
+- Single pods can be evicted from any node as soon as:
+  - It is a follower of an active-failover deployment (Q: can we trigger this failover to another server?)
+- Single pods can always be replaced with another single pod with a different ID on a different node.
+- `node.kubernetes.io/unreachable:NoExecute` toleration time is set high to "wait it out a while" (5min)
+- `node.kubernetes.io/not-ready:NoExecute` toleration time is set high to "wait it out a while" (5min)
+
+### SyncMaster Pods
+
+SyncMaster pods run an ArangoSync as master as part of an ArangoDB DC2DC cluster.
+They have no persistent state, but do have a unique address.
+
+- SyncMaster pods can always be evicted from any node
+- SyncMaster pods can always be replaced with another syncmaster pod on a different node
+- `node.kubernetes.io/unreachable:NoExecute` toleration time is set low (15sec)
+- `node.kubernetes.io/not-ready:NoExecute` toleration time is set low (15sec)
+
+### SyncWorker Pods
+
+SyncWorker pods run an ArangoSync as worker as part of an ArangoDB DC2DC cluster.
+They have no persistent state, but do have in-memory state and a unique address.
+
+- SyncWorker pods can always be evicted from any node
+- SyncWorker pods can always be replaced with another syncworker pod on a different node
+- `node.kubernetes.io/unreachable:NoExecute` toleration time is set a bit higher to try to avoid resynchronization (1min)
+- `node.kubernetes.io/not-ready:NoExecute` toleration time is set a bit higher to try to avoid resynchronization (1min)

From ab28b5007e011d84fb6ebc0913561436784e4755 Mon Sep 17 00:00:00 2001
From: Ewout Prangsma <ewout@prangsma.net>
Date: Mon, 14 May 2018 14:58:39 +0200
Subject: [PATCH 2/4] Typos

---
 docs/design/pod_evication_and_replacement.md | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/docs/design/pod_evication_and_replacement.md b/docs/design/pod_evication_and_replacement.md
index 567298463..9c91592be 100644
--- a/docs/design/pod_evication_and_replacement.md
+++ b/docs/design/pod_evication_and_replacement.md
@@ -12,7 +12,7 @@ from a taint being added to a node (either automatically by Kubernetes or manual
 
 ## Replacement
 
-Replacement is the process of replacing a pod an another pod that takes over the responsibilities
+Replacement is the process of replacing a pod by another pod that takes over the responsibilities
 of the original pod.
 
 The replacement pod has a new ID and new (read empty) persistent data.
@@ -33,13 +33,14 @@ The rules for eviction & replacement are specified per type of pod.
 
 ### Image ID Pods
 
-The Image ID pods are starter to fetch the ArangoDB version of a specific
+The Image ID pods are started to fetch the ArangoDB version of a specific
 ArangoDB image and fetch the docker sha256 of that image.
 They have no persistent state.
 
 - Image ID pods can always be evicted from any node
 - Image ID pods can always be restarted on a different node.
-  There is no need to replace an image ID pod.
+  There is no need to replace an image ID pod, nor will it cause problems when
+  2 image ID pods run at the same time.
 - `node.kubernetes.io/unreachable:NoExecute` toleration time is set very low (5sec)
 - `node.kubernetes.io/not-ready:NoExecute` toleration time is set very low (5sec)
 
@@ -56,7 +57,7 @@ They have no persistent state, but do have a unique ID.
 ### DBServer Pods
 
 DBServer pods run an ArangoDB dbserver as part of an ArangoDB cluster.
-It has persistent state potentially tight to the node it runs on and it has a unique ID.
+It has persistent state potentially tied to the node it runs on and it has a unique ID.
 
 - DBServer pods can be evicted from any node as soon as:
   - It has been completely drained AND
@@ -84,7 +85,7 @@ It has persistent state potentially tight to the node it runs on and it has a un
 ### Single Server Pods
 
 Single server pods run an ArangoDB server as part of an ArangoDB single server deployment.
-It has persistent state potentially tight to the node.
+It has persistent state potentially tied to the node.
 
 - Single server pods cannot be evicted from any node.
 - Single server pods cannot be replaced with another pod.
@@ -94,7 +95,7 @@ It has persistent state potentially tight to the node.
 ### Single Pods in Active Failover Deployment
 
 Single pods run an ArangoDB single server as part of an ArangoDB active failover deployment.
-It has persistent state potentially tight to the node it runs on and it has a unique ID.
+It has persistent state potentially tied to the node it runs on and it has a unique ID.
 
 - Single pods can be evicted from any node as soon as:
   - It is a follower of an active-failover deployment (Q: can we trigger this failover to another server?)

From d7f2ccb6495be4e1bc88a5caa2499c26e6166741 Mon Sep 17 00:00:00 2001
From: Ewout Prangsma <ewout@prangsma.net>
Date: Mon, 14 May 2018 15:01:07 +0200
Subject: [PATCH 3/4] Typo

---
 docs/design/pod_evication_and_replacement.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/design/pod_evication_and_replacement.md b/docs/design/pod_evication_and_replacement.md
index 9c91592be..988d5284e 100644
--- a/docs/design/pod_evication_and_replacement.md
+++ b/docs/design/pod_evication_and_replacement.md
@@ -77,7 +77,7 @@ It has persistent state potentially tight to the node it runs on and it has a un
   - It is no longer the agency leader AND
   - There is at least an agency leader that is responding AND
   - There is at least an agency follower that is responding
-- Agent pods can be replaced with another agent pod with the same ID but whiped persistent state on a different node when:
+- Agent pods can be replaced with another agent pod with the same ID but wiped persistent state on a different node when:
   - The old pod is known to be deleted (e.g. explicit eviction)
 - `node.kubernetes.io/unreachable:NoExecute` toleration time is not set to "wait it out forever"
 - `node.kubernetes.io/not-ready:NoExecute` toleration time is not set "wait it out forever"

From cddb02b3144db58f47adc1402fdb234db3f90d21 Mon Sep 17 00:00:00 2001
From: Ewout Prangsma <ewout@prangsma.net>
Date: Mon, 14 May 2018 15:01:37 +0200
Subject: [PATCH 4/4] Changed agent tolerations

---
 docs/design/pod_evication_and_replacement.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/design/pod_evication_and_replacement.md b/docs/design/pod_evication_and_replacement.md
index 988d5284e..8a5fa5e94 100644
--- a/docs/design/pod_evication_and_replacement.md
+++ b/docs/design/pod_evication_and_replacement.md
@@ -79,8 +79,8 @@ It has persistent state potentially tight to the node it runs on and it has a un
   - There is at least an agency follower that is responding
 - Agent pods can be replaced with another agent pod with the same ID but wiped persistent state on a different node when:
   - The old pod is known to be deleted (e.g. explicit eviction)
-- `node.kubernetes.io/unreachable:NoExecute` toleration time is not set to "wait it out forever"
-- `node.kubernetes.io/not-ready:NoExecute` toleration time is not set "wait it out forever"
+- `node.kubernetes.io/unreachable:NoExecute` toleration time is set high to "wait it out a while" (5min)
+- `node.kubernetes.io/not-ready:NoExecute` toleration time is set high to "wait it out a while" (5min)
 
 ### Single Server Pods