From 6c612d12fb4e387bef3bb0a23cb5431a35cb53f5 Mon Sep 17 00:00:00 2001 From: Ewout Prangsma Date: Thu, 10 May 2018 10:32:04 +0200 Subject: [PATCH 1/4] Added a spec regarding the rules for eviction & replacement of pods --- docs/design/pod_evication_and_replacement.md | 123 +++++++++++++++++++ 1 file changed, 123 insertions(+) create mode 100644 docs/design/pod_evication_and_replacement.md diff --git a/docs/design/pod_evication_and_replacement.md b/docs/design/pod_evication_and_replacement.md new file mode 100644 index 000000000..567298463 --- /dev/null +++ b/docs/design/pod_evication_and_replacement.md @@ -0,0 +1,123 @@ +# Pod Eviction & Replacement + +This chapter specifies the rules around evicting pods from nodes and +restarting or replacing them. + +## Eviction + +Eviction is the process of removing a pod that is running on a node from that node. + +This is typically the result of a drain action (`kubectl drain`) or +from a taint being added to a node (either automatically by Kubernetes or manually by an operator). + +## Replacement + +Replacement is the process of replacing a pod an another pod that takes over the responsibilities +of the original pod. + +The replacement pod has a new ID and new (read empty) persistent data. + +Note that replacing a pod is different from restarting a pod. A pod is restarted when it has been reported +to have termined. + +## NoExecute Tolerations + +NoExecute tolerations are used to control the behavior of Kubernetes (wrt. to a Pod) when the node +that the pod is running on is no longer reachable or becomes not-ready. + +See the applicable [Kubernetes documentation](https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/) for more info. + +## Rules + +The rules for eviction & replacement are specified per type of pod. + +### Image ID Pods + +The Image ID pods are starter to fetch the ArangoDB version of a specific +ArangoDB image and fetch the docker sha256 of that image. +They have no persistent state. + +- Image ID pods can always be evicted from any node +- Image ID pods can always be restarted on a different node. + There is no need to replace an image ID pod. +- `node.kubernetes.io/unreachable:NoExecute` toleration time is set very low (5sec) +- `node.kubernetes.io/not-ready:NoExecute` toleration time is set very low (5sec) + +### Coordinator Pods + +Coordinator pods run an ArangoDB coordinator as part of an ArangoDB cluster. +They have no persistent state, but do have a unique ID. + +- Coordinator pods can always be evicted from any node +- Coordinator pods can always be replaced with another coordinator pod with a different ID on a different node +- `node.kubernetes.io/unreachable:NoExecute` toleration time is set low (15sec) +- `node.kubernetes.io/not-ready:NoExecute` toleration time is set low (15sec) + +### DBServer Pods + +DBServer pods run an ArangoDB dbserver as part of an ArangoDB cluster. +It has persistent state potentially tight to the node it runs on and it has a unique ID. + +- DBServer pods can be evicted from any node as soon as: + - It has been completely drained AND + - It is no longer the shard master for any shard +- DBServer pods can be replaced with another dbserver pod with a different ID on a different node when: + - It is not the shard master for any shard OR + - For every shard it is the master for, there is an in-sync follower +- `node.kubernetes.io/unreachable:NoExecute` toleration time is set high to "wait it out a while" (5min) +- `node.kubernetes.io/not-ready:NoExecute` toleration time is set high to "wait it out a while" (5min) + +### Agent Pods + +Agent pods run an ArangoDB dbserver as part of an ArangoDB agency. +It has persistent state potentially tight to the node it runs on and it has a unique ID. + +- Agent pods can be evicted from any node as soon as: + - It is no longer the agency leader AND + - There is at least an agency leader that is responding AND + - There is at least an agency follower that is responding +- Agent pods can be replaced with another agent pod with the same ID but whiped persistent state on a different node when: + - The old pod is known to be deleted (e.g. explicit eviction) +- `node.kubernetes.io/unreachable:NoExecute` toleration time is not set to "wait it out forever" +- `node.kubernetes.io/not-ready:NoExecute` toleration time is not set "wait it out forever" + +### Single Server Pods + +Single server pods run an ArangoDB server as part of an ArangoDB single server deployment. +It has persistent state potentially tight to the node. + +- Single server pods cannot be evicted from any node. +- Single server pods cannot be replaced with another pod. +- `node.kubernetes.io/unreachable:NoExecute` toleration time is not set to "wait it out forever" +- `node.kubernetes.io/not-ready:NoExecute` toleration time is not set "wait it out forever" + +### Single Pods in Active Failover Deployment + +Single pods run an ArangoDB single server as part of an ArangoDB active failover deployment. +It has persistent state potentially tight to the node it runs on and it has a unique ID. + +- Single pods can be evicted from any node as soon as: + - It is a follower of an active-failover deployment (Q: can we trigger this failover to another server?) +- Single pods can always be replaced with another single pod with a different ID on a different node. +- `node.kubernetes.io/unreachable:NoExecute` toleration time is set high to "wait it out a while" (5min) +- `node.kubernetes.io/not-ready:NoExecute` toleration time is set high to "wait it out a while" (5min) + +### SyncMaster Pods + +SyncMaster pods run an ArangoSync as master as part of an ArangoDB DC2DC cluster. +They have no persistent state, but do have a unique address. + +- SyncMaster pods can always be evicted from any node +- SyncMaster pods can always be replaced with another syncmaster pod on a different node +- `node.kubernetes.io/unreachable:NoExecute` toleration time is set low (15sec) +- `node.kubernetes.io/not-ready:NoExecute` toleration time is set low (15sec) + +### SyncWorker Pods + +SyncWorker pods run an ArangoSync as worker as part of an ArangoDB DC2DC cluster. +They have no persistent state, but do have in-memory state and a unique address. + +- SyncWorker pods can always be evicted from any node +- SyncWorker pods can always be replaced with another syncworker pod on a different node +- `node.kubernetes.io/unreachable:NoExecute` toleration time is set a bit higher to try to avoid resynchronization (1min) +- `node.kubernetes.io/not-ready:NoExecute` toleration time is set a bit higher to try to avoid resynchronization (1min) From ab28b5007e011d84fb6ebc0913561436784e4755 Mon Sep 17 00:00:00 2001 From: Ewout Prangsma Date: Mon, 14 May 2018 14:58:39 +0200 Subject: [PATCH 2/4] Typos --- docs/design/pod_evication_and_replacement.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/design/pod_evication_and_replacement.md b/docs/design/pod_evication_and_replacement.md index 567298463..9c91592be 100644 --- a/docs/design/pod_evication_and_replacement.md +++ b/docs/design/pod_evication_and_replacement.md @@ -12,7 +12,7 @@ from a taint being added to a node (either automatically by Kubernetes or manual ## Replacement -Replacement is the process of replacing a pod an another pod that takes over the responsibilities +Replacement is the process of replacing a pod by another pod that takes over the responsibilities of the original pod. The replacement pod has a new ID and new (read empty) persistent data. @@ -33,13 +33,14 @@ The rules for eviction & replacement are specified per type of pod. ### Image ID Pods -The Image ID pods are starter to fetch the ArangoDB version of a specific +The Image ID pods are started to fetch the ArangoDB version of a specific ArangoDB image and fetch the docker sha256 of that image. They have no persistent state. - Image ID pods can always be evicted from any node - Image ID pods can always be restarted on a different node. - There is no need to replace an image ID pod. + There is no need to replace an image ID pod, nor will it cause problems when + 2 image ID pods run at the same time. - `node.kubernetes.io/unreachable:NoExecute` toleration time is set very low (5sec) - `node.kubernetes.io/not-ready:NoExecute` toleration time is set very low (5sec) @@ -56,7 +57,7 @@ They have no persistent state, but do have a unique ID. ### DBServer Pods DBServer pods run an ArangoDB dbserver as part of an ArangoDB cluster. -It has persistent state potentially tight to the node it runs on and it has a unique ID. +It has persistent state potentially tied to the node it runs on and it has a unique ID. - DBServer pods can be evicted from any node as soon as: - It has been completely drained AND @@ -84,7 +85,7 @@ It has persistent state potentially tight to the node it runs on and it has a un ### Single Server Pods Single server pods run an ArangoDB server as part of an ArangoDB single server deployment. -It has persistent state potentially tight to the node. +It has persistent state potentially tied to the node. - Single server pods cannot be evicted from any node. - Single server pods cannot be replaced with another pod. @@ -94,7 +95,7 @@ It has persistent state potentially tight to the node. ### Single Pods in Active Failover Deployment Single pods run an ArangoDB single server as part of an ArangoDB active failover deployment. -It has persistent state potentially tight to the node it runs on and it has a unique ID. +It has persistent state potentially tied to the node it runs on and it has a unique ID. - Single pods can be evicted from any node as soon as: - It is a follower of an active-failover deployment (Q: can we trigger this failover to another server?) From d7f2ccb6495be4e1bc88a5caa2499c26e6166741 Mon Sep 17 00:00:00 2001 From: Ewout Prangsma Date: Mon, 14 May 2018 15:01:07 +0200 Subject: [PATCH 3/4] Typo --- docs/design/pod_evication_and_replacement.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/pod_evication_and_replacement.md b/docs/design/pod_evication_and_replacement.md index 9c91592be..988d5284e 100644 --- a/docs/design/pod_evication_and_replacement.md +++ b/docs/design/pod_evication_and_replacement.md @@ -77,7 +77,7 @@ It has persistent state potentially tight to the node it runs on and it has a un - It is no longer the agency leader AND - There is at least an agency leader that is responding AND - There is at least an agency follower that is responding -- Agent pods can be replaced with another agent pod with the same ID but whiped persistent state on a different node when: +- Agent pods can be replaced with another agent pod with the same ID but wiped persistent state on a different node when: - The old pod is known to be deleted (e.g. explicit eviction) - `node.kubernetes.io/unreachable:NoExecute` toleration time is not set to "wait it out forever" - `node.kubernetes.io/not-ready:NoExecute` toleration time is not set "wait it out forever" From cddb02b3144db58f47adc1402fdb234db3f90d21 Mon Sep 17 00:00:00 2001 From: Ewout Prangsma Date: Mon, 14 May 2018 15:01:37 +0200 Subject: [PATCH 4/4] Changed agent tolerations --- docs/design/pod_evication_and_replacement.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/design/pod_evication_and_replacement.md b/docs/design/pod_evication_and_replacement.md index 988d5284e..8a5fa5e94 100644 --- a/docs/design/pod_evication_and_replacement.md +++ b/docs/design/pod_evication_and_replacement.md @@ -79,8 +79,8 @@ It has persistent state potentially tight to the node it runs on and it has a un - There is at least an agency follower that is responding - Agent pods can be replaced with another agent pod with the same ID but wiped persistent state on a different node when: - The old pod is known to be deleted (e.g. explicit eviction) -- `node.kubernetes.io/unreachable:NoExecute` toleration time is not set to "wait it out forever" -- `node.kubernetes.io/not-ready:NoExecute` toleration time is not set "wait it out forever" +- `node.kubernetes.io/unreachable:NoExecute` toleration time is set high to "wait it out a while" (5min) +- `node.kubernetes.io/not-ready:NoExecute` toleration time is set high to "wait it out a while" (5min) ### Single Server Pods