-
Notifications
You must be signed in to change notification settings - Fork 73
Added a spec regarding the rules for eviction & replacement of pods #133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,124 @@ | ||
# Pod Eviction & Replacement | ||
|
||
This chapter specifies the rules around evicting pods from nodes and | ||
restarting or replacing them. | ||
|
||
## Eviction | ||
|
||
Eviction is the process of removing a pod that is running on a node from that node. | ||
|
||
This is typically the result of a drain action (`kubectl drain`) or | ||
from a taint being added to a node (either automatically by Kubernetes or manually by an operator). | ||
|
||
## Replacement | ||
|
||
Replacement is the process of replacing a pod by another pod that takes over the responsibilities | ||
of the original pod. | ||
|
||
The replacement pod has a new ID and new (read empty) persistent data. | ||
|
||
Note that replacing a pod is different from restarting a pod. A pod is restarted when it has been reported | ||
to have termined. | ||
|
||
## NoExecute Tolerations | ||
|
||
NoExecute tolerations are used to control the behavior of Kubernetes (wrt. to a Pod) when the node | ||
that the pod is running on is no longer reachable or becomes not-ready. | ||
|
||
See the applicable [Kubernetes documentation](https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/) for more info. | ||
|
||
## Rules | ||
|
||
The rules for eviction & replacement are specified per type of pod. | ||
|
||
### Image ID Pods | ||
|
||
The Image ID pods are started to fetch the ArangoDB version of a specific | ||
ArangoDB image and fetch the docker sha256 of that image. | ||
They have no persistent state. | ||
|
||
- Image ID pods can always be evicted from any node | ||
- Image ID pods can always be restarted on a different node. | ||
There is no need to replace an image ID pod, nor will it cause problems when | ||
2 image ID pods run at the same time. | ||
- `node.kubernetes.io/unreachable:NoExecute` toleration time is set very low (5sec) | ||
- `node.kubernetes.io/not-ready:NoExecute` toleration time is set very low (5sec) | ||
|
||
### Coordinator Pods | ||
|
||
Coordinator pods run an ArangoDB coordinator as part of an ArangoDB cluster. | ||
They have no persistent state, but do have a unique ID. | ||
|
||
- Coordinator pods can always be evicted from any node | ||
- Coordinator pods can always be replaced with another coordinator pod with a different ID on a different node | ||
- `node.kubernetes.io/unreachable:NoExecute` toleration time is set low (15sec) | ||
- `node.kubernetes.io/not-ready:NoExecute` toleration time is set low (15sec) | ||
|
||
### DBServer Pods | ||
|
||
DBServer pods run an ArangoDB dbserver as part of an ArangoDB cluster. | ||
It has persistent state potentially tied to the node it runs on and it has a unique ID. | ||
|
||
- DBServer pods can be evicted from any node as soon as: | ||
- It has been completely drained AND | ||
- It is no longer the shard master for any shard | ||
- DBServer pods can be replaced with another dbserver pod with a different ID on a different node when: | ||
- It is not the shard master for any shard OR | ||
- For every shard it is the master for, there is an in-sync follower | ||
- `node.kubernetes.io/unreachable:NoExecute` toleration time is set high to "wait it out a while" (5min) | ||
- `node.kubernetes.io/not-ready:NoExecute` toleration time is set high to "wait it out a while" (5min) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This comment has to do with the comment about "replacing" above: If the dbserver has a new ID, it will not really be used for anything without user invention, unless we put some kind of rebalancing of shards in. |
||
|
||
### Agent Pods | ||
|
||
Agent pods run an ArangoDB dbserver as part of an ArangoDB agency. | ||
It has persistent state potentially tight to the node it runs on and it has a unique ID. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "tight" -> "tied" |
||
|
||
- Agent pods can be evicted from any node as soon as: | ||
- It is no longer the agency leader AND | ||
- There is at least an agency leader that is responding AND | ||
- There is at least an agency follower that is responding | ||
- Agent pods can be replaced with another agent pod with the same ID but wiped persistent state on a different node when: | ||
- The old pod is known to be deleted (e.g. explicit eviction) | ||
- `node.kubernetes.io/unreachable:NoExecute` toleration time is set high to "wait it out a while" (5min) | ||
- `node.kubernetes.io/not-ready:NoExecute` toleration time is set high to "wait it out a while" (5min) | ||
|
||
### Single Server Pods | ||
|
||
Single server pods run an ArangoDB server as part of an ArangoDB single server deployment. | ||
It has persistent state potentially tied to the node. | ||
|
||
- Single server pods cannot be evicted from any node. | ||
- Single server pods cannot be replaced with another pod. | ||
- `node.kubernetes.io/unreachable:NoExecute` toleration time is not set to "wait it out forever" | ||
- `node.kubernetes.io/not-ready:NoExecute` toleration time is not set "wait it out forever" | ||
|
||
### Single Pods in Active Failover Deployment | ||
|
||
Single pods run an ArangoDB single server as part of an ArangoDB active failover deployment. | ||
It has persistent state potentially tied to the node it runs on and it has a unique ID. | ||
|
||
- Single pods can be evicted from any node as soon as: | ||
- It is a follower of an active-failover deployment (Q: can we trigger this failover to another server?) | ||
- Single pods can always be replaced with another single pod with a different ID on a different node. | ||
- `node.kubernetes.io/unreachable:NoExecute` toleration time is set high to "wait it out a while" (5min) | ||
- `node.kubernetes.io/not-ready:NoExecute` toleration time is set high to "wait it out a while" (5min) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Need to check this, do not know by heart. |
||
|
||
### SyncMaster Pods | ||
|
||
SyncMaster pods run an ArangoSync as master as part of an ArangoDB DC2DC cluster. | ||
They have no persistent state, but do have a unique address. | ||
|
||
- SyncMaster pods can always be evicted from any node | ||
- SyncMaster pods can always be replaced with another syncmaster pod on a different node | ||
- `node.kubernetes.io/unreachable:NoExecute` toleration time is set low (15sec) | ||
- `node.kubernetes.io/not-ready:NoExecute` toleration time is set low (15sec) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any requirement about the same network endpoint or an internal k8s service being set up in case of a replacement? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no |
||
|
||
### SyncWorker Pods | ||
|
||
SyncWorker pods run an ArangoSync as worker as part of an ArangoDB DC2DC cluster. | ||
They have no persistent state, but do have in-memory state and a unique address. | ||
|
||
- SyncWorker pods can always be evicted from any node | ||
- SyncWorker pods can always be replaced with another syncworker pod on a different node | ||
- `node.kubernetes.io/unreachable:NoExecute` toleration time is set a bit higher to try to avoid resynchronization (1min) | ||
- `node.kubernetes.io/not-ready:NoExecute` toleration time is set a bit higher to try to avoid resynchronization (1min) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here about network endpoint. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add? "There is no danger at all if two coordinator pods with different ID run concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (a bit different)