Skip to content

Added a spec regarding the rules for eviction & replacement of pods #133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 14, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 124 additions & 0 deletions docs/design/pod_evication_and_replacement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Pod Eviction & Replacement

This chapter specifies the rules around evicting pods from nodes and
restarting or replacing them.

## Eviction

Eviction is the process of removing a pod that is running on a node from that node.

This is typically the result of a drain action (`kubectl drain`) or
from a taint being added to a node (either automatically by Kubernetes or manually by an operator).

## Replacement

Replacement is the process of replacing a pod by another pod that takes over the responsibilities
of the original pod.

The replacement pod has a new ID and new (read empty) persistent data.

Note that replacing a pod is different from restarting a pod. A pod is restarted when it has been reported
to have termined.

## NoExecute Tolerations

NoExecute tolerations are used to control the behavior of Kubernetes (wrt. to a Pod) when the node
that the pod is running on is no longer reachable or becomes not-ready.

See the applicable [Kubernetes documentation](https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/) for more info.

## Rules

The rules for eviction & replacement are specified per type of pod.

### Image ID Pods

The Image ID pods are started to fetch the ArangoDB version of a specific
ArangoDB image and fetch the docker sha256 of that image.
They have no persistent state.

- Image ID pods can always be evicted from any node
- Image ID pods can always be restarted on a different node.
There is no need to replace an image ID pod, nor will it cause problems when
2 image ID pods run at the same time.
- `node.kubernetes.io/unreachable:NoExecute` toleration time is set very low (5sec)
- `node.kubernetes.io/not-ready:NoExecute` toleration time is set very low (5sec)

### Coordinator Pods

Coordinator pods run an ArangoDB coordinator as part of an ArangoDB cluster.
They have no persistent state, but do have a unique ID.

- Coordinator pods can always be evicted from any node
- Coordinator pods can always be replaced with another coordinator pod with a different ID on a different node
- `node.kubernetes.io/unreachable:NoExecute` toleration time is set low (15sec)
- `node.kubernetes.io/not-ready:NoExecute` toleration time is set low (15sec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add? "There is no danger at all if two coordinator pods with different ID run concurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (a bit different)


### DBServer Pods

DBServer pods run an ArangoDB dbserver as part of an ArangoDB cluster.
It has persistent state potentially tied to the node it runs on and it has a unique ID.

- DBServer pods can be evicted from any node as soon as:
- It has been completely drained AND
- It is no longer the shard master for any shard
- DBServer pods can be replaced with another dbserver pod with a different ID on a different node when:
- It is not the shard master for any shard OR
- For every shard it is the master for, there is an in-sync follower
- `node.kubernetes.io/unreachable:NoExecute` toleration time is set high to "wait it out a while" (5min)
- `node.kubernetes.io/not-ready:NoExecute` toleration time is set high to "wait it out a while" (5min)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has to do with the comment about "replacing" above: If the dbserver has a new ID, it will not really be used for anything without user invention, unless we put some kind of rebalancing of shards in.


### Agent Pods

Agent pods run an ArangoDB dbserver as part of an ArangoDB agency.
It has persistent state potentially tight to the node it runs on and it has a unique ID.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"tight" -> "tied"


- Agent pods can be evicted from any node as soon as:
- It is no longer the agency leader AND
- There is at least an agency leader that is responding AND
- There is at least an agency follower that is responding
- Agent pods can be replaced with another agent pod with the same ID but wiped persistent state on a different node when:
- The old pod is known to be deleted (e.g. explicit eviction)
- `node.kubernetes.io/unreachable:NoExecute` toleration time is set high to "wait it out a while" (5min)
- `node.kubernetes.io/not-ready:NoExecute` toleration time is set high to "wait it out a while" (5min)

### Single Server Pods

Single server pods run an ArangoDB server as part of an ArangoDB single server deployment.
It has persistent state potentially tied to the node.

- Single server pods cannot be evicted from any node.
- Single server pods cannot be replaced with another pod.
- `node.kubernetes.io/unreachable:NoExecute` toleration time is not set to "wait it out forever"
- `node.kubernetes.io/not-ready:NoExecute` toleration time is not set "wait it out forever"

### Single Pods in Active Failover Deployment

Single pods run an ArangoDB single server as part of an ArangoDB active failover deployment.
It has persistent state potentially tied to the node it runs on and it has a unique ID.

- Single pods can be evicted from any node as soon as:
- It is a follower of an active-failover deployment (Q: can we trigger this failover to another server?)
- Single pods can always be replaced with another single pod with a different ID on a different node.
- `node.kubernetes.io/unreachable:NoExecute` toleration time is set high to "wait it out a while" (5min)
- `node.kubernetes.io/not-ready:NoExecute` toleration time is set high to "wait it out a while" (5min)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check this, do not know by heart.


### SyncMaster Pods

SyncMaster pods run an ArangoSync as master as part of an ArangoDB DC2DC cluster.
They have no persistent state, but do have a unique address.

- SyncMaster pods can always be evicted from any node
- SyncMaster pods can always be replaced with another syncmaster pod on a different node
- `node.kubernetes.io/unreachable:NoExecute` toleration time is set low (15sec)
- `node.kubernetes.io/not-ready:NoExecute` toleration time is set low (15sec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any requirement about the same network endpoint or an internal k8s service being set up in case of a replacement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no


### SyncWorker Pods

SyncWorker pods run an ArangoSync as worker as part of an ArangoDB DC2DC cluster.
They have no persistent state, but do have in-memory state and a unique address.

- SyncWorker pods can always be evicted from any node
- SyncWorker pods can always be replaced with another syncworker pod on a different node
- `node.kubernetes.io/unreachable:NoExecute` toleration time is set a bit higher to try to avoid resynchronization (1min)
- `node.kubernetes.io/not-ready:NoExecute` toleration time is set a bit higher to try to avoid resynchronization (1min)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here about network endpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no