This is a Kubernetes controller for an automated Node operation. In general, if we perform a Node operation that affects running Pods, we need to do the following steps:
- Make the Node unschedulable.
- Evict running Pods in the Node and wait all running node to be evicted.
- Perform the operation.
- Make the Node schedulable.
Node operation controller automates these steps. In addition, this controller:
- watches NodeCondition and perform an arbitrary operation
- takes care count of unavailable Nodes due to the operation
- When NodeOperation resource is created, go to next step
- Confirm the NodeOperation does not violate NodeDisruptionBudgets.
- If it violates NodeDisruptionBudgets, wait for other NodeOperations to finish.
- Taint the target Node specified in NodeOperation.
- The Taint is
nodeops.k8s.preferred.jp/operating=:NoSchedule
- The Taint is
- Evict all running Pods in the Node.
- By default, this uses Pod eviction API. You can control eviction by NodeDisruptionBudget.
- This behavior can be configured by
evictionStrategy
option of NodeOperation.
- After eviction, run a Job configured in the NodeOperation
- The Pod created by the Job has
nodeops.k8s.preferred.jp/nodename
annotation which indicates the target Node.
- The Pod created by the Job has
- Wait the Job to be in Completed or Failed phase.
- Untaint the Node.
For most operation team, they would have their own secret-sauce for daily operation. This means typical node failure can be cured by common recipe shared among the team. NodeRemediation
, NodeRemediationTemplate
and NodeOperationTemplate
enable us to automate the common operation for known node issues.
NodeOperationTemplate
represents a template of common node operation.NodeRemediation
defines- target node to apply the remediation,
- known failure by Node
conditions
, and - corresponding
nodeOperationTemplate
to fix the failure.
NodeRemediationTemplate
defines- target nodes to apply the remediation by
nodeSelector
and - a template of
NodeRemediation
.
- target nodes to apply the remediation by
Node operation controller watches nodes and if it detected the failure matches some NodeRemediation
, then it creates NodeOperation
from specified NodeOperationTemplate
automatically.
apiVersion: nodeops.k8s.preferred.jp/v1alpha1
kind: NodeOperation
metadata:
name: example
spec:
nodeName: "<operation target node>"
jobTemplate:
metadata:
namespace: default
spec: # batchv1.JobSpec
template:
spec:
containers:
- name: operation
image: busybox
command: ["sh", "-c", "echo Do some operation for $TARGET_NODE && sleep 60 && echo Done"]
env:
- name: TARGET_NODE
valueFrom:
fieldRef:
fieldPath: "metadata.annotations['nodeops.k8s.preferred.jp/nodename']"
restartPolicy: Never
evictionStrategy: Evict # optional
nodeDisruptionBudgetSelector: {} # optional
skipWaitingForEviction: false # optional
This controller has some ways to evict Pods:
evictionStrategy: Evict
: This strategy tries to evict Pods by Pod eviction API and it respects PodDisruptionBudget.evictionStrategy: Delete
: This strategy tries to evict Pods by deleting Pods.evictionStrategy: ForceDelete
: This strategy tries to evict Pods by deleting Pods forcibly.evictionStrategy: None
: This strategy does not evict Pods and it just waits all Pods to finish.
By default, a NodeOperation respects all NodeDisruptionBudgets (NDB) but in some cases, some NDBs need to be ignored. (e.g. urgent operations) If nodeDisruptionBudgetSelector is set, only NDBs whose labels match the nodeDisruptionBudgetSelector will be respected.
By default, a NodeOperation waits for all pods drained by the eviction. If skipWaitingForEviction is true, a NodeOperation skips waiting for the eviction finishing. It means that a NodeOperation ignores not drained pods.
apiVersion: nodeops.k8s.preferred.jp/v1alpha1
kind: NodeDisruptionBudget
metadata:
name: example
spec:
selector: # nodeSelector for Nodes that this NodeDisruptionBudget affects
nodeLabelKey: nodeLabelValue
maxUnavailable: 1 # optional
minAvailable: 1 # optional
taintTargets: [] # optional
minAvailable
: minimum number of available NodesmaxAvailable
: maximum number of unavailable Nodes
By default, this controller treats Nodes with a specific taint as "unavailable". The taint is nodeops.k8s.preferred.jp/operating=:NoSchedule
and it is added to Nodes during this controller is processing NodeOperations.
In addition to the default taint, Nodes with taints which match taintTargets
are "unavailable".
taintTargets:
- key: 'k1'
operator: 'Equal'
value: 'v1'
effect: 'NoSchedule'
For instance, if the above taintTargets
are set, Nodes with k1=v1:NoSchedule
taint are "unavailable".
A NodeRemediation watches condition of a Node and it creates a NodeOperation from a NodeOperationTemplate to remediate the condition.
apiVersion: nodeops.k8s.preferred.jp/v1alpha1
kind: NodeOperationTemplate
metadata:
name: optemplate1
spec:
template:
metadata: {}
spec: # NodeOperationSpec
job:
metadata:
namespace: default
spec: # batchv1.JobSpec
template:
spec:
containers:
- name: operation
image: busybox
command: ["echo", "Do some operation here"]
restartPolicy: Never
apiVersion: nodeops.k8s.preferred.jp/v1alpha1
kind: NodeRemediation
metadata:
name: remediation1
spec:
nodeName: node1
nodeOperationTemplateName: 'optemplate1'
rule:
conditions:
- type: PIDPressure
status: "True"
- type: OtherCondition
status: "Unknown"
A NodeRemediationTemplate creates NodeRemediations for each Nodes filtered by nodeSelector.
apiVersion: nodeops.k8s.preferred.jp/v1alpha1
kind: NodeRemediationTemplate
metadata:
name: remediationtemplate1
spec:
nodeSelector:
'kubernetes.io/os': 'linux'
template:
spec:
nodeOperationTemplateName: 'optemplate1'
rule:
conditions:
- type: PIDPressure
status: "True"
- type: OtherCondition
status: "Unknown"
The release process is fully automated by tagpr. To release, just merge the latest release PR.