Skip to content

Commit b12c545

Browse files
committed
Add docs for extended tolerations operators
Signed-off-by: Heba Elayoty <heelayot@microsoft.com>
1 parent 0c94ca5 commit b12c545

File tree

3 files changed

+135
-28
lines changed

3 files changed

+135
-28
lines changed

content/en/docs/concepts/scheduling-eviction/taint-and-toleration.md

Lines changed: 102 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ weight: 50
1111

1212
<!-- overview -->
1313
[_Node affinity_](/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity)
14-
is a property of {{< glossary_tooltip text="Pods" term_id="pod" >}} that *attracts* them to
14+
is a property of {{< glossary_tooltip text="Pods" term_id="pod" >}} that _attracts_ them to
1515
a set of {{< glossary_tooltip text="nodes" term_id="node" >}} (either as a preference or a
1616
hard requirement). _Taints_ are the opposite -- they allow a node to repel a set of pods.
1717

@@ -39,6 +39,7 @@ places a taint on node `node1`. The taint has key `key1`, value `value1`, and ta
3939
This means that no pod will be able to schedule onto `node1` unless it has a matching toleration.
4040

4141
To remove the taint added by the command above, you can run:
42+
4243
```shell
4344
kubectl taint nodes node1 key1=value1:NoSchedule-
4445
```
@@ -81,6 +82,24 @@ A toleration "matches" a taint if the keys are the same and the effects are the
8182
* the `operator` is `Exists` (in which case no `value` should be specified), or
8283
* the `operator` is `Equal` and the values should be equal.
8384

85+
{{< feature-state feature_gate_name="TaintTolerationComparisonOperators" >}}
86+
87+
You can also use numeric comparison operators for threshold-based matching:
88+
89+
* the `operator` is `Gt` (greater than) and the toleration value is greater than the taint value, or
90+
* the `operator` is `Lt` (less than) and the toleration value is less than the taint value.
91+
92+
For numeric operators, both the toleration and taint values must be valid integers.
93+
If either value cannot be parsed as an integer, the toleration does not match.
94+
95+
{{< note >}}
96+
When you create a Pod that uses `Gt` or `Lt` tolerations operators, the API server
97+
validates that the toleration values are valid integers. Taint values on nodes are not
98+
validated at node registration time. If a node has a non-numeric taint value
99+
(for example, `node.kubernetes.io/sla=high:NoSchedule`), pods with numeric comparison
100+
operators will not match that taint and cannot schedule on that node.
101+
{{< /note >}}
102+
84103
{{< note >}}
85104

86105
There are two special cases:
@@ -93,15 +112,15 @@ An empty `effect` matches all effects with key `key1`.
93112

94113
The above example used the `effect` of `NoSchedule`. Alternatively, you can use the `effect` of `PreferNoSchedule`.
95114

96-
97115
The allowed values for the `effect` field are:
98116

99117
`NoExecute`
100118
: This affects pods that are already running on the node as follows:
101-
* Pods that do not tolerate the taint are evicted immediately
102-
* Pods that tolerate the taint without specifying `tolerationSeconds` in
119+
120+
* Pods that do not tolerate the taint are evicted immediately
121+
* Pods that tolerate the taint without specifying `tolerationSeconds` in
103122
their toleration specification remain bound forever
104-
* Pods that tolerate the taint with a specified `tolerationSeconds` remain
123+
* Pods that tolerate the taint with a specified `tolerationSeconds` remain
105124
bound for the specified amount of time. After that time elapses, the node
106125
lifecycle controller evicts the Pods from the node.
107126

@@ -111,7 +130,7 @@ The allowed values for the `effect` field are:
111130

112131
`PreferNoSchedule`
113132
: `PreferNoSchedule` is a "preference" or "soft" version of `NoSchedule`.
114-
The control plane will *try* to avoid placing a Pod that does not tolerate
133+
The control plane will _try_ to avoid placing a Pod that does not tolerate
115134
the taint on the node, but it is not guaranteed.
116135

117136
You can put multiple taints on the same node and multiple tolerations on the same pod.
@@ -122,7 +141,7 @@ remaining un-ignored taints have the indicated effects on the pod. In particular
122141
* if there is at least one un-ignored taint with effect `NoSchedule` then Kubernetes will not schedule
123142
the pod onto that node
124143
* if there is no un-ignored taint with effect `NoSchedule` but there is at least one un-ignored taint with
125-
effect `PreferNoSchedule` then Kubernetes will *try* to not schedule the pod onto the node
144+
effect `PreferNoSchedule` then Kubernetes will _try_ to not schedule the pod onto the node
126145
* if there is at least one un-ignored taint with effect `NoExecute` then the pod will be evicted from
127146
the node (if it is already running on the node), and will not be
128147
scheduled onto the node (if it is not yet running on the node).
@@ -173,9 +192,62 @@ means that if this pod is running and a matching taint is added to the node, the
173192
the pod will stay bound to the node for 3600 seconds, and then be evicted. If the
174193
taint is removed before that time, the pod will not be evicted.
175194

195+
## Numeric comparison operators {#numeric-comparison-operators}
196+
197+
{{< feature-state feature_gate_name="TaintTolerationComparisonOperators" >}}
198+
199+
In addition to the `Equal` and `Exists` operators, you can use numeric comparison
200+
operators (`Gt` and `Lt`) to match taints with integer values. This is useful for
201+
threshold-based scheduling scenarios, such as matching nodes based on reliability
202+
levels or SLA requirements.
203+
204+
For example, if nodes are tainted with an SLA value:
205+
206+
```shell
207+
kubectl taint nodes node1 node.kubernetes.io/sla=950:NoSchedule
208+
```
209+
210+
A pod can tolerate nodes with SLA greater than 900:
211+
212+
{{% code_sample file="pods/pod-with-numeric-toleration.yaml" %}}
213+
214+
This toleration matches the taint on `node1` because `900 < 950` (the toleration
215+
value is less than the taint value for the `Gt` operator).
216+
217+
Similarly, you can use the `Lt` operator to match taints where the toleration value
218+
is greater than the taint value:
219+
220+
```yaml
221+
tolerations:
222+
- key: "node.kubernetes.io/sla"
223+
operator: "Lt"
224+
value: "1000"
225+
effect: "NoSchedule"
226+
```
227+
228+
{{< note >}}
229+
When using numeric comparison operators:
230+
231+
* Both the toleration and taint values must be valid integers (signed 64-bit).
232+
* If a value cannot be parsed as an integer, the toleration does not match.
233+
* Numeric operators work with all taint effects: `NoSchedule`, `PreferNoSchedule`, and `NoExecute`.
234+
{{< /note >}}
235+
236+
{{< warning >}}
237+
238+
Before disabling the `TaintTolerationComparisonOperators` feature gate, you should identify
239+
all workloads using the `Gt` or `Lt` operators to avoid controller hot-loops.
240+
241+
Before disabling the feature gate:
242+
243+
* Update all workload controller templates to use `Equal` or `Exists` operators instead
244+
* Delete any pending pods that use `Gt` or `Lt` operators
245+
* Monitor the `apiserver_request_total` metric for spikes in validation errors
246+
{{< /warning >}}
247+
176248
## Example Use Cases
177249

178-
Taints and tolerations are a flexible way to steer pods *away* from nodes or evict
250+
Taints and tolerations are a flexible way to steer pods _away_ from nodes or evict
179251
pods that shouldn't be running. A few of the use cases are
180252

181253
* **Dedicated Nodes**: If you want to dedicate a set of nodes for exclusive use by
@@ -184,8 +256,8 @@ a particular set of users, you can add a taint to those nodes (say,
184256
toleration to their pods (this would be done most easily by writing a custom
185257
[admission controller](/docs/reference/access-authn-authz/admission-controllers/)).
186258
The pods with the tolerations will then be allowed to use the tainted (dedicated) nodes as
187-
well as any other nodes in the cluster. If you want to dedicate the nodes to them *and*
188-
ensure they *only* use the dedicated nodes, then you should additionally add a label similar
259+
well as any other nodes in the cluster. If you want to dedicate the nodes to them _and_
260+
ensure they _only_ use the dedicated nodes, then you should additionally add a label similar
189261
to the taint to the same set of nodes (e.g. `dedicated=groupName`), and the admission
190262
controller should additionally add a node affinity to require that the pods can only schedule
191263
onto nodes labeled with `dedicated=groupName`.
@@ -215,25 +287,28 @@ manually add tolerations to your pods.
215287
* **Taint based Evictions**: A per-pod-configurable eviction behavior
216288
when there are node problems, which is described in the next section.
217289

290+
* **SLA-based Scheduling**: In clusters with mixed node types (i.e. spot instances),
291+
* you can taint nodes with numeric SLA or reliability values. Pods can then
292+
use numeric comparison operators to opt-in to nodes meeting specific reliability thresholds,
293+
while the cluster's default policy keeps most workloads away from lower-SLA nodes.
294+
218295
## Taint based Evictions
219296

220297
{{< feature-state for_k8s_version="v1.18" state="stable" >}}
221298

222-
223-
224299
The node controller automatically taints a Node when certain conditions
225300
are true. The following taints are built in:
226301

227-
* `node.kubernetes.io/not-ready`: Node is not ready. This corresponds to
302+
* `node.kubernetes.io/not-ready`: Node is not ready. This corresponds to
228303
the NodeCondition `Ready` being "`False`".
229-
* `node.kubernetes.io/unreachable`: Node is unreachable from the node
304+
* `node.kubernetes.io/unreachable`: Node is unreachable from the node
230305
controller. This corresponds to the NodeCondition `Ready` being "`Unknown`".
231-
* `node.kubernetes.io/memory-pressure`: Node has memory pressure.
232-
* `node.kubernetes.io/disk-pressure`: Node has disk pressure.
233-
* `node.kubernetes.io/pid-pressure`: Node has PID pressure.
234-
* `node.kubernetes.io/network-unavailable`: Node's network is unavailable.
235-
* `node.kubernetes.io/unschedulable`: Node is unschedulable.
236-
* `node.cloudprovider.kubernetes.io/uninitialized`: When the kubelet is started
306+
* `node.kubernetes.io/memory-pressure`: Node has memory pressure.
307+
* `node.kubernetes.io/disk-pressure`: Node has disk pressure.
308+
* `node.kubernetes.io/pid-pressure`: Node has PID pressure.
309+
* `node.kubernetes.io/network-unavailable`: Node's network is unavailable.
310+
* `node.kubernetes.io/unschedulable`: Node is unschedulable.
311+
* `node.cloudprovider.kubernetes.io/uninitialized`: When the kubelet is started
237312
with an "external" cloud provider, this taint is set on a node to mark it
238313
as unusable. After a controller from the cloud-controller-manager initializes
239314
this node, the kubelet removes this taint.
@@ -284,8 +359,8 @@ Nodes for 5 minutes after one of these problems is detected.
284359
[DaemonSet](/docs/concepts/workloads/controllers/daemonset/) pods are created with
285360
`NoExecute` tolerations for the following taints with no `tolerationSeconds`:
286361

287-
* `node.kubernetes.io/unreachable`
288-
* `node.kubernetes.io/not-ready`
362+
* `node.kubernetes.io/unreachable`
363+
* `node.kubernetes.io/not-ready`
289364

290365
This ensures that DaemonSet pods are never evicted due to these problems.
291366

@@ -320,11 +395,11 @@ onto the affected node.
320395
The DaemonSet controller automatically adds the following `NoSchedule`
321396
tolerations to all daemons, to prevent DaemonSets from breaking.
322397

323-
* `node.kubernetes.io/memory-pressure`
324-
* `node.kubernetes.io/disk-pressure`
325-
* `node.kubernetes.io/pid-pressure` (1.14 or later)
326-
* `node.kubernetes.io/unschedulable` (1.10 or later)
327-
* `node.kubernetes.io/network-unavailable` (*host network only*)
398+
* `node.kubernetes.io/memory-pressure`
399+
* `node.kubernetes.io/disk-pressure`
400+
* `node.kubernetes.io/pid-pressure` (1.14 or later)
401+
* `node.kubernetes.io/unschedulable` (1.10 or later)
402+
* `node.kubernetes.io/network-unavailable` (_host network only_)
328403

329404
Adding these tolerations ensures backward compatibility. You can also add
330405
arbitrary tolerations to DaemonSets.
@@ -343,4 +418,3 @@ devices. Like taints they apply to all pods which share the same allocated devic
343418
and how you can configure it
344419
* Read about [Pod Priority](/docs/concepts/scheduling-eviction/pod-priority-preemption/)
345420
* Read about [device taints and tolerations](/docs/concepts/scheduling-eviction/dynamic-resource-allocation#device-taints-and-tolerations)
346-
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
title: TaintTolerationComparisonOperators
3+
content_type: feature_gate
4+
_build:
5+
list: never
6+
render: false
7+
8+
stages:
9+
- stage: alpha
10+
defaultValue: false
11+
fromVersion: "1.35"
12+
---
13+
Enables numeric comparison operators (`Lt` and `Gt`) for
14+
[tolerations](/docs/concepts/scheduling-eviction/taint-and-toleration/),
15+
allowing pods to match taints using threshold-based comparisons.
16+
This is useful for scenarios like SLA-based scheduling where nodes are
17+
tainted with numeric values representing reliability levels.
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
apiVersion: v1
2+
kind: Pod
3+
metadata:
4+
name: nginx-numeric-toleration
5+
labels:
6+
env: test
7+
spec:
8+
containers:
9+
- name: nginx
10+
image: nginx
11+
imagePullPolicy: IfNotPresent
12+
tolerations:
13+
- key: "node.kubernetes.io/sla"
14+
operator: "Gt"
15+
value: "900"
16+
effect: "NoSchedule"

0 commit comments

Comments
 (0)