Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 99 additions & 29 deletions content/en/docs/concepts/scheduling-eviction/taint-and-toleration.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ weight: 50

<!-- overview -->
[_Node affinity_](/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity)
is a property of {{< glossary_tooltip text="Pods" term_id="pod" >}} that *attracts* them to
is a property of {{< glossary_tooltip text="Pods" term_id="pod" >}} that _attracts_ them to
a set of {{< glossary_tooltip text="nodes" term_id="node" >}} (either as a preference or a
hard requirement). _Taints_ are the opposite -- they allow a node to repel a set of pods.

Expand Down Expand Up @@ -39,6 +39,7 @@ places a taint on node `node1`. The taint has key `key1`, value `value1`, and ta
This means that no pod will be able to schedule onto `node1` unless it has a matching toleration.

To remove the taint added by the command above, you can run:

```shell
kubectl taint nodes node1 key1=value1:NoSchedule-
```
Expand Down Expand Up @@ -81,37 +82,56 @@ A toleration "matches" a taint if the keys are the same and the effects are the
* the `operator` is `Exists` (in which case no `value` should be specified), or
* the `operator` is `Equal` and the values should be equal.

{{< feature-state feature_gate_name="TaintTolerationComparisonOperators" >}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this feature gate reference is only related to a few paragraphs below, won't this cause confusion for the reader as to where the description of this feature ends?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we didn't mention it before the new operator, it would be more confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a sub(sub)section for this feature.


You can also use numeric comparison operators for threshold-based matching:

* the `operator` is `Gt` (greater than) and the toleration value is greater than the taint value, or
* the `operator` is `Lt` (less than) and the toleration value is less than the taint value.

For numeric operators, both the toleration and taint values must be valid integers.
If either value cannot be parsed as an integer, the toleration does not match.

{{< note >}}
When you create a Pod that uses `Gt` or `Lt` tolerations operators, the API server validates that
the toleration values are valid integers. Taint values on nodes are not validated at node
registration time. If a node has a non-numeric taint value (for example,
`servicelevel.organization.example/agreed-service-level=high:NoSchedule`),
pods with numeric comparison operators will not match that taint and cannot schedule on that node.
{{< /note >}}
Copy link
Member

@lmktfy lmktfy Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could mention that people can use admission-time checks to block node registrations with invalid taints.

It's a very strict choice but you can do it if you need to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho, this is not the main point from this part of documentation and it can be misleading in this context.


{{< note >}}

There are two special cases:

If the `key` is empty, then the `operator` must be `Exists`, which matches all keys and values. Note that the `effect` still needs to be matched at the same time.
If the `key` is empty, then the `operator` must be `Exists`, which matches all keys and values.
Note that the `effect` still needs to be matched at the same time.

An empty `effect` matches all effects with key `key1`.

{{< /note >}}

The above example used the `effect` of `NoSchedule`. Alternatively, you can use the `effect` of `PreferNoSchedule`.


The allowed values for the `effect` field are:

`NoExecute`
: This affects pods that are already running on the node as follows:

* Pods that do not tolerate the taint are evicted immediately
* Pods that tolerate the taint without specifying `tolerationSeconds` in
their toleration specification remain bound forever
their toleration specification remain bound forever
* Pods that tolerate the taint with a specified `tolerationSeconds` remain
bound for the specified amount of time. After that time elapses, the node
lifecycle controller evicts the Pods from the node.
bound for the specified amount of time. After that time elapses, the node
lifecycle controller evicts the Pods from the node.

`NoSchedule`
: No new Pods will be scheduled on the tainted node unless they have a matching
toleration. Pods currently running on the node are **not** evicted.

`PreferNoSchedule`
: `PreferNoSchedule` is a "preference" or "soft" version of `NoSchedule`.
The control plane will *try* to avoid placing a Pod that does not tolerate
The control plane will _try_ to avoid placing a Pod that does not tolerate
the taint on the node, but it is not guaranteed.

You can put multiple taints on the same node and multiple tolerations on the same pod.
Expand All @@ -122,7 +142,7 @@ remaining un-ignored taints have the indicated effects on the pod. In particular
* if there is at least one un-ignored taint with effect `NoSchedule` then Kubernetes will not schedule
the pod onto that node
* if there is no un-ignored taint with effect `NoSchedule` but there is at least one un-ignored taint with
effect `PreferNoSchedule` then Kubernetes will *try* to not schedule the pod onto the node
effect `PreferNoSchedule` then Kubernetes will _try_ to not schedule the pod onto the node
* if there is at least one un-ignored taint with effect `NoExecute` then the pod will be evicted from
the node (if it is already running on the node), and will not be
scheduled onto the node (if it is not yet running on the node).
Expand Down Expand Up @@ -173,9 +193,62 @@ means that if this pod is running and a matching taint is added to the node, the
the pod will stay bound to the node for 3600 seconds, and then be evicted. If the
taint is removed before that time, the pod will not be evicted.

## Numeric comparison operators {#numeric-comparison-operators}

{{< feature-state feature_gate_name="TaintTolerationComparisonOperators" >}}

In addition to the `Equal` and `Exists` operators, you can use numeric comparison operators
(`Gt` and `Lt`) to match taints with integer values. This is useful for threshold-based scheduling
scenarios, such as matching nodes based on reliability levels or SLA requirements.

For example, if nodes are tainted with a value representing a service level agreement (SLA):

```shell
kubectl taint nodes node1 servicelevel.organization.example/agreed-service-level=950:NoSchedule
```

A pod can tolerate nodes with SLA greater than 900:

{{% code_sample file="pods/pod-with-numeric-toleration.yaml" %}}

This toleration matches the taint on `node1` because `950 > 900` (the taint value
is greater than the toleration value for the `Gt` operator).
Similarly, you can use the `Lt` operator to match taints where the taint value is
less than the toleration value:

```yaml
tolerations:
- key: "servicelevel.organization.example/agreed-service-level"
operator: "Lt"
value: "1000"
effect: "NoSchedule"
```

{{< note >}}
When using numeric comparison operators:

* Both the toleration and taint values must be valid signed 64-bit integers
(zero leading numbers (e.g., "0550") are not allowed).
* If a value cannot be parsed as an integer, the toleration does not match.
* Numeric operators work with all taint effects: `NoSchedule`, `PreferNoSchedule`, and `NoExecute`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain somewhere how the PreferNoSchedule works for numeric operators?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. PTAL.

* For `PreferNoSchedule` with numeric operators: if a pod's toleration doesn't satisfy the numeric comparison
(e.g., toleration value < taint value when using `Gt`), the `TaintToleration` plugin gives the node a lower score
but may still schedule there if no better options exist.
{{< /note >}}

{{< warning >}}

Before disabling the `TaintTolerationComparisonOperators` feature gate:

* You should identify all workloads using the `Gt` or `Lt` operators to avoid controller hot-loops.
* Update all workload controller templates to use `Equal` or `Exists` operators instead
* Delete any pending pods that use `Gt` or `Lt` operators
* Monitor the `apiserver_request_total` metric for spikes in validation errors
{{< /warning >}}

## Example Use Cases

Taints and tolerations are a flexible way to steer pods *away* from nodes or evict
Taints and tolerations are a flexible way to steer pods _away_ from nodes or evict
pods that shouldn't be running. A few of the use cases are

* **Dedicated Nodes**: If you want to dedicate a set of nodes for exclusive use by
Expand All @@ -184,8 +257,8 @@ a particular set of users, you can add a taint to those nodes (say,
toleration to their pods (this would be done most easily by writing a custom
[admission controller](/docs/reference/access-authn-authz/admission-controllers/)).
The pods with the tolerations will then be allowed to use the tainted (dedicated) nodes as
well as any other nodes in the cluster. If you want to dedicate the nodes to them *and*
ensure they *only* use the dedicated nodes, then you should additionally add a label similar
well as any other nodes in the cluster. If you want to dedicate the nodes to them _and_
ensure they _only_ use the dedicated nodes, then you should additionally add a label similar
to the taint to the same set of nodes (e.g. `dedicated=groupName`), and the admission
controller should additionally add a node affinity to require that the pods can only schedule
onto nodes labeled with `dedicated=groupName`.
Expand Down Expand Up @@ -219,21 +292,19 @@ when there are node problems, which is described in the next section.

{{< feature-state for_k8s_version="v1.18" state="stable" >}}



The node controller automatically taints a Node when certain conditions
are true. The following taints are built in:

* `node.kubernetes.io/not-ready`: Node is not ready. This corresponds to
* `node.kubernetes.io/not-ready`: Node is not ready. This corresponds to
the NodeCondition `Ready` being "`False`".
* `node.kubernetes.io/unreachable`: Node is unreachable from the node
* `node.kubernetes.io/unreachable`: Node is unreachable from the node
controller. This corresponds to the NodeCondition `Ready` being "`Unknown`".
* `node.kubernetes.io/memory-pressure`: Node has memory pressure.
* `node.kubernetes.io/disk-pressure`: Node has disk pressure.
* `node.kubernetes.io/pid-pressure`: Node has PID pressure.
* `node.kubernetes.io/network-unavailable`: Node's network is unavailable.
* `node.kubernetes.io/unschedulable`: Node is unschedulable.
* `node.cloudprovider.kubernetes.io/uninitialized`: When the kubelet is started
* `node.kubernetes.io/memory-pressure`: Node has memory pressure.
* `node.kubernetes.io/disk-pressure`: Node has disk pressure.
* `node.kubernetes.io/pid-pressure`: Node has PID pressure.
* `node.kubernetes.io/network-unavailable`: Node's network is unavailable.
* `node.kubernetes.io/unschedulable`: Node is unschedulable.
* `node.cloudprovider.kubernetes.io/uninitialized`: When the kubelet is started
with an "external" cloud provider, this taint is set on a node to mark it
as unusable. After a controller from the cloud-controller-manager initializes
this node, the kubelet removes this taint.
Expand Down Expand Up @@ -284,8 +355,8 @@ Nodes for 5 minutes after one of these problems is detected.
[DaemonSet](/docs/concepts/workloads/controllers/daemonset/) pods are created with
`NoExecute` tolerations for the following taints with no `tolerationSeconds`:

* `node.kubernetes.io/unreachable`
* `node.kubernetes.io/not-ready`
* `node.kubernetes.io/unreachable`
* `node.kubernetes.io/not-ready`

This ensures that DaemonSet pods are never evicted due to these problems.

Expand Down Expand Up @@ -320,11 +391,11 @@ onto the affected node.
The DaemonSet controller automatically adds the following `NoSchedule`
tolerations to all daemons, to prevent DaemonSets from breaking.

* `node.kubernetes.io/memory-pressure`
* `node.kubernetes.io/disk-pressure`
* `node.kubernetes.io/pid-pressure` (1.14 or later)
* `node.kubernetes.io/unschedulable` (1.10 or later)
* `node.kubernetes.io/network-unavailable` (*host network only*)
* `node.kubernetes.io/memory-pressure`
* `node.kubernetes.io/disk-pressure`
* `node.kubernetes.io/pid-pressure` (1.14 or later)
* `node.kubernetes.io/unschedulable` (1.10 or later)
* `node.kubernetes.io/network-unavailable` (_host network only_)

Adding these tolerations ensures backward compatibility. You can also add
arbitrary tolerations to DaemonSets.
Expand All @@ -343,4 +414,3 @@ devices. Like taints they apply to all pods which share the same allocated devic
and how you can configure it
* Read about [Pod Priority](/docs/concepts/scheduling-eviction/pod-priority-preemption/)
* Read about [device taints and tolerations](/docs/concepts/scheduling-eviction/dynamic-resource-allocation#device-taints-and-tolerations)

Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
title: TaintTolerationComparisonOperators
content_type: feature_gate
_build:
list: never
render: false

stages:
- stage: alpha
defaultValue: false
fromVersion: "1.35"
---
Enables numeric comparison operators (`Lt` and `Gt`) for
[tolerations](/docs/concepts/scheduling-eviction/taint-and-toleration/).
16 changes: 16 additions & 0 deletions content/en/examples/pods/pod-with-numeric-toleration.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
apiVersion: v1
kind: Pod
metadata:
name: nginx-numeric-toleration
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
tolerations:
- key: "servicelevel.organization.example/agreed-service-level"
operator: "Gt"
value: "900"
effect: "NoSchedule"