kubernetes · k8s-ci-robot · Apr 13, 2023 · Mar 4, 2023 · Apr 1, 2023 · Apr 2, 2023
diff --git a/content/en/blog/_posts/2023-04-11-topology-spread-features.md b/content/en/blog/_posts/2023-04-11-topology-spread-features.md
@@ -0,0 +1,146 @@
+---
+layout: blog
+title: "Kubernetes 1.27: More fine-grained pod topology spread policies reached beta"
+date: 2023-04-11
-date: 2023-04-11
+date: 2023-04-17
-date: 2023-04-11
+date: 2023-04-17
+slug: fine-grained-pod-topology-spread-features-beta
+evergreen: true
-evergreen: true
-evergreen: true
+---
+
+**Authors:** [Alex Wang](https://github.com/denkensk)(Shopee), [Kante Yin](https://github.com/kerthcet)(DaoCloud), [Kensei Nakada](https://github.com/sanposhiho)(Mercari)
+
+In Kubernetes v1.19, [Pod Topology Spread Constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/) went to GA.
+It is the feature to control how Pods are spread in the cluster topology or failure domains (regions, zones, nodes etc).
+
+As time passed, we received feedback from users,
-As time passed, we received feedback from users,
+As time passed, we - SIG Scheduling - received feedback from users,
-As time passed, we received feedback from users,
+As time passed, we - SIG Scheduling - received feedback from users,
+and, as a result, we're actively working on improving the Topology Spread feature via three KEPs.
+All of these features have reached beta in Kubernetes v1.27 and are enabled by default.
+
+This blog post introduces each feature and the use case behind each of them.
+
+## KEP-3022: min domains in Pod Topology Spread
+
+Pod Topology Spread has the `maxSkew` parameter to define the degree to which Pods may be unevenly distributed.
+
+But, there wasn't a way to control the number of domains over which we should spread.
+Some users want to force spreading Pods over a minimum number of domains, and if there aren't enough already present, make the cluster-autoscaler provision them.
+
+Then, we introduced the `minDomains` parameter in the Pod Topology Spread.
-Then, we introduced the `minDomains` parameter in the Pod Topology Spread.
+Kubernetes v1.24 introduced the `minDomains` parameter for pod topology spread constraints,
+as an alpha feature.
-Then, we introduced the `minDomains` parameter in the Pod Topology Spread.
+Kubernetes v1.24 introduced the `minDomains` parameter for pod topology spread constraints,
+as an alpha feature.
+Via `minDomains` parameter, you can define the minimum number of domains.
+
+For example, assume there are 3 Nodes with the enough capacity,
+and a newly created replicaset has the following `topologySpreadConstraints` in template.
-and a newly created replicaset has the following `topologySpreadConstraints` in template.
+and a newly created ReplicaSet has the following `topologySpreadConstraints` in its Pod template.
-and a newly created replicaset has the following `topologySpreadConstraints` in template.
+and a newly created ReplicaSet has the following `topologySpreadConstraints` in its Pod template.
+
+```yaml
+topologySpreadConstraints:
-topologySpreadConstraints:
+...
+topologySpreadConstraints:
-topologySpreadConstraints:
+...
+topologySpreadConstraints:
+- maxSkew: 1
+  minDomains: 5 # requires 5 Nodes at least.
-  minDomains: 5 # requires 5 Nodes at least.
+  minDomains: 5 # requires 5 Nodes at least (because each Node has a unique hostname)
-  minDomains: 5 # requires 5 Nodes at least.
+  minDomains: 5 # requires 5 Nodes at least (because each Node has a unique hostname)
+  whenUnsatisfiable: DoNotSchedule # minDomains is valid only when DoNotSchedule is used.
+  topologyKey: kubernetes.io/hostname
+  labelSelector:
+    matchLabels:
+        foo: bar
+```
+
+In this case, 3 Pods will be scheduled to those 3 Nodes,
+but other 2 Pods from this replicaset will be unschedulable until more Nodes join the cluster.
+
+The cluster autoscaler provisions new Nodes based on these unschedulable Pods,
+and as a result, the replicas are finally spread over 5 Nodes.
-The cluster autoscaler provisions new Nodes based on these unschedulable Pods,
-and as a result, the replicas are finally spread over 5 Nodes.
+You can imagine that the cluster autoscaler provisions new Nodes based on these unschedulable Pods,
+and as a result, the replicas are finally spread over 5 Nodes.
-The cluster autoscaler provisions new Nodes based on these unschedulable Pods,
-and as a result, the replicas are finally spread over 5 Nodes.
+You can imagine that the cluster autoscaler provisions new Nodes based on these unschedulable Pods,
+and as a result, the replicas are finally spread over 5 Nodes.
+
+## KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew
-## KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew
+## KEP-3094: Take taints/tolerations into consideration when calculating podTopologySpread skew
-## KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew
+## KEP-3094: Take taints/tolerations into consideration when calculating podTopologySpread skew
+
+Before this enhancement, when you deploy a pod with `podTopologySpread` configured, kube-scheduler would
+take the Nodes that satisfy the Pod's nodeAffinity and nodeSelector into consideration
+in filtering and scoring, but would not care about whether the node taints are tolerated by the incoming pod or not.
+This may lead to a node with untolerated taint as the only candidate for spreading, and as a result,
+the pod will stuck in Pending if it doesn't tolerate the taint.
+
+ To allow more fine-gained decisions about which Nodes to account for when calculating spreading skew, we introduced
+ two new fields in `TopologySpreadConstraint` to define node inclusion policies including nodeAffinity and nodeTaint.
- To allow more fine-gained decisions about which Nodes to account for when calculating spreading skew, we introduced
- two new fields in `TopologySpreadConstraint` to define node inclusion policies including nodeAffinity and nodeTaint.
+To allow more fine-gained decisions about which Nodes to account for when calculating spreading skew,
+Kubernetes 1.25 introduced two new fields within `topologySpreadConstraints` to define node inclusion policies:
+`nodeAffinityPolicy` and `nodeTaintPolicy`.
- To allow more fine-gained decisions about which Nodes to account for when calculating spreading skew, we introduced
- two new fields in `TopologySpreadConstraint` to define node inclusion policies including nodeAffinity and nodeTaint.
+To allow more fine-gained decisions about which Nodes to account for when calculating spreading skew,
+Kubernetes 1.25 introduced two new fields within `topologySpreadConstraints` to define node inclusion policies:
+`nodeAffinityPolicy` and `nodeTaintPolicy`.
+
+A manifest that applies these policies looks like the following:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: example-pod
+spec:
+  # Configure a topology spread constraint
+  topologySpreadConstraints:
+    - maxSkew: <integer>
+      # ...
+      nodeAffinityPolicy: [Honor|Ignore]
+      nodeTaintsPolicy: [Honor|Ignore]
+  # other Pod fields go here
+```
+
+**nodeAffinityPolicy** indicates how we'll treat Pod's nodeAffinity/nodeSelector in pod topology spreading.
-**nodeAffinityPolicy** indicates how we'll treat Pod's nodeAffinity/nodeSelector in pod topology spreading.
+The `nodeAffinityPolicy` field indicates how Kubernetes treats a Pod's `nodeAffinity` or `nodeSelector` for
+pod topology spreading.
-**nodeAffinityPolicy** indicates how we'll treat Pod's nodeAffinity/nodeSelector in pod topology spreading.
+The `nodeAffinityPolicy` field indicates how Kubernetes treats a Pod's `nodeAffinity` or `nodeSelector` for
+pod topology spreading.
+If `Honor`, kube-scheduler will filter out nodes not matching nodeAffinity/nodeSelector in the calculation of spreading skew.
+If `Ignore`, all nodes will be included, regardless of whether they match the Pod's nodeAffinity/nodeSelector or not.
-If `Honor`, kube-scheduler will filter out nodes not matching nodeAffinity/nodeSelector in the calculation of spreading skew.
-If `Ignore`, all nodes will be included, regardless of whether they match the Pod's nodeAffinity/nodeSelector or not.
+If `Honor`, kube-scheduler filters out nodes not matching `nodeAffinity`/`nodeSelector` in the calculation of
+spreading skew.
+If `Ignore`, all nodes will be included, regardless of whether they match the Pod's `nodeAffinity`/`nodeSelector`
+or not.
-If `Honor`, kube-scheduler will filter out nodes not matching nodeAffinity/nodeSelector in the calculation of spreading skew.
-If `Ignore`, all nodes will be included, regardless of whether they match the Pod's nodeAffinity/nodeSelector or not.
+If `Honor`, kube-scheduler filters out nodes not matching `nodeAffinity`/`nodeSelector` in the calculation of
+spreading skew.
+If `Ignore`, all nodes will be included, regardless of whether they match the Pod's `nodeAffinity`/`nodeSelector`
+or not.
+
+For backwards-compatibility, nodeAffinityPolicy defaults to `Honor`.
-For backwards-compatibility, nodeAffinityPolicy defaults to `Honor`.
+For backwards compatibility, `nodeAffinityPolicy` defaults to `Honor`.
-For backwards-compatibility, nodeAffinityPolicy defaults to `Honor`.
+For backwards compatibility, `nodeAffinityPolicy` defaults to `Honor`.
+
+**nodeTaintsPolicy** indicates how we'll treat node taints in pod topology spreading.
-**nodeTaintsPolicy** indicates how we'll treat node taints in pod topology spreading.
+The `nodeTaintsPolicy` field defines how Kubernetes considers node taints for pod topology spreading.
-**nodeTaintsPolicy** indicates how we'll treat node taints in pod topology spreading.
+The `nodeTaintsPolicy` field defines how Kubernetes considers node taints for pod topology spreading.
+If `Honor`, only tainted nodes for which the incoming pod has a toleration, will be included in the calculation of spreading skew.
+If `Ignore`, kube-scheduler will not consider the node taints at all in the calculation of spreading skew, so a node with
+pod untolerated taint will also be included.
+
+For backwards-compatibility, nodeTaintsPolicy defaults to the `Ignore`.
-For backwards-compatibility, nodeTaintsPolicy defaults to the `Ignore`.
+For backwards compatibility, `nodeTaintsPolicy` defaults to `Ignore`.
-For backwards-compatibility, nodeTaintsPolicy defaults to the `Ignore`.
+For backwards compatibility, `nodeTaintsPolicy` defaults to `Ignore`.
+
+The feature was introduced in v1.25 as alpha level. By default, it was disabled, so if you want to use this feature in v1.25,
+you have to enable the feature gate `NodeInclusionPolicyInPodTopologySpread` actively. In the following v1.26, we graduated
+this feature to beta and it was enabled by default since.
-The feature was introduced in v1.25 as alpha level. By default, it was disabled, so if you want to use this feature in v1.25,
-you have to enable the feature gate `NodeInclusionPolicyInPodTopologySpread` actively. In the following v1.26, we graduated
-this feature to beta and it was enabled by default since.
+The feature was introduced in v1.25 as alpha. By default, it was disabled, so if you want to use this feature in v1.25,
+you had to explictly enable the feature gate `NodeInclusionPolicyInPodTopologySpread`. In the following v1.26
+release, that associated feature graduated to beta and is enabled by default.
-The feature was introduced in v1.25 as alpha level. By default, it was disabled, so if you want to use this feature in v1.25,
-you have to enable the feature gate `NodeInclusionPolicyInPodTopologySpread` actively. In the following v1.26, we graduated
-this feature to beta and it was enabled by default since.
+The feature was introduced in v1.25 as alpha. By default, it was disabled, so if you want to use this feature in v1.25,
+you had to explictly enable the feature gate `NodeInclusionPolicyInPodTopologySpread`. In the following v1.26
+release, that associated feature graduated to beta and is enabled by default.
+
+## KEP-3243: Respect PodTopologySpread after rolling upgrades
-## KEP-3243: Respect PodTopologySpread after rolling upgrades
+## KEP-3243: Respect Pod topology spread after rolling upgrades
-## KEP-3243: Respect PodTopologySpread after rolling upgrades
+## KEP-3243: Respect Pod topology spread after rolling upgrades
+
+Pod Topology Spread uses the field `labelSelector` to identify the group of pods over which
+spreading will be calculated. When using topology spreading with Deployments, it is common
+practice to use the `labelSelector` of the Deployment as the `labelSelector` in the topology
+spread constraints. However, this implies that all pods of a Deployment are part of the spreading
+calculation, regardless of whether they belong to different revisions. As a result, when a new revision
+is rolled out, spreading will apply across pods from both the old and new ReplicaSets, and so by the
+time the new ReplicaSet is completely rolled out and the old one is rolled back, the actual spreading
+we are left with may not match expectations because the deleted pods from the older ReplicaSet will cause
+skewed distribution for the remaining pods. To avoid this problem, in the past users needed to add a
+revision label to Deployment and update it manually at each rolling upgrade (both the label on the
+podTemplate and the `labelSelector` in the `topologySpreadConstraints`).
-podTemplate and the `labelSelector` in the `topologySpreadConstraints`).
+Pod template and the `labelSelector` in the `topologySpreadConstraints`).
-podTemplate and the `labelSelector` in the `topologySpreadConstraints`).
+Pod template and the `labelSelector` in the `topologySpreadConstraints`).
+
+To solve this problem with a simpler API, we added a new field named
-To solve this problem with a simpler API, we added a new field named
+To solve this problem with a simpler API, Kubernetes v1.25 introduced a new field named
-To solve this problem with a simpler API, we added a new field named
+To solve this problem with a simpler API, Kubernetes v1.25 introduced a new field named
+`matchLabelKeys` to `topologySpreadConstraints`. `matchLabelKeys` is a list of pod label keys to select
+the pods over which spreading will be calculated. The keys are used to lookup values from the labels of
+the Pod being scheduled, those key-value labels are ANDed with `labelSelector` to select the group of
+existing pods over which spreading will be calculated for the incoming pod.
+
+With `matchLabelKeys`, you don't need to update the `pod.spec` between different revisions.
+The controller or operator managing rollouts just needs to set different values to the same label key for different revisions.
+The scheduler will assume the values automatically based on `matchLabelKeys`.
+For example, if you are configuring a Deployment, you can use the label keyed with
+[pod-template-hash](https://kubernetes.io//docs/concepts/workloads/controllers/deployment/#pod-template-hash-label),
+which is added automatically by the Deployment controller, to distinguish between different
+revisions in a single Deployment.
+
+```yaml
+topologySpreadConstraints:
+    - maxSkew: 1
+      topologyKey: kubernetes.io/hostname
+      whenUnsatisfiable: DoNotSchedule
+      labelSelector:
+        matchLabels:
+          app: foo
+      matchLabelKeys:
+        - pod-template-hash
+```
+
+## Getting involved
+
+These features are managed by the [SIG/Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling).
-These features are managed by the [SIG/Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling).
+These features are managed by Kubernetes [SIG Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling).
-These features are managed by the [SIG/Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling).
+These features are managed by Kubernetes [SIG Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling).
+
+Please join us and share your feedback. We look forward to hearing from you!
+
+## How can I learn more?
+
+- [Pod Topology Spread Constraints | Kubernetes](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#container-resource-metrics)
+- [KEP-3022: min domains in Pod Topology Spread](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3022-min-domains-in-pod-topology-spread)
+- [KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3094-pod-topology-spread-considering-taints)
+- [KEP-3243: Respect PodTopologySpread after rolling upgrades](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades)