-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod Topology Spread blog post (KEP-3022 KEP-3094 KEP-3243) #39777
Changes from 10 commits
bd64321
528e113
ba1d6bd
c35cd20
dcfe5ae
bbe2382
4e1b4f7
81482cf
a141976
ae626b9
2d14bb7
f03ac42
1841857
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,146 @@ | ||||||||||||||
--- | ||||||||||||||
layout: blog | ||||||||||||||
title: "Kubernetes 1.27: More fine-grained pod topology spread policies reached beta" | ||||||||||||||
date: 2023-04-11 | ||||||||||||||
slug: fine-grained-pod-topology-spread-features-beta | ||||||||||||||
evergreen: true | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you sure? This represents a commitment by the SIG to keep the content current. However, this doesn't feel like a post that would stay up to date as it is tied to a specific release.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Ah, OK. I'll delete then. |
||||||||||||||
--- | ||||||||||||||
|
||||||||||||||
**Authors:** [Alex Wang](https://github.com/denkensk)(Shopee), [Kante Yin](https://github.com/kerthcet)(DaoCloud), [Kensei Nakada](https://github.com/sanposhiho)(Mercari) | ||||||||||||||
|
||||||||||||||
In Kubernetes v1.19, [Pod Topology Spread Constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/) went to GA. | ||||||||||||||
It is the feature to control how Pods are spread in the cluster topology or failure domains (regions, zones, nodes etc). | ||||||||||||||
|
||||||||||||||
As time passed, we received feedback from users, | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
and, as a result, we're actively working on improving the Topology Spread feature via three KEPs. | ||||||||||||||
All of these features have reached beta in Kubernetes v1.27 and are enabled by default. | ||||||||||||||
|
||||||||||||||
This blog post introduces each feature and the use case behind each of them. | ||||||||||||||
|
||||||||||||||
## KEP-3022: min domains in Pod Topology Spread | ||||||||||||||
|
||||||||||||||
Pod Topology Spread has the `maxSkew` parameter to define the degree to which Pods may be unevenly distributed. | ||||||||||||||
|
||||||||||||||
But, there wasn't a way to control the number of domains over which we should spread. | ||||||||||||||
Some users want to force spreading Pods over a minimum number of domains, and if there aren't enough already present, make the cluster-autoscaler provision them. | ||||||||||||||
|
||||||||||||||
Then, we introduced the `minDomains` parameter in the Pod Topology Spread. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
Via `minDomains` parameter, you can define the minimum number of domains. | ||||||||||||||
|
||||||||||||||
For example, assume there are 3 Nodes with the enough capacity, | ||||||||||||||
and a newly created replicaset has the following `topologySpreadConstraints` in template. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
```yaml | ||||||||||||||
topologySpreadConstraints: | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
- maxSkew: 1 | ||||||||||||||
minDomains: 5 # requires 5 Nodes at least. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
whenUnsatisfiable: DoNotSchedule # minDomains is valid only when DoNotSchedule is used. | ||||||||||||||
topologyKey: kubernetes.io/hostname | ||||||||||||||
labelSelector: | ||||||||||||||
matchLabels: | ||||||||||||||
foo: bar | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
In this case, 3 Pods will be scheduled to those 3 Nodes, | ||||||||||||||
but other 2 Pods from this replicaset will be unschedulable until more Nodes join the cluster. | ||||||||||||||
|
||||||||||||||
The cluster autoscaler provisions new Nodes based on these unschedulable Pods, | ||||||||||||||
and as a result, the replicas are finally spread over 5 Nodes. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
## KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
Before this enhancement, when you deploy a pod with `podTopologySpread` configured, kube-scheduler would | ||||||||||||||
take the Nodes that satisfy the Pod's nodeAffinity and nodeSelector into consideration | ||||||||||||||
in filtering and scoring, but would not care about whether the node taints are tolerated by the incoming pod or not. | ||||||||||||||
This may lead to a node with untolerated taint as the only candidate for spreading, and as a result, | ||||||||||||||
the pod will stuck in Pending if it doesn't tolerate the taint. | ||||||||||||||
|
||||||||||||||
To allow more fine-gained decisions about which Nodes to account for when calculating spreading skew, we introduced | ||||||||||||||
two new fields in `TopologySpreadConstraint` to define node inclusion policies including nodeAffinity and nodeTaint. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this right?
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think so, what's your concern here? |
||||||||||||||
|
||||||||||||||
A manifest that applies these policies looks like the following: | ||||||||||||||
|
||||||||||||||
```yaml | ||||||||||||||
apiVersion: v1 | ||||||||||||||
kind: Pod | ||||||||||||||
metadata: | ||||||||||||||
name: example-pod | ||||||||||||||
spec: | ||||||||||||||
# Configure a topology spread constraint | ||||||||||||||
topologySpreadConstraints: | ||||||||||||||
- maxSkew: <integer> | ||||||||||||||
# ... | ||||||||||||||
nodeAffinityPolicy: [Honor|Ignore] | ||||||||||||||
nodeTaintsPolicy: [Honor|Ignore] | ||||||||||||||
# other Pod fields go here | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
**nodeAffinityPolicy** indicates how we'll treat Pod's nodeAffinity/nodeSelector in pod topology spreading. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
If `Honor`, kube-scheduler will filter out nodes not matching nodeAffinity/nodeSelector in the calculation of spreading skew. | ||||||||||||||
If `Ignore`, all nodes will be included, regardless of whether they match the Pod's nodeAffinity/nodeSelector or not. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
For backwards-compatibility, nodeAffinityPolicy defaults to `Honor`. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
**nodeTaintsPolicy** indicates how we'll treat node taints in pod topology spreading. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
If `Honor`, only tainted nodes for which the incoming pod has a toleration, will be included in the calculation of spreading skew. | ||||||||||||||
If `Ignore`, kube-scheduler will not consider the node taints at all in the calculation of spreading skew, so a node with | ||||||||||||||
pod untolerated taint will also be included. | ||||||||||||||
|
||||||||||||||
For backwards-compatibility, nodeTaintsPolicy defaults to the `Ignore`. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
The feature was introduced in v1.25 as alpha level. By default, it was disabled, so if you want to use this feature in v1.25, | ||||||||||||||
you have to enable the feature gate `NodeInclusionPolicyInPodTopologySpread` actively. In the following v1.26, we graduated | ||||||||||||||
this feature to beta and it was enabled by default since. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
## KEP-3243: Respect PodTopologySpread after rolling upgrades | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
This isn't the exact title, but it's close enough |
||||||||||||||
|
||||||||||||||
Pod Topology Spread uses the field `labelSelector` to identify the group of pods over which | ||||||||||||||
spreading will be calculated. When using topology spreading with Deployments, it is common | ||||||||||||||
practice to use the `labelSelector` of the Deployment as the `labelSelector` in the topology | ||||||||||||||
spread constraints. However, this implies that all pods of a Deployment are part of the spreading | ||||||||||||||
calculation, regardless of whether they belong to different revisions. As a result, when a new revision | ||||||||||||||
is rolled out, spreading will apply across pods from both the old and new ReplicaSets, and so by the | ||||||||||||||
time the new ReplicaSet is completely rolled out and the old one is rolled back, the actual spreading | ||||||||||||||
we are left with may not match expectations because the deleted pods from the older ReplicaSet will cause | ||||||||||||||
skewed distribution for the remaining pods. To avoid this problem, in the past users needed to add a | ||||||||||||||
revision label to Deployment and update it manually at each rolling upgrade (both the label on the | ||||||||||||||
podTemplate and the `labelSelector` in the `topologySpreadConstraints`). | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
To solve this problem with a simpler API, we added a new field named | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
`matchLabelKeys` to `topologySpreadConstraints`. `matchLabelKeys` is a list of pod label keys to select | ||||||||||||||
the pods over which spreading will be calculated. The keys are used to lookup values from the labels of | ||||||||||||||
the Pod being scheduled, those key-value labels are ANDed with `labelSelector` to select the group of | ||||||||||||||
existing pods over which spreading will be calculated for the incoming pod. | ||||||||||||||
|
||||||||||||||
With `matchLabelKeys`, you don't need to update the `pod.spec` between different revisions. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not explained why you would do that. When describing the original problem (two paragraphs above) you can say that an alternative would be to add a different label value every time the deployment changes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, yeah. I add this in the paragraphs above. |
||||||||||||||
The controller or operator managing rollouts just needs to set different values to the same label key for different revisions. | ||||||||||||||
The scheduler will assume the values automatically based on `matchLabelKeys`. | ||||||||||||||
For example, if you are configuring a Deployment, you can use the label keyed with | ||||||||||||||
[pod-template-hash](https://kubernetes.io//docs/concepts/workloads/controllers/deployment/#pod-template-hash-label), | ||||||||||||||
which is added automatically by the Deployment controller, to distinguish between different | ||||||||||||||
revisions in a single Deployment. | ||||||||||||||
|
||||||||||||||
```yaml | ||||||||||||||
topologySpreadConstraints: | ||||||||||||||
- maxSkew: 1 | ||||||||||||||
topologyKey: kubernetes.io/hostname | ||||||||||||||
whenUnsatisfiable: DoNotSchedule | ||||||||||||||
labelSelector: | ||||||||||||||
matchLabels: | ||||||||||||||
app: foo | ||||||||||||||
matchLabelKeys: | ||||||||||||||
- pod-template-hash | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
## Getting involved | ||||||||||||||
|
||||||||||||||
These features are managed by the [SIG/Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling). | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
||||||||||||||
Please join us and share your feedback. We look forward to hearing from you! | ||||||||||||||
|
||||||||||||||
## How can I learn more? | ||||||||||||||
|
||||||||||||||
- [Pod Topology Spread Constraints | Kubernetes](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#container-resource-metrics) | ||||||||||||||
sanposhiho marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
- [KEP-3022: min domains in Pod Topology Spread](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3022-min-domains-in-pod-topology-spread) | ||||||||||||||
- [KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3094-pod-topology-spread-considering-taints) | ||||||||||||||
- [KEP-3243: Respect PodTopologySpread after rolling upgrades](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the filename to match.