-
Notifications
You must be signed in to change notification settings - Fork 40.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support inter-Pod affinity to one or more Pods #68701
Comments
It is worth noting that this will apply only to inter-pod affinity, not inter-pod anti-affinity. Inter-pod anti-affinity is considered "violated" if there is a pod that matches ANY terms of the anti-affinity. So, matching against a group of Pods does not make sense for anti-affinity. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
This question was probably answered elsewhere, but could the change in behavior disrupt existing clusters? e.g. a canary workload is launched with pod-affinity for labels {app="foo", env="canary"}. That workload could end up in a topology containing {app="foo", env="prod"} & {app="bar", env="canary"} after this change. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
Re @misterikkit:
If the goal is to gather pods with labels {app="foo", env="canary"}, the workload should have it defined within the same affinityTerm, in different expressions. And if app="foo" and env="canary" are defined in two different affinityTerms, then yes, after the change, a topology containing {app="foo", env="prod"} & {app="bar", env="canary"} can be a fit. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-lifecycle rotten |
@sanposhiho: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Curious shouldn't this be possible by reading the design proposal? But it seems not, so my use case is related to:
We're running cilium in chaining with aks CSI. But since mid-last year removal of taints is only possible via Azure API breaking the cilium node readiness. Following this comment from Microsoft we followed the advice and created a mutation webhook (instead of calling the API) adding a podAffinity so that pods only get scheduled on nodes where cilium agent is already running. |
I can't speak for the original intent of the design, as it predates my time here, but the reality is that it's not implemented like that. Then we can't change the behavior as it would be backwards-incompatible. So two things:
|
Thanks for the feedback. The idea is/was to make sure that cilium agent (daemonset) is scheduled first before any other pod is scheduled. Our mutation webhook just merged this podAffinity to every pod created, but this breaks now any other podAffinity already set on the pod itself. # resulting in this
affinity:
podAffinity:
requiredduringschedulingignoredduringexecution:
- labelSelector:
matchLabels:
example.com/name: myservice
topologyKey: kubernetes.io/hostname
- labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- cilium
namespaces:
- kube-system
topologyKey: kubernetes.io/hostname |
Sorry, a bit confusion here, doesn't this works as expected? Unless the cilium pod launches, this new created pod will stuck in pending? |
@kerthcet the problem is that kube-scheduler looks for a single pod that satisfies all affinities. Well, even if we add the feature that you request, it would only be available in 1.28 at the earliest (potentially 1.29, as the feature has to be disabled by default first). So I would suggest you open an issue against Azure support. Other than that, you are welcome to work on this feature. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In the current implementation of inter-Pod affinity, the scheduler looks for a single existing pod that can satisfy all the terms of inter-pod affinity of an incoming pod.
With the recent changes made to the implementation of inter-Pod affinity, we can now support multiple pods satisfying inter-pod affinity. One of the main reasons we didn't pursue the idea before was the fact that the inter-pod affinity feature was very slow (3 orders of magnitude slower than other scheduler predicates). We didn't want to add more complication to an already slow predicate. However, we can now think about adding the feature.
With this feature, a pod can have multiple affinity terms satisfied by a group of pods, as opposed to only a single pod. For example:
With our current (K8s 1.12) implementation, Pod3 is not schedulable, because there is no single pod that satisfies all of its affinity terms. However, if we support multiple pods satisfying the affinity terms, Pod3 can be scheduled on nodeB. Pod1 satisfies the first term of its affinity in region1 and Pod2 satisfies its second term in zone2. So, any node in zone2/region1 will be feasible for Pod3.
Given the current implementation of inter-pod affinity and the use of "Topology Pair Maps", I believe implementing this feature requires little changes and won't have noticeable performance impact.
/kind feature
/sig scheduling
cc/ @Huang-Wei @ahmad-diaa
The text was updated successfully, but these errors were encountered: