Support for topology spread constraints with cluster autoscaler #2849

martin-adema · 2022-03-17T12:01:57Z

When a deployment is applied using topology spread constraints with a maxSkew of 1 and topology key "topology.kubernetes.io/zone" the cluster autoscaler scales up 1 zone with too many nodes and after the scale-down-unneeded-time has passed they will be removed again.
There is a nodepool for each zone (3) and balance-similar-node-groups is set to true.

I would expect nodes to be added to each zone with the similar number of nodes and not unneeded nodes extra being added which are removed again the after scale-down-unneeded-time timeout.

The issue can be reproduced by applying a deployment with resource requests sized about half the size of the nodes, about 30 pods and with topology spread constraints configured:
topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule
Cluster setup with autoscaled nodepools per zone and with balance-similar-node-groups set to true..

MS support ticket 2112070050001650 was opened for this issue. Was told there is no special integration between pod topology spreading and CA as in this this behavior is expected. Advised to open issue and request for integration of topology spread constraints.

Kubernetes 1.21.9
1 system nodepool (Standard_D16as_v4) with 3 nodes (no autoscaling)
3 user nodepools (1 per zone, Standard_D16as_v4) and cluster autoscaling (3 - 30)

The text was updated successfully, but these errors were encountered:

ghost · 2022-03-17T12:02:01Z

Hi martin-adema, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
Please abide by the AKS repo Guidelines and Code of Conduct.
If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

ghost · 2022-03-19T18:00:44Z

Triage required from @Azure/aks-pm

ghost · 2022-03-24T19:00:40Z

Action required from @Azure/aks-pm

ghost · 2022-04-09T00:00:41Z

Issue needing attention of @Azure/aks-leads

ghost · 2022-04-24T00:00:45Z

Issue needing attention of @Azure/aks-leads

ghost · 2022-05-09T06:00:42Z

Issue needing attention of @Azure/aks-leads

ghost · 2022-05-24T06:00:47Z

Issue needing attention of @Azure/aks-leads

ghost · 2022-06-08T06:00:51Z

Issue needing attention of @Azure/aks-leads

ghost · 2022-06-23T06:01:07Z

Issue needing attention of @Azure/aks-leads

ghost · 2022-07-08T12:01:06Z

Issue needing attention of @Azure/aks-leads

ghost · 2022-07-23T12:01:12Z

Issue needing attention of @Azure/aks-leads

ghost · 2022-08-07T12:01:14Z

Issue needing attention of @Azure/aks-leads

ghost · 2022-08-22T12:01:24Z

Issue needing attention of @Azure/aks-leads

ghost · 2022-09-06T18:01:14Z

Issue needing attention of @Azure/aks-leads

wangyira · 2022-09-21T18:01:05Z

@justindavies could you help take a look?

lgmorand · 2022-10-10T12:38:24Z

Hi my customer has the same issue using PodAntiaffinity and the CA never triggering when pods cannot be scheduled.

To be sure, is it linked to this issue ?

ghost · 2022-11-09T19:00:55Z

Action required from @Azure/aks-pm

ghost · 2023-04-10T18:00:54Z

Issue needing attention of @Azure/aks-leads

ghost · 2023-04-26T00:00:56Z

Issue needing attention of @Azure/aks-leads

ghost · 2023-05-11T06:00:52Z

Issue needing attention of @Azure/aks-leads

ghost · 2023-05-26T12:00:52Z

Issue needing attention of @Azure/aks-leads

ghost · 2023-06-10T18:00:57Z

Issue needing attention of @Azure/aks-leads

ghost · 2023-06-26T00:00:53Z

Issue needing attention of @Azure/aks-leads

ghost · 2023-07-11T06:00:53Z

Issue needing attention of @Azure/aks-leads

ghost · 2023-07-26T12:00:47Z

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service · 2024-02-02T19:35:07Z

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

microsoft-github-policy-service · 2024-02-02T19:39:06Z

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service · 2024-05-05T02:16:28Z

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

anthonynguyen394 · 2024-05-05T23:29:11Z

Hi, requesting the same feature.
We are using AKS 1.28.5, with one nodepool spread across 3 zones. CA didn't trigger new node creation.
Topology settings:

      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app.kubernetes.io/instance: app-name
          minDomains: 3

mmisztal1980 · 2024-05-22T16:55:44Z

Hello, we are suffering from a "thundering herd" problem. We were looking to use topology spread constraints to partially mitigate it, but found that the autoscaler does not work when they're applied, despite the fact that the unscheduleable pods metric was significantly greater than 0

kevinkrp93 · 2024-05-31T19:14:22Z

@mmisztal1980 - Ack and apologies for the delay. As a best practice we generally advice using three nodepools, one for each zone while using cluster autoscaler with multiple zones. If this setup doesn't work, please file a new support ticket and update it here and we will investigate further, especially since you are running also a newer version of Cluster Autoscaler.

mmisztal1980 · 2024-06-03T06:56:32Z

Hi @kevinkrp93 , thanks for the response. Before we apply your suggestion, we're more interested in understanding why the autoscaler does not react to the Pending pods which are the result of topology spread constraints in NoSchedule mode? Is this a bug or a feature?

kevinkrp93 · 2024-06-06T20:32:43Z

@mmisztal1980 - Autoscaler does not make zone based scale-out calls. Zone balancing is taken care by the compute provider, so if autoscaler thinks it can't fit the pod on a node autoscaler will not issue a scale-out call to compute. Also autoscaler treats all the nodes in the nodepool equally (and uses the nodepool template to scale up). Hence we advise one zone per nodepool when using cluster autoscaler with AZs.

Please refer - https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-release-1.28/cluster-autoscaler/FAQ.md#i-have-a-couple-of-pending-pods-but-there-was-no-scale-up

microsoft-github-policy-service · 2024-06-28T04:50:24Z

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.

microsoft-github-policy-service · 2024-07-05T07:17:59Z

This issue will now be closed because it hasn't had any activity for 7 days after stale. martin-adema feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

ghost added triage action-required labels Mar 17, 2022

ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Mar 24, 2022

ghost removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Sep 21, 2022

wangyira added cluster-autoscaler action-required Needs Attention 👋 Issues needs attention/assignee/owner and removed triage action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Sep 21, 2022

ghost added the action-required label Nov 4, 2022

ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Nov 9, 2022

microsoft-github-policy-service bot added the stale Stale issue label Feb 2, 2024

AllenWen-at-Azure assigned justindavies Feb 9, 2024

AllenWen-at-Azure removed the stale Stale issue label Feb 9, 2024

microsoft-github-policy-service bot removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Feb 9, 2024

microsoft-github-policy-service bot added the action-required label Mar 5, 2024

microsoft-github-policy-service bot added the stale Stale issue label May 5, 2024

microsoft-github-policy-service bot removed the stale Stale issue label May 5, 2024

kevinkrp93 self-assigned this May 31, 2024

microsoft-github-policy-service bot removed the action-required label May 31, 2024

microsoft-github-policy-service bot added the stale Stale issue label Jun 28, 2024

microsoft-github-policy-service bot closed this as completed Jul 5, 2024

Support for topology spread constraints with cluster autoscaler #2849

Support for topology spread constraints with cluster autoscaler #2849

Comments

martin-adema commented Mar 17, 2022 • edited Loading

ghost commented Mar 17, 2022

ghost commented Mar 19, 2022

ghost commented Mar 24, 2022

ghost commented Apr 9, 2022

ghost commented Apr 24, 2022

ghost commented May 9, 2022

ghost commented May 24, 2022

ghost commented Jun 8, 2022

ghost commented Jun 23, 2022

ghost commented Jul 8, 2022

ghost commented Jul 23, 2022

ghost commented Aug 7, 2022

ghost commented Aug 22, 2022

ghost commented Sep 6, 2022

wangyira commented Sep 21, 2022

lgmorand commented Oct 10, 2022

ghost commented Nov 9, 2022

ghost commented Apr 10, 2023

ghost commented Apr 26, 2023

ghost commented May 11, 2023

ghost commented May 26, 2023

ghost commented Jun 10, 2023

ghost commented Jun 26, 2023

ghost commented Jul 11, 2023

ghost commented Jul 26, 2023

microsoft-github-policy-service bot commented Feb 2, 2024

microsoft-github-policy-service bot commented Feb 2, 2024

microsoft-github-policy-service bot commented May 5, 2024

anthonynguyen394 commented May 5, 2024

mmisztal1980 commented May 22, 2024

kevinkrp93 commented May 31, 2024

mmisztal1980 commented Jun 3, 2024

kevinkrp93 commented Jun 6, 2024 • edited Loading

microsoft-github-policy-service bot commented Jun 28, 2024

microsoft-github-policy-service bot commented Jul 5, 2024

martin-adema commented Mar 17, 2022 •

edited

Loading

kevinkrp93 commented Jun 6, 2024 •

edited

Loading