Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for topology spread constraints with cluster autoscaler #2849

Closed
martin-adema opened this issue Mar 17, 2022 · 44 comments
Closed

Support for topology spread constraints with cluster autoscaler #2849

martin-adema opened this issue Mar 17, 2022 · 44 comments
Assignees
Labels

Comments

@martin-adema
Copy link

martin-adema commented Mar 17, 2022

When a deployment is applied using topology spread constraints with a maxSkew of 1 and topology key "topology.kubernetes.io/zone" the cluster autoscaler scales up 1 zone with too many nodes and after the scale-down-unneeded-time has passed they will be removed again.
There is a nodepool for each zone (3) and balance-similar-node-groups is set to true.

I would expect nodes to be added to each zone with the similar number of nodes and not unneeded nodes extra being added which are removed again the after scale-down-unneeded-time timeout.

The issue can be reproduced by applying a deployment with resource requests sized about half the size of the nodes, about 30 pods and with topology spread constraints configured:
topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule
Cluster setup with autoscaled nodepools per zone and with balance-similar-node-groups set to true..

MS support ticket 2112070050001650 was opened for this issue. Was told there is no special integration between pod topology spreading and CA as in this this behavior is expected. Advised to open issue and request for integration of topology spread constraints.

Kubernetes 1.21.9
1 system nodepool (Standard_D16as_v4) with 3 nodes (no autoscaling)
3 user nodepools (1 per zone, Standard_D16as_v4) and cluster autoscaling (3 - 30)

@ghost
Copy link

ghost commented Mar 17, 2022

Hi martin-adema, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

  1. If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
  2. Please abide by the AKS repo Guidelines and Code of Conduct.
  3. If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
  4. Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
  5. Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
  6. If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

@ghost
Copy link

ghost commented Mar 19, 2022

Triage required from @Azure/aks-pm

@ghost
Copy link

ghost commented Mar 24, 2022

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Mar 24, 2022
@ghost
Copy link

ghost commented Apr 9, 2022

Issue needing attention of @Azure/aks-leads

10 similar comments
@ghost
Copy link

ghost commented Apr 24, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented May 9, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented May 24, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jun 8, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jun 23, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jul 8, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jul 23, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Aug 7, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Aug 22, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Sep 6, 2022

Issue needing attention of @Azure/aks-leads

@wangyira
Copy link
Contributor

@justindavies could you help take a look?

@ghost ghost removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Sep 21, 2022
@wangyira wangyira added cluster-autoscaler action-required Needs Attention 👋 Issues needs attention/assignee/owner and removed triage action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Sep 21, 2022
@lgmorand
Copy link

Hi my customer has the same issue using PodAntiaffinity and the CA never triggering when pods cannot be scheduled.

To be sure, is it linked to this issue ?

@ghost ghost added the action-required label Nov 4, 2022
@ghost
Copy link

ghost commented Nov 9, 2022

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Nov 9, 2022
@ghost
Copy link

ghost commented Apr 10, 2023

Issue needing attention of @Azure/aks-leads

7 similar comments
@ghost
Copy link

ghost commented Apr 26, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented May 11, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented May 26, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jun 10, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jun 26, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jul 11, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jul 26, 2023

Issue needing attention of @Azure/aks-leads

@microsoft-github-policy-service microsoft-github-policy-service bot added the stale Stale issue label Feb 2, 2024
Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

@anthonynguyen394
Copy link

Hi, requesting the same feature.
We are using AKS 1.28.5, with one nodepool spread across 3 zones. CA didn't trigger new node creation.
Topology settings:

      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app.kubernetes.io/instance: app-name
          minDomains: 3

@microsoft-github-policy-service microsoft-github-policy-service bot removed the stale Stale issue label May 5, 2024
@mmisztal1980
Copy link

Hello, we are suffering from a "thundering herd" problem. We were looking to use topology spread constraints to partially mitigate it, but found that the autoscaler does not work when they're applied, despite the fact that the unscheduleable pods metric was significantly greater than 0

@kevinkrp93
Copy link
Contributor

@mmisztal1980 - Ack and apologies for the delay. As a best practice we generally advice using three nodepools, one for each zone while using cluster autoscaler with multiple zones. If this setup doesn't work, please file a new support ticket and update it here and we will investigate further, especially since you are running also a newer version of Cluster Autoscaler.

@mmisztal1980
Copy link

Hi @kevinkrp93 , thanks for the response. Before we apply your suggestion, we're more interested in understanding why the autoscaler does not react to the Pending pods which are the result of topology spread constraints in NoSchedule mode? Is this a bug or a feature?

@kevinkrp93
Copy link
Contributor

kevinkrp93 commented Jun 6, 2024

@mmisztal1980 - Autoscaler does not make zone based scale-out calls. Zone balancing is taken care by the compute provider, so if autoscaler thinks it can't fit the pod on a node autoscaler will not issue a scale-out call to compute. Also autoscaler treats all the nodes in the nodepool equally (and uses the nodepool template to scale up). Hence we advise one zone per nodepool when using cluster autoscaler with AZs.

Please refer - https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-release-1.28/cluster-autoscaler/FAQ.md#i-have-a-couple-of-pending-pods-but-there-was-no-scale-up

@microsoft-github-policy-service microsoft-github-policy-service bot added the stale Stale issue label Jun 28, 2024
Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.

Copy link
Contributor

This issue will now be closed because it hasn't had any activity for 7 days after stale. martin-adema feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants