[BUG] - Nodes don't scale down on GKE and AKS #2507

Adam-D-Lewis · 2024-06-11T20:10:48Z

Describe the bug

I noticed that GKE won't autoscale all nodes down to 0 in some cases. I saw that metrics-server deployment and the event-exporter-gke replicaset nodeSelector only has

nodeSelector:                                                                                                                                                                          
    kubernetes.io/os: linux

meaning it can be scheduled on any of the nodes preventing them from scaling down.

Options to fix this might be

Disable metrics collection - https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#enable_components
Set a taint on user and worker nodes (and any custom nodes groups created) to force metrics-server pod to run on general node group

I think AWS doesn't have metrics-server enabled by default so I think it's reasonable to disable it.

Expected behavior

nodes should autoscale down

OS and architecture in which you are running Nebari

Linux x86-64

How to Reproduce the problem?

see above

Command output

No response

Versions and dependencies used.

No response

Compute environment

GCP

Integrations

No response

Anything else?

No response

Adam-D-Lewis · 2024-06-18T15:27:59Z

While I don't think this is the issue, it occurs to me that the other nodes might be scaling up b/c we have more pods than cpu/memory on the general node.

viniciusdc · 2024-06-21T15:32:16Z

While I don't think this is the issue, it occurs to me that the other nodes might be scaling up b/c we have more pods than cpu/memory on the general node.

That's a good point, we relly need to check out those taints

viniciusdc · 2024-06-21T15:34:34Z

I think as an overall change, your 2 points seems more reasonable (to all providers). For the AWS specifically, I think the metrics is a service that you need to enable if you want to use, and costs an extra expense to keep. I also agree to disable it in such case, or make it optional....

Adam-D-Lewis · 2024-06-27T17:22:28Z

also I think the GKE deployed kubedns replicaset has the same issue. I think the solution is to put taints on the user nodes and worker nodes.

Adam-D-Lewis · 2024-08-01T14:46:55Z

I also saw the metrics server and jupyterhub's user scheduler cause the same problem on AKS.

Adam-D-Lewis · 2024-08-16T21:03:37Z

The solution I propose is to add a taints section to each node group class. Thus you could specify the a taint on the user node via something like the following:

  node_groups:
    user:
      instance: Standard_D4_v3
      taints:
        - dedicated=user:NoSchedule

Then, we go and make sure the corresponding toleration is added to the jupyterhub user pod so that those pods will be able to run on the user node group. This should also work with pods started via argo-jupyter-scheduler.

This would not be supported for local deployments since local deployments only deploy a single node cluster atm. For existing deployments, it wouldn't affect the node group, but we would apply the specified toleration to the jupyterlab user pod.

Adam-D-Lewis added type: bug 🐛 Something isn't working needs: triage 🚦 Someone needs to have a look at this issue and triage labels Jun 11, 2024

github-project-automation bot added this to 🪴 Nebari Project Management Jun 11, 2024

github-project-automation bot moved this to New 🚦 in 🪴 Nebari Project Management Jun 11, 2024

Adam-D-Lewis removed the needs: triage 🚦 Someone needs to have a look at this issue and triage label Jun 18, 2024

viniciusdc added the provider: GCP label Jun 21, 2024

Adam-D-Lewis linked a pull request Aug 1, 2024 that will close this issue

Add taint to user and worker nodes #2605

Open

11 tasks

Adam-D-Lewis changed the title ~~[BUG] - Nodes don't scale down on GKE~~ [BUG] - Nodes don't scale down on GKE and AKS Aug 1, 2024

Adam-D-Lewis added the provider: Azure label Aug 1, 2024

Adam-D-Lewis self-assigned this Aug 1, 2024

Adam-D-Lewis added this to the Nebari Core Team - Cost Monitoring milestone Aug 1, 2024

Adam-D-Lewis moved this from New 🚦 to In progress 🏗 in 🪴 Nebari Project Management Aug 1, 2024

Adam-D-Lewis modified the milestones: Nebari Core Team - Cost Controls, Nebari Maintenance Team - September 2024 Aug 29, 2024

dcmcand modified the milestones: Nebari Maintenance Team - September 2024, Nebari Maintenance Team - November 2024 Oct 30, 2024

Adam-D-Lewis modified the milestones: Nebari Maintenance Team - November 2024, 2024.11.2 release Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] - Nodes don't scale down on GKE and AKS #2507

[BUG] - Nodes don't scale down on GKE and AKS #2507

Adam-D-Lewis commented Jun 11, 2024 •

edited

Loading

Adam-D-Lewis commented Jun 18, 2024

viniciusdc commented Jun 21, 2024 •

edited

Loading

viniciusdc commented Jun 21, 2024

Adam-D-Lewis commented Jun 27, 2024 •

edited

Loading

Adam-D-Lewis commented Aug 1, 2024

Adam-D-Lewis commented Aug 16, 2024 •

edited

Loading

[BUG] - Nodes don't scale down on GKE and AKS #2507

[BUG] - Nodes don't scale down on GKE and AKS #2507

Comments

Adam-D-Lewis commented Jun 11, 2024 • edited Loading

Describe the bug

Expected behavior

OS and architecture in which you are running Nebari

How to Reproduce the problem?

Command output

Versions and dependencies used.

Compute environment

Integrations

Anything else?

Adam-D-Lewis commented Jun 18, 2024

viniciusdc commented Jun 21, 2024 • edited Loading

viniciusdc commented Jun 21, 2024

Adam-D-Lewis commented Jun 27, 2024 • edited Loading

Adam-D-Lewis commented Aug 1, 2024

Adam-D-Lewis commented Aug 16, 2024 • edited Loading

Adam-D-Lewis commented Jun 11, 2024 •

edited

Loading

viniciusdc commented Jun 21, 2024 •

edited

Loading

Adam-D-Lewis commented Jun 27, 2024 •

edited

Loading

Adam-D-Lewis commented Aug 16, 2024 •

edited

Loading