Add taint to user and worker nodes #2605

Adam-D-Lewis · 2024-08-01T14:58:13Z

Reference Issues or PRs

Fixes #2507

I need to test running pods with Argo Workflow through Nebari Workflow Controller before merging this PR

What does this implement/fix?

Put a x in the boxes that apply

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds a feature)
Breaking change (fix or feature that would cause existing features not to work as expected)
Documentation Update
Code style update (formatting, renaming)
Refactoring (no functional changes, no API changes)
Build related changes
Other (please describe):

Testing

Did you test the pull request locally?
Did you add new tests?

How to test this PR

A few possible ways to test:

You could not specify taints, do a deployment and then add taints, redeploy and make sure things are still working.
You could do a deployment and specify taints and make sure things work.
You could do a local deployment and make sure it deploys and make sure things work.

Things working is defined as:

jupyterlab server spins up for user
dask scheduler and worker spins up for user (when using nebari-git-dask built in env). Use some code like this

Any other comments?

Adam-D-Lewis · 2024-08-19T22:17:05Z

src/_nebari/stages/infrastructure/__init__.py

@@ -41,10 +41,33 @@ class ExistingInputVars(schema.Base):
    kube_context: str


-class DigitalOceanNodeGroup(schema.Base):


Duplicate class, so I deleted it

Adam-D-Lewis · 2024-08-19T22:18:58Z

This method works as intended when tested on GCP. However, One issue is that certain daemonsets won't run on the tainted nodes. I saw the issue with rook ceph csi-cephfslplugin from my rook PR, but I expect it would also be an issue for the monitoring daemonset pods. So we'd likely need to add the appropriate toleration to those daemonsets.

Adam-D-Lewis · 2024-08-21T22:14:22Z

src/_nebari/stages/kubernetes_services/template/rook-ceph.tf

@@ -45,6 +45,13 @@ resource "helm_release" "rook-ceph" {
      },
      csi = {
        enableRbdDriver = false, # necessary to provision block storage, but saves some cpu and memory if not needed
+        provisionerReplicas : 1, # default is 2 on different nodes
+        pluginTolerations = [


runs csi-driver on all nodes, even those with NoSchedule taints. Doesn't run on nodes with NoExecute taints. This is what the nebari-prometheus-node-exporter daemonset does so I copied it here.

Adam-D-Lewis · 2024-08-21T22:15:11Z

...bari/stages/kubernetes_services/template/modules/kubernetes/services/monitoring/loki/main.tf

+          effect   = "NoSchedule"
+        },
+        {
+          operator = "Exists"


runs promtail on all nodes, even those with NoSchedule taints. Doesn't run on nodes with NoExecute taints. This is what the nebari-prometheus-node-exporter daemonset does so I copied it here. Promtail is what exports logs from the node so we still want it to run on the user and worker nodes.

Adam-D-Lewis · 2024-08-21T22:15:40Z

...bari/stages/kubernetes_services/template/modules/kubernetes/services/monitoring/loki/main.tf

+        {
+          key      = "node-role.kubernetes.io/master"
+          operator = "Exists"
+          effect   = "NoSchedule"
+        },
+        {
+          key      = "node-role.kubernetes.io/control-plane"
+          operator = "Exists"
+          effect   = "NoSchedule"
+        },


These top 2 are the default value for this helm chart.

Adam-D-Lewis · 2024-08-21T23:30:12Z

Okay, so things are working for the user node group. I tried adding a taint to the worker node group, but the dask scheduler won't run on the tainted worker node group. See this commit to see what I tried in a quick test. I do see the new scheduler_pod_extra_config value in /var/lib/dask-gateway/config.json in the dask gateway pod, but the scheduler tolerations look like

│   tolerations:                                                                                                                                                                            │
│   - effect: NoExecute                                                                                                                                                                     │
│     key: node.kubernetes.io/not-ready                                                                                                                                                     │
│     operator: Exists                                                                                                                                                                      │
│     tolerationSeconds: 300                                                                                                                                                                │
│   - effect: NoExecute                                                                                                                                                                     │
│     key: node.kubernetes.io/unreachable                                                                                                                                                   │
│     operator: Exists                                                                                                                                                                      │
│     tolerationSeconds: 300

so I think possibly the merge isn't going as expected, but I need to verify. The docs say that "This dict will be deep merged with the scheduler pod spec (a V1PodSpec object) before submission. Keys should match those in the kubernetes spec, and should be camelCase."

Adam-D-Lewis · 2024-10-25T21:46:20Z

I managed to get the taints applied to the scheduler pod in this commit. I would have expected the c.KubeClusterConfig.scheduler_extra_pod_config to get merged with the options returned by the function passed to c.Backend.cluster_options, but it wasn't.

I should verify this and maybe submit an issue to dask-gateway.

I still need to apply the toleration to the dask workers.

Adam-D-Lewis · 2024-10-31T22:22:44Z

...ubernetes_services/template/modules/kubernetes/services/dask-gateway/files/gateway_config.py

@@ -227,18 +229,23 @@ def base_username_mount(username, uid=1000, gid=100):
    }


-def worker_profile(options, user):


I renamed this function since it affects the scheduler as well and not just the worker

Adam-D-Lewis · 2024-10-31T22:43:40Z

Okay things were working as expected for the jupyterlab pod and the dask worker and scheduler pods on GKE. I need to test on:

AWS
Azure.

I also need to test:

running an Argo Workflows pod. (Update: This worked. The taints were copied over when run with jupyterflow-override.)

Adam-D-Lewis · 2024-11-04T18:21:32Z

I want to create an issue at least to prompt users on upgrade to ask if they want to add the taints for potential cost reductions.

Update: done now - #2824

Adam-D-Lewis · 2024-11-04T18:26:05Z

src/_nebari/stages/infrastructure/__init__.py

@@ -150,6 +201,22 @@ class AWSNodeGroupInputVars(schema.Base):
    permissions_boundary: Optional[str] = None
    ami_type: Optional[AWSAmiTypes] = None
    launch_template: Optional[AWSNodeLaunchTemplate] = None
+    node_taints: list[dict]
+
+    @field_validator("node_taints", mode="before")


This code is repeated (see line 233 in this file) for GCP and AWS NodeGroupInputVars classes, but that's b/c the format expected by GCP and AWS terraform modules for taints happens to be the same. I think the required formats for the different modules could evolve separately and so I chose to duplicate the code in this case.

Adam-D-Lewis · 2024-11-06T15:39:06Z

We should add some instructions to the docs about adding other node groups (e.g. gpus). Users should add the user taint to other user node profiles in order to prevent the same issue this PR prevents.

Adam-D-Lewis · 2024-12-10T15:25:39Z

Rather than making the user make sure to put a consistent taint on each node group, maybe we should just have a "type" field on node groups to simplify this. Less flexible, but I think it's flexible enough for the use cases we expect.
This would replace taints: [dedicated=user:NoSchedule].

google_cloud_platform:
  project: qhub-279316
  region: us-central1
  kubernetes_version: 1.28.9-gke.1289000
  tags:
  - "nebari-quansight-dev"
  node_groups:
    general:
      instance: n1-standard-8
      min_nodes: 1
      max_nodes: 1
      type: general   <--------------------------NEW-------------------------------

    user:
      instance: n1-standard-4
      min_nodes: 0
      max_nodes: 200
      type: user   <----------------------------- NEW -----------------------------

    large:
      instance: n1-standard-8
      min_nodes: 0
      max_nodes: 200
      type: user   <----------------------------- NEW -----------------------------

    worker:
      instance: n1-standard-4
      min_nodes: 0
      max_nodes: 1000
      type: worker   <----------------------------- NEW -----------------------------

^ We discussed this in a group meeting. This is not the ideal solution. Instead, we should just:

add a default value for the taints so if unspecified the taint is applied.
use a single taint for user and worker nodes.

Adam-D-Lewis · 2024-12-30T21:10:07Z

Okay, this PR is ready for review!

viniciusdc · 2025-01-06T15:48:48Z

Hi, @Adam-D-Lewis. Were the changes we discussed last week applied? Regardingng the default taints and overrides?

Adam-D-Lewis · 2025-01-06T15:54:17Z

Hi, @Adam-D-Lewis. Were the changes we discussed last week applied? Regardingng the default taints and overrides?

Yes, they were in this commit. See the set_missing_taints_to_default_taints method as a starting point. Happy to answer any other questions or walk through it if needed.

Adam-D-Lewis · 2025-01-07T15:28:18Z

looks like merging in main, broke many tests

viniciusdc · 2025-01-07T16:51:50Z

It looks like 'provider_enum_name_map' is not recognized, though it seems it was not removed...

Adam-D-Lewis · 2025-01-14T19:00:10Z

Failing test appears unrelated to this PR since it's a playwright test and this PR makes no changes to the UI/UX other than the command line when running nebari init.

viniciusdc · 2025-01-14T20:36:36Z

Thanks @Adam-D-Lewis , thats related to the recent conda-store update.

marcelovilla · 2025-01-15T07:09:36Z

@Adam-D-Lewis the failing tests have been addressed in #2911

Adam-D-Lewis added 6 commits June 26, 2024 09:59

save progress

5000f06

Merge branch 'develop' into node-taint

7ce8555

fix node taint check

a661514

Merge branch 'develop' into node-taint

fb55fab

fix node taints on gcp

7f1800d

add latest changes

40940f6

Adam-D-Lewis commented Aug 19, 2024

View reviewed changes

Adam-D-Lewis added 2 commits August 21, 2024 12:11

merge develop

cdac5c6

allow daemonsets to run on user node group

6382c7b

Adam-D-Lewis commented Aug 21, 2024

View reviewed changes

Adam-D-Lewis added 2 commits August 21, 2024 18:23

recreate node groups when taints change

e9d9dd9

quick attempt to get scheduler running on tanted worker node group

c55cd5f

Adam-D-Lewis added 2 commits October 25, 2024 14:50

Merge branch 'main' into node-taint

57e6e09

add default options to options_handler

a1370c9

Adam-D-Lewis added 8 commits October 28, 2024 09:33

add comments

0e7e11c

rename variable

adb9d74

add comment

7944071

make work for all providers

fa81fb9

move var back

da9fd82

move var back

6a1f81d

move var back

b4c08f3

move var back

9bae2a1

Adam-D-Lewis commented Oct 31, 2024

View reviewed changes

add reference

b3dbeda

Adam-D-Lewis commented Nov 4, 2024

View reviewed changes

Adam-D-Lewis added 4 commits November 4, 2024 12:35

more clean up

e05f143

Merge branch 'main' into node-taint

3a4ae6b

fix test

f3cb2e9

fix test error

b125e8c

Adam-D-Lewis mentioned this pull request Nov 4, 2024

Ask users if they'd like to have default taints added to the user and worker node groups of Nebari for potential cost savings #2824

Open

Adam-D-Lewis marked this pull request as ready for review November 4, 2024 19:51

Adam-D-Lewis requested review from dcmcand, viniciusdc and marcelovilla November 4, 2024 19:55

Adam-D-Lewis added this to the 2024.11.2 release milestone Nov 7, 2024

Merge branch 'main' into node-taint

8f9f846

viniciusdc removed this from the 2024.12.2 release milestone Dec 12, 2024

Adam-D-Lewis added 3 commits December 30, 2024 14:06

merge main

2264558

add test

747a293

Merge branch 'main' into node-taint

964f360

Merge branch 'main' into node-taint

4f48462

small cleanup

459ac01

Merge branch 'main' into node-taint

03c7c4a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add taint to user and worker nodes #2605

Add taint to user and worker nodes #2605

Adam-D-Lewis commented Aug 1, 2024 •

edited

Loading

Adam-D-Lewis Aug 19, 2024 •

edited

Loading

Adam-D-Lewis commented Aug 19, 2024 •

edited

Loading

Adam-D-Lewis Aug 21, 2024

Adam-D-Lewis Aug 21, 2024

Adam-D-Lewis Aug 21, 2024

Adam-D-Lewis commented Aug 21, 2024 •

edited

Loading

Adam-D-Lewis commented Oct 25, 2024 •

edited

Loading

Adam-D-Lewis Oct 31, 2024

Adam-D-Lewis commented Oct 31, 2024 •

edited

Loading

Adam-D-Lewis commented Nov 4, 2024 •

edited

Loading

Adam-D-Lewis Nov 4, 2024 •

edited

Loading

Adam-D-Lewis commented Nov 6, 2024 •

edited

Loading

Adam-D-Lewis commented Dec 10, 2024 •

edited

Loading

Adam-D-Lewis commented Dec 30, 2024

viniciusdc commented Jan 6, 2025

Adam-D-Lewis commented Jan 6, 2025 •

edited

Loading

Adam-D-Lewis commented Jan 7, 2025

viniciusdc commented Jan 7, 2025

Adam-D-Lewis commented Jan 14, 2025

viniciusdc commented Jan 14, 2025

marcelovilla commented Jan 15, 2025

		@@ -41,10 +41,33 @@ class ExistingInputVars(schema.Base):
		kube_context: str


		class DigitalOceanNodeGroup(schema.Base):

		@@ -227,18 +229,23 @@ def base_username_mount(username, uid=1000, gid=100):
		}


		def worker_profile(options, user):

Add taint to user and worker nodes #2605

Are you sure you want to change the base?

Add taint to user and worker nodes #2605

Conversation

Adam-D-Lewis commented Aug 1, 2024 • edited Loading

Reference Issues or PRs

What does this implement/fix?

Testing

How to test this PR

Any other comments?

Adam-D-Lewis Aug 19, 2024 • edited Loading

Choose a reason for hiding this comment

Adam-D-Lewis commented Aug 19, 2024 • edited Loading

Adam-D-Lewis Aug 21, 2024

Choose a reason for hiding this comment

Adam-D-Lewis Aug 21, 2024

Choose a reason for hiding this comment

Adam-D-Lewis Aug 21, 2024

Choose a reason for hiding this comment

Adam-D-Lewis commented Aug 21, 2024 • edited Loading

Adam-D-Lewis commented Oct 25, 2024 • edited Loading

Adam-D-Lewis Oct 31, 2024

Choose a reason for hiding this comment

Adam-D-Lewis commented Oct 31, 2024 • edited Loading

Adam-D-Lewis commented Nov 4, 2024 • edited Loading

Adam-D-Lewis Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Adam-D-Lewis commented Nov 6, 2024 • edited Loading

Adam-D-Lewis commented Dec 10, 2024 • edited Loading

Adam-D-Lewis commented Dec 30, 2024

viniciusdc commented Jan 6, 2025

Adam-D-Lewis commented Jan 6, 2025 • edited Loading

Adam-D-Lewis commented Jan 7, 2025

viniciusdc commented Jan 7, 2025

Adam-D-Lewis commented Jan 14, 2025

viniciusdc commented Jan 14, 2025

marcelovilla commented Jan 15, 2025

Adam-D-Lewis commented Aug 1, 2024 •

edited

Loading

Adam-D-Lewis Aug 19, 2024 •

edited

Loading

Adam-D-Lewis commented Aug 19, 2024 •

edited

Loading

Adam-D-Lewis commented Aug 21, 2024 •

edited

Loading

Adam-D-Lewis commented Oct 25, 2024 •

edited

Loading

Adam-D-Lewis commented Oct 31, 2024 •

edited

Loading

Adam-D-Lewis commented Nov 4, 2024 •

edited

Loading

Adam-D-Lewis Nov 4, 2024 •

edited

Loading

Adam-D-Lewis commented Nov 6, 2024 •

edited

Loading

Adam-D-Lewis commented Dec 10, 2024 •

edited

Loading

Adam-D-Lewis commented Jan 6, 2025 •

edited

Loading