Set min nodes to 0 for worker and user. #2168

pt247 · 2023-12-22T14:45:31Z

Reference Issues or PRs

What does this implement/fix?

Reduces cost of default deployment.

Change

An AWS Node group can be associated with several Auto Scaling groups, but in our case, it's just one. To scale from zero in AWS, you need to follow these steps:

Define the service node selector, such as dedicated=user.
Ensure that the nodes groups have a corresponding label like dedicated=user. This will help the service decide which node group to place the pod on.
ASG is required to give cluster-autoscaler a hint like k8s.io/cluster-autoscaler/node-template/label/dedicated=user.

Put a x in the boxes that apply

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds a feature)
Breaking change (fix or feature that would cause existing features not to work as expected)
Documentation Update
Code style update (formatting, renaming)
Refactoring (no functional changes, no API changes)
Build related changes
Other (please describe):
As a part of this change, the user scheduler is moved to the General node. This is needed to ensure the user node scales down to zero when the user is not running notebooks or desk clusters.

Testing

Did you test the pull request locally?
Did you add new tests?

Any other comments?

pt247 · 2023-12-29T12:50:24Z

This needs further testing. As scaling up from 0 nodes required more than just setting min nodes to 0.

…e groups to support scaling from zero.

pt247 · 2024-01-03T18:55:28Z

I have now tested scaling up and down of user and worker nodes from and to 0. I did this by creating the following:

Creating a JupyterHub cluster for the user.
Creating a Dask-cluster for the user.
Deleting Dask-cluster.
Release the node from the Jupyter node hub.

I had to move the user scheduler to general, though. This was needed because the user scheduler on the user node will always trigger the user node scale-up.

…g to 03-kubernetes-initialize.

…lector.

pt247 · 2024-01-05T10:04:52Z

Just a couple of changes are needed.

Moving tagging of ASG to a later stage. As it needs to exist before it can be tagged.
Revert changes to retain original node selectors.

pt247 · 2024-01-05T11:11:22Z

I need to figure out why Local Integration Tests are falling:

[terraform]: │ Error: failed to refresh cached credentials, no EC2 IMDS role found,
[terraform]: │ operation error ec2imds: GetMetadata, http response error StatusCode: 404,
[terraform]: │ request to EC2 IMDS failed

I'm still trying to figure out what the root cause is. If anyone has suggestions, please let me know.
I will work on this later today.

pt247 · 2024-01-18T09:02:11Z

Update

After moving the tagging logic to the post_deploy stage of Phase 2, we were able to successfully get past the Deploy Nebari phase in the test-local-integration. However, we are now encountering a failure at the Cypress Run stage. Specifically, the failure is occurring during the First Test -- Check Nebari login and start JupyterLab. Upon reviewing the saved video, it appears that the login is functioning properly, but the JupyterLab server is not starting.

Next steps

I have set up an EC2 instance and installed the local cluster. I will attempt to log in to it by exploring the display through SSH.

Please let me know if there is a better way.

…uler back on user node to check local deployment is effected.

pt247 · 2024-01-18T17:10:16Z

I have modified the code to make sure that only changes for AWS don't affect other deployments. With this, the tests are working as expected except for the Playwright Tests, which is failing in development, as far as I know.

costrouc

Looks great to me!

pt247 · 2024-02-09T16:57:51Z

All checks pass(ed). If you are happy with the changes, please feel free to merge them @costrouc

Prashant Tiwari and others added 4 commits December 22, 2023 14:38

Set min nodes to 0 for worker and user.

058028c

Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0

68f34b0

Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0

7031d46

Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0

125f390

pt247 closed this Dec 29, 2023

Move user scheduler to general and add tags and lables to asg and nod…

1b8a6c1

…e groups to support scaling from zero.

pt247 reopened this Jan 3, 2024

Prashant Tiwari and others added 7 commits January 3, 2024 17:54

Change the tag key for autoscheudler to work.

26150be

[pre-commit.ci] Apply automatic pre-commit fixes

492835f

Retain previous node selectors.

8123adc

resolve git merge.

d602e7c

[pre-commit.ci] Apply automatic pre-commit fixes

fd141de

Formatting changes.

6533542

Formatting changes.

7d26843

kcpevey linked an issue Jan 4, 2024 that may be closed by this pull request

[ENH] - Set minimum nodes to 0 for AWS deployment #2154

Closed

Prashant Tiwari added 2 commits January 5, 2024 09:33

ASG needs to exist before its tagged, moving aws_autoscaling_group_ta…

3e5044f

…g to 03-kubernetes-initialize.

Remove origianl node selector. as it prevents the pod to match nodese…

1197fe3

…lector.

Terraform format changes.

1d406a3

Prashant Tiwari and others added 8 commits January 5, 2024 17:40

Minor formatting changes.

8ca133b

Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0

2079175

Upgrade provider version.

a2c891a

Pin hashicorp/aws version.

214b2f3

Pin hashicorp/aws version.

9d825b6

Revert Pin hashicorp/aws version.

0d8b692

Pin hashicorp/aws version.

4f82abf

Pin hashicorp/aws version.

b04cd1b

Prashant Tiwari added 4 commits January 17, 2024 14:47

Add back original tags.

240f36f

Remove unused tagging from stage 3.

7037637

Remove extra config.

fa31975

Fix dask config.

46bb93c

pt247 marked this pull request as draft January 17, 2024 17:32

Prashant Tiwari and others added 2 commits January 17, 2024 22:35

Revert changes for autoscaling.

94b2ba2

Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0

11da557

Prashant Tiwari and others added 8 commits January 18, 2024 10:58

Revert changes to jupyerhub and dask servers, temprarly placing sched…

7e8b003

…uler back on user node to check local deployment is effected.

Fix revert.

2bd5686

Conditionally add lables for AWS.

e395098

[pre-commit.ci] Apply automatic pre-commit fixes

4235594

Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0

e882fa6

Fix dask config.

fd14645

Merge develop.

6d28bb2

Fix jupyterhub nodeselectors.

5c45b2d

pt247 marked this pull request as ready for review January 18, 2024 17:06

sblair-metrostar mentioned this pull request Jan 19, 2024

1811 adding missing general node selectors/affinity to core components #2214

Closed

10 tasks

costrouc approved these changes Feb 8, 2024

View reviewed changes

pt247 and others added 2 commits February 8, 2024 17:13

Merge branch 'nebari-dev:develop' into 2154-aws-set-minimum-notes-to-0

ab7f8f6

Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0

f1f9534

pt247 closed this Feb 9, 2024

pt247 reopened this Feb 9, 2024

costrouc merged commit 0d1a30d into nebari-dev:develop Feb 10, 2024
18 checks passed

pavithraes added this to the 2024.2.1 milestone Feb 16, 2024

pavithraes mentioned this pull request Feb 16, 2024

Update docs for scaling to zero nodes in AWS nebari-dev/nebari-docs#410

Closed

Adam-D-Lewis mentioned this pull request Mar 20, 2024

Set node affinity for more pods to ensure they run on general node pool #2353

Merged

10 tasks

Adam-D-Lewis mentioned this pull request May 31, 2024

[DOC] - Consider adding docs on how to add a new node group nebari-dev/nebari-docs#472

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set min nodes to 0 for worker and user. #2168

Set min nodes to 0 for worker and user. #2168

pt247 commented Dec 22, 2023 •

edited

Loading

pt247 commented Dec 29, 2023

pt247 commented Jan 3, 2024

pt247 commented Jan 5, 2024

pt247 commented Jan 5, 2024

pt247 commented Jan 18, 2024

pt247 commented Jan 18, 2024 •

edited

Loading

costrouc left a comment

pt247 commented Feb 9, 2024

Set min nodes to 0 for worker and user. #2168

Set min nodes to 0 for worker and user. #2168

Conversation

pt247 commented Dec 22, 2023 • edited Loading

Reference Issues or PRs

What does this implement/fix?

Change

Testing

Any other comments?

pt247 commented Dec 29, 2023

pt247 commented Jan 3, 2024

pt247 commented Jan 5, 2024

pt247 commented Jan 5, 2024

pt247 commented Jan 18, 2024

Update

Next steps

pt247 commented Jan 18, 2024 • edited Loading

costrouc left a comment

Choose a reason for hiding this comment

pt247 commented Feb 9, 2024

pt247 commented Dec 22, 2023 •

edited

Loading

pt247 commented Jan 18, 2024 •

edited

Loading