Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set min nodes to 0 for worker and user. #2168

Merged
merged 108 commits into from
Feb 10, 2024

Conversation

pt247
Copy link
Contributor

@pt247 pt247 commented Dec 22, 2023

Reference Issues or PRs

#2154

What does this implement/fix?

Reduces cost of default deployment.

Change

An AWS Node group can be associated with several Auto Scaling groups, but in our case, it's just one. To scale from zero in AWS, you need to follow these steps:

  1. Define the service node selector, such as dedicated=user.
  2. Ensure that the nodes groups have a corresponding label like dedicated=user. This will help the service decide which node group to place the pod on.
  3. ASG is required to give cluster-autoscaler a hint like k8s.io/cluster-autoscaler/node-template/label/dedicated=user.

Put a x in the boxes that apply

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds a feature)
  • Breaking change (fix or feature that would cause existing features not to work as expected)
  • Documentation Update
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build related changes
  • Other (please describe):
    As a part of this change, the user scheduler is moved to the General node. This is needed to ensure the user node scales down to zero when the user is not running notebooks or desk clusters.

Testing

  • Did you test the pull request locally?
  • Did you add new tests?

Any other comments?

@pt247 pt247 closed this Dec 29, 2023
@pt247
Copy link
Contributor Author

pt247 commented Dec 29, 2023

This needs further testing. As scaling up from 0 nodes required more than just setting min nodes to 0.

@pt247 pt247 reopened this Jan 3, 2024
@pt247
Copy link
Contributor Author

pt247 commented Jan 3, 2024

I have now tested scaling up and down of user and worker nodes from and to 0. I did this by creating the following:

  1. Creating a JupyterHub cluster for the user.
  2. Creating a Dask-cluster for the user.
  3. Deleting Dask-cluster.
  4. Release the node from the Jupyter node hub.

I had to move the user scheduler to general, though. This was needed because the user scheduler on the user node will always trigger the user node scale-up.

@kcpevey kcpevey linked an issue Jan 4, 2024 that may be closed by this pull request
@pt247
Copy link
Contributor Author

pt247 commented Jan 5, 2024

Just a couple of changes are needed.

  1. Moving tagging of ASG to a later stage. As it needs to exist before it can be tagged.
  2. Revert changes to retain original node selectors.

@pt247
Copy link
Contributor Author

pt247 commented Jan 5, 2024

I need to figure out why Local Integration Tests are falling:

[terraform]: │ Error: failed to refresh cached credentials, no EC2 IMDS role found,
[terraform]: │ operation error ec2imds: GetMetadata, http response error StatusCode: 404,
[terraform]: │ request to EC2 IMDS failed

I'm still trying to figure out what the root cause is. If anyone has suggestions, please let me know.
I will work on this later today.

@pt247 pt247 marked this pull request as draft January 17, 2024 17:32
@pt247
Copy link
Contributor Author

pt247 commented Jan 18, 2024

Update

After moving the tagging logic to the post_deploy stage of Phase 2, we were able to successfully get past the Deploy Nebari phase in the test-local-integration. However, we are now encountering a failure at the Cypress Run stage. Specifically, the failure is occurring during the First Test -- Check Nebari login and start JupyterLab. Upon reviewing the saved video, it appears that the login is functioning properly, but the JupyterLab server is not starting.

Next steps

I have set up an EC2 instance and installed the local cluster. I will attempt to log in to it by exploring the display through SSH.

Please let me know if there is a better way.

@pt247 pt247 marked this pull request as ready for review January 18, 2024 17:06
@pt247
Copy link
Contributor Author

pt247 commented Jan 18, 2024

I have modified the code to make sure that only changes for AWS don't affect other deployments. With this, the tests are working as expected except for the Playwright Tests, which is failing in development, as far as I know.

Copy link
Member

@costrouc costrouc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me!

@pt247 pt247 closed this Feb 9, 2024
@pt247 pt247 reopened this Feb 9, 2024
@pt247
Copy link
Contributor Author

pt247 commented Feb 9, 2024

All checks pass(ed). If you are happy with the changes, please feel free to merge them @costrouc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

[ENH] - Set minimum nodes to 0 for AWS deployment
4 participants