Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UX - SkyServe] user now is able to select a LB policy from a range of options #4061

Merged
merged 87 commits into from
Nov 11, 2024

Conversation

AlexCuadron
Copy link
Contributor

@AlexCuadron AlexCuadron commented Oct 10, 2024

I added the option to specify different LB policies, by default, RoundRobin is used. This PR is intended to enable easy switching between LB policies to facilitate its development.

The default behaviour (without user interaction) doesn't modify the execution flow and the user is not allowed to use any other LB policy other than round-robing without them being added explicitly.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @AlexCuadron ! It will greatly expand our customizability. Left some discussion ;) We might also think of how to expose this feature to our end user, in our Service YAML - maybe add a load_balancing_policy section under service?

sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
@AlexCuadron
Copy link
Contributor Author

@cblmemo Done! PTAL again :D

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AlexCuadron for the awesome work! It mostly looks good to me. Left some discussions ;)

examples/serve/gorilla/gorilla.yaml Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/load_balancing_policies.py Outdated Show resolved Hide resolved
sky/serve/load_balancing_policies.py Outdated Show resolved Hide resolved
sky/serve/service_spec.py Outdated Show resolved Hide resolved
@AlexCuadron
Copy link
Contributor Author

Thanks for the comments @cblmemo, fixed and ready for next round 💪

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @AlexCuadron ! Mostly looks good to me. Left some discussions.

btw, I created a branch here and lets merge our PR to the branch first. Merging into master might need more time and we want to move fast ;)

https://github.com/skypilot-org/skypilot/tree/heterogeneous-lb

sky/serve/load_balancing_policies.py Show resolved Hide resolved
@AlexCuadron AlexCuadron changed the base branch from master to heterogeneous-lb October 20, 2024 06:09
@AlexCuadron
Copy link
Contributor Author

I changed the base branch and updated based on comments :)
PTAL @cblmemo

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt fix @AlexCuadron ! Mostly looks good to me. Left some nits :))

examples/serve/misc/cancel/service.yaml Outdated Show resolved Hide resolved
examples/serve/llama2/llama2.yaml Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/service_spec.py Outdated Show resolved Hide resolved
sky/serve/service_spec.py Outdated Show resolved Hide resolved
@AlexCuadron
Copy link
Contributor Author

Done! PTAL @cblmemo :D

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt fix @AlexCuadron ! Left some final nits 🚀

examples/serve/load_balancing_policies_example.yaml Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/__init__.py Outdated Show resolved Hide resolved
AlexCuadron and others added 2 commits October 23, 2024 13:38
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* [Catalog] Silently ignore TPU price not found.

* assert for non tpu v6e

* format
Michaelvll and others added 7 commits November 7, 2024 15:55
…ing (skypilot-org#4264)

* fix race condition for setting job status to FAILED during INIT

* Fix

* fix

* format

* Add smoke tests

* revert pending submit

* remove update entirely for the job schedule step

* wait for job 32 to finish

* fix smoke

* move and rename

* Add comment

* minor
* [docs] use k8s instead of kubernetes in the CLI

* fix docs build script for linux

* Update docs/source/reference/kubernetes/kubernetes-getting-started.rst

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

---------

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* [jobs] autodown managed job clusters

If all goes correctly, the managed job controller should tear down a managed job
cluster once the managed job completes. However, if the controller fails somehow
(e.g. crashes, is terminated, etc), we don't want to leak resources.

As a failsafe, set autodown on the job cluster. This is not foolproof, since the
skylet on the cluster can also crash, but it's likely to catch many cases.

* add comment about autodown duration

* add leading _
* Update cloud_vm_ray_backend.py

* Update cloud_vm_ray_backend.py

* format
@AlexCuadron
Copy link
Contributor Author

PTAL @Michaelvll

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @AlexCuadron ! Thanks for adding this. I tried this PR and got the following:

$ sky serve up examples/serve/load_balancing_policies_example.yaml
Service from YAML spec: examples/serve/load_balancing_policies_example.yaml
ValueError: Invalid service YAML: Found unsupported field 'load_balancing_policy'.

Should we also update the sky/utils/schemas.py?

sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/service_spec.py Outdated Show resolved Hide resolved
sky/serve/service_spec.py Show resolved Hide resolved
AlexCuadron and others added 4 commits November 7, 2024 22:07
Co-authored-by: Tian Xia <cblmemo@gmail.com>
1. Add available policies to schema validation
2. Show available policies in error message when invalid policy is specified
3. Display load balancing policy in service spec repr when explicitly set
…policies

Only 'round_robin' is currently implemented in LoadBalancingPolicy class
sky/utils/schemas.py Outdated Show resolved Hide resolved
Move policy validation to code to avoid duplication and make it easier to maintain when adding new policies
@AlexCuadron
Copy link
Contributor Author

PTAL @cblmemo

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @AlexCuadron ! Left 2 final nits ;)

sky/utils/schemas.py Outdated Show resolved Hide resolved
sky/serve/service_spec.py Outdated Show resolved Hide resolved
@AlexCuadron
Copy link
Contributor Author

Oops, sorry for the circular import 😅
Should be gtg @cblmemo 👍

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt fix @AlexCuadron ! LGTM.

@cblmemo cblmemo added this pull request to the merge queue Nov 11, 2024
Merged via the queue into skypilot-org:master with commit 3bfc29e Nov 11, 2024
20 checks passed
@cblmemo cblmemo deleted the user_select_policy branch November 11, 2024 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.