-
Notifications
You must be signed in to change notification settings - Fork 7k
[train] Update vicuna release test example to use V2 #57767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR updates the cluster configuration for the Vicuna example to use 16 worker GPUs for training, which is a valid change. However, this introduces an inconsistency with the documentation in the accompanying Jupyter Notebook, which has not been updated. I've left a specific comment on the configuration file change. Please update the notebook documentation to match the new cluster setup to avoid user confusion.
| instance_type: m5.4xlarge | ||
|
|
||
| worker_node_types: | ||
| - name: worker_node | ||
| instance_type: g5.4xlarge | ||
| min_workers: 15 | ||
| max_workers: 15 | ||
| min_workers: 16 | ||
| max_workers: 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes correctly adjust the cluster to use 16 worker GPUs for training. However, this makes the documentation in the corresponding notebook (vicuna_13b_lightning_deepspeed_finetune.ipynb) outdated and misleading.
The notebook's 'Cluster Setting' section needs to be updated to reflect:
- The head node is now
m5.4xlarge(a CPU instance). - There are now 16 worker nodes.
- The tip about using a GPU head node for inference is no longer accurate and should be revised, as inference will now run on a worker node.
Please update the notebook to ensure the example remains consistent and clear for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I kicked off the release test so we should make sure it passes there too.
| "name": "stderr", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "2025-10-15 15:50:45,333\tINFO worker.py:1833 -- Connecting to existing Ray cluster at address: 10.0.171.127:6379...\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should delete these output cells here and elsewhere so they don't show up in the docs.
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Updates the vicuna lightning deepspeed example to run w/ Train V2. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Updates the vicuna lightning deepspeed example to run w/ Train V2. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: xgui <xgui@anyscale.com>
Updates the vicuna lightning deepspeed example to run w/ Train V2. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Updates the vicuna lightning deepspeed example to run w/ Train V2. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Description
Updates the vicuna lightning deepspeed example to run w/ Train V2.
Related issues
Types of change
Checklist
Does this PR introduce breaking changes?
Testing:
Code Quality:
git commit -s)Documentation:
doc/source/(if applicable)Additional context